Why should you know about Continual Learning?

Kishan Parshotam
6 min readJun 23, 2021
Tesla’s new supercomputer presented at CVPR21 Workshop on Autonomous Driving.

Tesla’s approach at carving its way onto Level 5 full driving automation, has focused on large-scale data collection, selection, cleaning, and model re-training. In this year’s CVPR Workshop on Autonomous Driving, Andrej Karpathy, Director of A.I. at Tesla, presented how the methodical process of data collection and model re-training is done at Tesla, and how it has been improving their autopilot functionality.

However, this approach may not apply to some companies:

  1. User-sensitive applications. Companies may deal with strict data privacy policies due to the nature of their activities. This means that the original training data is no longer available for re-training.
  2. Exponentially large datasets. Most companies cannot afford a supercomputer. Therefore, they need a different way to continuously improve their A.I. models.

In this article, we explore how Continual Learning (CL) can be used to overcome these scenarios.

What is Continual Learning?

Continual learning deals with incrementally training a neural network model on new data without depending on old samples.

For example, an online retail company receives numerous new advertisements daily. Such a company often needs to delete historical user records due to privacy concerns. For applications such as identifying duplicate listings, this may affect model (re-)training.

This example benefits from a continual learning setup, where new data is presented, trained, and then discarded. However, this introduces the challenge of catastrophic forgetting.

What is catastrophic forgetting?

Humans have the ability to learn new objects and instances of objects daily. But, this is not the case for neural networks. Previous research [1] has shown that neural networks are highly sensitive to new information. That is, when trained further on a downstream task (i.e. task B), the model risks forgetting the knowledge learned on the upstream task (i.e. task A).

Catastrophic forgetting for two tasks. A model converged on Task A will “forget” this task when trained on a new one, Task B.

This problem stems from the ’stability-plasticity’ dilemma introduced in [2]. On the one hand, a system requires stability to prevent forgetting of previous knowledge. On the other hand, such a system requires plasticity to integrate new knowledge. Therefore, research in continual learning focuses on balancing these two ends of the spectrum.

Continual Learning approaches

The techniques developed so far by the research community can roughly be categorised into three distinct approaches:

  1. Regularisation — These optimise a model to serve both new and old tasks without requiring extra elements in memory. This is achieved by finding a “balance” in the network’s parameters that jointly optimise all tasks.
  2. Replay — This approach optimises storing relevant subsets from previous training sets. This can be achieved by keeping hard copies or using the old data to train a generative model that can sample synthetic training data points. When it is time to re-train a model on a downstream task, we bring these samples “back to life” and append them to our training set. However, as the number of tasks increases, so do the number of stored samples and/or generative models.
  3. Architectural — One can also update a sub-network within the main network by creating new branches for new tasks or new sets of data. This approach is not ideal when we have many new tasks since we would have to keep increasing our network size.

This article focuses on regularisation approaches since these achieve adequate performances without extra memory efforts and better address data privacy.

Let’s have a look at a few regularisation-based techniques from the Machine Learning literature:


Assuming that we have a working model already trained on a dataset that is no longer available, we can re-use it to train on the new data. For that, we initialise a model with the old weights and then train until convergence, yielding new weights. We do this without restricting weight updates.

An illustrative example of the Naive approach for a two-layer network.


Another baseline approach is to restrict weight update to discriminant layers, say the last fully-connected layers. This is similar to transfer-learning.

An illustrative example of the Fine-Tune approach for a two-layer network.

Less Forgetting Learning [3]

Less Forgetting Learning (LFL) gives us a glimpse of how the concepts of stability-plasticity are incorporated into neural networks. In [3], the authors have modelled stability in two ways. Firstly, using the Euclidean loss between the activations of the old model trained under the old data and the new model. Secondly, by freezing the last layer before the Softmax. The plasticity (which allows the model to learn from new data) is done on the layers before the second to last, which are free to improve their feature extraction capabilities.

An illustrative example of LFL for a two-layer network.

Learning without Forgetting [4]

A different way of controlling stability is by fixing the decision boundaries that a model inherently learns. The Learning without Forgetting (LwF) approach leverages an idea from Teacher-Student networks. Specifically, once we have collected data points, we run them through the base model (<old model>) and the new model. We then regularise the training with a Knowledge Distillation loss to maximise the similarity between the outputs of the two models.

An illustrative example of LwF for a two-layer network.

Elastic Weight Consolidation [5]

This approach lets the neurons do the work. We do not define which module from the network we want to control to avoid forgetting, but we let the model decide that. Specifically, the authors of Elastic Weight Consolidation (EWC) use the Fisher Information Matrix to define weight importance at the neuron-level. When training, this acts as a regulariser between the old and the new model updates (L_f).

An illustrative example of EWC for a two-layer network.


This article provided the motivations and a short insight on how academia is tackling Continual Learning. To bring this approach into large-scale applications is still a challenge. But a few companies are pioneering this.

In my next blog I will talk more about my research [6] and how it was applied in practice at OLX Group to detect duplicate listings!


[1] — Goodfellow, Ian J., et al. “An empirical investigation of catastrophic forgetting in gradient-based neural networks.” ICLR (2013).

[2] — McClelland, James L., Bruce L. McNaughton, and Randall C. O’Reilly. “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.” Psychological review (1995).

[3] — Jung, Heechul, et al. “Less-forgetting learning in deep neural networks.” arXiv preprint arXiv:1607.00122 (2016).

[4] — Li, Zhizhong, and Derek Hoiem. “Learning without forgetting.” IEEE transactions on pattern analysis and machine intelligence (2017).

[5] — Kirkpatrick, James, et al. “Overcoming catastrophic forgetting in neural networks.” Proceedings of the national academy of sciences (2017).

[6] — Parshotam, Kishan, and Mert Kilickaya. “Continual Learning of Object Instances.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2020).