Continual Learning: Teaching Neural Networks New Tricks Without Forgetting Old Ones

Human beings learn throughout their lives without erasing what they knew before. Neural networks, trained naively on sequential tasks, do the opposite: learning a new task overwrites weights that encoded the previous one — catastrophic forgetting. Continual learning is the field dedicated to solving this problem, with implications for any system that must adapt to new data without access to all historical data.

Catastrophic Forgetting: Why It Happens

Standard gradient descent updates weights to minimise loss on the current task's data. When a new task arrives, gradient updates are indifferent to previous task performance — they move weights wherever the new task's loss surface demands. Since the same weights encode both old and new knowledge, new task optimisation systematically degrades old task performance.

This is a structural consequence of a fixed-size shared weight space, not a bug. Solving it requires protecting weights that matter for old tasks, expanding the weight space, or replaying old data.

Regularisation-Based Approaches: EWC

Elastic Weight Consolidation (EWC, Kirkpatrick et al., DeepMind 2017) adds a regularisation term penalising changes to weights important for previous tasks. Importance is measured using the Fisher information matrix — which approximates the curvature of the loss surface around the previous task's solution.

Weights with high Fisher information (changing them degrades old task performance) are heavily penalised for changing. Weights with low Fisher information are free to adapt for the new task.

EWC scales to sequential multi-task scenarios and requires no storage of old data. Its limitation: Fisher information approximation becomes imprecise as the number of tasks grows.

Architecture-Based Approaches: Progressive Networks

Progressive Neural Networks never modify old weights. When a new task arrives, a new column (sub-network) is instantiated with lateral connections from previous columns. Old columns are frozen; only the new column's weights are trained.

PNN completely prevents forgetting by construction, but the network grows with each task — unsuitable for very long task sequences.

Replay-Based Approaches

Experience replay: maintain a fixed buffer of old examples and interleave them with new task training. Simple and effective when storing real data is permissible.

Generative replay: train a generative model on old task data and generate pseudo-examples during new task training — avoiding real data storage, useful for privacy-sensitive settings.

Dark Experience Replay (DER): store the model's soft logit predictions alongside inputs, using knowledge distillation to preserve the old model's output distribution. More sample-efficient than standard experience replay.

Evaluation Dimensions

Backward transfer: does learning task N degrade performance on tasks 1 through N-1?
Forward transfer: does prior learning improve performance on new tasks?
Final average accuracy: across all tasks after training on the full sequence

Most methods achieve good backward transfer but limited forward transfer — genuine positive forward transfer remains an open research challenge.

Conclusion

EWC suits scenarios where old data cannot be retained and task count is moderate. Progressive networks suit scenarios where forgetting cannot be tolerated at any cost. Replay methods suit scenarios where representative data storage is feasible. The right choice depends on your task sequence length, data retention constraints, and network capacity budget.

Keywords: continual learning, catastrophic forgetting, elastic weight consolidation, EWC, progressive networks, experience replay, lifelong learning, sequential learning, neural networks