Unifying Continual Learning and Meta-Learning with Meta-Experience Replay

Share this post:

We have recently developed Meta-Experience Replay (MER), a new framework that integrates meta-learning and experience replay for continual learning. It combines an efficient meta-learning algorithm called Reptile with a widely successful technique for stabilizing reinforcement learning called Experience Replay. MER achieves state-of-the-art performance on continual learning benchmarks and is mathematically similar to Gradient Episodic Memory. Our algorithm optimizes for a new view of the deep continual learning problem. We hope it will motivate future research towards applying meta-learning in continual learning settings.

Read Paper | View Code

Continual learning and the transfer-interference trade-off

AI trained with neural networks has recently led to impressive advances in natural language processing, image processing, and complex games such as Go, chess, and poker. However, the networks trained to solve these problems are very specialized or “narrow” in their focus, working well under certain tightly controlled operating conditions, but unable to expand to new domains. This is because while neural networks have achieved great successes when training with many passes over fixed stationary distributions of data, they struggle when training over changing non-stationary distributions of data. Put differently, neural networks can effectively solve many tasks if we train them from scratch and continually sample from all tasks many times until training has converged. Meanwhile, they struggle when training incrementally if there is a time dependence to the data received, which is what we generally experience in the real world.

Unfortunately, training in this “offline” fashion is unnatural for a large variety of cases as we move from “narrow” AI with a single competency to “broad” AI that masters many competencies over a lifetime. For example, when we are training a lifelong classifier for a new use case, we would like to gradually provide it with new examples and get real-time updates to the classifier as we are teaching it. In contrast, it is not acceptable to wait hours or days or weeks for those updates. In these cases, a classifier that literally starts from scratch and re-introduces itself to everything it has ever known simply to integrate in a single experience within a potentially massive lifetime isn’t feasible. For AI to tackle “broad” AI use cases in natural settings like this one, we need neural networks that are capable of continual learning. Continual learning is a setting where a network is trained incrementally after each example it receives. The network only gets to see each example one time and the distribution of data received has a time dependence.

There is a long history dating back to at least the 1980s of trying to understand the difficulty of performing continual learning with neural networks. Many recent papers focus on McCloskey and Cohen’s characterization of the “catastrophic forgetting” problem in neural networks, where neural networks tend to quickly unlearn old knowledge without repetition to reinforce the training. However, approaches that only consider stabilizing continual learning by reducing “forgetting” are only looking at half the picture. Thus, it is easy to construct domains where they fail. More generally, the difficulty of neural networks to train continually can be explained by Grossberg’s “stability-plasticity dilemma”. Acknowledging this dilemma makes it clear that while reducing “forgetting” may improve network “stability”, it is not really addressing the greater problem if it comes at the cost of “plasticity,” as it so often does.

In our work, we introduce an alternative view of Grossberg’s famous dilemma that we call the “transfer-interference trade-off”. Here we are referring to the dilemma shown below of deciding on the degree to which different examples should be learned using the same weights. If two examples are learned using different weights, there is low potential for both interference and transfer between examples while learning either example. This is optimal if the two examples are unrelated, but is sub-optimal if the two examples are related as it eliminates the opportunity for transfer learning. In contrast, if two examples are learned using the same weights, there is high potential for both transfer and interference. This is optimal if the examples are related, but can lead to high interference if they are not related. With this view in mind, we can see that vanilla neural networks struggle with continual learning largely because they latently attempt to learn a solution to this trade-off by greedily optimizing for each example without any direct supervision about pairwise interactions between the learning of examples. On the other hand, if a neural network was trained to share its weights across examples based on interactions between the gradients of examples it has seen so far, this should effectively make it easier to perform continual learning going forward. This will happen to the extent that what we learn about balancing weight sharing generalizes. This is the central motivation of MER.

How MER works

There are many possible instantiations of MER and we present three variants all using Reptile in our ICLR paper. In principle, many other meta-learning algorithms can also be used instead of Reptile within this framework. In the following pseudo code we outline a particularly simple and effective variant that we explore in our paper:

Initialize θ, the parameter vector
Initialize M = [ ], the memory buffer
for incoming example xc,yc ∼ P(x,y,t) do
  Draw a random size k batch of examples from M and interleave in xc,yc
  Train using SGD sequentially on each of the k+1 examples using an s times higher learning rate for xc,yc, resulting in parameters θ’
  Reptile Update: θ ← θ + γ(θ’ - θ)
  Add to buffer: M ∪ xc,yc using reservoir sampling
end for
Return θ


This is a fairly straightforward integration of Experience Replay with a Reptile meta-update over sampled mini-batches rather than standard mini-batch SGD. The computation is comparable to that of standard Experience Replay with the exception of having a greater dependence on sequential computation that is difficult to parallelize. The added benefit of the Reptile update rule, however, is that it allows us to approximate the following regularized objective without computing expensive second derivatives to the extent that M captures the distribution of data seen before D:

This regularized objective is ideal for maximizing the transfer interference trade-off. In addition to training a network to learn each example, this objective also incentives the network to learn examples in a way that maximizes agreement between gradients. So, to the extent that our learning about gradient dot products generalizes, continual learning should become easier moving forward minimizing interference and maximizing transfer. In comparison, Gradient Episodic Memory minimizes disagreement between gradients as well (promoting “stability”), but does not directly incentivize agreement between gradients for fast adaption (promoting “plasticity”).

We tested MER on several datasets, including MNIST Rotations, MNIST Permutations, a more non-stationary version of MNIST Permutations with more frequent task switches, incremental learning on Omniglot, and non-stationary versions of the video games Catcher and Flappy Bird. For example, we trained our agent to navigate through pipes in Flappy Bird while making the pipe gap the bird needs to get through smaller and smaller every 25 thousand steps of training as shown below. This enabled us to test how the agent learns in a non-stationary environment. Continual reinforcement learning is very difficult especially when there are sudden changes to the dynamics of gameplay as it really tests the agent’s ability to detect changes in the environment without supervision. However, by learning the harder settings, an agent trained using MER (DQN-MER) got better at playing the easier settings, showing how it was able to find commonalities across the changing environment and improve its performance when a typical DQN trained with Experience Replay could not.

MER implementation

We have made an implementation of MER for our MNIST continual learning experiments available on GitHub as a fork of the GEM project in order to promote consistent evaluation with past approaches. We hope our codebase can help provide a jumping off point for others looking to test algorithms improving continual learning for neural networks.

Our team

This work is the result of a fluid collaboration between industry and academia focused on improving continual learning for neural networks. The research team at IBM includes Matthew Riemer, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Ignacio Cases is a PhD student in the Stanford NLP Group and a recipient of the 2019 IBM PhD Fellowship. Robert Ajemian is a Research Scientist in the Brain and Cognitive Science Department at MIT. This work is part of a larger collaboration on learning in non-stationary environments within the MIT-IBM Watson AI Lab.

Let’s chat at ICLR!

Come hear more about our work at ICLR 2019 in New Orleans. We will be presenting poster #82 on May 8th from 11:00AM to 1:00PM in Great Hall BC of the Ernest N. Morial Convention Center. Looking forward to meeting you there!

IBM Research

Ignacio Cases

PhD candidate in Computational Linguistics, Stanford University

More AI stories

Exploring quantum spin liquids as a reservoir for atomic-scale electronics

In “Probing resonating valence bond states in artificial quantum magnets,” we show that quantum spin liquids can be built and probed with atomic precision.

Continue reading

Fine-grained visual recognition for mobile AR technical support

Our team of researchers recently published paper “Fine-Grained Visual Recognition in Mobile Augmented Reality for Technical Support,” in IEEE ISMAR 2020, which outlines an augmented reality (AR) solution that our colleagues in IBM Technology Support Services use to increase the rate of first-time fixes and reduce the mean time to recovery from a hardware disruption.

Continue reading

Hardware-aware approach for fault-tolerant quantum computation

Our article “Topological and subsystem codes on low-degree graphs with flag qubits” [1], published in Physical Review X, takes a bottom-up approach to quantum error correcting codes that are adapted to a heavy-hexagon lattice – a topology we implement in our latest 65-qubit Hummingbird (r2) chip, available to IBM Q Network users in the Manhattan-named system.

Continue reading