论文---overcoming catastrophic forgetting in neural networks

来源:互联网 发布:淘宝黑莓 编辑:程序博客网 时间:2024/05/20 06:56


overcoming catastrophic forgetting in neural networks 

出处:2017 Jan 25 PNAS(proceedings of the national academy of sciences) 

作者:deepmind团队 具体作者就不一一表述



如今深度神经网络有个很难以解决的问题,就是持续学习(continual learning)。人脑的神经元数量是有限的,故而在人脑的整理学习过程中,不会出现应对一个新的问题就重新规划问题,而是对已有的神经元组合进行修改,使之能适应于持续学习。

这篇文章就是根据生物学上的突破(synaptic consolidation突触整合),将已有的深度神经网络进行修改,增加参数,使之能更好的适用于人工神经网络的持续学习。


The ability to learn tasks in a sequential fashion is crucial to the development of artificial intelligence. Neural networks are not, in general, capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks which they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. We demonstrate our approach is scalable and effective by solving a set of classification tasks based on the MNIST hand written digit dataset and by learning several Atari 2600 games sequentially



(2)灾难性遗忘是网络结构的必然特征(catastrophic forgetting);











这篇文章的算法设计叫做Elastic weight consolidation(EWC),重要的部分(算法设计)在我读论文的时候,要求给出完整翻译(有部分是意译)。

In brains, synaptic consolidation enables continual learning by reducing the plasticity of synapses that are vital to previously learned tasks. We implement an algorithm that performs a similar operation in artificial neural networks by constraining important parameters to stay close to their old values. In this section we explain why we expect to find a solution to a new task in the neighborhood of an older one, how we implement the constraint, and finally how we determine which parameters are important.


In this work, we demonstrate that task-specific synaptic consolidation offers a novel solution to the continual learning problem for artificial intelligence. We develop an algorithm analogous to synaptic consolidation for artificial neural networks,which we refer to as elastic weight consolidation (EWC for short). This algorithm slows down learning on certain weights based on how important they are to previously seen tasks. We show how EWC can be used in supervised learning and reinforcement learning problems to train several tasks sequentially without forgetting older ones, in marked contrast to previous deep-learning techniques.

在我们的工作中,我们表明特定任务突触整合方案提供了一种新奇的人工智能持续学习的解决方案。我们为人工神经网络提出类似于突触整合的算法,命名为elastic weight  consolidation(EWC)。这个算法降低重要权重的学习率,重要权重的决定权是以前任务中的重要性。我们展示EWC怎样被使用于监督学习和强化学习中,对比实验是深度学习技术。

A deep neural network consists of multiple layers of linear projection followed by element-wise non-linearities. Learning a task consists of adjusting the set of weights and biases of the linear projections, to optimize performance. Many configurations of will result in the same performance; this is relevant for EWC: over-parameterization makes it likely that there is a solution for task B, , that is close to the previously found solution for task A, . While learning task B, EWC therefore protects the performance in task A by constraining the parameters to stay in a region of low error for task A centered around , as shown schematically in Figure 1. This constraint is implemented as a quadratic penalty, and can therefore be imagined as a spring anchoring the parameters to the previous solution, hence the name elastic. Importantly, the stiffness of this spring should not be the same for all parameters; rather, it should be greater for those parameters that matter most to the performance during task A.


In order to justify this choice of constraint and to define which weights are most important for a task, it is useful to consider neural network training from a probabilistic perspective. From this point of view, optimizing the parameters is tantamount to finding their most probable values given some data D.We can compute this conditional probability  from the prior probability of the parameters  and the probability of the data  by using Bayes’ rule: 


Note that the log probability of the data given the parameters  is simply the negative of the loss function for the problem at hand . Assume that the data is split into two independent parts, one defining task A(DA) and the other task B(DB). Then we can re-arrange equation :


Note that the left hand side is still describing the posterior probability of the parameters given the entire dataset, while the right hand side only depends on the loss function for task B .


All the information about task A must therefore have been absorbed into the posterior distribution .This posterior probability must contain information about which parameters were important to task A and is therefore the key to implementing EWC. The true posterior probability is intractable, so, following the work on the Laplace approximation by Mackay, we approximate the posterior as a Gaussian distribution with mean given by the parameters  and a diagonal precision given by the diagonal of the Fisher information matrix F. F has three key properties : (a) it is equivalent to the second derivative of the loss near a minimum, (b) it can be computed from first-order derivatives alone and is thus easy to calculate even for large models, and (c) it is guaranteed to be positive semi-definite. Note that this approach is similar to expectation propagation where each subtask is seen as a factor of the posterior. Given this approximation, the function L that we minimize in EWC is:

任务A的所有信息因此必须被吸收进后验概率分布。后验概率必须包含任务A的参数重要性,这也是EWC执行的关键。但真实的后验概率是很棘手的,因此接下来的工作使用Laplace近似,我们近似后验概率作为由参数给出的高斯分布和Fisher信息对角矩阵F的对角线精度。F有三个关键性质:(a) 它相当于最小损失函数的二阶导数;(b)它可以单独的计算一阶导数,因此可以应用于大规模计算;(c)可以保证是半正定矩阵。注意,这个方法类似于期望传播,每个子任务被看做一个后验的因素。给出这个近似,EWC中最小化的函数L变为:

where  is the loss for task B only, sets how important the old task is compared to the new one and i labels each parameter.


When moving to a third task, task C, EWC will try to keep the network parameters close to the learned parameters of both task A and B.This can be enforced either with two separate penalties, or as one by noting that the sum of two quadratic penalties is itself a quadratic penalty.


figure 1

Figure 1: elastic weight consolidation (EWC) ensures task A is remembered whilst training on task B. Training trajectories are illustrated in a schematic parameter space, with parameter regions leading to good performance on task A (gray) and on task B(cream). After learning the first task, the parameters are at . If we take gradient steps according to task B alone (blue arrow), we will minimize the loss of task B but destroy what we have learnt for task A. On the other hand,if we constrain each weight with the same coefficient (green arrow) the restriction imposed is too severe and we can only remember task A at the expense of not learning task B. EWC, conversely, finds a solution for task B without incurring a significant loss on task A (red arrow) by explicitly computing how important weights are for task A.






(2)明天更新的有几个 dropout、early stopping、梯度下降三种算法对比、MINST介绍、Atari介绍、强化学习、条件概率、先验概率、后验概率、高斯分布等。

0 0