论文---overcoming catastrophic forgetting in neural networks

来源：互联网发布：淘宝黑莓编辑：程序博客网时间：2024/05/20 06:56

不定期更新--论文

overcoming catastrophic forgetting in neural networks

出处：2017 Jan 25 PNAS（proceedings of the national academy of sciences）

作者：deepmind团队具体作者就不一一表述

deepmind团队是深度学习应用方向最厉害的团队，隶属于google。

接下来看这篇论文，经过清明节三天的努力，将此论文啃下。

如今深度神经网络有个很难以解决的问题，就是持续学习(continual learning)。人脑的神经元数量是有限的，故而在人脑的整理学习过程中，不会出现应对一个新的问题就重新规划问题，而是对已有的神经元组合进行修改，使之能适应于持续学习。

这篇文章就是根据生物学上的突破（synaptic consolidation突触整合），将已有的深度神经网络进行修改，增加参数，使之能更好的适用于人工神经网络的持续学习。

Abstract

The ability to learn tasks in a sequential fashion is crucial to the development of artificial intelligence. Neural networks are not, in general, capable of this and it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models. We show that it is possible to overcome this limitation and train networks that can maintain expertise on tasks which they have not experienced for a long time. Our approach remembers old tasks by selectively slowing down learning on the weights important for those tasks. We demonstrate our approach is scalable and effective by solving a set of classification tasks based on the MNIST hand written digit dataset and by learning several Atari 2600 games sequentially

从摘要中我们可以得到几点信息：

（1）顺序学习能力是人工智能的发展的拦路虎；

（2）灾难性遗忘是网络结构的必然特征（catastrophic forgetting）；

（3）顺序学习的定义，即根据任务A训练网络模型后，再根据任务B训练网络模型，此时对任务A进行测试，还可以维持其重要内容；

（4）它们对于灾难性遗忘提出了一个改进型的算法；

（5）改进型算法的测试集有两个，MINST和Atari。

Introduction

Introduction太长了，就不粘具体的内容，只给出具体有用的信息：

（1）为什么人工神经网络的连续学习会出现问题？

由于当前的人工神经网络对顺序任务的学习方式是先训练任务A，然后再训练任务B，任务A的参数与任务B的参数基本无关，使得当任务B训练完成后，该网络无法给出任务A的结果。

（2）什么叫做灾难消失？

在网络顺序训练多重任务时，对先前任务的重要权重无法保留，称之为灾难性消失。

算法设计环节

这篇文章的算法设计叫做Elastic weight consolidation（EWC），重要的部分（算法设计）在我读论文的时候，要求给出完整翻译（有部分是意译）。

In brains, synaptic consolidation enables continual learning by reducing the plasticity of synapses that are vital to previously learned tasks. We implement an algorithm that performs a similar operation in artificial neural networks by constraining important parameters to stay close to their old values. In this section we explain why we expect to find a solution to a new task in the neighborhood of an older one, how we implement the constraint, and finally how we determine which parameters are important.

在大脑中，通过减少突触的可塑性，整合突触能够持续学习，这对先前的学习任务极为重要。我们在人工神经网络中执行一个算法，具有同样的性能，通过限制重要参数以便于保留以前的参数值。这部分中，我们解释了三个问题：为什么我们期望在以前的学习任务周围找出一个新任务的解决方案；怎么实施限制；最后怎么确定哪些参数是重要的。

In this work, we demonstrate that task-specific synaptic consolidation offers a novel solution to the continual learning problem for artificial intelligence. We develop an algorithm analogous to synaptic consolidation for artificial neural networks,which we refer to as elastic weight consolidation (EWC for short). This algorithm slows down learning on certain weights based on how important they are to previously seen tasks. We show how EWC can be used in supervised learning and reinforcement learning problems to train several tasks sequentially without forgetting older ones, in marked contrast to previous deep-learning techniques.

在我们的工作中，我们表明特定任务突触整合方案提供了一种新奇的人工智能持续学习的解决方案。我们为人工神经网络提出类似于突触整合的算法，命名为elastic weight consolidation（EWC）。这个算法降低重要权重的学习率，重要权重的决定权是以前任务中的重要性。我们展示EWC怎样被使用于监督学习和强化学习中，对比实验是深度学习技术。

A deep neural network consists of multiple layers of linear projection followed by element-wise non-linearities. Learning a task consists of adjusting the set of weights and biases of the linear projections, to optimize performance. Many configurations of will result in the same performance; this is relevant for EWC: over-parameterization makes it likely that there is a solution for task B, , that is close to the previously found solution for task A, . While learning task B, EWC therefore protects the performance in task A by constraining the parameters to stay in a region of low error for task A centered around , as shown schematically in Figure 1. This constraint is implemented as a quadratic penalty, and can therefore be imagined as a spring anchoring the parameters to the previous solution, hence the name elastic. Importantly, the stiffness of this spring should not be the same for all parameters; rather, it should be greater for those parameters that matter most to the performance during task A.

一个深度神经网络由多层线性投影组成。一个学习任务由调整线性投影中的权重和偏移组成，用来优化性能。的配置往往效果是一样的；这也就是EWC的相关性：过参数化使得任务B的参数和任务A的参数很相似。当学习任务B时，EWC因此保护任务A的性能，通过限制参数在为中心的一个小误差范围内，计划在图1中展示出来。这个限制执行的效果类似于二次惩罚，因此可以被想象为弹簧被锚定为以前任务的参数，因此被命名为弹性。更重要的是，弹簧的刚度不能和所有参数都相同；相反，它应该是参数重要性更大的那些参数。

In order to justify this choice of constraint and to define which weights are most important for a task, it is useful to consider neural network training from a probabilistic perspective. From this point of view, optimizing the parameters is tantamount to finding their most probable values given some data D.We can compute this conditional probability

from the prior probability of the parameters

and the probability of the data

by using Bayes’ rule:

为了证明限制的选择和定义任务中哪个权重最重要，从概率论观点考虑神经网络训练是很有用的。从这个观点看来，优化参数与从数据D中找到最有可能的值有关。我们可以计算条件概率，通过参数先验概率和数据的概率，使用贝叶斯规则：

Note that the log probability of the data given the parameters is simply the negative of the loss function for the problem at hand . Assume that the data is split into two independent parts, one defining task A(DA) and the other task B(DB). Then we can re-arrange equation :

注意，给定参数的数据对数概率可以简单的定义为负损失函数。假设数据被分为两个独立的部分，一个定义为任务，另一个定义为任务。然后我们就重写等式：

Note that the left hand side is still describing the posterior probability of the parameters given the entire dataset, while the right hand side only depends on the loss function for task B .

注意，左手边仍然描述了整个数据集参数的后验概率，而右手边只依赖于任务B的损失函数

。

All the information about task A must therefore have been absorbed into the posterior distribution .This posterior probability must contain information about which parameters were important to task A and is therefore the key to implementing EWC. The true posterior probability is intractable, so, following the work on the Laplace approximation by Mackay, we approximate the posterior as a Gaussian distribution with mean given by the parameters and a diagonal precision given by the diagonal of the Fisher information matrix F. F has three key properties : (a) it is equivalent to the second derivative of the loss near a minimum, (b) it can be computed from first-order derivatives alone and is thus easy to calculate even for large models, and (c) it is guaranteed to be positive semi-definite. Note that this approach is similar to expectation propagation where each subtask is seen as a factor of the posterior. Given this approximation, the function L that we minimize in EWC is:

任务A的所有信息因此必须被吸收进后验概率分布。后验概率必须包含任务A的参数重要性，这也是EWC执行的关键。但真实的后验概率是很棘手的，因此接下来的工作使用Laplace近似，我们近似后验概率作为由参数给出的高斯分布和Fisher信息对角矩阵F的对角线精度。F有三个关键性质：(a) 它相当于最小损失函数的二阶导数；(b)它可以单独的计算一阶导数，因此可以应用于大规模计算；(c)可以保证是半正定矩阵。注意，这个方法类似于期望传播，每个子任务被看做一个后验的因素。给出这个近似，EWC中最小化的函数L变为：

where is the loss for task B only, sets how important the old task is compared to the new one and i labels each parameter.

其中是任务B的损失函数，的设置为对于每个参数i标签，相比于新任务，对旧任务的重要程度。

When moving to a third task, task C, EWC will try to keep the network parameters close to the learned parameters of both task A and B.This can be enforced either with two separate penalties, or as one by noting that the sum of two quadratic penalties is itself a quadratic penalty.

当有三个任务时，任务C，EWC将试图确保参数也同样适用于任务A和任务B。这个可以被执行成两种惩罚，也可以被执行为二次惩罚。

figure 1

Figure 1: elastic weight consolidation (EWC) ensures task A is remembered whilst training on task B. Training trajectories are illustrated in a schematic parameter space, with parameter regions leading to good performance on task A (gray) and on task B(cream). After learning the first task, the parameters are at . If we take gradient steps according to task B alone (blue arrow), we will minimize the loss of task B but destroy what we have learnt for task A. On the other hand,if we constrain each weight with the same coefficient (green arrow) the restriction imposed is too severe and we can only remember task A at the expense of not learning task B. EWC, conversely, finds a solution for task B without incurring a significant loss on task A (red arrow) by explicitly computing how important weights are for task A.

图1：弹性权重整合(EWC)确保训练任务B时同时记得任务A。训练轨迹阐明概要的参数空间，参数范围被很好的展示，任务A（灰色）、任务B（奶白）。经过第一次训练后，参数集合是。如果我们训练任务B单独使用梯度下降（蓝色箭头），当我们在最小化任务B的损失函数时，会失去任务A训练后的结果。另外，如果我们训练过程中相同的系数（绿色箭头）限制每一个权重，这样限制就过于严重，我们只能记住任务A的训练结果而无法训练任务B。相反的，EWC找到任务B的解决方案，不会对任务A的参数进行重大改变（红色箭头）通过明确的计算任务A的权重如何重要。

算法总结：

（1）此算法的设计原理采用映射的效果，将任务A的部分重要参数空间仍然复用，其它参数映射到任务B的参数空间中，这种映射的采用方案为条件概率，使用先验概率和后验概率，将条件概率公式进行修改。具体的见下面的图片（公式实在不好实现，所以进行手写图片的方式）

论文总结

（1）算法的给出并不是很好，可以根据此算法进行修改，给出更加完整的算法方案。

（2）明天更新的有几个 dropout、early stopping、梯度下降三种算法对比、MINST介绍、Atari介绍、强化学习、条件概率、先验概率、后验概率、高斯分布等。

0 0