Training Very Deep Networks--Highway Networks 论文笔记

来源：互联网发布：南阳百牛网络编辑：程序博客网时间：2024/06/11 06:05

网上有传言微软的深度残差学习是抄袭 Highway Networks，只是Highway Networks的一个特例。Highway Networks 的确是先发表的。

http://people.idsia.ch/~rupesh/very_deep_learning/

有开源代码
reference:
http://blog.csdn.net/cv_family_z/article/details/50349436
http://blog.csdn.net/l494926429/article/details/51737883

Our Highway Networks take inspiration from Long Short Term Memory (LSTM)and allow training of deep, efficient networks (even with hundreds of layers) with conventional gradient-based methods. Even when large depths are not required, highway layers can be used instead of traditional neural layers to allow the network to adaptively copy or transform representations

我们这个高速CNN网络受 LSTM启发，可以使用传统基于梯度的方法快速训练深度网络（几百层的）。即使不需要大的深度，高速网络也可以自适应表示合适的特征变换。

随着神经网络的发展，网络的深度逐渐加深（更深的层数以及更小的感受野，能够提高网络分类的准确性(Szegedy et al.,2014;Simonyan & Zisserman,2014)），网络的训练也就变得越来越困难。Highway Networks就是一种解决深层次网络训练困难的网络框架。以下这几篇文章证明了优化深层神经网络十分困难（写文章的时候肯定用得到，先记下）：（Glorot & Bengio,2010;Saxe et al.,2013;He et al.,2015，） (Simonyan & Zisserman,2014; Romero et al., 2014）（Szegedy et al.,2014; Lee et al., 2015）。

Highway Networks：一种可学习的门限机制，在此机制下，一些信息流没有衰减的通过一些网络层，适用于SGD法。

2 Highway Networks
一般一个 plain feedforward neural network 有L层网络组成，每层网络对输入进行一个非线性映射变换，可以表达如下
这里写图片描述
H为非线性函数，W权重，x输入，y输出。

一般后续还有其他处理，例如非线性激活函数， convolutional or recurrent
对于高速CNN网络，我们定义一层网络如下

We refer to T as the transform gate and C as the carry gate
T和C分别表示对输入的映射和直接传送。

在这篇文献中我们设置 C=1-T，则得到下式
这里写图片描述
上公式中参数的维数须一致。 x,y, H(x,WH)andT(x,WT)的维度必须相同，不够补零。

我们观察到，对于特殊的T：

for the Jacobian of the layer transform：

Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of H and that of a layer which simply passes its inputs through

这样一部分数据进行处理，一部分直接通过。最后的输出公式就是

Highway Networks的卷基层与全链接层相似，对权值共享和局部感受野进行H和T的转化。

转化门T定义，WT 为权重矩阵，bT 为偏置。

整个结构差不多就是这样，在过深的网络里，如果每个数据流都进行所有的处理，加大计算复杂度和反向传播的难度，所以就需要对数据进行处理，让其在某些网络层的数据不被处理。

2.1 Constructing Highway Networks
如果 x,y,H,T的维数不一致，可以通过处理使其一致。

2.2 Training Deep Highway Networks
我们定义 transform gate 如下
这里写图片描述
W是权重矩阵， b是 bias 向量
This suggests a simple initialization scheme which is independent of the nature of H: b T can be initialized with a negative value (e.g. -1, -3 etc.) such that the network is initially biased towards carry behavior. This scheme is strongly inspired by the proposal [30] to initially bias the gates in an LSTM network, to help bridge long-term temporal dependencies early in learning

  初始化时可以给b初始化一个负值，相当于网络在开始的时候侧重于搬运行为（carry behavior），就是什么处理都不做。这个主要是受文献【30】启发。我们的实验也证明了这个推测是正确的。

这里写图片描述

0 0