[深度学习论文笔记][Image Classification] Identity Mappings in Deep Residual Networks

来源：互联网发布：spin.js 编辑：程序博客网时间：2024/05/16 00:45

He, Kaiming, et al. “Identity mappings in deep residual networks.” arXiv preprint arXiv:1603.05027 (2016). (Citations: 43).

1 Identity Mapping

[Original Residual Unit and Proposed Residual Unit] See Fig. 10.

[Forward Pass]

The feature of any deeper layer L 1 can be represented as the feature of any shallower layer L 0 plus a residual function.

[Backward Pass] The gradient of any shallow layer L 0 can be represented as the gradient of any deep layer L 1 plus the gradient propagates through the weight layers.

2 Experiment on Skip Connections
Replace skip shortcut with gating or 1 × 1 conv should have stronger representational abilities than identity shortcuts. In fact, the shortcut-only gating and 1 × 1 convolution cover the solution space of identity shortcuts. However, their training error is higher than that of identity shortcuts, indicating that the degradation of these models is caused by optimization issues, instead of representational abilities.

3 On the Usage of Activation Functions
There are various usages of activations, see Fig. 11.

[Original] The signal is impacted if it is negative. The impact of ReLU is not severe when the ResNet has fewer layers. This is because after some training, the weights are adjusted into a status such that x + F (x) is more frequently above zero and ReLU does not truncate it x is always non-negative due to the previous ReLU, so x + F (x) is below zero only when the magnitude of F is very negative). The truncation, however, is more frequent when there are 1000 layers.

[BN After Addition] BN layer alters the signal that passes through the shortcut and impedes information propagation, as reflected by the difficulties on reducing training loss at the beginning of training.

[ReLU Before Addition] This leads to a non-negative output from the transform F , while intuitively a “residual” function should take values in R . As a result, the forward propagated signal is monotonically increasing. This may impact the representational ability, and the result is worse than the baseline.

[ReLU-Only Pre-activation] This ReLU layer is not used in conjunction with a BN layer, and may not enjoy the benefits of BN.

[Full Pre-activation] It reaches slightly higher training loss at convergence, but produces lower test error. This is presumably caused by BN’s regularization effect. In the original Residual Unit, although the BN normalizes the signal, this is soon added to the shortcut and thus the merged signal is not normalized. This unnormalized signal is then used as the input of the next weight layer. On the contrary, in our pre-activation version, the inputs to all weight layers have been normalized.

0 0