Normalizing All Layers(II): Back-Propagation
来源:互联网 发布:淘宝店装修要多少钱 编辑:程序博客网 时间:2024/06/08 16:23
This blog is notes extracted from http://saban.wang/2016/03/28/Normalizing-All-Layers%EF%BC%9A-Back-Propagation/
It mainly covers how to normalize layers in back-propagation.
Introduction
In the last post, we discussed how to make all neurons of a neural network to have normal gaussian distribution. However, as the Conclusion section claimed, we haven’t considered the back-propagation procedure. In fact, when we talk about the gradient vanishing or exploding problem, we usually refer to the gradients flow in the back-propagation procedure. Since this, the correct way seems to be normalizing the backward gradients of neurons, instead of the forward values.
In this post, we will discuss how to normalize all the gradients using a similar philosophy with the last post: for a given gradient dy∼N(0,I), normalizing the layer to make sure that dx is expected to have zero mean and one standard deviation.
Parametric Layer
Consider the back-propagate fomulation of Convolution and InnerProdcut layer,
we will get a similar strategy of normalizing each row of W to be on a
Activation Layers
One problem that can’t be avoided when calculating the formulations of activations is that we should not only assume the distribution of the gradients, but also the forward input of the activation, because the gradients of activations are usually dependent on the inputs. Here we assume that both the input x and the gradient dy follow the normal gaussian distribution
Relu
Its backward gradients can be easily obtained:
When
Here the question comes, now we have two different standard deviations, one for forward values and one for backward gradients, which one should be used to normalize the ReLU layer? My tendency is to use the
Sigmoid
The backward gradient of Sigmoid activation is,
From simulating, we can get
The same with ReLU, we should still minus the
Pooling Layer
- Avarage Pooling
3x3: std is19 and14 for 2x2. - Max Pooling
3x3: std is13 and12 for 2x2.
Dropout Layer
The backward formula for Dropout layer is almost the same with the forward one, we should still divide the preserved values by
Conclusion
In this post, we have discussed the normalization strategy that serves the gradient flow of the backward propagation. The standard deviations of the gradients in the morden CNN are recorded here. However, when we are using the std of backward gradients, the forward value scale would not be controlled well. Inhomogeneous(非齐次) activations, such as sigmoid and tanh, are not suitable for this method because their domain may not cover a sufficient non-linear part of the activation.
So maybe a good choice is to use a separate scaling method for forward and backward propagation? This idea is conflict with the back-propagation algorithm, so we should still carefully examine it through experiment.
- Normalizing All Layers(II): Back-Propagation
- Normalizing All Layers(I)--Forward Propagation
- BP(Back Propagation)
- Back-Propagation Neural Networks
- Back Propagation in BPNeuralNetwork
- BP(Back Propagation)
- How back-Propagation works
- 8. back propagation
- Back Propagation Algorithm
- document.layers document.all
- 基础Back-Propagation算法总结
- Back Propagation算法推导过程
- 理解back propagation反向传播
- 反向传播算法(Back Propagation)
- document.all和document.layers
- document.all与document.layers
- document.all、document.layers用法
- document.all、document.layers用法
- Cocos2d-x 3.9教程:8. Cocos2d-x中的4种布局
- ubuntu下安装android-studio
- 自适应网页设计的方法(手机端良好的访问体验)
- 逆序数
- 主流外骨骼概览
- Normalizing All Layers(II): Back-Propagation
- 使用Solr搭建“小”数据集群的搜索和推荐功能
- JAVA开发,MySQL-SQLServer移植几点备注
- memcached实战系列(七)理解Memcached的数据过期方式、新建过程、查找过程
- Eclipse中出现-访问限制由于对必需的库XX具有一定限制,因此无法访问类型
- Camel-Component组件总结
- 自定义cell中的按钮点击事件
- 卷积神经网络在自然语言处理的应用Understanding Convolutional Neural Networks for NLP
- C语言之内存对齐