[Paper Note] Batch normalization(未完成)
来源:互联网 发布:电脑报淘宝生活馆 编辑:程序博客网 时间:2024/06/06 02:40
Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Notation:
Internal covariate shift: Inputs of each layer change during training, as the previous parameters change.
This slows down the training and requires lower learning rate et careful.
Covariate shift : The input distribution to a learning system change
- The saturated regime of non-linearity of Sigmoid activate function
Problem:
- Internal covariate shift
- Layers need to continuously adapt to the new distributions.
BN effects
- Reducing internal covariate shift problem by taking few steps which dramatically accelerate the training.
- It has a benefit on gradient flow through the network, bu reducing the the dependency of gradients on the scale of parameters and their initial values.
- It regularizes the the models and reduces the need of Dropout
- is make it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated nodes.
- Using only 7% of training steps to match performance of ImageNet
Main points
-
- BN is a transformation which applies on layer inputs x so as to normalize the distribution of mini-batch.
- (LeCun et al., 1998b; Wiesler & Ney, 2011) proved that the network converges faster if its inputs are whitened, that means inputs are zero mean, unit variances and decorrelated, convergence could be faster
But the full whitening is costly and no everywhere differentiable, we make two simplifications:- Normalize each scalar feature independently, not layer inputs and outputs jointly. Such operation speeds up convergence, even when the features are not decorrelated. Note that simply normalizing the input layer may change what the layer can represent. To adresse this, we make sure that the transformation insert in the network can represent the identity transform by introduce paires of learnable parameters
x̂ =1N∑i=0x γ(k) ,β(k) to scale and shift the normalized value:yk=γ(k)x̂ (k)+β(k)
how to make sure the network will learn to represent a identity network? - Since we use SGD, the normalization could be applied only on batch of data, not on the whole set. So in our case, each mini-batch produces estimates of the mean and variance of each activation. -
- Normalize each scalar feature independently, not layer inputs and outputs jointly.
- 啊
Let x be a layer input, treated as a vector, and X be the set of these inputs over the training dataset, the normalization can then be written as a transformation:
which depends not only on x itself bu also all training exemples - each of which depends on
在一般的神经网络中,每个网络层的输出可以作为一个n为向量,n 是神经元的个数。从BN的原理可以看出,它是对一个神经元的操作,我们可以选择对所有神经元进行BN,也可以选择只对部分神经元进行该操作。对每个被归一化的神经元,都有一个可学习的参数对
但是在CNN中,如果每个神经元都有一个可学习的参数对,那么参数数量将急剧增加:假设输入层是一个大小为
- [Paper Note] Batch normalization(未完成)
- Batch Normalization
- Batch Normalization
- Batch Normalization
- batch normalization
- Batch Normalization
- Batch Normalization
- batch normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- Batch Normalization
- 编译链接过程
- MySQL中修改多个数据表的字段拼接问题
- 分数化小数
- microPython VS lua
- poj 2114 Boatherds(树分治)
- [Paper Note] Batch normalization(未完成)
- mysql中alter语句中change和modify的区别
- Hadoop 在Ubuntu下的单机配置及运行示例
- PHP实现快速排序
- lucene集成ikanalyzer中文分词器插件;Analyzer使用时机
- 计数排序(Counting-Sort)
- eclipse创建maven web项目
- 5.2.3 栈方法
- ArrayList