BN and Caffe BN

来源：互联网发布：手机淘宝客户端改评价编辑：程序博客网时间：2024/05/21 11:48

caffe scale layer:
http://stackoverflow.com/questions/37410996/scale-layer-in-caffe

layer {
bottom: “res2b_branch2b”
top: “res2b_branch2b”
name: “scale2b_branch2b”
type: “Scale”
scale_param { bias_term: true }
}

layer {
name: “scaleToUnitInt”
type: “Scale”
bottom: “bot”
top: “scaled”
param { lr_mult: 0
decay_mult: 0
}
param { lr_mult: 0
}
scale_param {
filler { value: 0.5 }
bias_term: true
bias_filler { value: -2 }
}
}

Quora关于BN的一些属性的回答：
https://www.quora.com/Why-does-batch-normalization-help
Batch Normalization solves such problem with some additional assumptions. Followings are the properties of Batch Normalization with mean and variance for a mini batch version:

Learning faster: Learning rate can be increased compare to non-batch-normalized version.
Increase Accuracy: Flexibility on mean and variance value for every dimension in every hidden layer provides better learning, hence accuracy of the network.

Normalization or Whitening of the inputs to each layer: Zero means, unit variances and or not decorrelated.

To remove the ill-effect of Internal Covariate shift:Transformation makes data to big or to small; change of the input distribution away from normalization due to successive transformation.

Not-Stuck in the saturation mode: Even if ReLU is not used.

Integrate Whitening within the gradient descent optimization: Decoupled Whitening between training steps, which modifies network directly, reduces the effort of optimization. So, model blows up when normalization parameters are computed outside the gradient descent step.

Whitening within gradient descent: Requires inverse square root of covariance matrix as well as derivatives for backpropagation

Normalization of Individual dimension: Individual dimension of hidden layers are normalized independently rather than joint covariances. So, features are not decorrelated.

Normalization of mini-batch: Estimation of mean and variance are computed after each mini-batch rather than entire training set. 9. Even ignoring the joint covariance as it will create singular co-variance matrices for such small number of training sample per mini-batch compare to high dimension size of the hidden layer.

Learning of scale and shift for every dimension: Scaled and shifted values are passed to the next layer, whether mean and variances are calculated after getting all mini-batch activation of current layer. So, forward pass of all the samples within the mini-batch should pass layer wise. Backpropagation is required for getting gradient of weights as well as scaling (variance) and shift (mean).

Inference: During inference moving averaged mean and variance parameters during mini batch training are considered.

Convolution Neural Network: Whitening of intermediate layers, before or after the nonlinearity creates a lot of new innovation pathways [11-15].

[1] What is buckling of a column?
[2] Why does buckling occur in columns?
[3] What is buckling?
[4] What is meant by buckling in engineering words?
[5] How does buckling analysis work?
[6] What is the difference between crippling and buckling?
[7] What is the difference between crushing and buckling failures of a column?
[8] What is the cylindrical buckling?
[9] What is difference between buckling and bending?
[10] batch normalization
[11] How do I apply Batch Normalization to the convolutional layer of a CNN?
[12] How does batch normalization behave differently at training time and test time?
[13] How does a person choose the best size of mini-batch in the test when the model is using batch normalization?
[14] How does a person choose the best size of mini-batch in the test when the model is using batch normalization?
[15] What is local response normalization?

介绍BN的blog:
https://standardfrancis.wordpress.com/2015/04/16/batch-normalization/

0 0