理解Batch Normalization

来源：互联网发布：2017淘宝注册用户数量编辑：程序博客网时间：2024/06/12 00:04

一直以来对batch normalization的理解都是似懂非懂。在机器学习和深度学习炼丹师的修炼之路上，似懂非懂是一件很危险的事。今天，虽然还是没有能够把谷歌原始的论文好好研读一遍，但是看了一篇很不错的博客，不啰嗦了，也不重复造轮子，链接：
https://r2rt.com/implementing-batch-normalization-in-tensorflow.html

有一个一直的认识误区是，以前总是误以为batch normalization是对隐层的神经元做的去均值除以方差（比如隐层有十个神经元，那么就用十个神经元的activation去均值除以方差），但是事实上并非如此，batch normalization实质上是对每一个mini-batch的输入数据，沿着batch的方向（在TensorFlow中就是tensor第0维的方向），对每一个feature进行的normalization的操作，这正好也和它的名字符合。

batch normalization的目的就是为了减少训练过程中，由于网络参数（weight）实时更新，previous layer参数的变化会使得input distribution of current layer（这里的distribution应该是针对单个神经元接收的batch input而言）不断摇摆变化导致加大weight学习的难度，这一效应。

而强制地把每一层神经元接收到的mini-batch feature map沿着batch的方向进行归一化（这可以视作一种正则化），会减小整个网络的capacity，所以归一化后还要再学习两个参数 α 和 β 用于自适应地scale和shift。

Batch Normalization in TensorFlow

推荐：https://github.com/martin-gorner/tensorflow-mnist-tutorial/blob/master/README_BATCHNORM.md

我在TensorFlow中使用了output = tf.contrib.layers.batch_norm ()
然后用了tf.control_dependencies() 将batch_norm中的EMA（指数移动平均）过程作为train_op的依赖

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops):    train_op = optimizer.minimize(loss)

BN中使用EMA的方法从batch mean/variance估计public mean/variance，这里会涉及到decay参数的影响，从公式来看：

decay = 0.999 # use numbers closer to 1 if you have more datatrain_mean = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1 - decay))train_var = tf.assign(pop_var, pop_var * decay + batch_var * (1 - decay))

decay反映了当前估计的衰减速度，decay越小衰减越快（指数衰减），对训练后期的batch mean有更多重视，所以相当于能够更快的热身。

Use a smaller decay value will accelerate the warm-up phase. The default decay is 0.999, for small datasets such like MNIST, you can choose 0.99 or 0.95, and it warms up in a short time.

但是，这建立在训练是可靠的前提下，如果训练本来就跑偏了（loss很大），那么早点热身也没用！正如TF文档中写到的

Lower decay value (recommend trying decay=0.9) if model experiences reasonably good training performance but poor validation and/or test performance. Try zero_debias_moving_mean=True for improved stability.

另外，由于使用BN层的网络，预测的时候要用到估计的总体均值和方差，如果iteration还比较少的时候就急着去检验或者预测的话，可能这时EMA估计得到的总体均值/方差还不accurate和stable，所以会造成训练和预测悬殊，这种情况就是造成下面这个issue的原因：https://github.com/tensorflow/tensorflow/issues/7469

事实上，先前那段引用说“smaller decay value will accelerate the warm-up phase“，正是这个issue的提问者自己问题解决后总结的话。解决的办法就是：当训练结果远好于预测的时候，那么可以通过减小decay，早点“热身”，让网络接受后期batch mean/variance的“熏陶”，这又回到了刚才引用的TF文档中的那段话。

References:
https://github.com/tensorflow/tensorflow/issues/7469
https://github.com/tensorflow/tensorflow/issues/1122#issuecomment-280325584
https://github.com/soloice/mnist-bn

阅读全文

0 0