Deep learning: prevent overfitting && speed up training

来源：互联网发布：美国邮箱数据出售编辑：程序博客网时间：2024/06/06 11:41

prevent overfitting：

1. 在loss function上添加L1、L2正则项

2. Drop out

3. Data augmentation

4. early stopping

在Keras上都有对应的实现方式

speed up training

1. normalization（零均值化 + 归一化方差），why normalization？因为如果input 差别很大的话，loss function通常是这样的，这就意味着需要很小的learning rate（太大了直接跳出optimal范围了），而且梯度下降过程也是在optimal区域内内来回震荡

Unintuitive effects and their consequences. Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note that in linear classifiers where the weights are dot producted wTxi (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples xi by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you’d have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases.

http://cs231n.github.io/optimization-2/

2. 权重随机初始化：这起源于deep model训练上的问题，deep model由于hidden layer很多，链式求导的公式gradient连乘项就会比较多，如果每个连乘项比1大，那就是exploding gradient，反之就是vainshing gradient，

将初始权重设为0均值，variance为1/n（n是该层的input dimension），这样就可以使W*X在合理的范围，而导数与W*X相关，对应的也被约束了

一般不同的激活函数会选择不同的variance，

阅读全文

0 0