learning_rate&weight_decay&momentum

来源：互联网发布：windows7 iso for mac 编辑：程序博客网时间：2024/05/17 09:15

http://blog.csdn.net/u010025211/article/details/50055815 >Caffe中learning rate 和 weight decay 的理解

weight decay

在机器学习或者模式识别中, 会出现overfitting, 而当网络逐渐overfitting时网络权值逐渐变大; 因此, 为了避免出现overfitting, 会给误差函数添加一个惩罚项, 常用的惩罚项是所有权重的平方乘以一个衰减常量之和, 其用来惩罚大的权值.
权值衰减惩罚项使得权值收敛到较小的绝对值，而惩罚大的权值。因为大的权值会使得系统出现过拟合，降低其泛化性能。

The weight_decay parameter govern the regularization term of the neural net.
During training, a regularization term is added to the network’s loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.
As a rule of thumb, the more training examples you have, the weaker this term should be, the more parameters you have (i.e., deeper net, larger filters, large InnerProduct layers etc.) the higher this term should be.
Caffe also allows you to choose between L2 regularization (default) andL1 regularization, by setting regularization_type: “L1”
While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.

momentum

Stochastic gradient descent ( solver_type: SGD ) updates the weights W by a linear combination of the negative gradient ∇L(W) and the previous weight update Vt.
The learning rate α is the weight of the negative gradient. The momentum μ is the weight of the previous update.
Formally, we have the following formulas to compute the update value Vt+1 and the updated weights Wt+1 at iteration t+1, given the previous weight update Vt and current weights Wt:

V t + 1 W t + 1 = μ V t - α \nabla L (W t) = W t + V t + 1

The learning “hyperparameters” ( α and μ ) might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent Tricks (L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade:Springer, 2012).

Rules of thumb for setting the learning rate α and momentum μ

https://www.zhihu.com/question/24529483/answer/114711446?在神经网络中weight> decay起到的做用是什么？momentum呢？normalization呢？陈永志的回答
momentum是梯度下降法中一种常用的加速技术。对于一般的SGD，其表达式为x←x−α∗dx, x沿负梯度方向下降。而带momentum项的SGD则写生如下形式：

v x = β * v - a * d x \leftarrow x + v

其中β即momentum系数，通俗的理解上面式子就是，如果上一次的momentum（即v）与这一次的负梯度方向是相同的，那这次下降的幅度就会加大，所以这样做能够达到加速收敛的过程。
https://www.zhihu.com/question/24529483/answer/114711446?在神经网络中weight> decay起到的做用是什么？momentum呢？normalization呢？Hzhe Xu的回答
说下自己对momentum的看法。momentum是冲量单元，但是更好地理解方式是“粘性因子”，也就是所说的viscosity。momentum的作用是把直接用SGD方法改变位置（position）的方式变成了用SGD来对速度(velocity)进行改变。momentum让“小球”的速度保持一个衡量，增加了某一方向上的连续性，同时减小了因为learning带来的波动，因此使得我们采用更大的learning rate来进行训练，从而达到更快。

A good strategy for deep learning with SGD (Stochastic gradient descent) is to initialize the learning rate α to a value around α≈0.01=10−2, and dropping it by a constant factor (e.g., 10) throughout training.
when the loss begins to reach an apparent “plateau”, repeating this (我猜是这个减小α的过程) several times.
Generally, you probably want to use a momentum μ=0.9 or similar value. By smoothing
the weight updates across iterations, momentum tends to make deep learning with SGD
both stabler and faster.

这里 μ = momentum α = base_lr

http://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate>Difference between neural net weight decay and learning rate

The learning rate is a parameter that determines how much an updating step influences the current value of the weights, while weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero (以指数级的速度降为0, 我猜原因是正则项是所以权重的一阶或二阶范式, 不过也不是指数下降啊…), if no other update is scheduled.

So let’s say that we have a cost or error function E(w) that we want to minimize. Gradient descent tells us to modify the weights w in the direction of steepest descent in E:

w i = w i - η \partial E \partial w i

where η is the learning rate, and if it’s large you will have a correspondingly large modification of the weights (in general it shouldn’t be too large, otherwise you’ll overshoot the local minimum in your cost function).

In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to E(w)=E(w)+ηλ22w2. In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how you trade off the original cost E with the large weights penalization.

Applying gradient descent to this new cost function we obtain:

w i + 1 = w i - η \partial E \partial w i - η λ w i

The new term −ηλwi coming from the regularization causes the weight to decay in proportion to its size.

阅读全文

0 0