[torch]optim.sgd学习参数

来源:互联网 发布:数据建模知乎理论物理 编辑:程序博客网 时间:2024/06/06 21:37

https://github.com/torch/optim/blob/master/doc/intro.md
https://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate
http://cs231n.github.io/neural-networks-3/#sgd
http://www.jianshu.com/p/58b3fe300ecb
http://www.jianshu.com/p/d8222a84613c

learning rate

学习率较小时,收敛到极值的速度较慢。学习率较大时,容易在搜索过程中发生震荡。

The learning rate is a parameter that determines how much an updating step influences the current value of the weights. While weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero, if no other update is scheduled.

So let’s say that we have a cost or error function E(w) that we want to minimize. Gradient descent tells us to modify the weights w in the direction of steepest descent in E:

wiwiηEwi,

where η is the learning rate, and if it’s large you will have a correspondingly large modification of the weights wi (in general it shouldn’t be too large, otherwise you’ll overshoot the local minimum in your cost function).

weight decay

In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to E˜(w)=E(w)+λ2w2. In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how you trade off the original cost E with the large weights penalization.

Applying gradient descent to this new cost function we obtain:

wiwiηEwiηλwi.

The new term ηλwi coming from the regularization causes the weight to decay in proportion to its size.

learning rate decay

在使用梯度下降法求解目标函数func(x) = x * x的极小值时,更新公式为x += v,其中每次x的更新量v为v = - dx * lr,dx为目标函数func(x)对x的一阶导数。可以想到,如果能够让lr随着迭代周期不断衰减变小,那么搜索时迈的步长就能不断减少以减缓震荡。学习率衰减因子由此诞生:

lri=lrstart1.0(1.0+decayi)

上面的公式即为学习率衰减公式,其中lri为第i次迭代时的学习率,lrstart为原始学习率,decay为一个介于[0.0, 1.0]的小数。
从公式上可看出:

decay越小,学习率衰减地越慢,当decay = 0时,学习率保持不变。decay越大,学习率衰减地越快,当decay = 1时,学习率衰减最快。

momentum

“冲量”这个概念源自于物理中的力学,表示力对时间的积累效应。

在普通的梯度下降法x+=v中,每次x的更新量vv=dxlr,其中dx为目标函数func(x)x的一阶导数,。
当使用冲量时,则把每次x的更新量v考虑为本次的梯度下降量dxlr与上次x的更新量v乘上一个介于[0, 1]的因子momentum的和,即

v=dxlr+vmomemtum

从公式上可看出:

当本次梯度下降- dx * lr的方向与上次更新量v的方向相同时,上次的更新量能够对本次的搜索起到一个正向加速的作用。当本次梯度下降- dx * lr的方向与上次更新量v的方向相反时,上次的更新量能够对本次的搜索起到一个减速的作用。
原创粉丝点击