Deep Learning:正则化(二)
来源:互联网 发布:mac看不到隐藏文件夹 编辑:程序博客网 时间:2024/06/09 19:43
Norm Penalties as Constrained Optimization
Consider the cost function regularized by a parameter norm penalty:
We can minimize a function subject to constraints by constructing a generalized Lagrange function, consisting of the original objective function plus a set of penalties.
Each penalty is a product between a coefficient, called a Karush–Kuhn–Tucker (KKT) multiplier, and a function representing whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than some constant k, we could construct a generalized Lagrange function
The solution to the constrained problem is given by
All positive α encourage Ω(θ) to shrink. The optimal value
To gain some insight into the effect of the constraint, we can fix
This is exactly the same as the regularized training problem of minimizing
(1) If Ω is the L2 norm, then the weights are constrained to lie in an L2 ball.
(2) If Ω is the L1 norm, then the weights are constrained to lie in a region of limited L1 norm.
Usually we do not know the size of the constraint region that we impose by using weight decay with coefficient α∗ because the value of α∗ does not directly tell us the value of k.
In principle, one can solve for k, but the relationship between k and α∗ depends on the form of J . While we do not know the exact size of the constraint region, we can control it roughly by increasing or decreasing α in order to grow or shrink the constraint region.
Larger α will result in a smallerconstraint region. Smaller α will result in a larger constraint region.
Sometimes we may wish to use explicit constraints rather than penalties:
(1)We can modify algorithms such as stochastic gradient descent to take a step downhill on J(θ) and then project θ back to the nearest point that satisfies Ω(θ) < k. This can be useful if we have an idea of what value of k is appropriate and do not want to spend time searching for the value of α that corresponds to this k.
(2) Another reason to use explicit constraints and reprojection rather than enforcing constraints with penalties is that penalties can cause non-convex optimization procedures to get stuck in local minima corresponding to small θ. When training neural networks, this usually manifests as neural networks that train with several “dead units.” These are units that do not contribute much to the behavior of the function learned by the network because the weights going into or out of them are all very small.
(3) Explicit constraints implemented by re-projection can work much better in these cases because they do not encourage the weights to approach the origin. Explicit constraints implemented by re-projection only have an effect when the weights become large and attempt to leave the constraint region.
(4) Finally, explicit constraints with reprojection can be useful because they impose some stability on the optimization procedure. When using high learning rates, it is possible to encounter a positive feedback loop in which large weights induce large gradients which then induce a large update to the weights. If these updates consistently increase the size of the weights, then θ rapidly moves away from the origin until numerical overflow occurs. Explicit constraints with reprojection prevent this feedback loop from continuing to increase the magnitude of the weights without bound.
In particular, Hinton et al. (2012c) recommend a strategy introduced by Srebro and Shraibman (2005): constraining the norm of each column of the weight matrix of a neural net layer, rather than constraining the Frobenius norm of the entire weight matrix. Constraining the norm of each column separately prevents any one hidden unit from having very large weights.
If we converted this constraint into a penalty in a Lagrange function, it would be similar to L2 weight decay but with a separate KKT multiplier for the weights of each hidden unit. Each of these KKT multipliers would be dynamically updated separately to make each hidden unit obey the constraint. In practice, column norm limitation is always implemented as an explicit constraint with reprojection.
- Deep Learning:正则化(二)
- Deep Learning:正则化(一)
- Deep learning:正则化(三)
- Deep Learning:正则化(四)
- Deep Learning:正则化(五)
- Deep Learning:正则化(六)
- Deep Learning:正则化(七)
- Deep Learning:正则化(八)
- Deep Learning:正则化(九)
- Deep Learning:正则化(十)
- Deep Learning:正则化(十一)
- Deep Learning:正则化(十二)
- Deep Learning:正则化(十三)
- Deep Learning:正则化(十四)
- Deep Learning(二)
- Deep Learning 4 -正则化
- 学习笔记:Deep Learning(二)深度神经网络以及正则化
- Deep Learning介绍(二)
- 当我们在外部使用``–list``参数调用这个脚本时,这个脚本必须返回一个JSON散列/字典
- excel这几大数据处理技巧,高效率操作技能,今天免费交给你!
- FileChannel类的理解和使用(java.nio.channels.FileChannel)
- 开发小程序时遇到的坑
- 滚动条样式的修改
- Deep Learning:正则化(二)
- IDEA控制台中文乱码解决办法
- 让心里住进阳光
- [Lintcode] #101 删除排序数组中的重复数字 II
- 关于并发
- [杂题] Codeforces #121C. Lucky Permutation
- Guitar Pro 7中的谱子缩放设置介绍
- lua数据结构
- cassandra插入二进制大文件超时问题