Deep Learning：正则化（二）

来源：互联网发布：mac看不到隐藏文件夹编辑：程序博客网时间：2024/06/09 19:43

Norm Penalties as Constrained Optimization

Consider the cost function regularized by a parameter norm penalty:

J ~ (θ; X, y) = J (θ; X, y) + α Ω (θ)

We can minimize a function subject to constraints by constructing a generalized Lagrange function, consisting of the original objective function plus a set of penalties.
Each penalty is a product between a coefficient, called a Karush–Kuhn–Tucker (KKT) multiplier, and a function representing whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than some constant k, we could construct a generalized Lagrange function

L (θ; X, y) = J (θ; X, y) + α (Ω (θ) - k)

The solution to the constrained problem is given by

θ * = arg min θ max α, α \geq 0 L (θ, α)

All positive α encourage Ω(θ) to shrink. The optimal value

α∗ will encourage Ω(θ) to shrink, but not so strongly to make Ω(θ) become less than k.
To gain some insight into the effect of the constraint, we can fix

α∗ and view the problem as just a function of θ:

θ * = arg min θ L (θ, α *) = arg min θ J (θ; X, y) + α * Ω (θ)

This is exactly the same as the regularized training problem of minimizing

J~. We can thus think of a parameter norm penalty as imposing a constraint on the weights.
(1) If Ω is the L2 norm, then the weights are constrained to lie in an L2 ball.
(2) If Ω is the L1 norm, then the weights are constrained to lie in a region of limited L1 norm.

Usually we do not know the size of the constraint region that we impose by using weight decay with coefficient α∗ because the value of α∗ does not directly tell us the value of k.
In principle, one can solve for k, but the relationship between k and α∗ depends on the form of J . While we do not know the exact size of the constraint region, we can control it roughly by increasing or decreasing α in order to grow or shrink the constraint region.
Larger α will result in a smallerconstraint region. Smaller α will result in a larger constraint region.

Sometimes we may wish to use explicit constraints rather than penalties:
(1)We can modify algorithms such as stochastic gradient descent to take a step downhill on J(θ) and then project θ back to the nearest point that satisfies Ω(θ) < k. This can be useful if we have an idea of what value of k is appropriate and do not want to spend time searching for the value of α that corresponds to this k.
(2) Another reason to use explicit constraints and reprojection rather than enforcing constraints with penalties is that penalties can cause non-convex optimization procedures to get stuck in local minima corresponding to small θ. When training neural networks, this usually manifests as neural networks that train with several “dead units.” These are units that do not contribute much to the behavior of the function learned by the network because the weights going into or out of them are all very small.
(3) Explicit constraints implemented by re-projection can work much better in these cases because they do not encourage the weights to approach the origin. Explicit constraints implemented by re-projection only have an effect when the weights become large and attempt to leave the constraint region.
(4) Finally, explicit constraints with reprojection can be useful because they impose some stability on the optimization procedure. When using high learning rates, it is possible to encounter a positive feedback loop in which large weights induce large gradients which then induce a large update to the weights. If these updates consistently increase the size of the weights, then θ rapidly moves away from the origin until numerical overflow occurs. Explicit constraints with reprojection prevent this feedback loop from continuing to increase the magnitude of the weights without bound.

In particular, Hinton et al. (2012c) recommend a strategy introduced by Srebro and Shraibman (2005): constraining the norm of each column of the weight matrix of a neural net layer, rather than constraining the Frobenius norm of the entire weight matrix. Constraining the norm of each column separately prevents any one hidden unit from having very large weights.

If we converted this constraint into a penalty in a Lagrange function, it would be similar to L2 weight decay but with a separate KKT multiplier for the weights of each hidden unit. Each of these KKT multipliers would be dynamically updated separately to make each hidden unit obey the constraint. In practice, column norm limitation is always implemented as an explicit constraint with reprojection.

阅读全文

0 0