Deep Learning：正则化（五）

来源：互联网发布：网络打麻将算赌博吗编辑：程序博客网时间：2024/06/05 15:30

Noise Robustness

Dataset Augmentation has motivated the use of noise applied to the inputs as a dataset augmentation strategy. For some models, the addition of noise with infinitesimal variance at the input of the model is equivalent to imposing a penalty on the norm of the weights.

In the general case, it is important to remember that noise injection can be much more powerful than simply shrinking the parameters, especially when the noise is added to the hidden units.
Another way that noise has been used in the service of regularizing models is by adding it to the weights. This technique has been used primarily in the context of recurrent neural networks (Jim et al., 1996; Graves, 2011). This can be interpreted as a stochastic implementation of a Bayesian inference over the weights. The Bayesian treatment of learning would consider the model weights to be uncertain and representable via a probability distribution that reflects this
uncertainty.
This can also be interpreted as equivalent (under some assumptions) to a more traditional form of regularization. Adding noise to the weights has been shown to be an effective regularization strategy in the context of recurrent neural networks.

We study the regression setting, where we wish to train a function y^(x) that maps a set of features x to a scalar using the least-squares cost function between the model predictions y^(x) and the true values y:

J = E p (x, y) [(y^(x) - y) 2]

The training set consists of m labeled examples {(x(1),y(1),...,(x(m),y(m)))}.

We now assume that with each input presentation we also include a random perturbation ϵW∼N(ϵ;0,ηI) of the network weights. Let us imagine that we have a standard l-layer MLP. We denote the perturbed model as y^ϵW(x). Despite
the injection of noise, we are still interested in minimizing the squared error of the output of the network. The objective function thus becomes:

J ~ W = E p (x, y, ϵ W) [(y^ϵ W (x) - y) 2] = E p (x, y, ϵ W) [y^2 ϵ W (x) - 2 y y^ϵ W (x) + y 2]

For small η, the minimization of J with added weight noise (with covariance ηI) is equivalent to minimization of J with an additional regularization term:

η E p (x, y) [| | \nabla W y^(x) | | 2]

This form of regularization encourages the parameters to go to regions of parameter space where small perturbations of the weights have a relatively small influence on the output.
In other words, it pushes the model into regions where the model is relatively insensitive to small variations in the weights, finding points that are not merely minima, but minima surrounded by flat regions.

Injecting Noise at the Output Targets

Most datasets have some amount of mistakes in the y labels. It can be harmful to maximize log p(y | x) when y is a mistake. One way to prevent this is to explicitly model the noise on the labels.

For example, we can assume that for some small constant ϵ, the training set label y is correct with probability 1−ϵ, and otherwise any of the other possible labels might be correct.
This assumption is easy to incorporate into the cost function analytically, rather than by explicitly drawing noise samples.
The standard cross-entropy loss may then be used with these soft targets. Maximum likelihood learning with a softmax classifier and hard targets may actually never converge—the softmax can never predict a probability of exactly 0 or exactly 1, so it will continue to learn larger and larger weights, making more extreme predictions forever.
It is possible to prevent this scenario using other regularization strategies like weight decay.
Label smoothing has the advantage of preventing the pursuit of hard probabilities without discouraging correct classification.

阅读全文

0 0