How to improve neural networks learning

来源：互联网发布：js中slice的用法编辑：程序博客网时间：2024/06/04 18:28

For our human intuition, we can learn more quickly if our errors are large (we are badly wrong). But it is not the case for neural networks. Our artificial neuron has a lot of difficulty learning when it’s badly wrong - far more difficulty than when it’s just a little wrong. (for quadratic cost function)

To say “learning slow” is really the same as saying

∂Cost∂w is small. As the quadratic cost function, which is

Cost=(y−a)22,

a=f(wx) is the output,

y is the desired output. So

\partial C o s t \partial w = (a - y) f' (w x) x

If f(wx) activation function is sigmoid
sigmoid function

we can find this graph that when the neuron’s output is close to 1 or 0 (wx is very positive or negative), the curve gets very flat which means f′ gets very small. Therefore, ∂Cost∂w is small when it’s badly wrong or a little wrong.

1. Introduction to the cross-entropy cost function

In order to address the learning slowdown, we can replace the quadratic cost function with cross-entropy.

C o s t = - 1 n \sum x [y ln a + (1 - y) ln (1 - a)]

y is desired output, a is actual output

It’s not obvious that the expression fixes the learning slowdown problem. In fact, frankly, it’s not even obvious that it makes sense to call this a cost function!

Before addressing the learning slowdown, let’s see in what sense the cross-entropy can be interpreted as a cost function.

Two properties in particular make it reasonable to interpret the cross-entropy as a cost function.
- Cost > 0 (noted a∈[0,1])
- The neuron’s actual output is close to the desired output for all training inputs, x, then the cross-entropy will be close to zero

Summing up, the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output.

let’s compute the partial derivative of the cross-entropy cost with respect to the weights.

\partial C o s t \partial w = - 1 n \sum x {[y f ( w x ) - 1 - y 1 - f ( w x )] f' (w x) x} = 1 n \sum x f ' ( w x ) x f ( w x ) ( 1 - f ( w x ) ) (f (w x) - y)

We see that the f′(wx) and f(wx)(1−f(wx)) terms cancel in the equation just above (f is sigmoid), and it simplifies to become:

\partial C o s t \partial w = 1 n \sum x x (f (w x) - y)

It tells us that the rate at which the weight learns is controlled by f(z)−y, i.e., by the error in the output. The larger the error, the faster the neuron will learn. This is just what we’d intuitively expect.

2. Softmax

The idea of softmax is to define a new type of output layer for our neural networks. It begins in the same way as with a sigmoid layer, by forming the weighted inputs.

z L j = \sum k w L j k a L - 1 k

However, we don’t apply the sigmoid function to get the output. Instead, in a softmax layer we apply the so-called softmax function to the

zLj:

a L j = e z L j \sum k e z L k

The output activations are guaranteed to always sum up to 1. As a result, if one output increases, then the other output activations must decrease by the same total amount. In other words, the output from the softmax layer can be thought of as a probability distribution.

For the learning slowdown problem: (here we use log-likelihood cost function)

We’ll use x to denote a training input to the network, and y to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is:

C o s t = - ln a L y

(if input an image of 7, the cost is

−lnaL7, only consider the

7th neuron)
To see that this makes intuitive sense, consider the case when the network is doing a good job, that is, it is confident the input is a 7. In that case it will estimate a value for the corresponding probability

aL7 which is close to 1, and so the cost

−lnaL7 will be small.

Therefore, the derivative of this cost function with respect to weight is:

\partial C o s t \partial w L j k = a L - 1 k (a L j - y j)

The detailed proof of derivative of softmax function:

\partial a j \partial z i = {a i (1 - a i), - a i a j, if i = j if i \neq j

In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Softmax is a ‘soft’ version of argmax - find the biggest value of a group. (An analogue is softplus) An explain for softmax name

3. Regularization

3.1 L2regularization

L2 is also called weight decay, which is to add an extra term to the cost function:

C o s t = - 1 n \sum x j [y j ln a L j + (1 - y j) ln (1 - a L j)] + λ 2 n \sum w w 2

It’s also worth noting that the regularization term doesn’t include the biases.

Empirically, doing this often doesn’t change the results very much, so to some extent it’s merely a convention whether to regularize the biases or not. However, it’s worth noting that having a large bias doesn’t make a neuron sensitive to its inputs in the same way as having large weights. And so we don’t need to worry about large biases enabling our network to learn the noise in our training data. At the same time, allowing large biases gives our networks more flexibility in behaviour - in particular, large biases make it easier for neurons to saturate, which is sometimes desirable.

General Form

C = C 0 + λ 2 n \sum w w 2

Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function.

Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of λ: when λ is small we prefer to minimize the original cost function, but when λ is large we prefer small weights.

But why this kind of compromise should help reduce overfitting?

First: take the partial derivatives of C

\partial C \partial w = \partial C 0 \partial w + λ n w

The learning rule for the weights becomes:

w \to w - η \partial C 0 \partial w - η λ n w = (1 - η λ n) w - η \partial C 0 \partial w

This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight w by a factor (1−ηλn).

This rescaling is sometimes referred to as weight decay, since it makes the weights smaller.

The smallness of the weights means that the behaviour of the network won’t change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. Think of it as a way of making it so single pieces of evidence don’t matter too much to the output of the network. (preferring simpler：Occam’s Razor)

3.2 L1regularization

C = C 0 + λ n \sum w | w |

Let’s try to understand how the behaviour of a network trained using L1 regularization differs from a network trained using L2 regularization.

First, we’ll look at the partial derivatives of cost (C) function:

\partial C \partial w = \partial C 0 \partial w + λ n s g n (w)

s g n (w) = ⎧ ⎩ ⎨ + 1, - 1, 0, if w is positive if w is negative if w == 0

The resulting update rule for an L1 regularized network is:

w \to w - η \partial C 0 \partial w - η λ n s g n (w)

In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w. And so when a particular weight has a large magnitude, |w|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |w| is small, L1 regularization shrinks the weight much more than L2 regularization.

The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero.
concentrating on large weights means retaining the large weights and eliminating the small weights (because when |w| is small, L1 regularization shrinks the weight much more than L2 regularization.)

3.3 Dropout

Dropout is a radically different technique for regularization. Unlike L1 and L2 regularization, dropout doesn’t rely on modifying the cost function. Instead, in dropout we modify the network itself.

Ordinarily, we’d train by forward-propagating x through the network, and then backpropagating to determine the contribution to the gradient. With dropout, this process is modified. We start by randomly (and temporarily) deleting half the hidden neurons in the network, while leaving the input and output neurons untouched.

We forward-propagate the input x through the modified network, and then backpropagate the result, also through the modified network. The next is updating the appropriate weights.

We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete, estimating the gradient for a different mini-batch, and updating the weights and biases in the network.

By repeating this process over and over, our network will learn a set of weights and biases. Of course, those weights and biases will have been learnt under conditions in which half the hidden neurons were dropped out. When we actually run the full network that means that twice as many hidden neurons will be active. To compensate for that, we halve the weights outgoing from the hidden neurons.

We can think of dropout as training several different neural networks all using the same training data. Of course, the networks may not start out identical, and as a result after training they may sometimes give different results. When that happens we could use some kind of averaging or voting scheme to decide which output to accept.

This kind of averaging scheme is often found to be a powerful (though expensive) way of reducing overfitting. The reason is that the different networks may overfit in different ways, and averaging may help eliminate that kind of overfitting.

3.4 Artificially expanding the training data

Rotate it by a small amount which generates a new sample. or translate, skew.

“elastic distortions”, a special type of image distortion intended to emulate the random oscillations found in hand muscles.

Using above methods to generate more training data.

Variations on this idea can be used to improve performance on many learning tasks, not just handwriting recognition. The general principle is to expand the training data by applying operations that reflect real-world variation.

4. How to choose hyper-parameters

In this section I explain some heuristics which can be used to set the hyper-parameters in a neural network.

Broad strategy: When using neural networks to attack a new problem the first challenge is to get any non-trivial learning, i.e., for the network to achieve results better than chance.

Learning rate: if η is too large, then the steps will be so large that they may actually overshoot the minimum. On the other hand, if η is too small, it slows down SGD.

With this picture in mind, we can set η as follows. First, we estimate the threshold value for η at which the cost on the training data immediately begins decreasing, instead of oscillating or increasing. Obviously, the actual value of η that you use should be no larger than the threshold value.

Use early stopping to determine the number of training epochs

Automated techniques: Random search for hyper-parameter optimization | Practical Bayesian optimization of machine learning algorithms

阅读全文

0 0