Improving the way neural networks learn

来源：互联网发布：php编辑器安卓编辑：程序博客网时间：2024/05/16 19:45

LINK

Why sigmoid + quadratic cost function learning slow?

The quadratic cost function is given by

C = ( y - a ) 2 2 (1)

where

a is the neuron’s output.

a=σ(z), where

z=wx+b. Using the chain rule to differentiate with respect to the weight and bias we get

\partial C \partial w = (a - y) σ' (z) x = a σ' (z) (2)

\partial C \partial b = (a - y) σ' (z) = a σ' (z) (3)

where I have substituted

x=1 and

y=0.
Recall the shape of the

σ function:
sigmoid function

We can see from this graph that when the neuron’s output is close to 1, the curve gets very flat, and so

σ′(z) gets very small. Equations (2) and (3) then tell us that

∂C/∂w and

∂C/∂b get very small.

Using the quadratic cost when we have linear neurons in the output layer. Suppose that we have a many-layer multi-neuron network. Suppose all the neurons in the final layer are linear neurons, meaning that the sigmoid activation function is not applied, and the outputs are simply aLj=zLj. Show that if we use the quadratic cost function then the output error δL for a single training example x is given by

$δ L = a L - y$
Similarly to the previous problem, use this expression to show that the partial derivatives with respect to the weights and biases in the output layer are given by
$\partial C \partial w L j k \partial C \partial b L j = = 1 n \sum x a L - 1 k (a L j - y j) 1 n \sum x (a L j - y j) .$
This shows that if the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic cost is, in fact, an appropriate cost function to use.

sigmoid + cross-entropy cost function

The cross-entropy cost function

C = - 1 n \sum x [y ln a + (1 - y) ln (1 - a)] (4)

where

n is the total number of items of training data, the sum is over all training inputs,

x, and

y is the corresponding desired output.

The partial derivative of the cross-entropy cost with respect to the weights. We substitute a=∂(z) into (4), and apply the chain rule twice, obtaining:

\partial C \partial w j = = - 1 n \sum x (y σ ( z ) - ( 1 - y ) 1 - σ ( z )) \partial σ \partial w j - 1 n \sum x (y σ ( z ) - ( 1 - y ) 1 - σ ( z )) σ' (z) x j . (5) (6)

Putting everything over a common denominator and simplifying this becomes:

\partial C \partial w j = 1 n \sum x σ ' ( z ) x j σ ( z ) ( 1 - σ ( z ) ) (σ (z) - y) . (7)

Using the definition of the sigmoid function,

σ(z)=1/(1+e−z), and a little algebra we can show that

σ′(z)=σ(z)(1−σ(z)).
We see that the

σ′(z) and

σ(z)(1−σ(z)) terms cancel in the equation just above, and it simplifies to become:

\partial C \partial w j = 1 n \sum x x j (σ (z) - y) . (8)

This is a beautiful expression. It tells us that the rate at which the weight learns is controlled by

σ(z)−y, i.e., by the error in the output. The larger the error, the faster the neuron will learn. In particular, it avoids the learning slowdown caused by the

σ'(z) term in the analogous equation for the quadratic cost, Equation (2).

In a similar way, we can compute the partial derivative for the bias.

\partial C \partial b = 1 n \sum x (σ (z) - y) . (9)

It’s easy to generalize the cross-entropy to many-neuron multi-layer networks. In particular, suppose y=y1,y2,... are the desired values at the output neurons, i.e., the neurons in the final layer, while aL1,aL2,... are the actual output values. Then we define the cross-entropy by

C = - 1 n \sum x \sum j [y j ln a L j + (1 - y j) ln (1 - a L j)] .

Softmax + log-likelihood cost

In a softmax layer we apply the so-called softmaxfunction to the zLj. According to this function, the activation aLj of the jth output neuron is

a L j = e z L j \sum k e z L k, (10)

where in the denominator we sum over all the output neurons.

The log-likelihood cost:

C \equiv - ln a L y . (11)

The partial derivative:

\partial C \partial b L j \partial C \partial w L j k = = a L j - y j a L - 1 k (a L j - y j) (12) (13)

These expressions ensure that we will not encounter a learning slowdown. In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Given this similarity, should you use a sigmoid output layer and cross-entropy, or a softmax output layer and log-likelihood? In fact, in many situations both approaches work well. As a more general point of principle, softmax plus log-likelihood is worth using whenever you want to interpret the output activations as probabilities. That’s not always a concern, but can be useful with classification problems (like MNIST) involving disjoint classes.

overfitting

In general, one of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit.

0 0