Improving the way neural networks learn

来源：互联网发布：天堂伞淘宝官方旗舰店编辑：程序博客网时间：2024/05/22 17:30

Improving the way neural networks learn

The cross-entropy cost function

Introducing the cross-entropy cost function

Using the quadratic cost function, the learning will slow down if the neuron saturates because σ′(zl) tends to be very small so ∂C∂wljk tends to be very small. While with the cross-entropy learning is faster when the neuron is unambiguously wrong than it is later on(the greater the initial error, the faster the neuron learns).
We define the cross-entropy by

C = - 1 n \sum x \sum j [y j ln a L j + (1 - y j) ln (1 - a L j)] . (1)

If the output neurons are linear neurons then the quadratic cost will not give rise to any problems with a learning slowdown. In this case the quadratic cost is, in fact, an appropriate cost function to use.

Using the cross-entropy to classify MNIST digits

What does the cross-entropy mean? Where does it come from?

Softmax

The idea of softmax is to define a new type of output layer for our neural networks. In a softmax layer we apply the so-called softmax function to the zLj. According to this function, the activation aLj of the jth output neuron is

a L j = e z L j \sum k e z L k, (2)

The output from the softmax layer can be thought of as a probability distribution.

The learning slowdown problem:
We’ll use x to denote a training input to the network, and y to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is

C \equiv - ln a L y . (3)

So, for instance, if we’re training with MNIST images, and input an image of a 7, then the log-likelihood cost is

−lnaL7.
What about the learning slowdown problem? To analyze that, recall that the key to the learning slowdown is the behaviour of the quantities

∂C/∂wLjk and

∂C/∂bLj.

\partial C \partial b L j \partial C \partial w L j k = = a L j - y j a L - 1 k (a L j - y j) (4) (5)

Just as in the earlier analysis, these expressions ensure that we will not encounter a learning slowdown. In fact, it’s useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

Overfitting and regularization

Overfitting is a major problem in neural networks. This is especially true in modern networks, which often have very large numbers of weights and biases. To train effectively, we need a way of detecting when overfitting is going on, so we don’t overtrain. And we’d like to have techniques for reducing the effects of overfitting.
The obvious way to detect overfitting is to use the approach above, keeping track of accuracy on the test data as our network trains. If we see that the accuracy on the test data is no longer improving, then we should stop training. Of course, strictly speaking, this is not necessarily a sign of overfitting. It might be that accuracy on the test data and the training data both stop improving at the same time. Still, adopting this strategy will prevent overfitting.
In fact, we’ll use a variation on this strategy. We’ll compute the classification accuracy on the validation_data at the end of each epoch. Once the classification accuracy on the validation_data has saturated, we stop training. This strategy is called early stopping. Of course, in practice we won’t immediately know when the accuracy has saturated. Instead, we continue training until we’re confident that the accuracy has saturated.
Why use the validation_data to prevent overfitting, rather than the test_data? In fact, this is part of a more general strategy, which is to use the validation_data to evaluate different trial choices of hyper-parameters such as the number of epochs to train for, the learning rate, the best network architecture, and so on. We use such evaluations to find and set good values for the hyper-parameters.
You can think of the validation data as a type of training data that helps us learn good hyper-parameters. This approach to finding good hyper-parameters is sometimes known as the hold out method, since the validation_data is kept apart or “held out” from the training_data.

In general, one of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit. Unfortunately, training data can be expensive or difficult to acquire, so this is not always a practical option.

Regularization

weight decay or L2 regularization

The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term. Here’s the regularized cross-entropy:

C = - 1 n \sum x j [y j ln a L j + (1 - y j) ln (1 - a L j)] + λ 2 n \sum w w 2, (6)

where λ>0 is known as the regularization parameter, and n is, as usual, the size of our training set.
Of course, it’s possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:

C = 1 2 n \sum x ∥ y - a L ∥ 2 + λ 2 n \sum w w 2 . (7)

In both cases we can write the regularized cost function as

C = C 0 + λ 2 n \sum w w 2, (8)

where C0 is the original, unregularized cost function.
Taking the partial derivatives of Equation (8) gives

\partial C \partial w \partial C \partial b = = \partial C 0 \partial w + λ n w \partial C 0 \partial b . (9) (10)

The ∂C0/∂w and ∂C0/∂b terms can be computed using backpropagation.
The partial derivatives with respect to the biases are unchanged, and so the gradient descent learning rule for the biases doesn’t change from the usual rule:

b \to b - η \partial C 0 \partial b . (11)

The learning rule for the weights becomes:

w \to = w - η \partial C 0 \partial w - η λ n w (1 - η λ n) w - η \partial C 0 \partial w . (12) (13)

This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight ww by a factor 1−ηλ/n. This rescaling is sometimes referred to as weight decay, since it makes the weights smaller.
The regularized learning rule for stochastic gradient descent becomes

w \to (1 - η λ n) w - η m \sum x \partial C x \partial w, (14)

L1 regularization

In this approach we modify the unregularized cost function by adding the sum of the absolute values of the weights:

C = C 0 + λ n \sum w | w | . (15)

\partial C \partial w = \partial C 0 \partial w + λ n s g n (w), (16)

w \to w' = w - η λ n sgn (w) - η \partial C 0 \partial w, (17)

Dropout

Dropout is a radically different technique for regularization. Unlike L1 and L2 regularization, dropout doesn’t rely on modifying the cost function. Instead, in dropout we modify the network itself.

Artificially expanding the training data

Using rotations, translating and skewing the images to expand the training data.

Weight initialization

Just to remind you, that prescription was to choose both the weights and biases using independent Gaussian random variables, normalized to have mean 0 and standard deviation 1.
While this approach has worked well, it was quite ad hoc, and it’s worth revisiting to see if we can find a better way of setting our initial weights and biases, and perhaps help our neural networks learn faster.
Suppose we have a neuron with nin input weights. Then we shall initialize those weights as Gaussian random variables with mean 0 and standard deviation 1/nin‾‾‾√; we’ll continue to initialize the biases as before, as Gaussian random variables with a mean of 0 and a standard deviation of 1.

Handwriting recognition revisited: the code

"""network2.py~~~~~~~~~~~~~~An improved version of network.py, implementing the stochasticgradient descent learning algorithm for a feedforward neural network.Improvements include the addition of the cross-entropy cost function,regularization, and better initialization of network weights.  Notethat I have focused on making the code simple, easily readable, andeasily modifiable.  It is not optimized, and omits many desirablefeatures."""#### Libraries# Standard libraryimport jsonimport randomimport sys# Third-party librariesimport numpy as np#### Define the quadratic and cross-entropy cost functionsclass QuadraticCost(object):    @staticmethod    def fn(a, y):        """Return the cost associated with an output ``a`` and desired output        ``y``.        """        return 0.5*np.linalg.norm(a-y)**2    @staticmethod    def delta(z, a, y):        """Return the error delta from the output layer."""        return (a-y) * sigmoid_prime(z)class CrossEntropyCost(object):    @staticmethod    def fn(a, y):        """Return the cost associated with an output ``a`` and desired output        ``y``.  Note that np.nan_to_num is used to ensure numerical        stability.  In particular, if both ``a`` and ``y`` have a 1.0        in the same slot, then the expression (1-y)*np.log(1-a)        returns nan.  The np.nan_to_num ensures that that is converted        to the correct value (0.0).        """        return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))    @staticmethod    def delta(z, a, y):        """Return the error delta from the output layer.  Note that the        parameter ``z`` is not used by the method.  It is included in        the method's parameters in order to make the interface        consistent with the delta method for other cost classes.        """        return (a-y)#### Main Network classclass Network(object):    def __init__(self, sizes, cost=CrossEntropyCost):        """The list ``sizes`` contains the number of neurons in the respective        layers of the network.  For example, if the list was [2, 3, 1]        then it would be a three-layer network, with the first layer        containing 2 neurons, the second layer 3 neurons, and the        third layer 1 neuron.  The biases and weights for the network        are initialized randomly, using        ``self.default_weight_initializer`` (see docstring for that        method).        """        self.num_layers = len(sizes)        self.sizes = sizes        self.default_weight_initializer()        self.cost=cost    def default_weight_initializer(self):        """Initialize each weight using a Gaussian distribution with mean 0        and standard deviation 1 over the square root of the number of        weights connecting to the same neuron.  Initialize the biases        using a Gaussian distribution with mean 0 and standard        deviation 1.        Note that the first layer is assumed to be an input layer, and        by convention we won't set any biases for those neurons, since        biases are only ever used in computing the outputs from later        layers.        """        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]        self.weights = [np.random.randn(y, x)/np.sqrt(x)                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]    def large_weight_initializer(self):        """Initialize the weights using a Gaussian distribution with mean 0        and standard deviation 1.  Initialize the biases using a        Gaussian distribution with mean 0 and standard deviation 1.        Note that the first layer is assumed to be an input layer, and        by convention we won't set any biases for those neurons, since        biases are only ever used in computing the outputs from later        layers.        This weight and bias initializer uses the same approach as in        Chapter 1, and is included for purposes of comparison.  It        will usually be better to use the default weight initializer        instead.        """        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]        self.weights = [np.random.randn(y, x)                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]    def feedforward(self, a):        """Return the output of the network if ``a`` is input."""        for b, w in zip(self.biases, self.weights):            a = sigmoid(np.dot(w, a)+b)        return a    def SGD(self, training_data, epochs, mini_batch_size, eta,            lmbda = 0.0,            evaluation_data=None,            monitor_evaluation_cost=False,            monitor_evaluation_accuracy=False,            monitor_training_cost=False,            monitor_training_accuracy=False):        """Train the neural network using mini-batch stochastic gradient        descent.  The ``training_data`` is a list of tuples ``(x, y)``        representing the training inputs and the desired outputs.  The        other non-optional parameters are self-explanatory, as is the        regularization parameter ``lmbda``.  The method also accepts        ``evaluation_data``, usually either the validation or test        data.  We can monitor the cost and accuracy on either the        evaluation data or the training data, by setting the        appropriate flags.  The method returns a tuple containing four        lists: the (per-epoch) costs on the evaluation data, the        accuracies on the evaluation data, the costs on the training        data, and the accuracies on the training data.  All values are        evaluated at the end of each training epoch.  So, for example,        if we train for 30 epochs, then the first element of the tuple        will be a 30-element list containing the cost on the        evaluation data at the end of each epoch. Note that the lists        are empty if the corresponding flag is not set.        """        if evaluation_data: n_data = len(evaluation_data)        n = len(training_data)        evaluation_cost, evaluation_accuracy = [], []        training_cost, training_accuracy = [], []        for j in xrange(epochs):            random.shuffle(training_data)            mini_batches = [                training_data[k:k+mini_batch_size]                for k in xrange(0, n, mini_batch_size)]            for mini_batch in mini_batches:                self.update_mini_batch(                    mini_batch, eta, lmbda, len(training_data))            print "Epoch %s training complete" % j            if monitor_training_cost:                cost = self.total_cost(training_data, lmbda)                training_cost.append(cost)                print "Cost on training data: {}".format(cost)            if monitor_training_accuracy:                accuracy = self.accuracy(training_data, convert=True)                training_accuracy.append(accuracy)                print "Accuracy on training data: {} / {}".format(                    accuracy, n)            if monitor_evaluation_cost:                cost = self.total_cost(evaluation_data, lmbda, convert=True)                evaluation_cost.append(cost)                print "Cost on evaluation data: {}".format(cost)            if monitor_evaluation_accuracy:                accuracy = self.accuracy(evaluation_data)                evaluation_accuracy.append(accuracy)                print "Accuracy on evaluation data: {} / {}".format(                    self.accuracy(evaluation_data), n_data)            print        return evaluation_cost, evaluation_accuracy, \            training_cost, training_accuracy    def update_mini_batch(self, mini_batch, eta, lmbda, n):        """Update the network's weights and biases by applying gradient        descent using backpropagation to a single mini batch.  The        ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the        learning rate, ``lmbda`` is the regularization parameter, and        ``n`` is the total size of the training data set.        """        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        for x, y in mini_batch:            delta_nabla_b, delta_nabla_w = self.backprop(x, y)            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]        self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw                        for w, nw in zip(self.weights, nabla_w)]        self.biases = [b-(eta/len(mini_batch))*nb                       for b, nb in zip(self.biases, nabla_b)]    def backprop(self, x, y):        """Return a tuple ``(nabla_b, nabla_w)`` representing the        gradient for the cost function C_x.  ``nabla_b`` and        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar        to ``self.biases`` and ``self.weights``."""        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        # feedforward        activation = x        activations = [x] # list to store all the activations, layer by layer        zs = [] # list to store all the z vectors, layer by layer        for b, w in zip(self.biases, self.weights):            z = np.dot(w, activation)+b            zs.append(z)            activation = sigmoid(z)            activations.append(activation)        # backward pass        delta = (self.cost).delta(zs[-1], activations[-1], y)        nabla_b[-1] = delta        nabla_w[-1] = np.dot(delta, activations[-2].transpose())        # Note that the variable l in the loop below is used a little        # differently to the notation in Chapter 2 of the book.  Here,        # l = 1 means the last layer of neurons, l = 2 is the        # second-last layer, and so on.  It's a renumbering of the        # scheme in the book, used here to take advantage of the fact        # that Python can use negative indices in lists.        for l in xrange(2, self.num_layers):            z = zs[-l]            sp = sigmoid_prime(z)            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp            nabla_b[-l] = delta            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())        return (nabla_b, nabla_w)    def accuracy(self, data, convert=False):        """Return the number of inputs in ``data`` for which the neural        network outputs the correct result. The neural network's        output is assumed to be the index of whichever neuron in the        final layer has the highest activation.        The flag ``convert`` should be set to False if the data set is        validation or test data (the usual case), and to True if the        data set is the training data. The need for this flag arises        due to differences in the way the results ``y`` are        represented in the different data sets.  In particular, it        flags whether we need to convert between the different        representations.  It may seem strange to use different        representations for the different data sets.  Why not use the        same representation for all three data sets?  It's done for        efficiency reasons -- the program usually evaluates the cost        on the training data and the accuracy on other data sets.        These are different types of computations, and using different        representations speeds things up.  More details on the        representations can be found in        mnist_loader.load_data_wrapper.        """        if convert:            results = [(np.argmax(self.feedforward(x)), np.argmax(y))                       for (x, y) in data]        else:            results = [(np.argmax(self.feedforward(x)), y)                        for (x, y) in data]        return sum(int(x == y) for (x, y) in results)    def total_cost(self, data, lmbda, convert=False):        """Return the total cost for the data set ``data``.  The flag        ``convert`` should be set to False if the data set is the        training data (the usual case), and to True if the data set is        the validation or test data.  See comments on the similar (but        reversed) convention for the ``accuracy`` method, above.        """        cost = 0.0        for x, y in data:            a = self.feedforward(x)            if convert: y = vectorized_result(y)            cost += self.cost.fn(a, y)/len(data)        cost += 0.5*(lmbda/len(data))*sum(            np.linalg.norm(w)**2 for w in self.weights)        return cost    def save(self, filename):        """Save the neural network to the file ``filename``."""        data = {"sizes": self.sizes,                "weights": [w.tolist() for w in self.weights],                "biases": [b.tolist() for b in self.biases],                "cost": str(self.cost.__name__)}        f = open(filename, "w")        json.dump(data, f)        f.close()#### Loading a Networkdef load(filename):    """Load a neural network from the file ``filename``.  Returns an    instance of Network.    """    f = open(filename, "r")    data = json.load(f)    f.close()    cost = getattr(sys.modules[__name__], data["cost"])    net = Network(data["sizes"], cost=cost)    net.weights = [np.array(w) for w in data["weights"]]    net.biases = [np.array(b) for b in data["biases"]]    return net#### Miscellaneous functionsdef vectorized_result(j):    """Return a 10-dimensional unit vector with a 1.0 in the j'th position    and zeroes elsewhere.  This is used to convert a digit (0...9)    into a corresponding desired output from the neural network.    """    e = np.zeros((10, 1))    e[j] = 1.0    return edef sigmoid(z):    """The sigmoid function."""    return 1.0/(1.0+np.exp(-z))def sigmoid_prime(z):    """Derivative of the sigmoid function."""    return sigmoid(z)*(1-sigmoid(z))

How to choose a neural network’s hyper-parameters?

In this section I explain some heuristics which can be used to set the hyper-parameters in a neural network.
Broad strategy:
When using neural networks to attack a new problem the first challenge is to get any non-trivial learning, i.e., for the network to achieve results better than chance. This can be surprisingly difficult, especially when confronting a new class of problem. Let’s look at some strategies you can use if you’re having this kind of trouble.
Suppose, for example, that you’re attacking MNIST for the first time. You start out enthusiastic, but are a little discouraged when your first network fails completely, as in the example above. The way to go is to strip the problem down. Get rid of all the training and validation images except images which are 0s or 1s. Then try to train a network to distinguish 0s from 1s. That enables much more rapid experimentation, and so gives you more rapid insight into how to build a good network.
You can further speed up experimentation by stripping your network down to the simplest network likely to do meaningful learning. If you believe a [784, 10] network can likely do better-than-chance classification of MNIST digits, then begin your experimentation with such a network. It’ll be much faster than training a [784, 30, 10] network, and you can build back up to the latter.
You can get another speed up in experimentation by increasing the frequency of monitoring. We can get feedback more quickly by monitoring the validation accuracy more often, say, after every 1,000 training images. Furthermore, instead of using the full 10,000 image validation set to monitor performance, we can get a much faster estimate using just 100 validation images. All that matters is that the network sees enough images to do real learning, and to get a pretty good rough estimate of performance.
Intuitively, it may seem as though simplifying the problem and the architecture will merely slow you down. In fact, it speeds things up, since you much more quickly find a network with a meaningful signal. Once you’ve got such a signal, you can often get rapid improvements by tweaking the hyper-parameters. As with many things in life, getting started can be the hardest thing to do.

Learning rate:
If η is too large then the steps will be so large that they may actually overshoot the minimum, causing the algorithm to climb up out of the valley instead.
First, we estimate the threshold value for η at which the cost on the training data immediately begins decreasing, instead of oscillating or increasing. This estimate doesn’t need to be too accurate. You may optionally refine your estimate, to pick out the largest value of η at which the cost decreases during the first few epochs. This gives us an estimate for the threshold value of η.
Obviously, the actual value of ηη that you use should be no larger than the threshold value. In fact, if the value of η is to remain usable over many epochs then you likely want to use a value for η that is smaller, say, a factor of two below the threshold. Such a choice will typically allow you to train for many epochs, without causing too much of a slowdown in learning.
Use early stopping to determine the number of training epochs:
Early stopping means that at the end of each epoch we should compute the classification accuracy on the validation data. When that stops improving, terminate. This makes setting the number of epochs very simple.
We might elect to terminate if the classification accuracy hasn’t improved during the last ten epochs. This ensures that we don’t stop too soon, in response to bad luck in training, but also that we’re not waiting around forever for an improvement that never comes. This no-improvement-in-ten rule is good for initial exploration of MNIST. I suggest using the no-improvement-in-ten rule for initial experimentation, and gradually adopting more lenient rules, as you better understand the way your network trains: no-improvement-in-twenty, no-improvement-in-fifty, and so on.

Learning rate schedule:
We’ve been holding the learning rate ηη constant. However, it’s often advantageous to vary the learning rate. Early on during the learning process it’s likely that the weights are badly wrong. And so it’s best to use a large learning rate that causes the weights to change quickly. Later, we can reduce the learning rate as we make more fine-tuned adjustments to our weights.
How should we set our learning rate schedule? Many approaches are possible. One natural approach is to use the same basic idea as early stopping. The idea is to hold the learning rate constant until the validation accuracy starts to get worse. Then decrease the learning rate by some amount, say a factor of two or ten. We repeat this many times, until, say, the learning rate is a factor of 1,024 (or 1,000) times lower than the initial value. Then we terminate.

The regularization parameter, λ:
I suggest starting initially with no regularization (λ=0.0), and determining a value for ηη, as above. Using that choice of η, we can then use the validation data to select a good value for λ. Start by trialling λ=1.0, and then increase or decrease by factors of 10, as needed to improve performance on the validation data. Once you’ve found a good order of magnitude, you can fine tune your value of λ. That done, you should return and re-optimize η again.

Mini-batch size:
With these factors in mind, choosing the best mini-batch size is a compromise. Too small, and you don’t get to take full advantage of the benefits of good matrix libraries optimized for fast hardware. Too large and you’re simply not updating your weights often enough. What you need is to choose a compromise value which maximizes the speed of learning. Fortunately, the choice of mini-batch size at which the speed is maximized is relatively independent of the other hyper-parameters (apart from the overall architecture), so you don’t need to have optimized those hyper-parameters in order to find a good mini-batch size.

Automated techniques:
I’ve been describing these heuristics as though you’re optimizing your hyper-parameters by hand. Hand-optimization is a good way to build up a feel for how neural networks behave. However, and unsurprisingly, a great deal of work has been done on automating the process. A common technique is grid search, which systematically searches through a grid in hyper-parameter space.

Summing up:
In particular, I’ve discussed the hyper-parameters largely independently. In practice, there are relationships between the hyper-parameters. You may experiment with η, feel that you’ve got it just right, then start to optimize for λ, only to find that it’s messing up your optimization for η. In practice, it helps to bounce backward and forward, gradually closing in good values.

Variations on stochastic gradient descent

Stochastic gradient descent by backpropagation has served us well in attacking the MNIST digit classification problem. However, there are many other approaches to optimizing the cost function, and sometimes those other approaches offer performance superior to mini-batch stochastic gradient descent. In this section I sketch two such approaches, the Hessian and momentum techniques.

Momentum-based gradient descent:

We introduce velocity variables v=v1,v2,…, one for each corresponding wj variable. Then we replace the gradient descent update rule w→w′=w−η∇C by

v w \to \to v' = μ v - η \nabla C w' = w + v' . (18) (19)

In these equations, μ is a hyper-parameter which controls the amount of damping or friction in the system, it’s called the momentum co-efficient.

Other models of artificial neuron

the tanh neuron:
The output of a tanh neuron with input x, weight vector w, and bias b is given by

tanh (w \cdot x + b), (20)

the tanh function is defined by

tanh (z) \equiv e z - e - z e z + e - z . (21)

With a little algebra it can easily be verified that

σ (z) = 1 + tanh ( z / 2 ) 2, (22)

We can also see graphically that the tanh function has the same shape as the sigmoid function. One difference between tanh neurons and sigmoid neurons is that the output from tanh neurons ranges from -1 to 1, not 0 to 1.

the rectified linear neuron or rectified linear unit:
The output of a rectified linear unit with input x, weight vector w, and bias b is given by

max (0, w \cdot x + b) . (23)

0 0