Machine Learning - Neural Networks Learning: Cost Function and Backpropagation

来源：互联网发布：mysql中check约束编辑：程序博客网时间：2024/06/05 02:01

This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 5, Neural Networks Learning. This article contains some topic about Cost Function and Backpropagation algorithm.

Cost Function and Backpropagation

Neural networks are one of the most powerful learning algorithms that we have today. In this and in the next few sections, We're going to start talking about a learning algorithm for fitting the parameters of a neural network given a training set.As with the discussion of most of our learning algorithms, we're going to begin by talking about the cost function for fitting the parameters of the network.

1. Cost function

I'm going to focus on the application of neural networks to classification problems. So suppose we have a network like that shown in the picture. And suppose we have a training set like this is x(i) , y(i) pairs of M training example.

L = total no. of layers in network, L = 4.

s_l = no. of units (not counting bias unit) in layer l,s₁ = 3, s₂ = 5,s₄ = s_L = 4

Binary classification

The first is Binary classification, where the labels y are either 0 or 1. In this case, we will have 1 output unit, so this Neural Network unit on top has 4 output units, but if we had binary classification we would have only one output unit that computes h(x). And the output of the neural network would be h(x) is going to be a real number.

y = 0 or 1

Multi-class classification (K classes)

K output units

Cost function

Logistic regression

The cost function we use for the neural network is going to be a generalization of the one that we use for logistic regression. For logistic regression we used to minimize the cost function J(θ) that was minus 1/m of this cost function and then plus this extra regularization term here, where this was a sum from J=1 through n, because we did not regularize the bias termθ₀.

Neural network

For a neural network, our cost function is going to be a generalization of this. Where instead of having basically just one, which is the compression output unit, we may instead have K of them. So here's our cost function.

Our new network now outputs vectors in R^K where K might be equal to 1 if we have a binary classification problem. I'm going to use this notation h(x) subscript i to denote the ith output. That is, h(x) is a k-dimensional vector and so this subscript i just selects out the ith element of the vector that is output by my neural network. My cost function J(θ) is now going to be the following.

Is - 1/m of a sum of a similar term to what we have for logistic regression, except that we have the sum from K equals 1 through K. This summation is basically a sum over my K output. So if I have four output units, that is if the final layer of my neural network has four output units, then this is a sum from k equals one through four of basically the logistic regression algorithm's cost function but summing that cost function over each of my four output units in turn.

And finally, the second term here is the regularization term, similar to what we had for the logistic regression. This summation term looks really complicated, but all it's doing is it's summing over these terms θji l for all values of ji and l. Except that we don't sum over the terms corresponding to these bias values like we have for logistic progression.

2. Backpropagation algorithm

In the previous section, we talked about a cost function for the neural network. In this section, let's start to talk about an algorithm, for trying to minimize the cost function. In particular, we'll talk about the back propagation algorithm.

Gradient computation

Here's the cost function that we wrote down in the previous section. What we'd like to do is try to find parameters theta to try to minimize J(θ). In order to use either gradient descent or one of the advance optimization algorithms.

Need code to compute:

What we need to do therefore is to write code that takes this input the parameters theta and computes j of theta and these partial derivative terms. Remember, that the parameters in the neural network of these things, theta superscript l subscript ij, that's the real number and so, these are the partial derivative terms we need to compute. In order to compute the cost function j of theta, we just use this formula up here and so, what I want to do for the most of this video is focus on talking about how we can compute these partial derivative terms.

Given one training example ( x, y)

Let's start by talking about the case of when we have only one training example, our entire training set comprises only one training example which is a pair (x, y).

And let's tap through the sequence of calculations we would do with this one training example. The first thing we do is we apply forward propagation in order to compute whether a hypotheses actually outputs given the input.

Forward propagation

So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network.

Gradient computation: Back propagation algorithm

Next, in order to compute the derivatives,we're going to use an algorithm called back propagation. The intuition of the back propagation algorithm is that for each note we're going to compute the term δ superscript l subscript jthat's going to somehow represent the error of note jin the layer l.

Intuition:

For each output unit (layer L = 4)

If you think of delta a and y as vectors then you can also take those and come up with a vectorized implementation of it, which is justδ⁽⁴⁾ gets set as a⁽⁴⁾

Where here, each of these δ⁽⁴⁾,a⁽⁴⁾ and y, each of these is a vector whose dimension is equal to the number of output units in our network.

What we do next is compute the delta terms for the earlier layers in our network. Here's a formula for computingδ⁽³⁾ isδ⁽³⁾is equal to theta 3 transpose timesδ⁽⁴⁾. And this dot times, this is the elementy's multiplication operation that we know from MATLAB.

Backpropagation algorithm

Training set:

3. Backpropagation intuition

Backpropagation maybe unfortunately is a less mathematically clean, or less mathematically simple algorithm, compared to linear regression or logistic regression. And I've actually used backpropagation, you know, pretty successfully for many years. And even today I still don't sometimes feel like I have a very good sense of just what it's doing, or intuition about what back propagation is doing. If, for those of you that are doing the programming exercises, that will at least mechanically step you through the different steps of how to implement back prop. So you'll be able to get it to work for yourself. And what I want to do in this section is look a little bit more at the mechanical steps of backpropagation, and try to give you a little more intuition about what the mechanical steps the back prop is doing to hopefully convince you that, you know, it's at least a reasonable algorithm.

Forward Propagation

In order to better understand backpropagation, let's take another closer look at what forward propagation is doing. Here's a neural network with two input units that is not counting the bias unit, and two hidden units in this layer, and two hidden units in the next layer. And then, finally, one output unit. Again, these counts two, two, two, are not counting these bias units on top.

In order to illustrate forward propagation,I'm going to draw this network a little bit differently. And in particular I'm going to draw this neural-network with the nodes drawn as these very fat ellipsis, so that I can write text in them. When performing forward propagation, we might have some particular example. Say some example (x_i,y_i) And it'll be this x_ithat we feed into the input layer.

So the way we compute this value, z₁⁽³⁾ is

When we forward propagated to the first hidden layer here,what we do is compute z₁⁽²⁾ and z₂⁽²⁾. So these are the weighted sum of inputs of the input units. And then we apply the sigmoid of the logistic function, and the sigmoid activation function applied to the z value. Here's are the activation values. So that gives us a₁⁽²⁾and a₂⁽²⁾ . And then we forward propagate again to get here z₁⁽³⁾. Apply the sigmoid of the logistic function, the activation function to that to get a₁⁽³⁾. And similarly, like so until we getz₁⁽⁴⁾. Apply the activation function. This gives us a₁⁽⁴⁾, which is the final output value of the neural network.

What is backpropagation doing?

What backpropagation is doing is doing a process very similar to Forward Propagation. Except that instead of the computations flowing from the left to the right of this network, the computations since their flow from the right to the left of the network. And using a very similar computation as this.

Cost function of neural network is

Focusing on a single example x⁽ⁱ⁾,y⁽ⁱ⁾, the case of 1 output unit (K=1), and ignoring regularization (λ=0), the cost function can be written as follows

And what this cost function does is it plays a role similar to the squared arrow. So, rather than looking at this complicated expression, if you want you can think of cost of i being approximately the square difference between what the neural network outputs, versus what is the actual value.

Think of

i.e.how well is the network doing on example i?

More formally, what the delta terms actually are is this, they're the partial derivative with respect to z_j^(l), that is this weighted sum of inputs that were confusing these z terms. Partial derivatives with respect to these things of the cost function. So concretely, the cost function is a function of the label y and of the value, this h(x) output value neural network. And if we could go inside the neural network and just change those z_j^(l) values a little bit, then that will affect these values that the neural network is outputting. And that will end up changing the cost function.

We don't compute the bias term

And by the way, so far I've been writing the delta values only for the hidden units, but excluding the bias units. Depending on how you define the backpropagation algorithm, or depending on how you implement it, you may end up implementing something that computes delta values for these bias units as well. The bias units always output the value of plus one, and they are just what they are, and there's no way for us to change the value. And so, depending on your implementation of back prop, the way I usually implement it. I do end up computing these delta values, but we just discard them, we don't use them.

0 0