How the backpropagation algorithm works

来源：互联网发布：java实战项目百度云编辑：程序博客网时间：2024/05/16 09:33

How the backpropagation algorithm works

I’ll explain a fast algorithm for computing the gradients of the cost function, an algorithm known as backpropagation.

Warm up: a fast matrix-based approach to computing the output from a neural network

We’ll use wljk to denote the weight for the connection from the kth neuron in the (l−1)th layer to the jth neuron in the lth layer. Explicitly, we use blj for the bias of the jth neuron in the lth layer. And we use alj for the activation of the jth neuron in the lth layer. With these notations, the activation alj of the jth neuron in the lth layer is related to the activations in the (l−1)th layer by the equation

a l j = σ (\sum k w l j k a l - 1 k + b l j), (1)

where the sum is over all neurons k in the

(l−1)th layer.

To rewrite this expression in a matrix form we define a weight matrix wl for each layer, l. The entries of the weight matrix wl are just the weights connecting to the lth layer of neurons, that is, the entry in the jth row and kth column is wljk. Similarly, for each layer l we define a bias vector, bl. You can probably guess how this works - the components of the bias vector are just the values blj, one component for each neuron in the lth layer. And finally, we define an activation vector al whose components are the activations alj. With these notations in mind, Equation (1) can be rewritten in the beautiful and compact vectorized form

a l = σ (w l a l - 1 + b l) . (2)

zl≡wlal−1+bl, we call

zl the weighted input to the neurons in layer l.

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. The first assumption we need is that the cost function can be written as an average C=1n∑xCx over cost functions Cx for individual training examples, x. The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network: C = C(aL).

The Hadamard product, s⊙t

We use s⊙t to denote the elementwise product of the two vectors. Thus the components of s⊙t are just (s⊙t)j=sjtj.

The four fundamental equations behind backpropagation

we define the error δlj of neuron j in layer l by

δ l j \equiv \partial C \partial z l j . (3)

An equation for the error in the output layer, δL: The components of δL are given by

δ L j = \partial C \partial a L j σ' (z L j) . (BP1)

δ L = \nabla a C ⊙ σ' (z L) . (BP1a)

An equation for the error δl in terms of the error in the next layer, δl+1: In particular

δ l = ((w l + 1) T δ l + 1) ⊙ σ' (z l), (BP2)

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

\partial C \partial b l j = δ l j . (BP3)

An equation for the rate of change of the cost with respect to any weight in the network: In particular:

\partial C \partial w l j k = a l - 1 k δ l j . (BP4)

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:

Input x: Set the corresponding activation a1 for the input layer.
Feedforward: For each l=2,3,…,L compute zl=wlal−1+bl and al=σ(zl).
Output error δL: Compute the vector δL=∇aC⊙σ′(zL).
Backpropagate the error: For each l=L−1,L−2,…,2 compute δl=((wl+1)Tδl+1)⊙σ′(zl).
Output: The gradient of the cost function is given by ∂C∂wljk=al−1kδlj and ∂C∂blj=δlj.

Examining the algorithm you can see why it’s called backpropagation.

As I’ve described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C=Cx. In practice, it’s common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

Input a set of training examples
For each training example x: Set the corresponding input activation ax,1, and perform the following steps:
- Feedforward: For each l=2,3,…,L compute zx,l=wlax,l−1+bl and ax,l=σ(zx,l).
- Output error δx,L: Compute the vector δx,L=∇aCx⊙σ′(zx,L).
- Backpropagate the error: For each l=L−1,L−2,…,2 compute δx,l=((wl+1)Tδx,l+1)⊙σ′(zx,l).
Gradient descent: For each l=L,L−1,…,2 update the weights according to the rule wl→wl−ηm∑xδx,l(ax,l−1)T, and the biases according to the rule bl→bl−ηm∑xδx,l.

The code for backpropagation

class Network(object):...    def update_mini_batch(self, mini_batch, eta):        """Update the network's weights and biases by applying        gradient descent using backpropagation to a single mini batch.        The "mini_batch" is a list of tuples "(x, y)", and "eta"        is the learning rate."""        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        for x, y in mini_batch:            delta_nabla_b, delta_nabla_w = self.backprop(x, y)            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]        self.weights = [w-(eta/len(mini_batch))*nw                         for w, nw in zip(self.weights, nabla_w)]        self.biases = [b-(eta/len(mini_batch))*nb                        for b, nb in zip(self.biases, nabla_b)]

class Network(object):...   def backprop(self, x, y):        """Return a tuple "(nabla_b, nabla_w)" representing the        gradient for the cost function C_x.  "nabla_b" and        "nabla_w" are layer-by-layer lists of numpy arrays, similar        to "self.biases" and "self.weights"."""        nabla_b = [np.zeros(b.shape) for b in self.biases]        nabla_w = [np.zeros(w.shape) for w in self.weights]        # feedforward        activation = x        activations = [x] # list to store all the activations, layer by layer        zs = [] # list to store all the z vectors, layer by layer        for b, w in zip(self.biases, self.weights):            z = np.dot(w, activation)+b            zs.append(z)            activation = sigmoid(z)            activations.append(activation)        # backward pass        delta = self.cost_derivative(activations[-1], y) * \            sigmoid_prime(zs[-1])        nabla_b[-1] = delta        nabla_w[-1] = np.dot(delta, activations[-2].transpose())        # Note that the variable l in the loop below is used a little        # differently to the notation in Chapter 2 of the book.  Here,        # l = 1 means the last layer of neurons, l = 2 is the        # second-last layer, and so on.  It's a renumbering of the        # scheme in the book, used here to take advantage of the fact        # that Python can use negative indices in lists.        for l in xrange(2, self.num_layers):            z = zs[-l]            sp = sigmoid_prime(z)            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp            nabla_b[-l] = delta            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())        return (nabla_b, nabla_w)...    def cost_derivative(self, output_activations, y):        """Return the vector of partial derivatives \partial C_x /        \partial a for the output activations."""        return (output_activations-y) def sigmoid(z):    """The sigmoid function."""    return 1.0/(1.0+np.exp(-z))def sigmoid_prime(z):    """Derivative of the sigmoid function."""    return sigmoid(z)*(1-sigmoid(z))

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm?
Let’s consider another approach to computing the gradient:

\partial C \partial w j \approx C ( w + ϵ e j ) - C ( w ) ϵ, (4)

where ϵ>0 is a small positive number, and ej is the unit vector in the jth direction. In other words, we can estimate ∂C/∂wjby computing the cost C for two slightly different values of wj, and then applying Equation (4). The same idea will let us compute the partial derivatives ∂C/∂b with respect to the biases.

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight wj we need to compute C(w+ϵej) in order to compute ∂C/∂wj. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w) as well, so that’s a total of a million and one passes through the network.

What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.
Compare that to the million and one forward passes we needed for the approach based on (4)! And so even though backpropagation appears superficially more complex than the approach based on (4), it’s actually much, much faster.

Backpropagation: the big picture

1 0