How the backpropagation algorithm works
来源:互联网 发布:java实战项目百度云 编辑:程序博客网 时间:2024/05/16 09:33
How the backpropagation algorithm works
I’ll explain a fast algorithm for computing the gradients of the cost function, an algorithm known as backpropagation.
Warm up: a fast matrix-based approach to computing the output from a neural network
We’ll use
where the sum is over all neurons k in the
To rewrite this expression in a matrix form we define a weight matrix
The two assumptions we need about the cost function
The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. The first assumption we need is that the cost function can be written as an average
The Hadamard product, s⊙t
We use s⊙t to denote the elementwise product of the two vectors. Thus the components of s⊙t are just
The four fundamental equations behind backpropagation
we define the error
An equation for the error in the output layer,
or
An equation for the error
An equation for the rate of change of the cost with respect to any bias in the network: In particular:
An equation for the rate of change of the cost with respect to any weight in the network: In particular:
The backpropagation algorithm
The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:
- Input x: Set the corresponding activation
a1 for the input layer. - Feedforward: For each l=2,3,…,L compute
zl=wlal−1+bl andal=σ(zl) . - Output error
δL : Compute the vectorδL=∇aC⊙σ′(zL) . - Backpropagate the error: For each l=L−1,L−2,…,2 compute
δl=((wl+1)Tδl+1)⊙σ′(zl) . - Output: The gradient of the cost function is given by
∂C∂wljk=al−1kδlj and∂C∂blj=δlj .
Examining the algorithm you can see why it’s called backpropagation.
As I’ve described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C=Cx. In practice, it’s common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of m training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:
- Input a set of training examples
- For each training example x: Set the corresponding input activation
ax,1 , and perform the following steps:- Feedforward: For each l=2,3,…,L compute
zx,l=wlax,l−1+bl andax,l=σ(zx,l) . - Output error
δx,L : Compute the vectorδx,L=∇aCx⊙σ′(zx,L) . - Backpropagate the error: For each l=L−1,L−2,…,2 compute
δx,l=((wl+1)Tδx,l+1)⊙σ′(zx,l) .
- Feedforward: For each l=2,3,…,L compute
- Gradient descent: For each l=L,L−1,…,2 update the weights according to the rule
wl→wl−ηm∑xδx,l(ax,l−1)T , and the biases according to the rulebl→bl−ηm∑xδx,l .
The code for backpropagation
class Network(object):... def update_mini_batch(self, mini_batch, eta): """Update the network's weights and biases by applying gradient descent using backpropagation to a single mini batch. The "mini_batch" is a list of tuples "(x, y)", and "eta" is the learning rate.""" nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]
class Network(object):... def backprop(self, x, y): """Return a tuple "(nabla_b, nabla_w)" representing the gradient for the cost function C_x. "nabla_b" and "nabla_w" are layer-by-layer lists of numpy arrays, similar to "self.biases" and "self.weights".""" nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # feedforward activation = x activations = [x] # list to store all the activations, layer by layer zs = [] # list to store all the z vectors, layer by layer for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation) # backward pass delta = self.cost_derivative(activations[-1], y) * \ sigmoid_prime(zs[-1]) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose()) # Note that the variable l in the loop below is used a little # differently to the notation in Chapter 2 of the book. Here, # l = 1 means the last layer of neurons, l = 2 is the # second-last layer, and so on. It's a renumbering of the # scheme in the book, used here to take advantage of the fact # that Python can use negative indices in lists. for l in xrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l-1].transpose()) return (nabla_b, nabla_w)... def cost_derivative(self, output_activations, y): """Return the vector of partial derivatives \partial C_x / \partial a for the output activations.""" return (output_activations-y) def sigmoid(z): """The sigmoid function.""" return 1.0/(1.0+np.exp(-z))def sigmoid_prime(z): """Derivative of the sigmoid function.""" return sigmoid(z)*(1-sigmoid(z))
In what sense is backpropagation a fast algorithm?
In what sense is backpropagation a fast algorithm?
Let’s consider another approach to computing the gradient:
where ϵ>0 is a small positive number, and ej is the unit vector in the jth direction. In other words, we can estimate ∂C/∂wjby computing the cost C for two slightly different values of wj, and then applying Equation (4). The same idea will let us compute the partial derivatives ∂C/∂b with respect to the biases.
Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight wj we need to compute C(w+ϵej) in order to compute ∂C/∂wj. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w) as well, so that’s a total of a million and one passes through the network.
What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network.
Compare that to the million and one forward passes we needed for the approach based on (4)! And so even though backpropagation appears superficially more complex than the approach based on (4), it’s actually much, much faster.
Backpropagation: the big picture
- How the backpropagation algorithm works
- chapter2 How the backpropagation algorithm works
- CHAPTER 2 How the backpropagation algorithm works
- [读书笔记]How the backpropagation algorithm works(未完待续)
- [神经网络]2.2/2.3-How the backpropagation algorithm works-The two assumptions we need...(翻译)
- [神经网络]2.1-How the backpropagation algorithm works-Warm up: a fast matrix-based approach ...(翻译)
- BP反向传播算法是如何工作的How the backpropagation algorithm works
- How the ECDSA algorithm works
- How the Best Path Algorithm Works
- How The Computer Works
- How the Computer Works
- How the Internet works
- How the Web Works
- backpropagation algorithm
- Backpropagation Algorithm
- The Basics: How Programming Works
- How The Immediate Attribute Works
- How the Scientific Method Works
- 我理解的MVC开发模式
- 2017Emacs
- collectionView 滚动到制定位置
- 了解c++中的类型转换
- java 5种设计模式
- How the backpropagation algorithm works
- Linux入门(6)——Ubuntu16.04安装atom
- html
- Leetcode -- 31. Next Permutation
- mysql从服务列表里消失了
- ganglia监控hadoop集群配置
- 我的Linux学习一(CentOS 7在虚拟机中的下载与安装)
- Viterbi译码及matlab代码(一)
- ProxyChains