线性神经网络Linear Neural Networks
来源:互联网 发布:瞩目视频会议软件 编辑:程序博客网 时间:2024/06/05 05:53
Linear Neural Networks
Multiple regression
Our car example showed how we could discover an optimal linear function for predicting one variable (fuel consumption) from one other (weight). Suppose now that we are also given one or more additional variables which could be useful as predictors. Our simple neural network model can easily be extended to this case by adding more input units (Fig. 1).
Similarly, we may want to predict more than one variable from the data that we're given. This can easily be accommodated by adding more output units (Fig. 2). The loss function for a network with multiple outputs is obtained simply by adding the loss for each output unit together. The network now has a typical layered structure: a layer of input units (and the bias), connected by a layer of weights to a layer of output units.
(Fig. 1)(Fig. 2)Computing the gradient
In order to train neural networks such as the ones shown above by gradient descent, we need to be able to compute the gradient G of the loss function with respect to each weight wij of the network. It tells us how a small change in that weight will affect the overall error E. We begin by splitting the loss function into separate terms for each point p in the training data:
First use the chain rule to decompose the gradient into two factors:
The Gradient Descent Algorithm
- Initialize all weights to small random values.
- REPEAT until done
- For each weight wij set
- For each data point (x, t)p
- set input units to x
- compute value of output units
- For each weight wij set
- For each weight wij set
The algorithm terminates once we are at, or sufficiently near to, the minimum of the error function, where G = 0. We say then that the algorithm hasconverged.
In summary:
general case
linear network
Training data
(x,t)
(x,t)
Model parameters
w
w
Model
y = g(w,x)
Error function
E(y,t)
Gradient with respect to wij
- (ti - yi) yj
Weight update rule
The Learning Rate
An important consideration is the learning rate µ, which determines by how much we change the weights w at each step. If µ is too small, the algorithm will take a long time to converge (Fig. 3).
(Fig. 3) Conversely, if µ is too large, we may end up bouncing around the error surface out of control - the algorithm diverges (Fig. 4). This usually ends with an overflow error in the computer's floating-point arithmetic.(Fig. 4) Batch vs. Online Learning
Above we have accumulated the gradient contributions for all data points in the training set before updating the weights. This method is often referred to asbatch learning. An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent.
(Fig. 5) Online learning has a number of advantages:
- it is often much faster, especially when the training set is redundant (contains many similar data points),
- it can be used when there is no fixed training set (new data keeps coming in),
- it is better at tracking nonstationary environments (where the best model gradually changes over time),
- the noise in the gradient can help to escape from local minima (which are a problem for gradient descent in nonlinear models).
These advantages are, however, bought at a price: many powerful optimization techniques (such as: conjugate and second-order gradient methods, support vector machines, Bayesian methods, etc.) - which we will not talk about in this course! - are batch methods that cannot be used online. (Of course this also means that in order to implement batch learning really well, one has to learn an awful lot about these rather complicated methods!)
A compromise between batch and online learning is the use of "mini-batches": the weights are updated after every n data points, where n is greater than 1 but smaller than the training set size.
In order to keep things simple, we will focus very much on online learning, where plain gradient descent is among the best available techniques. Online learning is also highly suitable for implementing things such as reactive control strategies in adapative agents, and should thus fit in well with the rest of your course.
Optimal Weight and Learning Rates for Linear Networks
Regression Revisited
Suppose we are given a set of data (x(1),y(1)),(x(2),y(2))...(x(p),y(p)):
If we assume that g is linear, then finding the best line that fits the data (linear regression) can be done algebraically:
The solution is based on minimizing the squared error (Cost) between the network output and the data:
where y = w x.
Finding the best set of weights
1-input, 1 output, 1 weight
But the derivative of E is zero at the minimum so we can solve for wopt.
n-inputs, m outputs: nm weights
The same analysis can be done in the multi-dimensional case except that now everything becomes matrices:
where wopt is an mxn matrix, H is an nxn matrix and Á is an mxn matrix.
Matrix inversion is an expensive operation. Also, if the input dimension, n, is very large then H is huge and may not even b possible to compute. If we are not able to compute the inverse Hessian or if we don't want to spend the time, then we can use gradient descent.
Gradient Descent: Picking the Best Learning Rate
For linear networks, E is quadratic then we can write
so that we have
But this is just a Taylor series expansion of E(w) about w0. Now, suppose we want to determine the optimal weight, wopt. We can differentiate E(w) and evaluate the result at wopt, noting that E`(wopt) is zero:
Solving for wopt we obtain:
comparing this to the update equation, we find that the learning "rate" that takes us directly to the minimum is equal to the inverse Hessian, which is a matrix and not a scalar. Why do we need a matrix?
2-D Example
Curvature axes aligned with the coordinate axes:
or in matrix form:
m1 and m2 are inversely related to the size of the curvature along each axis. Using the above learning rate matrix has the effect of scaling the gradient differently to make the surface "look" spherical.
If the axes are not aligned with coordinate axes, the we need a full matrix of learning rates. This matrix is just the inverse Hessian. In general, H-1 is not diagonal. We can obtain the curvature along each axis, however, by computing the eigenvalues of H. Anyone remember what eigenvalues are??
Taking a Step Back
We have been spending a lot of time on some pretty tough math. Why? Because training a network can take a long time if you just blindly apply the basic algorithms. There are techniques that can improve the rate of convergence by orders of magnitude. However, understanding these techniques requires a deep understanding of the underlying characteristics of the problem (i.e. the mathematics). Knowing what speed-up techniques to apply, can make a difference between having a net that takes 100 iterations to train vs. 10000 iterations to train (assuming it trains at all).
The previous slides are trying to make the following point for linear networks (i.e. those networks whose cost function is a quadratic function of the weights):
- The shape of the cost surface has a significant effect on how fast a net can learn. Ideally, we want a spherically symmetric surface.
- The correlation matrix is defined as the average over all inputs of xxT
- The Hessian is the second derivative of E with respect to w.
For linear nets, the Hessian is the same as the correlation matrix.- The Hessian, tells you about the shape of the cost surface:
- The eigenvalues of H are a measure of the steepness of the surface along the curvature directions.
- a large eigenvalue => steep curvature => need small learning rate
- the learning rate should be proportional to 1/eigenvalue
- if we are forced to use a single learning rate for all weights, then we must use a learning rate that will not cause divergence along the steep directions (large eigenvalue directions). Thus, we must choose a learning rate m that is on the order of 1/»max where »max is the largest eigenvalue.
- If we can use a matrix of learning rates, this matrix is proportional to H-1.
- For real problems (i.e. nonlinear), you don't know the eigenvalues so you just have to guess. Of course, there are algorithms that will estimate »max ....We won't be considering these here.
- An alternative solution to speeding up learning is to transform the inputs (that is, x -> Px, for some transformation matrix P) so that the resulting correlation matrix, (Px)(Px)T, is equal to the identity.
- The above suggestions are only really true for linear networks. However, the cost surface of nonlinear networks can be modeled as a quadratic in the vicinity of the current weight. We can then apply the similar techniques as above, however, they will only be approximations.
from: http://www.willamette.edu/~gorr/classes/cs449/LearningRates/LearningRates.html
- 线性神经网络Linear Neural Networks
- 神经网络(Neural Networks)
- 神经网络 (Neural Networks)
- Neural Networks(神经网络)
- Convolutional Neural Networks卷积神经网络
- 卷积神经网络(Convolutional Neural Networks)
- Convolutional Neural Networks 卷积神经网络
- Convolutional Neural Networks卷积神经网络
- Convolutional Neural Networks卷积神经网络
- Convolutional Neural Networks卷积神经网络
- 人工神经网络(Artificial Neural Networks)
- Convolutional Neural Networks卷积神经网络
- Convolutional Neural Networks (卷积神经网络)
- 神经网络(Neural Networks,NN)
- Convolutional Neural Networks (卷积神经网络)
- Convolutional Neural Networks卷积神经网络
- 卷积神经网络Convolutional Neural Networks
- Convolutional Neural Networks卷积神经网络
- 秒杀系统架构优化思路
- LeetCode之6_ZigZag Conversion
- 随机数实现猜数字游戏
- Eclipse上安装springsource-tool-suite
- C语言socket模拟客户和服务器通信
- 线性神经网络Linear Neural Networks
- MySQL 5.7 Distrib 5.7.8-rc 主从复制的简单配置,备忘
- 从一个表中查数据,插入另一个表
- [软件人生]关于认知,能力的思考——中国城市里的无知现象片段
- 随机数的获取
- 多层神经网络Multi-layer networks
- C++实现蛇形矩阵
- 通过构造方法为私有属性赋值
- 通过路径分隔符新建文件路径