Technical Details in Neural Networks(zz)

来源：互联网发布：2016网络名词编辑：程序博客网时间：2024/05/09 00:08

http://statsoft.com/textbook/stathome.html

Pre- and Post-processing:

Scaling. Numeric values have to be scaled into a range that isappropriate for the network. Typically, raw variable values are scaledlinearly. In some circumstances, non-linear scaling may be appropriate.

Nominal variables. Nominal variables may be two-state (e.g., Gender={Male,Female})or many-state (i.e., more than two states). A two-state nominalvariable is easily represented by transformation into a numeric value(e.g., Male=0, Female=1).

Multilayer Perceptrons:

Prediction problems may be divided into two main categories:

Classification. In classification,the objective is to determine to which of a number of discrete classesa given input case belongs. Examples include credit assignment (is thisperson a good or bad credit risk), cancer detection (tumor, clear),signature recognition (forgery, true). In all these cases, the outputrequired is clearly a single nominal variable. The most commonclassification tasks are (as above) two-state, although many-state tasks are also not unknown.

Regression. In regression,the objective is to predict the value of a (usually) continuousvariable: tomorrow's stock price, the fuel consumption of a car, nextyear's profits. In this case, the output required is a single numericvariable.

Multilayer Perceptrons Training Algorithms:

Back propagation.
Levenberg-Marquardt.
Conjugate gradient descent.
Quasi-Newton.
Delta-bar-Delta.

Back propagation. Back propagation is the best known training algorithm for neural networks and still one of the most useful. Devised independently by Rumelhart et. al. (1986), Werbos (1974), and Parker (1985). The on-line version of back propagation calculates the local gradient of each weight with respect to each case during training. Weights are updated once per training case.

The update formula is:

h - the learning rate

d - the local error gradient

a - the momentum coefficient

oi - the output of the i'th unit

Thresholds are treated as weights with oi = -1.

The local error gradient calculation depends on whether the unit into which the weights feed is in the output layer or the hidden layers.

Local gradients in output layers are the product of the derivatives of the network's error function and the units' activation functions.

Local gradients in hidden layers are the weighted sum of the unit's outgoing weights and the local gradients of the units to which these weights connect.

Levenberg-Marquardt. Levenberg-Marquardt (Levenberg, 1944; Marquardt, 1963; Bishop, 1995; Shepherd, 1997; Press et. al., 1992) is an advanced non-linear optimization algorithm. It can be used to train the weights in a network just as back propagation would be. It is reputably the fastest algorithm available for such training. However, its use is restricted as follows:

Single output networks. Levenberg-Marquardt can only be used on networks with a single output unit.

Small networks. Levenberg-Marquardt has space requirements proportional to the square of the number of weights in the network. This effectively precludes its use in networks of any great size (more than a few hundred weights).

Sum-squared error function. Levenberg-Marquardt is only defined for the sum squared error function. If you select a different error function for your network, it will be ignored during Levenberg-Marquardt training. It is usually therefore only appropriate for regression networks.

The Levenberg-Marquardt algorithm is designed specifically to minimize the sum-of-squares error function, using a formula that (partly) assumes that the underlying function modeled by the network is linear. Close to a minimum this assumption is approximately true, and the algorithm can make very rapid progress. Further away it may be a very poor assumption. Levenberg-Marquardt therefore compromises between the linear model and a gradient-descent approach. A move is only accepted if it improves the error, and if necessary the gradient-descent model is used with a sufficiently small step to guarantee downhill movement.

Levenberg-Marquardt uses the update formula:

where is the vector of case errors, and Z is the matrix of partial derivatives of these errors with respect to the weights:

The first term in the Levenberg-Marquardt formula represents the linearized assumption; the second a gradient-descent step. The control parameter governs the relative influence of these two approaches. Each time Levenberg-Marquardt succeeds in lowering the error, it decreases the control parameter by a factor of 10, thus strengthening the linear assumption and attempting to jump directly to the minimum. Each time it fails to lower the error, it increases the control parameter by a factor of 10, giving more influence to the gradient descent step, and also making the step size smaller. This is guaranteed to make downhill progress at some point.

Conjugate gradient descent. Conjugate gradient descent (Bishop, 1995; Shepherd, 1997) is an advanced method of training multilayer perceptrons. It usually performs significantly better than back propagation, and can be used wherever back propagation can be. It is the recommended technique for any network with a large number of weights (more than a few hundred) and/or multiple output units. For smaller networks, either Quasi-Newton or Levenberg-Marquardt may be better, the latter being preferred for low-residual regression problems.

Conjugate gradient descent is a batch update algorithm: whereas back propagation adjusts the network weights after each case, conjugate gradient descent works out the average gradient of the error surface across all cases before updating the weights once at the end of the epoch.

Conjugate gradient descent is batch-based; it calculates the error gradient as the sum of the error gradients on each training case.

The initial search direction is given by:

Subsequently, the search direction is updated using the Polak-Rebiere formula:

If the search direction is not downhill, the algorithm restartsusing the line of steepest descent. It restarts anyway after Wdirections (where W is the number of weights), as at that point the conjugacy has been exhausted.

Line searches are conducted using Brent's iterative line search procedure, which utilizes a parabolic interpolation to locate the line minima extremely quickly.

Quasi-Newton. Quasi-Newton (Bishop, 1995; Shepherd, 1997) is a batch update algorithm.

It maintains an approximation to the inverse Hessian matrix, called H below. The direction of steepest descent is called g below. The weight vector on the ith epoch is referred to as fi below. H is initialized to the identity matrix, so that the first step is in the direction g (i.e. the same direction as that chosen by Back Propagation). On each epoch, a back tracking line search is performed in the direction:

d = � Hg

Subsequently, the search direction is updated using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) formula:

This is "guaranteed" to maintain a positive-definite approximation (i.e. it will always indicate a descent direction), and to converge to the true inverse Hessian in W steps, where W is the number of weights, on a quadratic error surface. In practice, numerical errors may violate these theoretical guarantees and lead to divergence of weights or other modes of failure. In this case, run the algorithm again, or choose a different training algorithm.

Delta-bar-Delta. Delta-bar-Delta is inspired by the observation that the error surface may have a different gradient along each weight direction, and that consequently each weight should have its own learning rate (i.e. step size).

In Delta-bar-Delta, the individual learning rates for each weight are altered on each epoch to satisfy two important heuristics:

If the derivative has the same sign for several iterations, the learning rate is increased (the error surface has a low curvature, and so is likely to continue sloping the same way for some distance);
If the sign of the derivative alternates for several iterations, the learning rate is rapidly decreased (otherwise the algorithm may oscillate across points of high curvature).

Weights are updated using the same formula as in back propagation, except that momentum is not used, and each weight has its own time-dependent learning rate.

All learning rates are initially set to the same starting value; subsequently, they are adapted on each epoch using the formulae below.

The bar-Delta value is calculated as:

d(t) is the derivative of the error surface,

q is the smoothing constant.

The learning rate of each weight is updated using:

k is the linear increment factor,

f the exponential decay factor.

Error functions:

The error function is used in training the network and in reporting the error. The error function used can have a profound effect on the performance of training algorithms (Bishop, 1995).

The following four error functions are available.

Sum-squared. The error is the sum of the squared differences between the target and actual output values on each output unit. This is the standard error function used in regression problems. It can also be used for classification problems, giving robust performance in estimating discriminant functions, although arguably entropy functions are more appropriate for classification, as they correspond to maximum likelihood decision making (on the assumption that the generating distribution is drawn from the exponential family), and allow outputs to be interpreted as probabilities.

City-block. The error is the sum of the differences between the target and actual output values on each output unit; differences are always taken to be positive. The city-block error function is less sensitive to outlying points than the sum-squared error function (where a disproportionate amount of the error can be accounted for by the worst-behaved cases). Consequently, networks trained with this metric may perform better on regression problems if there are a few wide-flung outliers (either because the data naturally has such a structure, or because some cases may be mislabeled).

Cross-entropy (single & multiple). This error is the sum of the products of the target value and the logarithm of the error value on each output unit. There are two versions: one for single-output (two-class) networks, the other for multiple-output networks. The cross-entropy error function is specially designed for classification problems, where it is used in combination with the logistic (single output) or softmax (multiple output) activation functions in the output layer of the network. This is equivalent to maximum likelihood estimation of the network weights. An MLP with no hidden layers, a single output unit, and cross entropy error function is equivalent to a standard logistic regression function (logit or probit classification).

Kohonen. The Kohonen error assumes that the second layer of the network consists of radial units representing cluster centers. The error is the distance from the input case to the nearest of these. The Kohonen error function is intended for use with Kohonen networks and Cluster networks only.