Deep Learning：深度前馈神经网络(一)

来源：互联网发布：js获取radio的value 编辑：程序博客网时间：2024/05/16 05:31

深度前馈网络，也叫前馈神经网络或多层感知机（MLPs）。前馈网络的目标是近似某些函数f∗。例如，对于一个分类器，y=f∗(x)将x映射到一个分类y。前馈网络定义一个映射y=f(x;θ)，并学习参数θ的值，以找到最好的近似函数。
The model is associated with a directed acyclic graph describing how the functions are composed together。For example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) = f (3)(f (2)(f(1)(x)))。The overall length of the chain gives the depth of the model.
It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally drawing some insights from what we know about the brain, rather than as models of brain function.
To extend linear models to represent nonlinear functions of x, we can apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation. The question is then how to choose the mapping φ:
(1) One option is to use a very generic φ, such as the infinite-dimensional φ that is implicitly used by kernel machines based on the RBF kernel. If φ(x) is of high enough dimension, we can always have enough capacity to fit the training set, but generalization to the test set often remains poor. Very generic feature mappings are usually based only on the principle of local smoothness and do not encode enough prior information to solve advanced problems.
(2) Another option is to manually engineer φ. Until the advent of deep learning, this was the dominant approach. This approach requires decades of human effort for each separate task, with practitioners specializing in different domains such as speech recognition or computer vision, and with little transfer between domains.
(3) The strategy of deep learning is to learn φ. In this approach, we have a model y=f(x;θ,w)=φ(x;θ)Tw. We now have parameters θ that we use to learn φ from a broad class of functions, and parameters w that map from φ(x) to the desired output. This approach is the only one of the three that gives up on the convexity of the training problem, but the benefits outweigh the harms.

Gradient-Based Learning

The largest difference between the linear models we have seen so far and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non-convex.This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs.
Stochastic gradient descent applied to non-convex loss functions has no such convergence guarantee, and is sensitive to the values of the initial parameters.
For feedforward neural networks, it is important to initialize all weights to small random values.
The biases may be initialized to zero or to small positive values.
For the moment, it suffices to understand that
the training algorithm is almost always based on using the gradient to descend the cost function in one way or another.

Cost Function

In most cases, our parametric model defines a distribution p(y|x;θ) and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.
The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term.

Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution. This cost function is given by

J (θ) = - E x, y \sim p^d a t a log p m o d e l (y | x)

The specific form of the cost function changes from model to model, depending on the specific form of

logpmodel.
For example, if

pmodel(y|x)=N(y;f(x;θ),I), then we recover the mean squared error cost,

J (θ) = 1 2 E x, y \sim p^d a t a | | y - f (x; θ) | | 2 + c o n s t

up to a scaling factor of 1 2 and a term that does not depend on θ.
Previously, we saw that the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds for a linear model, but in fact, the equivalence holds regardless of the f(x; θ) used to predict the mean of the Gaussian.

One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm.
Functions that saturate (become very flat) undermine this objective because they make the gradient become very small. In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate. The negative log-likelihood helps to avoid this problem for many models.
The negative log-likelihood helps to avoid this problem for many models. Many output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units.

Learning Conditional Statistics

Instead of learning a full probability distributionp(y|x;θ) we often want to learn just one conditional statistic of y given x.
For example, we may have a predictorf(x;θ) that we wish to predict the mean of y.
We can thus think of learning as choosing a function rather than merely choosing a set of parameters.
Solving an optimization problem with respect to a function requires a mathematical tool called calculus of variations. At the moment, it is only necessary to understand that calculus of variations may be used to derive the following two results.
Our first result derived using calculus of variations is that solving the optimization problem

f * = a r g m i n f E x, y \sim p d a t a | | y - f (x) | | 2

yields

f * (x) = E y \sim p d a t a (y | x) [y]

so long as this function lies within the class we optimize over. In other words, if we could train on infinitely many samples from the true data-generating distribution, minimizing the mean squared error cost function gives a function that predicts the mean of y for each value of x.

A second result derived using calculus of variations is that

f * = a r g m i n f E x, y \sim p d a t a | | y - f (x) | | 1

yields a function that predicts the median value of y for each x, so long as such a function may be described by the family of functions we optimize over. This cost function is commonly called mean absolute error.
Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions. This is one reason that the cross-entropy cost function is more popular than mean squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y | x).

阅读全文

0 0