Deep Learning:深度前馈神经网络(一)
来源:互联网 发布:js获取radio的value 编辑:程序博客网 时间:2024/05/16 05:31
- 深度前馈网络,也叫前馈神经网络或多层感知机(MLPs)。前馈网络的目标是近似某些函数
f∗ 。例如,对于一个分类器,y=f∗(x) 将x 映射到一个分类y 。前馈网络定义一个映射y=f(x;θ) ,并学习参数θ 的值,以找到最好的近似函数。 - The model is associated with a directed acyclic graph describing how the functions are composed together。For example, we might have three functions f (1), f (2), and f (3) connected in a chain, to form f(x) = f (3)(f (2)(f(1)(x)))。The overall length of the chain gives the depth of the model.
- It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally drawing some insights from what we know about the brain, rather than as models of brain function.
- To extend linear models to represent nonlinear functions of x, we can apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation. The question is then how to choose the mapping φ:
(1) One option is to use a very generic φ, such as the infinite-dimensional φ that is implicitly used by kernel machines based on the RBF kernel. If φ(x) is of high enough dimension, we can always have enough capacity to fit the training set, but generalization to the test set often remains poor. Very generic feature mappings are usually based only on the principle of local smoothness and do not encode enough prior information to solve advanced problems.
(2) Another option is to manually engineer φ. Until the advent of deep learning, this was the dominant approach. This approach requires decades of human effort for each separate task, with practitioners specializing in different domains such as speech recognition or computer vision, and with little transfer between domains.
(3) The strategy of deep learning is to learn φ. In this approach, we have a modely=f(x;θ,w)=φ(x;θ)Tw . We now have parameters θ that we use to learn φ from a broad class of functions, and parameters w that map from φ(x) to the desired output. This approach is the only one of the three that gives up on the convexity of the training problem, but the benefits outweigh the harms.
Gradient-Based Learning
- The largest difference between the linear models we have seen so far and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non-convex.This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs.
- Stochastic gradient descent applied to non-convex loss functions has no such convergence guarantee, and is sensitive to the values of the initial parameters.
- For feedforward neural networks, it is important to initialize all weights to small random values.
- The biases may be initialized to zero or to small positive values.
- For the moment, it suffices to understand that
- the training algorithm is almost always based on using the gradient to descend the cost function in one way or another.
Cost Function
- In most cases, our parametric model defines a distribution
p(y|x;θ) and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function. - The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term.
Learning Conditional Distributions with Maximum Likelihood
Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution. This cost function is given by
The specific form of the cost function changes from model to model, depending on the specific form of
For example, if
up to a scaling factor of 1 2 and a term that does not depend on θ.
Previously, we saw that the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds for a linear model, but in fact, the equivalence holds regardless of the f(x; θ) used to predict the mean of the Gaussian.
- One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm.
- Functions that saturate (become very flat) undermine this objective because they make the gradient become very small. In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate. The negative log-likelihood helps to avoid this problem for many models.
- The negative log-likelihood helps to avoid this problem for many models. Many output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units.
Learning Conditional Statistics
Instead of learning a full probability distribution
For example, we may have a predictor
We can thus think of learning as choosing a function rather than merely choosing a set of parameters.
Solving an optimization problem with respect to a function requires a mathematical tool called calculus of variations. At the moment, it is only necessary to understand that calculus of variations may be used to derive the following two results.
Our first result derived using calculus of variations is that solving the optimization problem
yields
so long as this function lies within the class we optimize over. In other words, if we could train on infinitely many samples from the true data-generating distribution, minimizing the mean squared error cost function gives a function that predicts the mean of y for each value of x.
A second result derived using calculus of variations is that
yields a function that predicts the median value of y for each x, so long as such a function may be described by the family of functions we optimize over. This cost function is commonly called mean absolute error.
Unfortunately, mean squared error and mean absolute error often lead to poor results when used with gradient-based optimization. Some output units that saturate produce very small gradients when combined with these cost functions. This is one reason that the cross-entropy cost function is more popular than mean squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y | x).
- Deep Learning:深度前馈神经网络(一)
- Deep Learning:深度前馈神经网络(二)
- Deep Learning : 深度前馈神经网络(三)
- Deep Learning:深度前馈神经网络(四)
- Deep Learning:深度前馈神经网络(五)
- 深度神经网络优化(一)- Practical aspects of Deep Learning
- 深度学习(Deep Learning)读书思考二:前向神经网络
- 《Deep Learning》译文 第六章 深度前馈网络 前言
- Deep Learning读书笔记2---深度前馈网络
- 【深度学习Deep Learning系列】BP神经网络
- Deep Learning(深度学习)神经网络如何识别
- Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
- 【五】Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
- Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
- Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
- 深度学习(Deep Learning)读书思考六:循环神经网络一(RNN)
- Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
- Deep Learning模型之:CNN卷积神经网络(一)深度解析CNN
- 2018秋招前端面经总结
- MPEG DASH MPD文件字段解释
- Android 动画Proterty Animation基础(属性动画)
- 感悟
- 人生就是如此痛苦==
- Deep Learning:深度前馈神经网络(一)
- AutoLayout约束
- RxJava2+Retrofit2导包(新)
- JAVASE:this关键字
- Win10应用右下角小盾牌怎么解决
- WiFi网络WPA2 KRACK漏洞分析报告--黑客内参
- 共享内存的进程通信
- PHP开发中 Laravel框架的常用小技巧
- CSDN博客转发技巧