整理——Some basic questions about caffe and deep learning

来源：互联网发布：淘宝上如何交电费编辑：程序博客网时间：2024/05/16 18:56

Rectified Linear Units
摘自http://www.douban.com/note/348196265/
sigmoid 和 tanh 作为神经网络的激活函数已经很熟悉，今天看了一下 ReLU 这种线性激活函数。很显然，线性激活函数的计算开销又大大降低。而且很多工作显示 ReLU 有助于提升效果[1].
这里写图片描述
sigmoid:
g(x) = 1 /(1+exp(-1)). g’(x) = (1-g(x))g(x).
tanh :
g(x) = sinh(x)/cosh(x) = ( exp(x)- exp(-x) ) / ( exp(x) + exp(-x) )
Rectifier (ReL):
- hard ReLU: g(x)=max(0,x)
- Noise ReLU max(0, x+N(0, σ(x)).
softplus:
g(x) = log(1+exp(x)), 导数就是 logistic function

[以下摘自quora]
The major differences between the sigmoid and ReL function are:
Sigmoid function has range [0,1] whereas the ReL function has range [0,\infty] . Hence sigmoid function can be used to model probability, whereas ReL can be used to model positive real number.
The gradient of the sigmoid function vanishes as we increase or decrease x. However, the gradient of the ReL function doesn’t vanish as we increase x（gradient vanishing problem, 这点对训练神经网络很不好）. In fact, for max function, gradient is defined as

⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 x < 0 1 x > 0 i f i f ⎫ ⎭ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪

.
[以上摘自quora]

ReLU 的优点在 [1] 里提到：
Hard ReLU is naturally enforcing sparsity.
The derivative of ReLU is constant,

可能 ReLU 的问题是 dead-zone （输出恒为0的结点）的问题（The derivative of hard ReLU is constant over two ranges x<0 and x>=0, for x>0, g’=1, and x<0, g’=0）。这样就会“stuck”。有些小 trick 可以解决这个问题，比如把 bias 初始设为一个正数，但是这个问题在有些论文中[2]被指出并没有什么影响。当然还可以更换其他线性激活函数，比如 maxout 和 softplus.

[1] Rectifier Nonlinearities Improve Neural Network Acoustic Models
[2] Deep Sparse Rectifier Neural Networks
[3] http://www.quora.com/Deep-Learning/What-is-special-about-rectifier-neural-units-used-in-NN-learning
关于relu来源的描述：
Geoff Hinton gave a lecture in the Summer of 2013 that I found very helpful in understanding ReLUs and the rest. Essentially he claimed that the original activation function was chosen arbitrarily and that the ReLU’s work “better”, but aren’t the be-all-end-all. Also interesting was that the ReLU is an approximation to the summation of an infinite number of sigmoids with varying offsets - I wrote a blog post showing this is the case. They arrive at this function because of experimenting with a deep network where they varied the offsets of the activation functions at random until it “just worked” without pretraining. Based off this observation Hinton decided to try a network that essentially tried all the offsets at once - hence the ReLU.

0 0