cs231n听课笔记

来源：互联网发布：推荐淘宝潮流男装店编辑：程序博客网时间：2024/05/24 03:19

最近报了Udacity一节Deep Learning的课，无奈该课重实践而轻理论，做起作业和项目来，颇为吃力，所以从YouTube上找来Stanford今年春天的cs231n（Neural Network and Deep Learning）课来恶补基础知识，并简略记了笔记、补足了涉及到论文的链接，以便以后查阅。其中的Assignment，以后会在github上补齐。

Lecture1.introduction

这里是讲义

图像被称作互联网的“暗物质”（dark matter）
David Marr，1970s，stages of Visual Representation
1. input image
2. primal sketch(edge image)
3. 2 1/2-D sketch
4. 3-D model

Face Detection,2001
Histogam of Gradients(HoG),Dalal & Triggs,2005
PASCAL Visual Object Challenge(20 object categories)

现代图像识别问题是特征多，维度高，用算法经常过拟合

Image Net Challenge所用算法演变
- Lin CVPR 2011—svm
- Krizhevsky NIPS 2012——cNN,Supervision(AlexNet)
- Szegedy arxiv 2014/Simonyan arxiv 2014——-VGG GoogleNet
- Microsoft Research Asia 2015—–152 layer Residual Networks

Lecture2.image classificatioon

这里是讲义

Data-Driven Approach

collect a dataset of images and labels
use machine learning to train a classifier
evaluate the classifier on new images

机器学习方法用于预测分类，一般分两个函数（步骤）：

输入函数→train
输出函数→predict

常用分类器1：Nearest Neighbor

memorize all data and labels
predict the label of the most similar train image

Distance Metric to compare images
L1 distance(Manhattan distance):

d 1 (I 1, I 2) = \sum P ∣ ∣ I P 1 - I P 2 ∣ ∣

训练时间复杂度O(1), 预测时间复杂度O(N)，但是我们的需求是，训练时间可以长，但是预测速度越快越好
缺点：不准确,噪声点误分类
改进：K-Nearest Neighbor，给定一个K，将最邻近的K个样本点的分类作为最终预测结果

用欧几里得距离作为Distance Meric
L2(Euclidean)distance:

d 2 (I 1, I 2) = \sum P (I P 1 - I P 2) 2 - - - - - - - - - - - \sqrt

L1与L2区别

L1依赖于所选择的坐标系，若旋转坐标系，L1距离会变化
L2不依赖于坐标系
如果输入特征有特别含义，则用L1较好，如果特征间无差别，则用L2较好

设置K近邻的超参数

×选择最佳超参数K（BAD：K=1总是对训练集拟合最好）
×分为训练集和测试集（BAD：只在测试集上预测效果好，不知道未知数据预测效果如何）
√分为训练集、验证集和测试集
√cross-validation（在小数据集中非常有用，但在deep learning中不常用）

K-Nearest Neighbor on images never used
- Very slow at test time
- Distance metrics on pixels are not informative
L2距离对样本数据变化（图像变化，如遮挡，变换）不敏感
- curse of dimensionality
随着维度增加，数据个数（采样点）指数级增多

summary

In image classification we start with a training set of images and labels , andd must predict labels on the test set
The K-Neatest Neighbors classifier predicts labels based on nearest training examples
Distance metric and K are hyperparameters
Choose hyperparameters using the validation set; only run on the test set once at the very end!

常用分类器2：线性分类

Parametric Approach

一个尺寸为32（pixels）×32（pixels）×3（RGB）的图片，转为含3072数字的Array，通过权重矩阵W，转换为10个给定分类的分值

f (x) = W x + b

不需要测试集

该方法试图在高维空间用线性划分分类,但对线性不可分集无用

如何用cost function评价W的好坏，下节课讲

Lecture3.Loss Functions and Optimization

这里是讲义

评价权重矩阵W的方法

定义一个loss function量化分类的好坏
找出最小化以上函数的参数（optimization）

Loss function

一般表示：

假设数据集的样本表示为{(xi,yi)}Ni=1
xi为图像，yi为label（int）
数据集的损失表示为：

L = 1 N \sum i L i (f (x i, W), y i)

Multiclass SVM loss(Hinge Loss)：

L i = \sum j \neq y i max (0, s j - s y i + 1)

其中：

s=f(xi,W)(

sj表示预测分类分数，

syi表示其他分类分数)

so:

L = 1 N \sum i = 1 N L i

几点特性：
- Hinge损失不关心具体数值，只关心不同值之间的大小关系
- Hinge损失的取值范围为

[0,−∞)
- 若初始权重矩阵得出的s≈0，去掉一个分类重新赋值计算
- 若想降低算法对误分类的容忍度，可用平方hinge损失

Hinge损失函数代码

def L_i_wectorized(x, y, W):    scores = W.dot(x)    margins = np.maximum(0, scores - score[y] + 1)    margins[y] = 0    loss_i = np.sum(margins)    return loss_i

margins[y] = 0可保证迭代时跳过待计算分类（sj），实现j≠yi

正则项

L = 1 N \sum i = 1 N \sum j \neq y i max (0, f (x i; W) j - f (x i; W) y i + 1) + λ R (W)

常用正则项

L2 regularization R(W)=∑k∑lW2k,l
L1 regularization R(W)=∑k∑l∣∣Wk,l∣∣
Elastic net(L1+L2) R(W)=∑k∑lβW2k,l+∣∣Wk,l∣∣
Max norm regularization
Dropout
Batch normalization
stochastic depth

Softmax Classifier(Multinomial Logistic Regression)

Softmax函数实际为normalize化的指数函数，即softmax(x)=normalize(ex)
分类器损失函数如下：

P (Y = k | X = x i) = e s k \sum j e s j 其 中 s = f (x i; W)

L i = - l o g P (Y = y i | X = x i)

s o : L i = - l o g (e s y i \sum j e s j)

Optimization

Follow the slope

In 1-dimension,the slope is the derivative of a function
In multiple dimensions, the slope is the gradient

Analytic Gradient

fast,efficient,but error-pron,could be debugging through numerical gradient（which is slow, approximat, easy to write）

Gradient Descent

while True:    weights_grad = evaluate_gradient(loss_fun, data, weights)    weights += - step_size * weights_grad # perform parameter update

Stochastic Gradient Descent(SGD)

Full sum expensive when N is large
Approximate sum using a minibatch of examples(32/64/128 common size)

while True:    data_batch = sample_training_data(data, 256) # sample 256 examples    weights_grad = evaluate_gradient(loss_fun, data_batch, weights)    weights += - step_size * weights_grad # perform parameter update

Linear Classification Loss Visualization

用动画显示训练过程，可调超参数，观察收敛速度和方式
链接

对于图片特征线性不可分

可采用以下方法进行特征变换：
1. Color Histogram
把每个色彩的pixel数累计在相应的color bar下
2. Histogram of Oriented Gradients(HoG)
3. Bag of Words
- Step1.Build codebook
- Step2.Encode images

# Lecture4.Introduction to Neural Networks

这里是讲义

BP算法相关

Chain rule

Backpropagation中的重要方法，利用chain rule由输出节点反向计算输入节点的梯度。
假设三层神经网络（一输入层，两隐藏层），输入矩阵为X=[x1,x2,...,xm]，分别经过Hidden layer1（权重矩阵θ1，激活函数f(z)）,Hidden layer2（权重矩阵θ2，激活函数g(z)），到达输出层（权重矩阵θ3，激活函数h(z)），输出向量为$y = [y_1,y_2,…,y_n]。

即：

Z (1) = θ 1 \cdot X

Z (2) = θ 2 \cdot f (Z (1))

Z (3) = θ 3 \cdot g (Z (2))

y = h (Z (3))

那么，在算得输出节点损失δ后，可根据chain rule反向传播到Hidden layer2的损失如下：

\partial y \partial Z 2 =

forward

由输入向前计算得到输出，保存中间变量（Z(i)）

backward

应用chain rule及中间变量，计算各步关于输入量的损失

Neural Network

几种激活函数

Sigmoid
$σ (x) = 1 1 + e - x$
tanh $t a n h (x)$
ReLU (Rectified Linear Units)Yann LeCun,2009 $max (0, x)$
Leaky ReLU $max (0.1 x, x)$
ELU
${x x ⩾ 0 α (e x - 1) x < 0$

Lecture5.Convolutional Neural Networks

这里是讲义

History

Frank Rosenblatt, 1957, Perceptron
Widrow and Hoff, 1960, Adaline/Madaline
Rumelhar,1986, First time back-propagation became popular
Hinton and Salakhutdinov, 2006, Reinvigorated research in Deep Learning
Hinton Krizhevsky and Sutskever, 2012, First strong results:AlexNet

today

Hinton, 2012, Reproduced with permission
Ren He and Girshick, 2015, Faster R-CNN
Taigman, 2014, face recognition

Convolution

卷积的定义：百度百科
wikipedia
filter 和 image的元素相乘求和：

f [x, y] * g [x, y] = \sum n 1 = - \infty \infty \sum n 2 = - \infty \infty f [n 1, n 2] \cdot g [x - n 1, y - n 2]

Filter and padding

用3×3的filter卷过7×7像素的图片
- stride=1 得到一个5×5的output
- stride=2 得到一个3×3的output
- stride=3 不匹配！
- 一般说来，卷积层尺寸为（N-F）/stride+1
用3×3边缘用0填充的filter卷过7×7像素（边缘同样用0填充）的图片
- stride=1 得到一个7×7的output
- stride=3 得到一个3×3的output
- 若想得到同样尺寸的卷积层：
  - filter尺寸为3×3时，需填充（zero pad）1行像素0
  - filter尺寸为5×5时，zero pad with 2
  - filter尺寸为7×7时，zero pad with 3
  - 一般说来zero pad with（F-1）/2即可得到同样尺寸的卷积层（F为filter尺寸）
1中的padding称为‘valid padding’，2中的padding称为‘same padding’（不会丢掉edge和Corner的信息）

Pooling layer

makes the representations smaller and more manageable
operates over each activation map independently

max polling

summary

ConvNets 是卷积层（CONV），池化层（POOL），全链接层（FC）的堆叠
倾向于用更小的filter和更深的结构
倾向于用更多的卷积层、更少的池化层和全链接层
典型结构：[(CONV-RELU)×N-POOL]×M-(FC-RELU)×K,SOFTMAX
- N一般5以内，M很大，K一般2以内

Lecture6.Training Neural Networks1

这里是讲义

Mini-batch SGD

循环：
- 随机取一部分（batch）样本
- 前向传播得到损失
- Backpropagation 计算梯度
- 用梯度更新参数

Activation Functions

常见激活函数：

sigmoid

Squashes numbers to range[0,1]
曾经最经典的激活函数，良好的解释性，有很好的性质（指数、导数），但不适用于Backpropagation
缺点：
使梯度‘消失’，Backpropagation中（梯度为∂f∂x=sigmoid(1−sigmoid)）：
- 当x为较小负数，梯度趋近于0
- 当x为0（附近）时，梯度可继续计算
- 当x为较大正数，梯度趋近于0
Sigmoid的输出不是以0为中心
- 若输入为全正或全负，
指数计算耗时长

tanh(x)

Squashes numbers to range[-1,1]
输出以0为中心（优于sigmoid）
仍使梯度在较大正负值处消失

ReLU（最常用于cnn）

在正数区域不会‘饱和’
计算快
收敛速度快于tanh(x)和sigmoid （一般快6倍）
缺点：
输出不以0为中心
在负数区依然会饱和，且梯度为0
当数据位于非激活区时（负数），该部分权重也得不到更新
- 解决办法：初始化ReLU时，加一个小的正偏置（bias，比如0.01）

Leaky ReLU

不会‘饱和’
计算快
收敛速度快于tanh(x)和sigmoid （一般快6倍）
不存在像ReLU一样的缺点
Parametric Rectifier(PReLU):f(x)=max(αx,x)

ELU

具有ReLU的优点
输出均值接近0
相比Leaky ReLU，负饱和区对噪声的鲁棒性更好

Maxout ‘Neuron’Goodfellow,2013

建议

首选ReLU，慎重选择学习率
尝试Leaky ReLU、Maxout、ELU
尝试tanh（可能并不会太好）
不要用sigmoid！！！！！

Data Preprocessing

预处理之前的分类loss对权重矩阵的变化非常敏感，难于优化

preprocess the data

zero-mean

X -= np.mean(X, axis=0)

normalized（机器学习中常用方法，但图像处理中一般不需要）

X /= np.std(X, axis=0)

PCA/Whitening 与normalized一样，不常用于图像处理

例子

比如CIFAR-10中的32×32×3的图像，有以下两种处理方式：
- 减去整个图像（所有样本）的平均（AlexNet）
（mean image是一个32×32×3的array）
- 减去每个通道的平均（VGGNet）

Weight initialization

small random numbers（服从高斯分布，均值为0，方差为0.01）

W = 0.01 * np.random.randn(D,H)

- 小型神经网络可以，深度网络不合适（方差衰减，所有值变为0）- 一个比较好的方法[‘Xavier initialization’](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.207.2059&rep=rep1&type=pdf)- 另一个方法[He et al. 2015](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)

- 初始化不合适会怎样
- 过大：activation 饱和（使用tanh时），梯度为0
- 过小：activation 趋于0，梯度为0

Batch Normalization

步骤：
1. 在各维度上独立的求出均值和方差
2. Normalize
3. （一般）插入全链接层和卷积层之后
优点：
- improve gradient flow through the network
- 允许更高的学习率
- 降低对（权重？）初始化的依赖
注意：
- 在测试集上不再重新计算mean

训练模型中出现的几点问题

loss不下降或下降很慢（但准确率在上升）
- 学习率太低。由于softmax选择最大值的特性，可能造成准确率上升而损失不下降的现象
loss返回NaN（explode）
- 学习率太高
学习率一般取1e-3到1e-5

Hyperparameter Optimization

Cross-validation

用较小的迭代次数粗略的选出大概合适的超参数范围
增多迭代次数，继续选出精确的超参数
- 如果cost一直大于3倍于初始cost，说明参数错误，尽早终止

Lecture7.Training Neural Networks2

这里是讲义

更高级的优化方法

SGD的问题：
- 如果loss在一个方向的变化慢而另一个方向上变化快，梯度下降的过程中梯度会在变化快的方向上震荡的方式下降，若是多维特征空间，这种震荡效率很低
- 局部最小值或‘鞍点’：在鞍点处，有些方向是下降，另外一些方向是上升的，在多维空间最为常见
- ‘stochastic’：效率低，噪声影响严重

SGD+Momentum (SGD改进)

传统SGD:

x t + 1 = x t - α ▽ f (x t)

while True:    dx = compute_gradient(x)    x += learning_rate * dx

SGD+Monmentum:

定义v为“速度”，记录运行过程中梯度的平均值
定义ρ为“摩擦力”，一般取值0.9或0.99

v t + 1 = ρ v t + ▽ f (x t)

x t + 1 = x t - α v t + 1

vx = 0while True:    dx = compute_gradient(x)    vx = rho * vx + dx    x += learning_rate * vx

Momentum的变型：Nesterov Momentum

含纠正因子，在局部最小值处，会比常规Momentum更快找到正确位置
$v t + 1 = ρ v t - α ▽ f (x t ~)$
$x t + 1 ~ = x t ~ - ρ v t + (1 + ρ) v t + 1 = x t ~ + v t + 1 + ρ (v t + 1 - v t)$

dx = compute_gradient(x)old_v = vv = rho * v - learning_rate * dxx += -rho * old_v + (1 + rho) * v

AdaGrad(不常用）

grad_squared = 0while True:    dx = compute_gradient(x)    grad_squared += dx * dx    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

RMSProp（AdaGrad的变型）

不会像Momentum产生overshoot的问题）

grad_squard = 0while True:    dx = compute_gradient(x)    grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

Adam(AdaGrad + Momentum)

融合了两个方法的优点
加入了偏置项，避免first and second moment 出现除0
beta1 = 0.9, beta2 = 0.999, learning_rate = 1e-3或5e-4（以上参数适用于大多数模型的初始调参）

first_moment = 0second_moment = 0while True:    dx = compute_gradient(x)    # Momentum    first_moment = beta1 * first_moment + (1 - beta1) * dx     #AdaGrad/RMSProp    second_moment = beta2 * second_moment + (1 - beta2) * dx * dx     # bias crrection    first_unbias = first_moment / (1 - beta1 ** t)    second_unbias = second_moment / (1 - beta2 ** t)    # AdaGrad/RMSProp    x -= learning_rate * first_moment / (np.sqrt(second_moment) + 1e-7))

以上方法的学习率如何设置？

学习率随着迭代次数增加而衰减！
- step decay：每经过一定步数，降低学习率
- exponential decay： $α = α 0 e - k t$
- 1/t dacay: $α = α 0 / (1 + k t)$
但不要一开始就设定decay，先设置其他超参数，最后优化模型时设置学习率decay
常用于SGD+Momentum

一阶优化

利用梯度形成线性逼近
沿梯度方向近似最小化损失函数

二阶优化-Newton step(牛顿下降法Wikipedia)

利用梯度和Hessian形成二次逼近
直接逼近最小值（非近似）

Newton step(牛顿下降法Wikipedia)

二阶泰勒展开式：
J(θ)≈J(θ)+(θ−θ0)⊤▽θJ(θ0)+12(θ−θ0)⊤H(θ−θ0)
- 解临界点更新牛顿参数
  $θ * = θ 0 - H - 1 ▽ θ J (θ 0)$
- 上式不含学习率，直接求解最小值。但问题在于，Hessian的元素数为O(N2)，求逆需要O(N3)时间复杂度。
- 所以上述方法在深度学习中并不常用

Quasi-Newton methods(改进的Newton)

BGFS-Broyden–Fletcher–Goldfarb–Shanno algorithm
L-BGFS-Limited-memory BFGS

综上，Adam一般是较好的方法，适用于大部分模型；如果用全部样本进行训练，可以尝试L-BFGS（注意去除噪声）

Model Ensembles:Tips and Tricks

训练一系列独立的模型，在测试集上取平均结果（一般会提升2%准确率）,取
trick:snapshots ensembles
trick:Polyak averaging

while True:    data_batch = dataset.sample_data_batch()    loss = network.forward(data_batch)    dx = network.backward()    x += - learning_rate * dx    # use for test set    x_test = 0.995 * x_test + 0.005 * x

阅读全文

1 0