cs231n Lecture 3

来源：互联网发布：软件设计师考试辅导编辑：程序博客网时间：2024/05/16 23:35

说明

除特殊说明，本文以及这个系列文章中的所有插图都来自斯坦福cs231n课程。
转载请注明出处；也请加上这个链接http://vision.stanford.edu/teaching/cs231n/syllabus.html
Feel free to contact me or leave a comment.

Abstract

上次课简单叙述了一下图像分类的整个流程；接着引入了一个linear score function.

今天的主要内容为：Loss function+optimization.

Loss function

上次课我们定义了一个linear score function，可是，我们看到有些图片分类正确，有些错误，那我们怎么衡量这些错误结果是多差呢？（quantifies our unhappiness with the scores across the training data）我们引入loss function.

Multiclass SVM loss(hinge loss)

Given an example (xi,yi) where xi is the image and where yi is the (integer) label,and using the shorthand for the scores vector:s=f(xi,W)

the SVM loss has the form: Li=max(0,sj-syi+1) and the full training loss is the mean over all examples in the training data: Li=max(0,sj-syi+1)/N.

Q: what if the sum was instead over all classes?(including j = y_i)
Scores will plus 1.Shifting.

Q: what if we used a mean instead of a sum here?
Scaling.

Q: what if we used ()2 square?
Nonlinear, sometimes will be better.(这里不太懂为什么有时候会比较好。)

Q: what is the min/max possible loss?
Zero，infinity.

Q: usually at initialization W are small numbers, so all s ~= 0. What is the loss?
Number of classes minus 1.

那现在我们已经学了multiclass SVM loss的loss function.但是，这个式子却有一个明显的bug，因为使得loss最小的Weights不唯一。那怎么样的Weights才比较好呢？

Weight Regularization

前一部分去fit training data;后一部分去make W nice;这两部分相互fight，尽管可能在训练集上表现欠佳，但在测试集的泛化能力比较好。

Q: What is the min/max possible loss L_i?
Zero，infinity.

Q: usually at initialization W are small numbers, so all s ~= 0. What is the loss?
negtive log 1/num of Classes

Softmax loss（cross-entropy loss）

Softmax Classifier

我们刚才讨论了hinge loss和cross-entropy loss，对于其的一个可视化的网站可以参考这里：Linear Classification Loss Visualization.

Optimization

找到使得loss function最小的parameters.

Strategy #1: A first very bad idea solution: Random search

Strategy #2: Follow the slope

In multiple dimensions, the gradient is the vector of (partial derivatives).

Evaluation the gradient numerically

approximate
very slow to evaluate

Calculus

In summary:

Numerical gradient: approximate, slow, easy to write
Analytic gradient: exact, fast, error-prone

=>In practice:
Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

Gradient Descent

negative gradient direction

Mini-batch Gradient Descent

only use a small portion of the training set to compute the gradient.
Common mini-batch sizes are 32/64/128 examples. e.g. Krizhevsky ILSVRC ConvNet used 256 examples

The effects of step size (or “learning rate”)

learning rate

Next class: Becoming a backprop ninja and Neural Networks (part 1).

0 0