20161206#cs231n#2.线性分类器 Assignment1--SVM&Softmax

来源：互联网发布：多功能机软件下载编辑：程序博客网时间：2024/05/23 01:13

课程网址

Linear classification: Support Vector Machine, Softmax

Linear Classifier线性分类器

其实就是一个线性映射,
Score function:

f (x i, W, b) = W x i + b

f即为预测结果

yi，W称之为weight，b为bias vector,其中

xi为列向量
下面引用一个简单的例子

In the above equation, we are assuming that the image xi has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, xi contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores).

优点在于对Training Set使用一次得到 W 和 b 之后就可以把它们discard，在代入Test Set 的数据，即可得到预测结果yi

线性分类器有个很大的问题就是会很死板地根据 Training Set 得到 W 的值，容易造成判断错误，这问题就需要神经网络去解决
一般是让f(xi,W,b)=Wxi+b变为

f (x i, W) = W x i

在

xi中增加一个常量1的维度来代替bias，这样的话方程就简化了

With our CIFAR-10 example, xi is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and W is now [10 x 3073] instead of [10 x 3072]. The extra column that W now corresponds to the bias b.
具体例子可以看网页

这样做的话，只需要增加一个维度即可实现只对一个矩阵W进行学习，而不是既要对存储W的矩阵进行学习又要对存储b的矩阵进行学习

对于图像中的pixel要进行mean normalization均值归一化和Feature Scaling 特征缩放法
具体表现为[0,255]→[-127,127]→[-1, 1]这样做好像有点问题，会漏一些数据应该是[-128,127]

loss function

Loss function又称Cost Function又称Objective,Loss的值越小，表明对结果的预测越好，以下介绍线性分类器常用的两种loss，一种为Multiclass Support Vector Machine loss，一种为Softmax Classifier

Multiclass Support Vector Machine (SVM) loss 多类别支持向量机loss

这是一种常见的定义Loss Function的方法

s j = f (x i; W) j

L i = \sum j \neq y i max (0, s j - s y i + Δ)

L = 1 N \sum i \sum j \neq y i [max (0, f (x i; W) j - f (x i; W) y i + Δ)] + λ \sum k \sum l W 2 k, l

Hinge loss：max(0,...) ，右边的 ... 代表某个数学表达式，其实就是一个以阈值为0的函数（感觉没什么特别的…)
Regularization Penalty正则惩罚项R(W)=∑k∑lW2k,l
Data Loss ： L=1N∑iLi
Regularization Loss ： λR(W)
Margin ：Δ 一般取1.0

svm的分类是有方向性的，如cs231n图中的箭头，或者查看
知乎-靠靠靠谱的回答
CSDN-SVM-支持向量机算法概述
Binary Support Vector Machines这个可以参考CS231n里面的解释

Softmax Classifier

其实Softmax分类器就是将逻辑回归分类器扩展到multiclass的层面上。
首先定义f为scores向量，类似于上面提到的s向量
Softmax Function:fj(z)=ezj∑kezk
Cross-entropy Loss：

L i = - ln (e f y i \sum j e f j) o r e q u i v a l e n t l y L i = - f y i + ln \sum j e f j

Cross-entropy Loss的值要越小越好
Cross-entropy Loss越小即越接近0时，表明efyi∑jefj的值越接近1，即softmax分类器预测为正确类yi的概率越大

下面是每个量的概率

P (y i | x i; W) = e f y i \sum j e f j

其实很明显，指数的值特别大，所以分子分母的值都会特别大，所以需要一个合适的方法去减少计算量。
由于

e f y i \sum j e f j = C e f y i C \sum j e f j = e f y i + ln C \sum j e f j + ln C

所以我们可以令

lnC=−maxjfj，这样的话分子的和分母的幂即

fj+lnC的最大值为0，有效避免了分子分母的

efj+lnC过大的问题

所有的P(yi|xi;W)的和值为1

SVM与Softmax Classifier的比较

与SVM相比softmax分类器给每一个类都提供了一个确信度，而SVM只是给了一个具体的值
对于λ它对softmax的影响很大，λ值很小的时候可能会出现特别大的概率值，但λ值大一点可能会使每一个类的概率值相对接近
λ的值会直接影响scores进而间接影响最后的概率值

[1,−2,0]→[e1,e−2,e0]=[2.71,0.14,1]→[0.7,0.04,0.26]
增大λ，使W被惩罚更多，导致scores的值变小([0.5,−1,0]),最终影响概率，使每个概率值相对更接近
[0.5,−1,0]→[e0.5,e−2,e0]=[1.65,0.37,1]→[0.55,0.12,0.33]

但对于SVM（令Δ=1）而言，scores的值[10, -100, -100]和 [10, 9, 9] 没有什么差别，因为最后的loss值都为0

任务：解释线性分类器

线性分类器就是用f(xi,W)=Wxi做出一个超平面，把不同的类的点分隔在平面的两侧，规定面的一边为正方向，这个正方向内的所有点就为线性分类器的预测结果。
线性分类器的超平面是用训练集训练出来的，其中最关键的就是W。W可认为是模板，每一行是用于估测同一个类的不同参数，与xi做内积的结果f(xi,W)就是根据这些参数估测出来的不同类的scores，对scores进行Loss Function处理，即可挑出最合适的那个类。

线性分类器这种参数化方法(Parametric Approach)相比于kNN而言的好处在于不用多次遍历训练集，只要遍历过一次训练集之后，即可得到W参数，即可丢弃训练集。在之后对test和W做矩阵乘法即可估计scores。

对于如何得到最合适的参数使得loss值最小，这就是最优化问题了(Optimization)

Assignment1–SVM

这里涉及到了矩阵求导
所以特意去查了点公式矩阵导数这里面用到的是7.标量y对矩阵X的求导的重要结论

在cs231n给的Assignment里面（这个定义很奇怪但里面的确是这么写的）

L i j = X i \cdot W j - X i \cdot W y i + 1 X \in N \times D W \in D \times C X i \in 1 \times N W j, W y i \in D \times 1

其中D是输入图像转化之后的维数，C是Classes数目，N是一个minibatch的样本数

初始化的时候

d W = [0, 0, . ., 0]

(其中

0为D×1维的列向量)

k!=yi的时候有

\partial L i j \partial W k = X T i

k==yi的时候有

\partial L i j \partial W k = - X T i

（这里就是把

Wj当做变量来求导）

\partial L i j \partial W = \partial ( X i \cdot W j - X i \cdot W y i + 1 ) \partial [ W 1 , W 2 , . . . , W N u m _ C l a s s e s ] = [0, 0, . . ., X T i, . . ., - X T i, . . ., 0]

将上述式子求dW+=∂Lij∂W，循环i∗j次,然后之后便可以得到最后的Gradient值dW
具体看电脑里面的代码
参考
http://blog.csdn.net/zengdong_1991/article/details/51346201
http://blog.csdn.net/yc461515457/article/details/51921607

Assignment2–Softmax

W∈D×C X∈N×D
loss function定义为

L i = - f y i + log \sum j e f j

所以

\partial L i \partial W = \partial ( - f y i + log \sum j e f j ) \partial W

http://blog.csdn.net/yc461515457/article/details/51924604
http://blog.csdn.net/xieyi4650/article/details/53332988

softmax.py

import numpy as npfrom random import shuffledef softmax_loss_naive(W, X, y, reg):  """  Softmax loss function, naive implementation (with loops)  Inputs have dimension D, there are C classes, and we operate on minibatches  of N examples.  Inputs:  - W: A numpy array of shape (D, C) containing weights.  - X: A numpy array of shape (N, D) containing a minibatch of data.  - y: A numpy array of shape (N,) containing training labels; y[i] = c means    that X[i] has label c, where 0 <= c < C.  - reg: (float) regularization strength  Returns a tuple of:  - loss as single float  - gradient with respect to weights W; an array of same shape as W  """  # Initialize the loss and gradient to zero.  loss = 0.0  dW = np.zeros_like(W)  #############################################################################  # TODO: Compute the softmax loss and its gradient using explicit loops.     #  # Store the loss in loss and the gradient in dW. If you are not careful     #  # here, it is easy to run into numeric instability. Don't forget the        #  # regularization!                                                           #  #############################################################################  scores=X.dot(W)  num_trains = scores.shape[0]  scores_max=np.max(scores, axis=1)  scores -= scores_max[:, np.newaxis]  scores_exp = np.exp(scores)  scores_exp_sum = np.sum(scores_exp, axis=1)  p = np.zeros(scores.shape)  for i in xrange(num_trains):    p[i, :] = scores_exp[i, :] / scores_exp_sum[i]    loss -= np.log(p[i, y[i]])  for i in xrange(num_trains):    dW += (X[i][:,np.newaxis])*p[i]    dW[:,y[i]] -= X[i,:].T  loss /= num_trains  loss += 0.5 * reg * np.sum(W * W)  dW = dW/num_trains+reg*W  #############################################################################  #                          END OF YOUR CODE                                 #  #############################################################################  return loss, dWdef softmax_loss_vectorized(W, X, y, reg):  """  Softmax loss function, vectorized version.  Inputs and outputs are the same as softmax_loss_naive.  """  # Initialize the loss and gradient to zero.  loss = 0.0  dW = np.zeros_like(W)  #############################################################################  # TODO: Compute the softmax loss and its gradient using no explicit loops.  #  # Store the loss in loss and the gradient in dW. If you are not careful     #  # here, it is easy to run into numeric instability. Don't forget the        #  # regularization!                                                           #  #############################################################################  scores = X.dot(W)  num_trains = scores.shape[0]  scores_max=np.max(scores, axis=1)  scores -= scores_max[:, np.newaxis]  scores_exp = np.exp(scores)  scores_exp_sum = np.sum(scores_exp, axis=1)  p = np.zeros(scores.shape)  p= scores_exp/scores_exp_sum[:,np.newaxis]  loss = np.log(p[np.arange(num_trains),y]).sum()  loss = -loss  loss /= num_trains  loss += 0.5 * reg * np.sum(W * W)  p[np.arange(num_trains), y]-=1  dW=(X.T).dot(p)/num_trains +reg*W  #############################################################################  #                          END OF YOUR CODE                                 #  #############################################################################  return loss, dW

提醒

  scores_max=np.max(scores, axis=1)  scores -= scores_max[:, np.newaxis]

注意这里的np.newaxis
举个例子

>>> a=np.arange(12).reshape(3,4)>>> aarray([[ 0,  1,  2,  3],       [ 4,  5,  6,  7],       [ 8,  9, 10, 11]])>>> d=np.max(a,axis=1)>>> darray([ 3,  7, 11])>>> a-d[:,np.newaxis]array([[-3, -2, -1,  0],       [-3, -2, -1,  0],       [-3, -2, -1,  0]])>>> d[:,np.newaxis]array([[ 3],       [ 7],       [11]])>>> d.reshape(3,1)array([[ 3],       [ 7],       [11]])

所以以后在numpy里面矢量化需要注意这个问题

行向量直接加.T并不能变为列向量，如下必须使用np.newaxis

>>> b=np.arange(12)>>> barray([ 0,  1,  2, ...,  9, 10, 11])>>> b.Tarray([ 0,  1,  2, ...,  9, 10, 11])>>> b[:,np.newaxis]array([[ 0],       [ 1],       [ 2],       ...,        [ 9],       [10],       [11]])

代码看参考网址

0 0