CS231_A1:KNN

来源：互联网发布：win10 ubuntu easybcd 编辑：程序博客网时间：2024/06/03 19:48

学习cs231n中...

作业A1，给的代码是用python2写的，和python3有些区别，需要适当修改。

因为刚开始做，代码部分主要参考这篇文章：http://blog.csdn.net/zhyh1435589631/article/details/54236643，撒花and感谢o(*￣▽￣*)ブ

加了一堆注释，简单做笔记如下。

0.数据集：CIFAR-10

10大类，50000张用于训练，10000张用于测试，分为5个batch。

每张图片都是32*32*3大小。

测试的时候下载好的数据集需要解压到datasets文件夹里。

数据集本身的结构是一个dict，dict_keys包含([b'batch_label', b'labels', b'data', b'filenames'])4个。

dict['data'].shape=(10000,3072)

dict['labels].length=10000

def load_CIFAR_batch(filename):  """ load single batch of cifar """  with open(filename, 'rb') as f:    datadict = pickle.load(f,encoding='bytes')    X = datadict[b'data']    Y = datadict[b'labels']    X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")    Y = np.array(Y)    return X, Ydef load_CIFAR10(ROOT):  """ load all of cifar """  xs = []  ys = []  for b in range(1,6):，    f = os.path.join(ROOT, 'data_batch_%d' % (b, ))    X, Y = load_CIFAR_batch(f) #单独取其中一个batch的数据载入，所以图片维度是10000*3*32*32    xs.append(X)    ys.append(Y)      Xtr = np.concatenate(xs)  Ytr = np.concatenate(ys)  del X, Y  Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))  return Xtr, Ytr, Xte, Yte

两个函数载入了数据。

数据结构：Xtr、Ytr ：训练集中的10000*5条数据及其类别序号

Xte、Yte：测试集中的10000条数据及其类别序号

1.KNN

# Load the raw CIFAR-10 data.cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)# As a sanity check, we print out the size of the training and test data.print ('Training data shape: ', X_train.shape)print ('Training labels shape: ', y_train.shape)print ('Test data shape: ', X_test.shape)print ('Test labels shape: ', y_test.shape)

结果：

Training data shape: (50000, 32, 32, 3)
Training labels shape: (50000,)
Test data shape: (10000, 32, 32, 3)
Test labels shape: (10000,)

X_train, y_train, X_test, y_test的意义分别对应Xtr, Ytr, Xte, Yte

之后每类随机选择了7张图片进行显示。

# Subsample the data for more efficient code execution in this exercisenum_training = 5000mask = range(num_training)X_train = X_train[mask]y_train = y_train[mask]num_test = 500mask = range(num_test)X_test = X_test[mask]y_test = y_test[mask]

Mini-batch方法，在训练集随机选5000张，测试集随机选500张。

# Reshape the image data into rowsX_train = np.reshape(X_train, (X_train.shape[0], -1))X_test = np.reshape(X_test, (X_test.shape[0], -1))print (X_train.shape, X_test.shape)

结果：

(5000, 3072) (500, 3072)

X_train[5000*3072]

X_test[500*3072]

from cs231n.classifiers import KNearestNeighbor# Create a kNN classifier instance. # Remember that training a kNN classifier is a noop: # the Classifier simply remembers the data and does no further processing classifier = KNearestNeighbor()classifier.train(X_train, y_train)

class KNearestNeighbor(object):  """ a kNN classifier with L2 distance """  def __init__(self):    pass  def train(self, X, y):    """    Train the classifier. For k-nearest neighbors this is just     memorizing the training data.    Inputs:    - X: A numpy array of shape (num_train, D) containing the training data      consisting of num_train samples each of dimension D.    - y: A numpy array of shape (N,) containing the training labels, where         y[i] is the label for X[i].    """    self.X_train = X    self.y_train = y

数据载入knn中。

# Open cs231n/classifiers/k_nearest_neighbor.py and implement# compute_distances_two_loops.# Test your implementation:dists = classifier.compute_distances_two_loops(X_test) #images之间的L2距离向量print (dists.shape)

结果：
(500, 5000)

这里需要手动补全文件里knn类的一个2 loop的函数。

（sqrt是开方，square是平方，一开始居然没反应过来他俩的区别，好蠢好蠢=_=）

  def compute_distances_two_loops(self, X):    """    Compute the distance between each test point in X and each training point    in self.X_train using a nested loop over both the training data and the     test data.    Inputs:    - X: A numpy array of shape (num_test, D) containing test data.    Returns:    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]      is the Euclidean distance between the ith test point and the jth training      point.    """    num_test = X.shape[0]             #测试图片数 500    num_train = self.X_train.shape[0] #训练图片数 5000    dists = np.zeros((num_test, num_train))    for i in range(num_test):      for j in range(num_train):        #####################################################################        # TODO:                                                             #        # Compute the l2 distance between the ith test point and the jth    #        # training point, and store the result in dists[i, j]. You should   #        # not use a loop over dimension.                                    #        #####################################################################        dists[i][j]=np.sqrt(np.sum(np.square(self.X_train[j,:]-X[i,:])))        #####################################################################        #                       END OF YOUR CODE                            #        #####################################################################    return dists

看一下这种方法的准确率。

分类器的predict_labels函数也需要自己写一下。

  def predict_labels(self, dists, k=1):    """    Given a matrix of distances between test points and training points,    predict a label for each test point.    Inputs:    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]      gives the distance betwen the ith test point and the jth training point.    Returns:    - y: A numpy array of shape (num_test,) containing predicted labels for the      test data, where y[i] is the predicted label for the test point X[i].      """    num_test = dists.shape[0]    y_pred = np.zeros(num_test)    for i in range(num_test):      # A list of length k storing the labels of the k nearest neighbors to      # the ith test point.      closest_y = []      #########################################################################      # TODO:                                                                 #      # Use the distance matrix to find the k nearest neighbors of the ith    #      # testing point, and use self.y_train to find the labels of these       #      # neighbors. Store these labels in closest_y.                           #      # Hint: Look up the function numpy.argsort.                             #      #########################################################################      sorted_num = np.argsort(dists[i])          #得到的结果是升序排序后的序号      closest_y = self.y_train[sorted_num[:k]]   #k个最紧邻的类别      #########################################################################      # TODO:                                                                 #      # Now that you have found the labels of the k nearest neighbors, you    #      # need to find the most common label in the list closest_y of labels.   #      # Store this label in y_pred[i]. Break ties by choosing the smaller     #      # label.                                                                #      #########################################################################      label = 0  #类别序号      count = 0  #该类别的票数      for j in closest_y:          tmp = 0 #第j类的票数          for v in closest_y: #一定不要把这个v命名成k啊你这个小傻子T_T              tmp += ( v==j ) #(v==j)为真时为1          if tmp > count:              count = tmp              label = j      y_pred[i] = label      #也可以用np.bincount解决这个问题，代码一步到位~      #y_pred[i] = np.argmax(np.bincount(closest_y))      #########################################################################      #                           END OF YOUR CODE                            #       #########################################################################    return y_pred

k=1时的准确率。

# Now implement the function predict_labels and run the code below:# We use k = 1 (which is Nearest Neighbor). 即k=1的knn等效于nn最近邻法y_test_pred = classifier.predict_labels(dists, k=1)# Compute and print the fraction of correctly predicted examplesnum_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

结果：

Got 137 / 500 correct => accuracy: 0.274000

准确率在27%左右。

k=5时的准确率。

y_test_pred = classifier.predict_labels(dists, k=5)num_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

结果：

. . .

Got 145 / 500 correct => accuracy: 0.290000

准确率提高到29%。

下面做一个小的改进，将之前的2 loops改为1 loop。

  def compute_distances_one_loop(self, X):    """    Compute the distance between each test point in X and each training point    in self.X_train using a single loop over the test data.    Input / Output: Same as compute_distances_two_loops    """    num_test = X.shape[0]    num_train = self.X_train.shape[0]    dists = np.zeros((num_test, num_train))    for i in range(num_test):      #######################################################################      # TODO:                                                               #      # Compute the l2 distance between the ith test point and all training #      # points, and store the result in dists[i, :].                        #      #######################################################################      dists[i,:]=np.sqrt(np.sum(np.square(self.X_train-X[i]),axis=1))      #######################################################################      #                         END OF YOUR CODE                            #      #######################################################################    return dists

通过弗洛贝尼乌斯范数Frobenius norm来看一下这两种方法的结果有何差异

# Now lets speed up distance matrix computation by using partial vectorization# with one loop. Implement the function compute_distances_one_loop and run the# code below:dists_one = classifier.compute_distances_one_loop(X_test)# To ensure that our vectorized implementation is correct, we make sure that it# agrees with the naive implementation. There are many ways to decide whether# two matrices are similar; one of the simplest is the Frobenius norm. In case# you haven't seen it before, the Frobenius norm of two matrices is the square# root of the squared sum of differences of all elements; in other words, reshape# the matrices into vectors and compute the Euclidean distance between them.difference = np.linalg.norm(dists - dists_one, ord='fro')print 'Difference was: %f' % (difference, )if difference < 0.001:  print 'Good! The distance matrices are the same'else:  print 'Uh-oh! The distance matrices are different'

结果：

Difference was: 0.000000
Good! The distance matrices are the same

再改进，去掉循环，计算距离一步到位。

这里参考了这篇文章：http://blog.csdn.net/zhyh1435589631/article/details/54236643

和这个：http://blog.csdn.net/geekmanong/article/details/51524402

  def getNormMatrix(self, x, lines_num):    """    Get a lines_num x size(x, 1) matrix    """     return np.ones((lines_num, 1)) * np.sum(np.square(x), axis = 1)   def compute_distances_no_loops(self, X):    """    Compute the distance between each test point in X and each training point    in self.X_train using no explicit loops.    Input / Output: Same as compute_distances_two_loops    """    num_test = X.shape[0]    num_train = self.X_train.shape[0]    dists = np.zeros((num_test, num_train))     #########################################################################    # TODO:                                                   the l2 distance between all test points and all training      #    # points without using any explicit loops, and store the result in      #    # dists.                                                                #    #                                                                       #    # You should implement this function using only basic array operations; #    # in particular you should not use functions from scipy.                #    #                                                                       #    # HINT: Try to formulate the l2 distance using matrix multiplication    #    #       and two broadcast sums.                                         #    #########################################################################    #dists = np.sqrt(self.getNorMatrix(X,num_train).T + self.getNorMatrix(self.X_train,num_test)-2*np.dot(X,self.X_train.T))      dists = np.multiply(np.dot(X,self.X_train.T),-2)    sq1 = np.sum(np.square(X),axis=1,keepdims=True)     print(sq1.shape)    sq2 = np.sum(np.square(self.X_train.T),axis=0,keepdims=True)#这里一定要注意！！    print(sq2.shape)    dists = np.add(dists,sq1)    dists = np.add(dists,sq2)    dists = np.sqrt(dists)    #########################################################################    #                         END OF YOUR CODE                              #    #########################################################################    return dists

结果：

(500, 1)
(1, 5000)
Difference was: 0.000000
Good! The distance matrices are the same

速度真的快很多！！

接下来详细比较了一下不同循环方法的运行时间：

Two loop version took 36.144591 seconds
One loop version took 66.494846 seconds
No loop version took 0.369892 seconds

可怕可怕，看来优化简直太重要了。。。。。

接下来是一个交叉验证的环节。

选不同的k值来看看效果。

这一块写了半天，在reshape那里一定要加int，否则float类型的数字不能reshape。

num_folds = 5k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]X_train_folds = []y_train_folds = []################################################################################# TODO:                                                                        ## Split up the training data into folds. After splitting, X_train_folds and    ## y_train_folds should each be lists of length num_folds, where                ## y_train_folds[i] is the label vector for the points in X_train_folds[i].     ## Hint: Look up the numpy array_split function.                                #################################################################################X_train_folds = np.array_split(X_train,num_folds)y_train_folds = np.array_split(y_train,num_folds)#################################################################################                                 END OF YOUR CODE                             ################################################################################## A dictionary holding the accuracies for different values of k that we find# when running cross-validation. After running cross-validation,# k_to_accuracies[k] should be a list of length num_folds giving the different# accuracy values that we found when using that value of k.k_to_accuracies = {}################################################################################# TODO:                                                                        ## Perform k-fold cross validation to find the best value of k. For each        ## possible value of k, run the k-nearest-neighbor algorithm num_folds times,   ## where in each case you use all but one of the folds as training data and the ## last fold as a validation set. Store the accuracies for all fold and all     ## values of k in the k_to_accuracies dictionary.                               #################################################################################for k in k_choices:    k_to_accuracies[k] = np.zeros(num_folds)    for i in range(num_folds):        #选出验证集和训练集        Xtr = np.array(X_train_folds[:i]+X_train_folds[i+1:])        ytr = np.array(y_train_folds[:i]+y_train_folds[i+1:])        Xte = np.array(X_train_folds[i])        yte = np.array(y_train_folds[i])        #reshape        num_train_x=X_train.shape[0]        num_train_y=y_train.shape[0]        Xtr = np.reshape(Xtr,(int(num_train_x*4/5),-1)) #就是这里！！坑我好久！！        ytr = np.reshape(ytr,(int(num_train_y*4/5),-1))        Xte = np.reshape(Xte,(int(num_train_x/5),-1))        yte = np.reshape(yte,(int(num_train_x/5),-1))        #train        classifier.train(Xtr,ytr)        yte_pred = classifier.predict(Xte,k)        yte_pred = np.reshape(yte_pred,(yte_pred.shape[0],-1))        num_correct = np.sum(yte_pred == yte)        accuracy = float(num_correct)/len(yte)        k_to_accuracies[k][i]=accuracy#################################################################################                                 END OF YOUR CODE                             ################################################################################## Print out the computed accuraciesfor k in sorted(k_to_accuracies):    for accuracy in k_to_accuracies[k]:        print ('k = %d, accuracy = %f' % (k, accuracy))

结果：

k = 1, accuracy = 0.263000k = 1, accuracy = 0.257000k = 1, accuracy = 0.264000k = 1, accuracy = 0.278000k = 1, accuracy = 0.266000k = 3, accuracy = 0.257000k = 3, accuracy = 0.263000k = 3, accuracy = 0.273000k = 3, accuracy = 0.282000k = 3, accuracy = 0.270000k = 5, accuracy = 0.265000k = 5, accuracy = 0.275000k = 5, accuracy = 0.295000k = 5, accuracy = 0.298000k = 5, accuracy = 0.284000k = 8, accuracy = 0.272000k = 8, accuracy = 0.295000k = 8, accuracy = 0.284000k = 8, accuracy = 0.298000k = 8, accuracy = 0.290000k = 10, accuracy = 0.272000k = 10, accuracy = 0.303000k = 10, accuracy = 0.289000k = 10, accuracy = 0.292000k = 10, accuracy = 0.285000k = 12, accuracy = 0.271000k = 12, accuracy = 0.305000k = 12, accuracy = 0.285000k = 12, accuracy = 0.289000k = 12, accuracy = 0.281000k = 15, accuracy = 0.260000k = 15, accuracy = 0.302000k = 15, accuracy = 0.292000k = 15, accuracy = 0.292000k = 15, accuracy = 0.285000k = 20, accuracy = 0.268000k = 20, accuracy = 0.293000k = 20, accuracy = 0.291000k = 20, accuracy = 0.287000k = 20, accuracy = 0.286000k = 50, accuracy = 0.273000k = 50, accuracy = 0.291000k = 50, accuracy = 0.274000k = 50, accuracy = 0.267000k = 50, accuracy = 0.273000k = 100, accuracy = 0.261000k = 100, accuracy = 0.272000k = 100, accuracy = 0.267000k = 100, accuracy = 0.260000k = 100, accuracy = 0.267000

. . .

接下来作图直观地看一下结果。

k=10的时候看起来好像是效果最好的，但是实际测试发现并不是！！

k=8比k=10要更好一点！！

k=8时：Got 147 / 500 correct => accuracy: 0.294000

k=10时：Got 144 / 500 correct => accuracy: 0.288000

OK！！写完收工！！

明天再见 : )