CS231_A1:KNN

来源:互联网 发布:win10 ubuntu easybcd 编辑:程序博客网 时间:2024/06/03 19:48

学习cs231n中...

作业A1,给的代码是用python2写的,和python3有些区别,需要适当修改。

因为刚开始做,代码部分主要参考这篇文章:http://blog.csdn.net/zhyh1435589631/article/details/54236643,撒花and感谢o(* ̄▽ ̄*)ブ

加了一堆注释,简单做笔记如下。



0.数据集:CIFAR-10

10大类,50000张用于训练,10000张用于测试,分为5个batch。

每张图片都是32*32*3大小。

测试的时候下载好的数据集需要解压到datasets文件夹里。

数据集本身的结构是一个dict,dict_keys包含([b'batch_label', b'labels', b'data', b'filenames'])4个。

dict['data'].shape=(10000,3072)

dict['labels].length=10000

def load_CIFAR_batch(filename):  """ load single batch of cifar """  with open(filename, 'rb') as f:    datadict = pickle.load(f,encoding='bytes')    X = datadict[b'data']    Y = datadict[b'labels']    X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")    Y = np.array(Y)    return X, Ydef load_CIFAR10(ROOT):  """ load all of cifar """  xs = []  ys = []  for b in range(1,6):,    f = os.path.join(ROOT, 'data_batch_%d' % (b, ))    X, Y = load_CIFAR_batch(f) #单独取其中一个batch的数据载入,所以图片维度是10000*3*32*32    xs.append(X)    ys.append(Y)      Xtr = np.concatenate(xs)  Ytr = np.concatenate(ys)  del X, Y  Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))  return Xtr, Ytr, Xte, Yte

两个函数载入了数据。

数据结构:Xtr、Ytr :训练集中的10000*5条数据及其类别序号

                  Xte、Yte:测试集中的10000条数据及其类别序号



1.KNN


# Load the raw CIFAR-10 data.cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)# As a sanity check, we print out the size of the training and test data.print ('Training data shape: ', X_train.shape)print ('Training labels shape: ', y_train.shape)print ('Test data shape: ', X_test.shape)print ('Test labels shape: ', y_test.shape)

结果:

Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)

X_train, y_train, X_test, y_test的意义分别对应Xtr, Ytr, Xte, Yte


之后每类随机选择了7张图片进行显示。


# Subsample the data for more efficient code execution in this exercisenum_training = 5000mask = range(num_training)X_train = X_train[mask]y_train = y_train[mask]num_test = 500mask = range(num_test)X_test = X_test[mask]y_test = y_test[mask]

Mini-batch方法,在训练集随机选5000张,测试集随机选500张。


# Reshape the image data into rowsX_train = np.reshape(X_train, (X_train.shape[0], -1))X_test = np.reshape(X_test, (X_test.shape[0], -1))print (X_train.shape, X_test.shape)

结果:

(5000, 3072) (500, 3072)

X_train[5000*3072]

X_test[500*3072]


from cs231n.classifiers import KNearestNeighbor# Create a kNN classifier instance. # Remember that training a kNN classifier is a noop: # the Classifier simply remembers the data and does no further processing classifier = KNearestNeighbor()classifier.train(X_train, y_train)


class KNearestNeighbor(object):  """ a kNN classifier with L2 distance """  def __init__(self):    pass  def train(self, X, y):    """    Train the classifier. For k-nearest neighbors this is just     memorizing the training data.    Inputs:    - X: A numpy array of shape (num_train, D) containing the training data      consisting of num_train samples each of dimension D.    - y: A numpy array of shape (N,) containing the training labels, where         y[i] is the label for X[i].    """    self.X_train = X    self.y_train = y

数据载入knn中。



# Open cs231n/classifiers/k_nearest_neighbor.py and implement# compute_distances_two_loops.# Test your implementation:dists = classifier.compute_distances_two_loops(X_test) #images之间的L2距离向量print (dists.shape)

结果:
(500, 5000)

这里需要手动补全文件里knn类的一个2 loop的函数。

(sqrt是开方,square是平方,一开始居然没反应过来他俩的区别,好蠢好蠢=_=)

  def compute_distances_two_loops(self, X):    """    Compute the distance between each test point in X and each training point    in self.X_train using a nested loop over both the training data and the     test data.    Inputs:    - X: A numpy array of shape (num_test, D) containing test data.    Returns:    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]      is the Euclidean distance between the ith test point and the jth training      point.    """    num_test = X.shape[0]             #测试图片数 500    num_train = self.X_train.shape[0] #训练图片数 5000    dists = np.zeros((num_test, num_train))    for i in range(num_test):      for j in range(num_train):        #####################################################################        # TODO:                                                             #        # Compute the l2 distance between the ith test point and the jth    #        # training point, and store the result in dists[i, j]. You should   #        # not use a loop over dimension.                                    #        #####################################################################        dists[i][j]=np.sqrt(np.sum(np.square(self.X_train[j,:]-X[i,:])))        #####################################################################        #                       END OF YOUR CODE                            #        #####################################################################    return dists


看一下这种方法的准确率。

分类器的predict_labels函数也需要自己写一下。

  def predict_labels(self, dists, k=1):    """    Given a matrix of distances between test points and training points,    predict a label for each test point.    Inputs:    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]      gives the distance betwen the ith test point and the jth training point.    Returns:    - y: A numpy array of shape (num_test,) containing predicted labels for the      test data, where y[i] is the predicted label for the test point X[i].      """    num_test = dists.shape[0]    y_pred = np.zeros(num_test)    for i in range(num_test):      # A list of length k storing the labels of the k nearest neighbors to      # the ith test point.      closest_y = []      #########################################################################      # TODO:                                                                 #      # Use the distance matrix to find the k nearest neighbors of the ith    #      # testing point, and use self.y_train to find the labels of these       #      # neighbors. Store these labels in closest_y.                           #      # Hint: Look up the function numpy.argsort.                             #      #########################################################################      sorted_num = np.argsort(dists[i])          #得到的结果是升序排序后的序号      closest_y = self.y_train[sorted_num[:k]]   #k个最紧邻的类别      #########################################################################      # TODO:                                                                 #      # Now that you have found the labels of the k nearest neighbors, you    #      # need to find the most common label in the list closest_y of labels.   #      # Store this label in y_pred[i]. Break ties by choosing the smaller     #      # label.                                                                #      #########################################################################      label = 0  #类别序号      count = 0  #该类别的票数      for j in closest_y:          tmp = 0 #第j类的票数          for v in closest_y: #一定不要把这个v命名成k啊你这个小傻子T_T              tmp += ( v==j ) #(v==j)为真时为1          if tmp > count:              count = tmp              label = j      y_pred[i] = label      #也可以用np.bincount解决这个问题,代码一步到位~      #y_pred[i] = np.argmax(np.bincount(closest_y))      #########################################################################      #                           END OF YOUR CODE                            #       #########################################################################    return y_pred

k=1时的准确率。

# Now implement the function predict_labels and run the code below:# We use k = 1 (which is Nearest Neighbor). 即k=1的knn等效于nn最近邻法y_test_pred = classifier.predict_labels(dists, k=1)# Compute and print the fraction of correctly predicted examplesnum_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

结果:

Got 137 / 500 correct => accuracy: 0.274000

准确率在27%左右。

k=5时的准确率。

y_test_pred = classifier.predict_labels(dists, k=5)num_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))


结果:

. . .

Got 145 / 500 correct => accuracy: 0.290000

准确率提高到29%。


下面做一个小的改进,将之前的2 loops改为1 loop。

  def compute_distances_one_loop(self, X):    """    Compute the distance between each test point in X and each training point    in self.X_train using a single loop over the test data.    Input / Output: Same as compute_distances_two_loops    """    num_test = X.shape[0]    num_train = self.X_train.shape[0]    dists = np.zeros((num_test, num_train))    for i in range(num_test):      #######################################################################      # TODO:                                                               #      # Compute the l2 distance between the ith test point and all training #      # points, and store the result in dists[i, :].                        #      #######################################################################      dists[i,:]=np.sqrt(np.sum(np.square(self.X_train-X[i]),axis=1))      #######################################################################      #                         END OF YOUR CODE                            #      #######################################################################    return dists

通过弗洛贝尼乌斯范数Frobenius norm来看一下这两种方法的结果有何差异

# Now lets speed up distance matrix computation by using partial vectorization# with one loop. Implement the function compute_distances_one_loop and run the# code below:dists_one = classifier.compute_distances_one_loop(X_test)# To ensure that our vectorized implementation is correct, we make sure that it# agrees with the naive implementation. There are many ways to decide whether# two matrices are similar; one of the simplest is the Frobenius norm. In case# you haven't seen it before, the Frobenius norm of two matrices is the square# root of the squared sum of differences of all elements; in other words, reshape# the matrices into vectors and compute the Euclidean distance between them.difference = np.linalg.norm(dists - dists_one, ord='fro')print 'Difference was: %f' % (difference, )if difference < 0.001:  print 'Good! The distance matrices are the same'else:  print 'Uh-oh! The distance matrices are different'


结果:

Difference was: 0.000000
Good! The distance matrices are the same


再改进,去掉循环,计算距离一步到位。

这里参考了这篇文章:http://blog.csdn.net/zhyh1435589631/article/details/54236643

和这个:http://blog.csdn.net/geekmanong/article/details/51524402

  def getNormMatrix(self, x, lines_num):    """    Get a lines_num x size(x, 1) matrix    """     return np.ones((lines_num, 1)) * np.sum(np.square(x), axis = 1)   def compute_distances_no_loops(self, X):    """    Compute the distance between each test point in X and each training point    in self.X_train using no explicit loops.    Input / Output: Same as compute_distances_two_loops    """    num_test = X.shape[0]    num_train = self.X_train.shape[0]    dists = np.zeros((num_test, num_train))     #########################################################################    # TODO:                                                   the l2 distance between all test points and all training      #    # points without using any explicit loops, and store the result in      #    # dists.                                                                #    #                                                                       #    # You should implement this function using only basic array operations; #    # in particular you should not use functions from scipy.                #    #                                                                       #    # HINT: Try to formulate the l2 distance using matrix multiplication    #    #       and two broadcast sums.                                         #    #########################################################################    #dists = np.sqrt(self.getNorMatrix(X,num_train).T + self.getNorMatrix(self.X_train,num_test)-2*np.dot(X,self.X_train.T))      dists = np.multiply(np.dot(X,self.X_train.T),-2)    sq1 = np.sum(np.square(X),axis=1,keepdims=True)     print(sq1.shape)    sq2 = np.sum(np.square(self.X_train.T),axis=0,keepdims=True)#这里一定要注意!!    print(sq2.shape)    dists = np.add(dists,sq1)    dists = np.add(dists,sq2)    dists = np.sqrt(dists)    #########################################################################    #                         END OF YOUR CODE                              #    #########################################################################    return dists

结果:

(500, 1)
(1, 5000)
Difference was: 0.000000
Good! The distance matrices are the same

速度真的快很多!!


接下来详细比较了一下不同循环方法的运行时间:

Two loop version took 36.144591 seconds
One loop version took 66.494846 seconds
No loop version took 0.369892 seconds

可怕可怕,看来优化简直太重要了。。。。。



接下来是一个交叉验证的环节。

选不同的k值来看看效果。

这一块写了半天,在reshape那里一定要加int,否则float类型的数字不能reshape。

num_folds = 5k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]X_train_folds = []y_train_folds = []################################################################################# TODO:                                                                        ## Split up the training data into folds. After splitting, X_train_folds and    ## y_train_folds should each be lists of length num_folds, where                ## y_train_folds[i] is the label vector for the points in X_train_folds[i].     ## Hint: Look up the numpy array_split function.                                #################################################################################X_train_folds = np.array_split(X_train,num_folds)y_train_folds = np.array_split(y_train,num_folds)#################################################################################                                 END OF YOUR CODE                             ################################################################################## A dictionary holding the accuracies for different values of k that we find# when running cross-validation. After running cross-validation,# k_to_accuracies[k] should be a list of length num_folds giving the different# accuracy values that we found when using that value of k.k_to_accuracies = {}################################################################################# TODO:                                                                        ## Perform k-fold cross validation to find the best value of k. For each        ## possible value of k, run the k-nearest-neighbor algorithm num_folds times,   ## where in each case you use all but one of the folds as training data and the ## last fold as a validation set. Store the accuracies for all fold and all     ## values of k in the k_to_accuracies dictionary.                               #################################################################################for k in k_choices:    k_to_accuracies[k] = np.zeros(num_folds)    for i in range(num_folds):        #选出验证集和训练集        Xtr = np.array(X_train_folds[:i]+X_train_folds[i+1:])        ytr = np.array(y_train_folds[:i]+y_train_folds[i+1:])        Xte = np.array(X_train_folds[i])        yte = np.array(y_train_folds[i])        #reshape        num_train_x=X_train.shape[0]        num_train_y=y_train.shape[0]        Xtr = np.reshape(Xtr,(int(num_train_x*4/5),-1)) #就是这里!!坑我好久!!        ytr = np.reshape(ytr,(int(num_train_y*4/5),-1))        Xte = np.reshape(Xte,(int(num_train_x/5),-1))        yte = np.reshape(yte,(int(num_train_x/5),-1))        #train        classifier.train(Xtr,ytr)        yte_pred = classifier.predict(Xte,k)        yte_pred = np.reshape(yte_pred,(yte_pred.shape[0],-1))        num_correct = np.sum(yte_pred == yte)        accuracy = float(num_correct)/len(yte)        k_to_accuracies[k][i]=accuracy#################################################################################                                 END OF YOUR CODE                             ################################################################################## Print out the computed accuraciesfor k in sorted(k_to_accuracies):    for accuracy in k_to_accuracies[k]:        print ('k = %d, accuracy = %f' % (k, accuracy))

结果:

k = 1, accuracy = 0.263000k = 1, accuracy = 0.257000k = 1, accuracy = 0.264000k = 1, accuracy = 0.278000k = 1, accuracy = 0.266000k = 3, accuracy = 0.257000k = 3, accuracy = 0.263000k = 3, accuracy = 0.273000k = 3, accuracy = 0.282000k = 3, accuracy = 0.270000k = 5, accuracy = 0.265000k = 5, accuracy = 0.275000k = 5, accuracy = 0.295000k = 5, accuracy = 0.298000k = 5, accuracy = 0.284000k = 8, accuracy = 0.272000k = 8, accuracy = 0.295000k = 8, accuracy = 0.284000k = 8, accuracy = 0.298000k = 8, accuracy = 0.290000k = 10, accuracy = 0.272000k = 10, accuracy = 0.303000k = 10, accuracy = 0.289000k = 10, accuracy = 0.292000k = 10, accuracy = 0.285000k = 12, accuracy = 0.271000k = 12, accuracy = 0.305000k = 12, accuracy = 0.285000k = 12, accuracy = 0.289000k = 12, accuracy = 0.281000k = 15, accuracy = 0.260000k = 15, accuracy = 0.302000k = 15, accuracy = 0.292000k = 15, accuracy = 0.292000k = 15, accuracy = 0.285000k = 20, accuracy = 0.268000k = 20, accuracy = 0.293000k = 20, accuracy = 0.291000k = 20, accuracy = 0.287000k = 20, accuracy = 0.286000k = 50, accuracy = 0.273000k = 50, accuracy = 0.291000k = 50, accuracy = 0.274000k = 50, accuracy = 0.267000k = 50, accuracy = 0.273000k = 100, accuracy = 0.261000k = 100, accuracy = 0.272000k = 100, accuracy = 0.267000k = 100, accuracy = 0.260000k = 100, accuracy = 0.267000
. . .


接下来作图直观地看一下结果。



k=10的时候看起来好像是效果最好的,但是实际测试发现并不是!!

k=8比k=10要更好一点!!

k=8时:Got 147 / 500 correct => accuracy: 0.294000

k=10时:Got 144 / 500 correct => accuracy: 0.288000



OK!!写完收工!!

明天再见 : )


原创粉丝点击