cs231n Assignment#1 (1)k-Nearest Neighbor (kNN) exercise 代码理解笔记
来源:互联网 发布:csgo n卡优化 编辑:程序博客网 时间:2024/06/05 20:35
在读这份代码前,首先我已经安装好最新版Anaconda,自带jupyter notebook.
在cmd窗口直接输入 jupyter notebook
可以在浏览器打开
(附注怎么下载到课程的作业代码,知乎主页上的链接已经失效了,可以直接前往官网http://cs231n.stanford.edu/index.html,进入他的courses note http://cs231n.github.io/)
点击Upload把knn.ipynb读上来。
# Run some setup code for this notebook.import sys #我这里添加3行代码,否则找不到data_utils模块,添加的就是下载到的文件夹放置路径sys.path.append('E:\\CZU\\assignment1\\cs231n')print (sys.path)import randomimport numpy as npfrom data_utils import load_CIFAR10 #这里也直接修改为模块的名字了import matplotlib.pyplot as pltfrom __future__ import print_function# This is a bit of magic to make matplotlib figures appear inline in the notebook# rather than in a new window.%matplotlib inlineplt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plotsplt.rcParams['image.interpolation'] = 'nearest'plt.rcParams['image.cmap'] = 'gray'# Some more magic so that the notebook will reload external python modules;# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython%load_ext autoreload%autoreload 2
# Load the raw CIFAR-10 data. #装载数据集cifar10_dir = 'E:\CZU\cifar-10-python\cifar-10-batches-py' #修改路径为自己下载好的cifar-10数据集所在路径X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)# As a sanity check, we print out the size of the training and test data.print('Training data shape: ', X_train.shape)print('Training labels shape: ', y_train.shape)print('Test data shape: ', X_test.shape)print('Test labels shape: ', y_test.shape)
此前我已经通过另外的代码单独了解了一下cifar-10数据集,读出来以后是字典类型
# Visualize some examples from the dataset. #看一下数据集的部分内容做个示例# We show a few examples of training images from each class.classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']num_classes = len(classes)samples_per_class = 7for y, cls in enumerate(classes): #按下标 内容 循环classes idxs = np.flatnonzero(y_train == y) #idxs 是narray类型,flatnonzero返回为所有非0内容的下标队列 idxs = np.random.choice(idxs, samples_per_class, replace=False) for i, idx in enumerate(idxs): plt_idx = i * num_classes + y + 1 plt.subplot(samples_per_class, num_classes, plt_idx) plt.imshow(X_train[idx].astype('uint8')) plt.axis('off') if i == 0: plt.title(cls)plt.show()
这段代码运行完是按列随机给7个该类图像,非常酷炫
# Subsample the data for more efficient code execution in this exercise 二次抽样 我这里为了算快点改成500 和 50 了num_training = 500 mask = list(range(num_training))X_train = X_train[mask]y_train = y_train[mask]num_test = 50mask = list(range(num_test))X_test = X_test[mask]y_test = y_test[mask]
# Reshape the image data into rowsX_train = np.reshape(X_train, (X_train.shape[0], -1))X_test = np.reshape(X_test, (X_test.shape[0], -1))print(X_train.shape, X_test.shape)
这里运行完显示
(500, 3072) (50, 3072)
sys.path.append('E:\\CZU\\assignment1\\cs231n\\classifiers') #再加了一行路径 下面也修改了一下,改成直接.py的文件from k_nearest_neighbor import KNearestNeighbor #这里我离开去pip install future,因为源代码里调用了past模块# Create a kNN classifier instance. # Remember that training a kNN classifier is a noop: #通过KNearestNeighbor造了一个分类器,不过这个函数什么都没干因为kNN是直接算图片距离的# the Classifier simply remembers the data and does no further processing #也就是这个分类器把所有训练样本作为参照,然后之后计算测试图片和训练用图片的距离classifier = KNearestNeighbor()classifier.train(X_train, y_train)
# Open cs231n/classifiers/k_nearest_neighbor.py and implement# compute_distances_two_loops.# Test your implementation:print (X_test[0,:5])print (classifier.X_train[0,:5])dists = classifier.compute_distances_one_loop(X_test) #这个计算距离的函数就要自己到k_nearest_neighbor.py里面写好了,就是作业部分,此处代码另说print(dists.shape)print (dists[0,:5]) #因为之前没结果,这里加了好几个print来检查,没结果的原因检查出来是tab和空格混用导致代码没读出来,所以说最好用专门的编辑器
贴一下上面输出的内容
[ 158. 112. 49. 159. 111.][ 59. 62. 63. 43. 46.](50, 500)[ 3803.92350081 4210.59603857 5504.0544147 3473.88960677 4371.58632535]
# We can visualize the distance matrix: each row is a single test example and# its distances to training examples #把dists可视化一下plt.imshow(dists, interpolation='none')plt.show()
notebook里面输出如图 black indicates low distances while white indicates high distances
# Now implement the function predict_labels and run the code below:# We use k = 1 (which is Nearest Neighbor).y_test_pred = classifier.predict_labels(dists, k=1) #这一步就是根据上面算出来的距离矩阵,按最接近的k个预测标签,这个内部代码也是自己写# Compute and print the fraction of correctly predicted examplesnum_correct = np.sum(y_test_pred == y_test) #这里看一下正确率,预测是27%左右,我实际上26%accuracy = float(num_correct) / num_testprint('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))y_test_pred = classifier.predict_labels(dists, k=5) #这里把k换成5,按理说应该正确率更高一点,不过我缩小了样本量,反而正确率降低,很正常num_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))# Now lets speed up distance matrix computation by using partial vectorization# with one loop. Implement the function compute_distances_one_loop and run the# code below:dists_one = classifier.compute_distances_one_loop(X_test)# To ensure that our vectorized implementation is correct, we make sure that it# agrees with the naive implementation. There are many ways to decide whether# two matrices are similar; one of the simplest is the Frobenius norm. In case# you haven't seen it before, the Frobenius norm of two matrices is the square# root of the squared sum of differences of all elements; in other words, reshape# the matrices into vectors and compute the Euclidean distance between them.difference = np.linalg.norm(dists - dists_one, ord='fro')print('Difference was: %f' % (difference, ))if difference < 0.001: print('Good! The distance matrices are the same')else: print('Uh-oh! The distance matrices are different')# Now implement the fully vectorized version inside compute_distances_no_loops# and run the codedists_two = classifier.compute_distances_no_loops(X_test)# check that the distance matrix agrees with the one we computed before:difference = np.linalg.norm(dists - dists_two, ord='fro')print('Difference was: %f' % (difference, ))if difference < 0.001: print('Good! The distance matrices are the same')else: print('Uh-oh! The distance matrices are different')# Let's compare how fast the implementations aredef time_function(f, *args): """ Call a function f with args and return the time (in seconds) that it took to execute. """ import time tic = time.time() f(*args) toc = time.time() return toc - tictwo_loop_time = time_function(classifier.compute_distances_two_loops, X_test)print('Two loop version took %f seconds' % two_loop_time)one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)print('One loop version took %f seconds' % one_loop_time)no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)print('No loop version took %f seconds' % no_loop_time)# you should see significantly faster performance with the fully vectorized implementation上面是三种计算距离(二重循环 一重循环 矩阵直接运算)函数的功效对比
交叉验证num_folds = 5k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]X_train_folds = []y_train_folds = []################################################################################# TODO: ## Split up the training data into folds. After splitting, X_train_folds and ## y_train_folds should each be lists of length num_folds, where ## y_train_folds[i] is the label vector for the points in X_train_folds[i]. ## Hint: Look up the numpy array_split function. #################################################################################X_train_folds = np.array_split(X_train, num_folds)y_train_folds = np.array_split(y_train, num_folds)################################################################################# END OF YOUR CODE ################################################################################## A dictionary holding the accuracies for different values of k that we find# when running cross-validation. After running cross-validation,# k_to_accuracies[k] should be a list of length num_folds giving the different# accuracy values that we found when using that value of k.k_to_accuracies = {}################################################################################# TODO: ## Perform k-fold cross validation to find the best value of k. For each ## possible value of k, run the k-nearest-neighbor algorithm num_folds times, ## where in each case you use all but one of the folds as training data and the ## last fold as a validation set. Store the accuracies for all fold and all ## values of k in the k_to_accuracies dictionary. #################################################################################for k in k_choices: k_to_accuracies[k]=np.zeros(num_folds) for i in range(num_folds): Xtr = np.array(X_train_folds[:i] + X_train_folds[i+1:]) ytr = np.array(y_train_folds[:i] + y_train_folds[i+1:]) Xte = np.array(X_train_folds[i]) yte = np.array(y_train_folds[i]) Xtr = np.reshape(Xtr, (int(X_train.shape[0] * 4 / 5), -1)) ytr = np.reshape(ytr, (int(y_train.shape[0] * 4 / 5), -1)) Xte = np.reshape(Xte, (int(X_train.shape[0] / 5), -1)) yte = np.reshape(yte, (int(y_train.shape[0] / 5), -1)) classifier.train(Xtr, ytr) yte_pred = classifier.predict(Xte, k) yte_pred = np.reshape(yte_pred, (yte_pred.shape[0], -1)) num_correct = np.sum(yte_pred == yte) accuracy = float(num_correct) / len(yte) k_to_accuracies[k][i] = accuracy################################################################################# END OF YOUR CODE ##################################################################################Print out the computed accuraciesfor k in sorted(k_to_accuracies): for accuracy in k_to_accuracies[k]: print('k = %d, accuracy = %f' % (k, accuracy))
这一部分中间代码参考别人的,回头仔细再读读,重写一遍
然后做了可视化,会出一张折线图
# plot the raw observationsfor k in k_choices: accuracies = k_to_accuracies[k] plt.scatter([k] * len(accuracies), accuracies)# plot the trend line with error bars that correspond to standard deviationaccuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)plt.title('Cross-validation on k')plt.xlabel('k')plt.ylabel('Cross-validation accuracy')plt.show()# Based on the cross-validation results above, choose the best value for k, # retrain the classifier using all the training data, and test it on the test# data. You should be able to get above 28% accuracy on the test data.best_k = 1classifier = KNearestNeighbor()classifier.train(X_train, y_train)y_test_pred = classifier.predict(X_test, k=best_k)# Compute and display the accuracynum_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
这里ipynb就结束了。
后面不少代码是参考别人的。需要重新再理解,复写一遍,主要是加强对python语音和numpy本身的熟悉,算法思路到此已经清楚了。
阅读全文
0 0
- cs231n Assignment#1 (1)k-Nearest Neighbor (kNN) exercise 代码理解笔记
- KNN(K-nearest neighbor)理解
- CS231n--assignment 1--KNN
- 机器学习实战笔记(1)——kNN(k Nearest Neighbor)算法
- K近邻k-Nearest Neighbor(KNN)算法的理解
- K最近邻(KNN,k-Nearest Neighbor)准确理解
- KNN(K-Nearest Neighbor)
- CS231N学习笔记2 Assignment1_Q1: k-Nearest Neighbor classifier
- kNN(k-nearest neighbor)理解与实现
- python3与机器学习实践---1、最简单的K-邻近算法(k-Nearest Neighbor,KNN)
- k-近邻算法(k-Nearest Neighbor, kNN)
- k-近邻算法(k-Nearest Neighbor,KNN)
- 【cs231n】assignment1 :k-Nearest Neighbor classifier
- 2-1 最近邻规则分类(K-Nearest Neighbor)KNN算法
- KNN(k-nearest neighbor algorithm)算法
- Python实现kNN(k nearest neighbor algorithm)
- 【机器学习】KNN(K-Nearest Neighbor)
- [Data Mining] kNN: k-nearest neighbor classification
- c++复制控制详解
- postgresql中将数字转换为字符串前面会多出一个空格
- Error: JAVA_HOME is incorrectly set. Please update F:\hadoop\conf\hadoop-env.cmd解决方法
- 大数据竞赛平台——Kaggle 入门篇
- selenium操作webdriver(一)
- cs231n Assignment#1 (1)k-Nearest Neighbor (kNN) exercise 代码理解笔记
- 基于ZooKeeper的分布式Session实现
- (转) java double、float 运算
- 浅谈iOS社交类个人主页界面的布局解析
- 数据库连接池----Druid配置详解
- 为金蝶K3页面增加批量导入选项(其它出库、其它入库、调拨单、生产领料、外购入库、成本调整)
- StringBuffer 和 StringBuilder
- 记录一次坑爹的Tomcat部署
- springBoot 一直扫描不到mapper