CS231n----assignment1 -notes for KNN
来源:互联网 发布:gta5ol捏脸美女数据 编辑:程序博客网 时间:2024/05/17 02:34
前言
自学了一段时间的cs231n的课程,但是由于python工具掌握并不熟练,暂时无法独自完成作业任务,将有借鉴代码。
k-Nearest Neighbor
1、Knn:k-Nearest Neighbor(K邻近分类),计算已知标签的训练集和测试集的距离(距离算法有很多种如:将两张图片先转化为两个向量I_1和I_2,然后让他们相减取绝对值为L1),统计距离最近的k个测试数据的标签,将投票数最高的标签赋给测试集。更高的k值可以使算法对异常数据更有鲁棒性。下面介绍处理数据时各个模块具体代码
- 数据导入函数
#自定义访问数据函数def load_CIFAR_batch(filename): """ load single batch of cifar """ with open(filename, 'rb') as f: #二进制方式读取文件 datadict = pickle.load(f) #pickle读取文件得到一个类似于excel表的表格 X = datadict['data'] Y = datadict['labels'] X = X.reshape(10000, 3, 32,32).transpose(0,2,3,1).astype("float") #每个文件里有10000个训练数据,每个数据是32*32像素的彩色图片,改变了数据的shape和索引顺序 Y = np.array(Y) return X, Y#访问文件,整合文件中的所有数据def load_CIFAR10(ROOT): """ load all of cifar """ xs = [] ys = [] for b in range(1,6): f = os.path.join(ROOT, 'data_batch_%d' % (b, )) #将多个路径组合后返回,循环访问五个训练数据文件 X, Y = load_CIFAR_batch(f) xs.append(X) ys.append(Y) Xtr = np.concatenate(xs) #使变成行向量 Ytr = np.concatenate(ys) del X, Y Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) return Xtr, Ytr, Xte, Yte
- KNN分类器代码
#自定义Knn分类器class KNearestNeighbor(object): def __init__(self): pass def train(self, X, y): """ Train the classifier. For k-nearest neighbors this is just memorizing the training data. Inputs: - X: A numpy array of shape (num_train, D) containing the training data consisting of num_train samples each of dimension D. - y: A numpy array of shape (N,) containing the training labels, where y[i] is the label for X[i]. """ self.X_train = X self.y_train = y def predict(self, X, k=1, num_loops=0): """ Predict labels for test data using this classifier. Inputs: - X: A numpy array of shape (num_test, D) containing test data consisting of num_test samples each of dimension D. - k: The number of nearest neighbors that vote for the predicted labels.(参与投票的数据个数) - num_loops: Determines which implementation to use to compute distances(用于选择计算测试数据和训练数据的距离公式) between training points and testing points. Returns: - y: A numpy array of shape (num_test,) containing predicted labels for the test data, where y[i] is the predicted label for the test point X[i]. """ if num_loops == 0: dists = self.compute_distances_no_loops(X) elif num_loops == 1: dists = self.compute_distances_one_loop(X) elif num_loops == 2: dists = self.compute_distances_two_loops(X) else: raise ValueError('Invalid value %d for num_loops' % num_loops) return self.predict_labels(dists, k=k) def compute_distances_two_loops(self, X): """ Compute the distance between each test point in X and each training point in self.X_train using a nested loop over both the training data and the test data. Inputs: - X: A numpy array of shape (num_test, D) containing test data. Returns: - dists: A numpy array of shape (num_test, num_train) where dists[i, j] is the Euclidean distance between the ith test point and the jth training point. """ num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train))#每个测试数据分别与每个训练数据做计算得到距离,所以距离个数有num_test*num_train个 for i in xrange(num_test): for j in xrange(num_train): train = self.X_train[j,:] test = X[i,:] distence = np.sqrt(np.sum((test-train)**2))#Calculate the eyclidean distance dists[i,j]=distence#第i个测试数据与第j个训练数据的计算结果放在第i行第j列 return dists def compute_distances_one_loop(self, X):#用一个循环完成计算 """ Compute the distance between each test point in X and each training point in self.X_train using a single loop over the test data. Input / Output: Same as compute_distances_two_loops """ num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in xrange(num_test): dis_array = X[i,:]-self.X_train dists[i,:] = np.sqrt(np.sum(dis_array**2)) return dists def compute_distances_no_loops(self, X): #不使用循环计算距离,使用矩阵运算的方式 """ Compute the distance between each test point in X and each training point in self.X_train using no explicit loops. Input / Output: Same as compute_distances_two_loops """ num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) M = np.dot(X, self.X_train.T) te = np.square(X).sum(axis = 1) tr = np.square(self.X_train).sum(axis = 1) dists = np.sqrt(-2*M+tr+np.matrix(te).T)#表达式:根号(te-tr)^2 dists = np.array(dists) return dists#预测标签函数 def predict_labels(self, dists, k=1): """ Given a matrix of distances between test points and training points, predict a label for each test point. Inputs: - dists: A numpy array of shape (num_test, num_train) where dists[i, j] gives the distance betwen the ith test point and the jth training point. Returns: - y: A numpy array of shape (num_test,) containing predicted labels for the test data, where y[i] is the predicted label for the test point X[i]. """ num_test = dists.shape[0] y_pred = np.zeros(num_test) #针对每个测试数据而言标签如何去贴 for i in xrange(num_test): # A list of length k storing the labels of the k nearest neighbors to # the ith test point. closest_y = [] idx = np.argsort(dists[i,:],-1)#argsort函数返回的是数组值从小到大的索引值,关于argsort()函数在后面有详细描述 closest_y = self.y_train[idx[:k]]#取出前K个项对应的索引,找到索引对应的训练数据的标签 closest_set = set(closest_y)#find max label返回集合,重复次数最多的排列在第一个,其余的按原顺序排列 for idx,item in enumerate(closest_set):#enumerate常用语for循环中利用它可以同时获得索引和值 y_pred[i]= item if idx == 0: break return y_pred
关于argsort
numpy.argsort(a, axis=-1, kind=’quicksort’, order=None)
a:待排序的array
axis:待排序的维度,-1表示最后一个维度
quicksort:排序方式(比如排序时用的算法)
order:排序顺序
- 交叉验证代码
交叉验证是为了找到合适的k值,交叉验证在训练集内部进行,选择出最佳k值之后再使用测试集进行最终的分类精准度的检测。
num_folds = 5k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]X_train_folds = []y_train_folds = []X_train_folds = np.array_split(X_train, num_folds);#split the array把训练数据分成五份y_train_folds = np.array_split(y_train, num_folds);k_to_accuracies = {}for k in k_choices: k_to_accuracies[k] = []for k in k_choices:#find the best k-value#设置交叉验证中的训练集和测试集 for i in range(num_folds): X_train_cv = np.vstack(X_train_folds[:i]+X_train_folds[i+1:]) X_test_cv = X_train_folds[i] y_train_cv = np.hstack(y_train_folds[:i]+y_train_folds[i+1:]) #size:4000 y_test_cv = y_train_folds[i] classifier.train(X_train_cv, y_train_cv)#进行训练 dists_cv = classifier.compute_distances_no_loops(X_test_cv)#选择距离计算公式 y_test_pred = classifier.predict_labels(dists_cv, k)#获得预测结果,给测试数据上了标签 num_correct = np.sum(y_test_pred == y_test_cv)#计算正确贴标签的测试数据个数 accuracy = float(num_correct) / y_test_cv.shape[0]#计算准确度 k_to_accuracies[k].append(accuracy)#准确度结果填入Arrayfor k in sorted(k_to_accuracies): for accuracy in k_to_accuracies[k]: print 'k = %d, accuracy = %f' % (k, accuracy)#打印不同k值下的准确度 # plot the raw observationsfor k in k_choices: accuracies = k_to_accuracies[k] #设置输入数据 plt.scatter([k] * len(accuracies), accuracies)# plot the trend line with error bars that correspond to standard deviationaccuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])#计算每个K对应的平均精准度accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])#计算标准误差plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)plt.title('Cross-validation on k')plt.xlabel('k')plt.ylabel('Cross-validation accuracy')plt.show()
由此可判断,K=1时分类结果最佳,选用该k值进行真正的分类即完成了整个knn分类
best_k = 1classifier = KNearestNeighbor()classifier.train(X_train, y_train)y_test_pred = classifier.predict(X_test, k=best_k)# Compute and display the accuracynum_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)print accuracy
注意:跑代码的时候我只用了一部分训练数据和测试数据以减少训练时间,分类精确度在27%左右。
总结
第一次写这种类型的博客,再加之自己对python工具掌握并不熟练,会出现许多用于偏差,另外,以上代码并非原创,只是整合了大家的成果,但是比较完整的。学习过程中要勤于积累才能进步。
阅读全文
0 0
- CS231n----assignment1 -notes for KNN
- CS231n - CNN for Visual Recognition Assignment1 ---- KNN
- cs231n:assignment1:KNN解答
- CS231n-assignment1-KNN篇
- cs231n assignment1:KNN
- CS231n——Assignment1-KNN
- CS231n-assignment1(作业1)-knn
- [CS231n@Stanford] Assignment1-Q1 (python) KNN实现
- cs231n课程作业assignment1(KNN)
- 【实验小结】cs231n assignment1 knn 部分
- KNN最近邻分类算法 + cs231n assignment1
- CS231n - CNN for Visual Recognition Assignment1 ---- SVM
- 20161106#cs231n#1.最近邻分类器 Assignment1-KNN
- cs231n assignment1
- CS231n-assignment1
- cs231n assignment1(KNN)用到的函数:enumerate()/flatnonzero()/argsort()/array_split()
- 关于CS231N-Assignment1-KNN中no-loop矩阵乘法代码的讲解
- [CS231n@Stanford] Assignment1-Q1
- Vue之自定义指令
- Mysql存储引擎(三)------常用存储引擎之间的对比
- H2数据库使用
- nSum系列题目总结
- java 使用jedis操作redis
- CS231n----assignment1 -notes for KNN
- 自调用匿名函数
- 一些java小案例
- ORA-12514: TNS: 监听程序当前无法识别连接描述符中请求的服务
- STL_Vector
- WebStorm 11注册
- 第一篇
- JavaScript最佳新手入门系列之(ajax)
- 随笔(1)STM32F4——音频播放器