k近邻算法

来源：互联网发布：申请淘宝达人网址编辑：程序博客网时间：2024/06/10 19:10

k近邻算法

@(机器学习实战)
书中配套代码放在GitHub上《机器学习实战》

看到k-近邻算法代码时，一个函数不太懂:

    import numpy as np    np.tile（A,reps）

这个函数是将数组A重复reps次，reps可以是个整数、数组等。
例：

import numpy as npa=np.array([0,1,2])np.tile(a,2)//在列方向将数组a重复2次>>>array([0,1,2,0,1,2])np.tile(a,(2,3))//在行方向重复2次，列方向重复3次>>>array([0,1,2,0,1,2],[0,1,2,0,1,2])

numpy中的sum函数。

import numpy as npa=np.array([[0,1,2],[2,1,3]])np.sum(a)//输出9，是将数组a全部元素求和np.sum(a,axis=0)//输出array([2,2,5])是将数组a按列求和np.sum(a,axis=1)//输出array([2,2,5])，是将数组按行求和

sum中没有axis表示全部相加，axis=0表示按列相加，axis=1表示按行相加。

numpy中的argsort函数。
函数返回数组值从小到大所对应的索引值，函数不影响原数组

    import numpy as np    x=np.array([3,1,2])    np.argsort(x)//返回array([1,2,0])    x=np.array([[3,1,2],[4,0,1]])    np.argsort(x,axis=0)//按列排序，返回arr([[0,1,1],[1,0,0]])    np.argsort(x,axis=1)//按行排序，返回array([[1,2,0],[1,2,0]])

dict.get(key)和dict[key]，Python中尽量使用dict.get(key)
因为：

    dictionary.get("bogus",default_value)//在dictionary中没有键key-"bogus"时，会返回default_value，如果省略，默认返回None，而如果使用    dictionary["bogus"]//会引发KeyError

numpy.zeros(shape,dtype=float)用来创建一个全0数组：

    numpy.zeros(5)//array([0.,0.,0.,0.,0.])    numpy.zeros((1,3))    //array([[0.,0.,0.],[0.,0.,0.],[0.,0.,0.]])    numpy.zeros(3,dtype=int)//array([0,0,0])

Determine the type of an object from stackoverflow
python查看一个变量或对象的类型，使用內建函数type()

    type(variable_name)    type(object_name)    type({})//<type 'dict'>    type([])//<type 'list'>

type()函数在查看对象类型时会出现错误情况：

    class Test1(object):        pass    calss Test2(Test1):        pass    a=Test1()    b=Test2()    type(a) is Test1 //True    type(b) is Test2 //True    type(b) is Test1 //False，这里对象b其实也是Test1类型，但使用type()并不能看得出。

所以推荐使用isinstance(object,type)

    isinstance(b,Tets1)//True    isinstance(b,Test2)//True    isinstance(a,Tets1)//True    isinstance(a,Test2)//False

用python做数据处理时，常常需要从txt、csv文件导入数据，一般的流程为：

     fr=open(filename)     for line in fr.readlines():         line = line.strip()         listFromLine = line.split('\t')         returnMat[index,:] = listFromLine[0:3]         classLabelVector.append(int(listFromLine[-1]))         index += 1     return returnMat,classLabelVector//返回数据集sample和对应的label

经常会用到.strip()和.split()，
.strip()可以去除首尾的空白字符，
split()可以去除空白字符，如空格字符，制表符等。split()返回的就是列表类型数据。

使用Matplotlib

    import matplotlib    import matplotlib.pyplot as plt    fig=plt.figure()//创建空白图像    ax=fig.add_subplot(111)//增加子图参数，111表示1*1网格，第1子图    ax.scatter(datingDataMat[:,1],datingDataMat[:,2])    ax.set(xlabel='Xlabel',ylabel='Ylabel')    plt.show()

fig=plt.figure():
这里写图片描述
ax=fig.add_subplot(111):

subplot()内的参数意思为被编码为整数子图网格参数，111表示，1*1格子，第一个子图；234表示2*3个格点，第4个子图；

ax.scatter(datingDataMat[:,1],datingDataMat[:,2]):使用数据集的第一列和第二列作为x轴和y轴的数据，scatter()表示散点图
这里写图片描述
有关matplotlib简易教程及常用术语可参考简书这篇教程

上文中打印出来的散点图全是蓝色的，不能直观的看出数据分布特性，所以对应《机器学习实战》这本书的图2-5，编写下列程序打印：

    #!/usr/bin/python2.7    import numpy as np    import matplotlib    import matplotlib.pyplot as plt    def file2matrix(filename):        fr = open(filename)        numberOfLines = len(fr.readlines())  # get the number of lines in the file        returnMat = np.zeros((numberOfLines, 3))  # prepare matrix to return        classLabelVector = []  # prepare labels return        fr = open(filename)        index = 0        for line in fr.readlines():            line = line.strip()            listFromLine = line.split('\t')            returnMat[index, :] = listFromLine[0:3]            # classLabelVector.append(int(listFromLine[-1]))//convert is incorrectly            if listFromLine[-1] == 'largeDoses':                classLabelVector.append(3)            elif listFromLine[-1] == 'smallDoses':                classLabelVector.append(2)            else:                classLabelVector.append(1)            index += 1        return returnMat, classLabelVector    def main():        type1_x = []        type1_y = []        type2_x = []        type2_y = []        type3_x = []        type3_y = []        datingDataMat, datingLabels = file2matrix("datingTestSet.txt")        for i in range(len(datingLabels)):            if datingLabels[i] == 1:                type1_x.append(datingDataMat[i][0])                type1_y.append(datingDataMat[i][1])            elif datingLabels[i] == 2:                type2_x.append(datingDataMat[i][0])                type2_y.append(datingDataMat[i][1])            else:                type3_x.append(datingDataMat[i][0])                type3_y.append(datingDataMat[i][1])        fig = plt.figure();        fig.add_subplot(111)        plt.scatter(type1_x, type1_y, c='r', label='didntLike')        plt.scatter(type2_x, type2_y, c='b', label='smallDoses')        plt.scatter(type3_x, type3_y, c='g', label='largeDoses')        plt.xlabel('Flight miles')        plt.ylabel('video game time percent')        plt.legend(loc=1)        plt.show()    if __name__ == '__main__':        main()

这里写图片描述

看的一些笔记，散点图右上角的图例只能分别打印，所以分别对type1,type2,type3描绘散点图，再使用plt.legend(loc=1)在右上角打印出图例

导入数据后，需要对数据做些预处理工作，比如本例中的样本特征：玩游戏时间的百分比、飞行里程数、每周冰淇淋消费数。其中飞行里程数样本数据远远大于其他两个特征。所以需要对数据做归一化处理，一般的归一化表达式是：
newValue=(oldValue−min)/(max−min)
本节采用的归一化代码为：

    def autoNorm(dataSet):        minVals = dataSet.min(0)        maxVals = dataSet.max(0)        ranges = maxVals - minVals        normDataSet = zeros(shape(dataSet))        m = dataSet.shape[0]        normDataSet = dataSet - tile(minVals, (m,1))        normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide        return normDataSet, ranges, minVals

由于刚开始接触numpy对一些函数使用还不清楚，所以在此记录，方便以后查阅，有读者看到能查缺补漏也是很高兴的。
minVals = dataSet.min(0)这条语句就是按列求每列的最小值，为什么记录这个呢，因为自己还是存在Java和C++语言的思维定势，看到后面带个0以为是返回一个元素，但其实在numpy中，一般都会按行和按列对ndarray做处理，这里的0其实就是表示axis=0按列处理，如果是1表示按行处理。同理，maxVals返回每列的最大值。由于三个特征，所以返回的都是1*3的ndarray数组，一个函数后(0)可能表示这代表按列返回，[0]表示返回第一个元素。
这里：minVals:[0,0,0.001156]，ndarray类型
maxVals:[91273,20.919349,1.6955]，ndarray类型
normDataSet为1000*3的全0数组，ndarray类型
dataSet.shape返回一个元组(1000,3),所以m=1000
tile()在上文也说明了，此时normDataSet返回的其实就是归一化表达式的分子部分(oldValue-min)
在查看代码时将变量所代表数据的维度写下来会更容易理解代码。

数据的预处理结束后，就到了k近邻算法发挥的时候，生成模型，并来对新的样本来分类，这里连同测试代码一起放出。

在这里放点KNN的理论知识：
KNN工作机制：给定测试样本，基于某种距离度量找出训练集中与其最靠近的k个训练样本，基于这k个邻居的label来对测试样本做预测。分类问题使用“投票法”，选择k个样本中出现最多的类别标记作为预测结果；回归问题使用“平均法”，k个样本label的平均值作为预测结果。
距离度量一般采用欧氏距离，sqrt((x1−x)2+(x2−x)2+...+(xn−x)2)这k个最近邻点可以具有相同的权重，也可以根据距离的远近来加权，权重系数为1/d.
从代码classify0()也可看出，KNN等到分配了测试样本再做处理(计算距离，找出距离最小的k个点等)，这种成为“懒惰学习”，即并没有什么训练过程，只是将测试样本带入求距离来做预测，与其对应的“急切学习”会在训练阶段对样本做学习处理。

我们使用样本数据的90%来训练模型，10%的样本来做测试。

    def classify0(inX, dataSet, labels, k):        dataSetSize = dataSet.shape[0]        diffMat = tile(inX, (dataSetSize,1)) - dataSet        sqDiffMat = diffMat**2        sqDistances = sqDiffMat.sum(axis=1)        distances = sqDistances**0.5        sortedDistIndicies = distances.argsort()             classCount={}                  for i in range(k):            voteIlabel = labels[sortedDistIndicies[i]]            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1        sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)        return sortedClassCount[0][0]    def datingClassTest():        hoRatio = 0.10      #hold out 10%        datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file        normMat, ranges, minVals = autoNorm(datingDataMat)        m = normMat.shape[0]        numTestVecs = int(m*hoRatio)        errorCount = 0.0        for i in range(numTestVecs):            classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)            print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])            if (classifierResult != datingLabels[i]): errorCount += 1.0        print "the total error rate is: %f" % (errorCount/float(numTestVecs))        print errorCount

classify0()实现了KNN算法，用来对未知数据进行分类，inX表示10%测试样例的每一行数据，dataSet表示90%的数据,labels表示90数据所对应的label，labels是List列表类型，k表示第k近邻，一般选择3.
datingClassTest()对10%数据做测试，查看测试结果，返回错误率。
注意：一般测试数据是需要随机选择的，本例中原本的样本数据就是未做任何排序，换言之已经是随机的了，所以可以随意选择10%即可，所以上述测试代码选择的10%其实就是样本数据前10%的数据。

KNN算法识别手写数字

获取的手写数字数据集中的数据为32*32的黑白图像，需要先将图像处理为一个向量，将32*32的二进制图像转换为1*1024的向量。这里写图片描述

    def img2vector(filename)://对filename文件中的图像转换为向量    returnVect = zeros((1,1024))    fr = open(filename)    for i in range(32):        lineStr = fr.readline()        for j in range(32):            returnVect[0,32*i+j] = int(lineStr[j])    return returnVect

手写数字识别的测试代码：

 def handwritingClassTest():        hwLabels = []        trainingFileList = listdir('trainingDigits')           #load the training set        m = len(trainingFileList)        trainingMat = zeros((m,1024))        for i in range(m):            fileNameStr = trainingFileList[i]            fileStr = fileNameStr.split('.')[0]     #take off .txt            classNumStr = int(fileStr.split('_')[0])            hwLabels.append(classNumStr)            trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)        testFileList = listdir('testDigits')        #iterate through the test set        errorCount = 0.0        mTest = len(testFileList)        for i in range(mTest):            fileNameStr = testFileList[i]            fileStr = fileNameStr.split('.')[0]     #take off .txt            classNumStr = int(fileStr.split('_')[0])            vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)            classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)            print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)            if (classifierResult != classNumStr): errorCount += 1.0        print "\nthe total number of errors is: %d" % errorCount        print "\nthe total error rate is: %f" % (errorCount/float(mTest))

上述代码和之前的测试代码非常类似，毕竟使用的算法一致，唯一不同的就是对数据集的处理和读取数据的部分。
上述：trainingFileList = listdir('trainingDigits')中trainingFileList是个list类型，放置了traingDigits文件夹中各文件名，以str存储在list中。

但用KNN算法来对新的手写数字进行分类时，根据代码执行可看出，每个测试新的数字需要和训练集中2000个向量计算距离，每次计算距离都涉及到1024个维度浮点运算，效率并不高。

阅读全文

0 0