K-近邻算法（kNN）

来源：互联网发布：sybase数据库导出mdb 编辑：程序博客网时间：2024/06/05 23:46

一.工作原理

前提：存在一个样本数据集合，也称为训练样本集，并且每个样本集每个数据都贴有标签，即我们知道样本集中的每一个数据与其所属分类的对应关系。

实现过程：输入没有标签的新数据后，将新数据的每个特征与样本集数据中数据对应特征（这里的特征具体指什么？距离？）进行比较，然后算法提取样本集中特征最相似数据的分类标签。一般来说，我们会选择样本集数据其中前k个最相似的数据，通常k是不大于20的整数。

输入输出：输入没有贴标签的样本数据，通过与训练集数据进行比较，输出原先样本集数据的标签。

二.K-近邻算法的一般流程

1.收集数据：可以使用任何方法
2.准备数据：距离计算（这个距离计算就是为了寻找具有相似特征的数据）所需要的数值，最好是结构化的数据格式。
3.分析数据：可以使用任何方法。
4.训练算法：此步骤不适用k-近邻算法。
5.测试算法：计算错误率。
6.使用算法：首先需要输入样本数据和结构化的输出结果，然后运行k-近邻算法判断输入个数据分别属于哪个分类，最后应用对计算处的分类执行后续的处理。

1.准备数据（用Python创建数据）

from numpy import *   #导入科学计算包Numpy，*代表numpy的所有模块import operator       #导入运算操作符def createDataSet():       #定义createDataSet 函数    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])  #创建四个点    labels = ['A','A','B','B']         #四个标签    return group,labels

则以上的代码，将[1.0,1.1],[1.0,1.0]定义为类A, [0,0],[0,0.1]定义为类B

2.分析数据
伪代码如下：

1.计算已知类别数据集中的点与当前点之间的距离；2.按照距离递增次序排序；3.选取与当前点距离最小的k个点；4.确定前k个点所在类别的出现频率；5.返回k个点出现频率最高的类别作为当前点的预测分类。

计算俩点的距离，可以用范数即 $||A-B||^{2}_{2}$ ，具体就是欧式距离公式,则有

$d=\sqrt{(xA_{0}-xB_{0})^{2}+(xA_{1}-xB_{1})^{2}}$

对于(1,0)和(2,1)，则欧式距离为： $d=\sqrt{(0-0)^{2}+(0-1)^{2}}$

如果数据集存在4个特征值，则点(1,0,0,1)与(7,6,9,4)之间的距离计算为：

$d=\sqrt{(7-1)^{2}+(6-0)^{2}+(9-0)^{2}+(4-1)^{2}}$

计算完所有点的距离后，可以对数据按照从小到大的次序排序。然后确定前K个距离最小元素的分类，代码如下：

from numpy import  *  #导入科学计算器numpyimport operator   #导入运算模块def createDataSet():  #创建数据集    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])    labels = ['A','A','B','B']    return group,labelsdef kmeans(index,dataset,labels,k):    datasetsize=dataset.shape[0]   #得到数组的行，即知道有几个训练数据    diffMat = tile(index,(datasetsize,1))-dataset #将原来的数组扩充为4个一样的数组，diffMat得到了目标与训练数值之间的差值    sqDiffMat=diffMat**2 #各个元素分别平方    sqDistances=sqDiffMat.sum(axis=1) #对应列相乘，即得到了每一个距离的平方    distance=sqDistances**0.5  #开方，得到距离    sortedDistindicies=distance.argsort() #升序排列,从小到大,这里返回的是数组值从小到大的索引值，即[2,3,1,0]      classcount={}    #选择距离最小的k个点    for i in range(k):        votelabel=labels[sortedDistindicies[i]]  #获得在第k个值前面的labels，如 ‘A’，‘B’之类的标签        classcount[votelabel]=classcount.get(votelabel,0)+1 #get(当找得到votelabel，返回原来的次数并加1，当没有找的的时候，返回0)   #排序   sortedclasscount=sorted(classcount.iteritems(),key=operator.itemgetter(1),reverse=True) #本身字典是没有顺序的，但是排序后就有了    return sortedclasscount[0][0]

打开shell窗口,输入：

>>>import kNN>>>kNN.kmeans([0,0],group,labels,3)

输出：

'B'

代码语法解析：
1.tile函数 tile(inx,i):扩展长度，tile(inx,(i,j)):i是扩展个数，j是扩展长度

>>> arrary=([[0,0],[1,2]])    #创建数组>>> tile(arrary,1)     #扩展长度为1，则数组不变array([[0, 0],       [1, 2]])>>> tile(arrary,2)  #扩展长度为2，则doublearray([[0, 0, 0, 0],       [1, 2, 1, 2]])>>> tile(arrary,(4,2)) #扩展个数为4，扩展长度为2array([[0, 0, 0, 0],       [1, 2, 1, 2],       [0, 0, 0, 0],       [1, 2, 1, 2],       [0, 0, 0, 0],       [1, 2, 1, 2],       [0, 0, 0, 0],       [1, 2, 1, 2]])>>>

2.diffMat结果为：

[[-1.  -1.1] [-1.  -1. ] [ 0.   0. ] [ 0.  -0.1]]

sqDiff=diffMat**2 结果为:

[[ 1.    1.21] [ 1.    1.  ] [ 0.    0.  ] [ 0.    0.01]]

sqDistances=sqDiffMat.sum(axis=1)，其实sqDiff和sqDistance这俩步再开方就是算欧式距离的，

[ 2.21  2.    0.    0.01]

3.Python sorted的用法

>>>L=[5,2,3,4]>>>print L.sort>>> L={"b":2,"c":1,"d":4,"a":3}>>> print L{'a': 3, 'c': 1, 'b': 2, 'd': 4}>>> print sorted(L.iteritems(),key=operator.itemgetter(1),reverse=True)[('d', 4), ('a', 3), ('b', 2), ('c', 1)]>>> print sorted(L.iteritems(),key=operator.itemgetter(1))[('c', 1), ('b', 2), ('a', 3), ('d', 4)]

4.Python 字典(Dictionary) get() 函数返回指定键的值，如果值不在字典中返回默认值

>>> L={"b":2,"c":1,"d":4,"a":3}>>> print L.get("c")1>>> print L.get("l")None>>> print L.get("l",0)0

三.案例一：网站匹配效果

1.案例分析：

这里写图片描述

2.思路如下：
测试算法：使用海伦提供的部分数据作为测试样本
测试样本和非测试样本的区别在于：测试样本是已经完成分类的数据，如果预测分类与实际分类不同，则标记为一个错误。

数据类型如下(下载链接为)：

这里写图片描述

3.用Python代码将该文件导入进来并存到数组中，如下所示：

def filematrix(filename):    fr=open(filename)     #打开文本文档    arrayOlines=fr.readlines()     #以行的形式输入    numberOfLines=len(arrayOlines)     #查看有多少行    returnMat=zeros((numberOfLines,3))      #创建一个元素都为0，行数为numberOfLines，列数为3    classLabelVector=[]      #创建一个空的列表    index=0    for line in arrayOlines:    #循环次数等于行数，这里应该是1000        line=line.strip()      #将行的\n符号去掉        listfromline=line.split('\t')       #对于每一行，根据‘\t'进行分割，        returnMat[index,:]=listfromline[0:3]     #对于每一行，将listfromline[0,1,2]元素赋值给returnMatrix        classLabelVector.append(int(listfromline[-1]))       #将listfrom最后一个元素赋值到classlabelvector后面        index +=1    return returnMat,classLabelVector

代码语法解析：
1.使用函数line.strip() 截取掉所有的回车字符，然后使用tab字符\t将上一步得到的整行数据分割成一个元素列表。
2.在listfromline[-1]使用index=-1来提取最后一列元素，这种负索引，可以很方便地将列表的最后一列存储到向量classLabelVector中。

输出为：

>>> kNN.filematrix("/home/hansry/python/ml/Ch02/datingTestSet2.txt")(array([[  4.09200000e+04,   8.32697600e+00,   9.53952000e-01],       [  1.44880000e+04,   7.15346900e+00,   1.67390400e+00],       [  2.60520000e+04,   1.44187100e+00,   8.05124000e-01],       ...,        [  2.65750000e+04,   1.06501020e+01,   8.66627000e-01],       [  4.81110000e+04,   9.13452800e+00,   7.28045000e-01],       [  4.37570000e+04,   7.88260100e+00,   1.33244600e+00]]), [3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 2, 1, 3, 1, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2, 3, 3, 3, 1, 1, 1, 1, 2, 2, 1, 3, 2, 2, 2, 2, 3, ...,]

———————————————————————————————-
Numpy数组和Python数组：
本书大量使用NumPy 数组，Numpy数组提供的数组不支持Python自带的数组类型，因此编写时要注意。

———————————————————————————————-

将上面的数据进行绘制，代码如下：

import matplotlibimport matplotlib.pyplot as pltdef plotfigure(returnMat,datelabels):    fig=plt.figure()  #画图    ax=fig.add_subplot(111) #设定画布，如果add_subplot(349),即将画布分割成3行4列，图像画在从左到右从上到下的第9块    ax.scatter(returnMat[:,1],returnMat[:,2],15.0*array(datelabels),15.0*array(datelabels)) #显示第2列和第3列的数数据    plt.show()

输入：

>>> reload(kNN)<module 'kNN' from 'kNN.pyc'>>>> returnMat,labelMat=kNN.filematrix("/home/hansry/python/ml/Ch02/datingTestSet2.txt")>>>kNN.plotfigure(returnMat,labelMat)

输出如下：

这里写图片描述

4.数据的归一化处理
为什么需要归一化呢，举个例子，飞行常客里里乘数与玩视频游戏是俩个不同的数量级的，如果正常计算，必然导致飞行常客里程数会很大，但是对于海伦而言，这俩个特征是一致的，所以必须归一化处理，将权重限制在[0,1]中。

使用一条公式：newvalue=（oldvalue-min）/（max-min）
其中，min和max分别是数据集中最小最大的特征值，虽然改变数值取值范围增加了分类器的复杂度，但是提高了准确度。

代码如下：

def autonorm(dataset):    minval=dataset.min(0)   #dataset为nx1的矩阵，取最小值    maxval=dataset.max(0)  #取最大值    rangeval=maxval-minval   #取range    datasize=dataset.shape[0]   #提取dataset的行数，即n    normdata=zeros((datasize,1)) #创建一个空矩阵    normdata=dataset-tile(minval,(datasize,1))      normdata=normdata/tile(rangeval,(datasize,1)) #得到归一化的数据，取值范围[0,1]之间    return normdata,rangeval,minval

运行（输出全部在）：

>>> reload(kNN)<module 'kNN' from 'kNN.py'>>>> returnMatrix,matrixlabels=kNN.filematrix("/home/hansry/python/ml/Ch02/datingTestSet2.txt")>>> array,ranges,minvals=kNN.autonorm(returnMatrix)>>> arrayarray([[ 0.44832535,  0.39805139,  0.56233353],       [ 0.15873259,  0.34195467,  0.98724416],       [ 0.28542943,  0.06892523,  0.47449629],       ...,        [ 0.29115949,  0.50910294,  0.51079493],       [ 0.52711097,  0.43665451,  0.4290048 ],       [ 0.47940793,  0.3768091 ,  0.78571804]])>>> rangesarray([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])>>> minvalsarray([ 0.      ,  0.      ,  0.001156])

5.测试算法：作为完整程序验证分类器

错误率：程序执行完成之后错误的个数除以数据点总数即是错误率。

另写一个模块为classify.py,代码如下，这个代码的主要作用是通过已经分类好的数据对该分类器进行测试，看错误率如何：

import kNN   #将上面写的kNN给导入进来def datintTest():    hoRatio=0.1       datingDataMat,datingLabels=kNN.filematrix("/home/hansry/python/ml/Ch02/datingTestSet2.txt") #读取datingTestSet2.txt里面的数据进行保存    normData,ranges,minvals=kNN.autonorm(datingDataMat)  #单位化    normsize=normData.shape[0]  #提取normData的行数    numTestVecs=int(normsize*hoRatio)   #利用normData十分之一的数据进行测试    errorcount=0      for i in range(numTestVecs): #循环numTestVecs次        classresult=kNN.kmeans(normData[i,:],normData[numTestVecs:normsize,:],\                             datingLabels[numTestVecs:normsize],3)    #计算距离，normData[i,0]每次循环中提取数据的第一行，normData[numTestVecs：normsize,:]从normData中选取numTestVecs行来作为计算，datingLabels[numTestVecs:normsize]为从样本中抽取的数据的labels，k=3        print "the classifier came back with ：%d,the real answer is : %d"  \                           %(classresult,datingLabels[i])        if(classresult !=datingLabels[i]):              errorcount +=1.0    print "the total error rate is :%f"  %(errorcount/(float(numTestVecs)))

>>> import classify>>> classify.datintTest()the classifier came back with ：3,the real answer is : 3the classifier came back with ：2,the real answer is : 2the classifier came back with ：1,the real answer is : 1the classifier came back with ：1,the real answer is : 1the classifier came back with ：1,the real answer is : 1the classifier came back with ：1,the real answer is : 1the classifier came back with ：3,the real answer is : 3the classifier came back with ：3,the real answer is : 3...the classifier came back with ：1,the real answer is : 1the classifier came back with ：3,the real answer is : 1the total error rate is :0.050000       #错误率为5%，我觉得还行吧

约会清单网站预测函数：

def classifyperson():    resultlist=["not at all","in small doses","in large doses"]  #列表，喜欢不喜欢    ffMiles=float(raw_input("frequent flier miles earned per year?"))  #raw_input,标准输入    percentTats=float(raw_input("percent of time spent playing video game?"))    icecream=float(raw_input("liters of ice cream consumed per year?"))    datingDataMat,datingLabels=kNN.filematrix("/home/hansry/python/ml/Ch02/datingTestSet2.txt")    normMat,ranges,minVals=kNN.autonorm(datingDataMat)    inArr=[ffMiles,percentTats,icecream]  #这里顺序不能弄乱，要不然结果肯定是错的    classifierResult=kNN.kmeans((inArr-minVals)/ranges,normMat,datingLabels,3)  #这里也将inArr进行单位化    print "you will probably like this person:",resultlist[classifierResult-1]

输出：

>>> import classify>>> reload(classify)<module 'classify' from 'classify.py'>>>> classify.classifyperson()frequent flier miles earned per year?40920percent of time spent playing video game?8.326976liters of ice cream consumed per year?0.953952you will probably like this person: in large doses        #40920  8.32697  0.953952  3     输入的数据是数据集中的其中一组数据，正确！！！！

四.案例二：手写识别系统

该手写识别系统有一个前提要求：
1.只能识别数字0到9
2.需要识别的数字已经使用图形处理软件，处理成具有相同色彩和大小：宽高是32x32像素的黑白图像，尽管采用文本格式存储不能有效地利用内存空间，但是为了方便理解，我们还是将图像转换为文本格式。

def matToVector(filename):    returnVect = zeros((1,1024))    fr = open(filename)    for i in range(32):        lineStr = fr.readline()        for j in range(32):            returnVect[0,32*i+j] = int(lineStr[j])  #将每一个行的元素都存到returnVect，所以这是个1x1024的矩阵    return returnVectdef handwritingClassTest():    hwLabels = []    trainingFileList = listdir('/home/hansry/python/ml/Ch02/digits/trainingDigits')           #load the training set    m = len(trainingFileList)    trainingMat = zeros((m,1024))    for i in range(m):        fileNameStr = trainingFileList[i]        fileStr = fileNameStr.split('.')[0]     #通过"."将fileNameStr分隔开，取第一个元素        classNumStr = int(fileStr.split('_')[0])        hwLabels.append(classNumStr)        trainingMat[i,:] = matToVector('/home/hansry/python/ml/Ch02/digits/trainingDigits/%s' % fileNameStr)  #将第一副图的mat放到trainingMat矩阵中的第一行    testFileList = listdir('/home/hansry/python/ml/Ch02/digits/testDigits')             errorCount = 0.0    mTest = len(testFileList)    for i in range(mTest):        fileNameStr = testFileList[i]        fileStr = fileNameStr.split('.')[0]            classNumStr = int(fileStr.split('_')[0])        vectorUnderTest = matToVector('/home/hansry/python/ml/Ch02/digits/testDigits/%s' % fileNameStr)          classifierResult = kNN.kmeans(vectorUnderTest, trainingMat, hwLabels, 3)        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)        if (classifierResult != classNumStr): errorCount += 1.0    print "\nthe total number of errors is: %d" % errorCount    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

输出结果如下：

>>> import source>>> source.handwritingClassTest()the classifier came back with: 1, the real answer is: 1the classifier came back with: 5, the real answer is: 5the classifier came back with: 8, the real answer is: 8...the classifier came back with: 9, the real answer is: 9the classifier came back with: 2, the real answer is: 2the classifier came back with: 2, the real answer is: 2the classifier came back with: 9, the real answer is: 9the classifier came back with: 2, the real answer is: 2the classifier came back with: 4, the real answer is: 4the classifier came back with: 5, the real answer is: 5the total number of errors is: 12the total error rate is: 0.012685

参考：《机器学习实战》
常用数据集：http://archive.ics.uci.edu/ml/index.php

阅读全文

1 0