python K-近邻分类器

来源：互联网发布：网易足球数据库编辑：程序博客网时间：2024/05/30 23:02

python K-近邻分类器

通过python构建一个基本的近邻分类器。在这个例子中（案例来源：《机器学习实战》），我们希望帮助海伦在约会网站上找到自己合适的约会对象。凯伦的约会对象有三类人，分别是极具魅力的人，魅力一般的人，和不喜欢的人。海伦总共收集了一千组数据，每个数据占一行，每行数据有三个特征。

代码如下：

# -*- coding: utf-8 -*-from numpy import *import operatorimport matplotlibimport matplotlib.pyplot as pltimport pylabdef createDataSet():    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])    labels=['A','A','B','B']    return group,labelsdef classify0(inX,dataSet,labels,k):    dataSetSize=dataSet.shape[0]    diffMat=tile(inX,(dataSetSize,1))-dataSet    sqDiffMat=diffMat**2    sqDistances=sqDiffMat.sum(axis=1)    distances=sqDistances**0.5    sortedDistIndices=distances.argsort()#将距离数组从小到大进行index排序    classCount={}    for i in range(k):        voteIlabel=labels[sortedDistIndices[i]]#获取前K个标签        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1#对前K个标签出现次数分别加总    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)#对前K个标签出现次数进行排序    return sortedClassCount[0][0]#返回出现次数最多标签的标签类def file2matrix(filename):#将原txt文本中的数据存储到矩阵中        fr=open(filename)    #arrayexp=fr.readline().strip().split('\t')    #numpara=len(arrayexp)    arrayOLines=fr.readlines()    numberOfLines=len(arrayOLines)    returnMat=zeros((numberOfLines,3))    classLabelVector=[]    index=0    for line in arrayOLines:        line=line.strip()        listFromLine=line.split('\t')        returnMat[index,:]=listFromLine[0:3]        classLabelVector.append(int(listFromLine[-1]))#负索引，很有实用意义        index +=1    return returnMat,classLabelVector#fig=plt.figure(1)#ax=fig.add_subplot(111)#ax.scatter(datingDataMat[:,1],datingDataMat[:,2])def autoNorm(dataSet):#特征值的归一化    minVal=dataSet.min(0)    maxVal=dataSet.max(0)    ranges=maxVal-minVal    normDataSet=zeros(shape(dataSet))    m=dataSet.shape[0]    normDataSet=dataSet-tile(minVal,(m,1))    normDataSet=dataSet/tile(ranges,(m,1))    return normDataSet,ranges,minVal    def datingClassTest(kvalue):#对测试集运行分类器    hoRatio = 0.50     #hold out 10%    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file    normMat, ranges, minVals = autoNorm(datingDataMat)    m = normMat.shape[0]    numTestVecs = int(m*hoRatio)    errorCount = 0.0    for i in range(numTestVecs):        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],kvalue)        #print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])        if (classifierResult != datingLabels[i]): errorCount += 1.0    #print "the total error rate is: %f" % (errorCount/float(numTestVecs))    #print errorCount    return 100*(errorCount/float(numTestVecs))def visiualizeDiffK(klist):    kvaluelist=[]    for i in klist:        errorCount=datingClassTest(i)        kvaluelist.append(errorCount)        return kvaluelist,klistkvaluelist,klist=visiualizeDiffK(range(1,200))fig4=plt.figure(4)plt.plot(klist,kvaluelist,'o')plt.show()

选择不同的K对最后的结果是有影响的，原书中并没有对不同K的值作具体的分析，于是我自己写了一个visiualizeDiffK函数，输入Klist是需要分析的K的取值，函数返回K值的列表和error rate列表

在这里，我选择了1-200K值作为测试集，最终得到以下图像，可以看到，当K的取值非常小的时候，误差率也是比较高的，这可能是因为过拟合造成的，K的最优值应该是7附近，

当K的值大于7，误差率又会开始快速增长

第一次使用python构建一个机器学习程序，因此对其中一些函数和模块的使用作如下标记：

shape

shape是numpy中的一个函数，其功能是返回一个矩阵的维度

例如，a=[[1,2,3],[2,3,4]]

a=mat(a) #将a转换为矩阵

shape(a)

(2,3)

假如我们不将a转换为矩阵，直接求解维度是否可以？实验证明可行。但如果我们要进行如下操作：

a=[[1,2,3],[2,3,4]]

a.shape(0) #求某一维度维数

这时系统就会报错，因为list没有一个shape函数，此时就必须要先将a转换为矩阵，再进行求解：
a=mat(a)

a.shape(0)

tile

tile函数类似于matlab中的repmat函数，即对向量进行复制。

a=[1,2,3]

tile(a,(2,1)) #将a作为元素，生成一个两行一列的矩阵(数组）

[[1,2,3],[1,2,3]]

sum

sum函数以前经常用到，但是通常都是直接调用，没有注意其中的一些参数

例如有这样一个数组：

a=[[1,2,3],[2,3,4]]

a=mat(a) #注意，这里需将a转换为矩阵形式

a.sum(axis=1)

[[6],[9]] #即列上求和，类推一下，将axis设置为0，就是对行求和了

argsort

这个函数的作用直接用语言解释有点复杂，这里直接给出一个例子：

a=array([1,2,3])

a.argsort() #数组array可以通过

array([0,1,2])

a=[4,6,1]

a=mat(a)

a.argsort() #矩阵在这里也可以通过，如果换成list就不行

[[2,0,1]]

get

我们知道，get函数可以用在字典中获取指定key的value。但是如果字典中没有这个key就会报错，我们可以对get函数的参数进行设置从而避免这一状况。

例如前面K-近邻分类器中：

classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1，如果此时字典中还没有votellabel这个key，就会自动返回0值

sorted

关于sorted函数和itetation,比较复杂，这里引用天马行空W博客中整理的用法

1.先说一下iterable，中文意思是迭代器。

Python的帮助文档中对iterable的解释是：iteralbe指的是能够一次返回它的一个成员的对象。iterable主要包括3类：

第一类是所有的序列类型，比如list(列表)、str(字符串)、tuple(元组)。

第二类是一些非序列类型，比如dict(字典)、file(文件)。

第三类是你定义的任何包含__iter__()或__getitem__()方法的类的对象。

2.Python帮助文档中对sorted方法的讲解：

sorted(iterable[,cmp,[,key[,reverse=True]]])

作用：Return a new sorted list from the items in iterable.

第一个参数是一个iterable，返回值是一个对iterable中元素进行排序后的列表(list)。

可选的参数有三个，cmp、key和reverse。

1)cmp指定一个定制的比较函数，这个函数接收两个参数（iterable的元素），如果第一个参数小于第二个参数，返回一个负数；如果第一个参数等于第二个参数，返回零；如果第一个参数大于第二个参数，返回一个正数。默认值为None。

2)key指定一个接收一个参数的函数，这个函数用于从每个元素中提取一个用于比较的关键字。默认值为None。

3)reverse是一个布尔值。如果设置为True，列表元素将被倒序排列。

通常来说，key和reverse比一个等价的cmp函数处理速度要快。这是因为对于每个列表元素，cmp都会被调用多次，而key和reverse只被调用一次。

3.具体的用法如下：

1)排序基础

一个简单的升序排列很简单-只需要调用sorted()函数即可。这个函数返回一个新的排序列表。：

>>> sorted([5,2,3,1,4])

[1,2,3,4,5]

你也可以使用list的list.sort()方法。这个方法会修改原始的list（返回值为None）。通常这个方法不如sorted()方便-如果你不需要原始的list，list.sort()方法效率会稍微高一些。

>>> a=[5,2,3,1,4]

>>> a.sort()

>>> a

[1,2,3,4,5]

另一个区别在于list.sort()方法只为list定义。而sorted()函数可以接收任何的iterable。

>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'}) [1, 2, 3, 4, 5]

2)Key Functions(关键字函数)

从Python2.4开始，list.sort()和sorted()方法都添加了一个key参数来说明一个函数，这个函数在做比较之前会对list中的每个元素进行调用。

例如，这里是一个大小写不敏感的字符串比较：

>>> sorted("This is a test string from Andrew".split(), key=str.lower) ['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']

key的值应该是一个函数，这个函数接收一个参数并且返回一个用于比较的关键字。这种技术比较快，原因在于对每个输入记录，这个函数只会被调用一次。

对复杂对象的比较通常是使用对象的切片作为关键字。例如：

>>> student_tuples = [ ('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10), ]

>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

同样的技术适用于有named属性的对象。例如：

>>> class Student: def __init__(self, name, grade, age):

self.name = name

self.grade = grade

self.age = age

def __repr__(self):

return repr((self.name, self.grade, self.age))

>>> student_objects = [Student('john', 'A', 15),Student('jane', 'B', 12),Student('dave', 'B', 10), ]

>>> sorted(student_objects, key=lambda student: student.age) # sort by age [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

3)Operator Module Functions (Operator模块中的函数)

上面的key-function模式很常见，因此Python提供了方便的函数使得祖先函数更简单和快捷。operator module有itemgetter,attrgetter，以及从Python2.6开始的methodcaller函数。

使用这些函数，上面的例子会变得更简单和快捷：

>>> from operator import itemgetter, attrgetter

>>> sorted(student_tuples, key=itemgetter(2)) [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

>>> sorted(student_objects, key=attrgetter('age')) [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

operator模块支持多级排序。例如先按成绩排序，再按年龄排序：

>>> sorted(student_tuples, key=itemgetter(1,2)) [('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]

>>> sorted(student_objects, key=attrgetter('grade', 'age')) [('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]

4)升序和降序

list.sort()和sorted()都接收一个reverse参数。它是用于降序排序的标志。例如，为了获得学生年龄的降序排序：

>>> sorted(student_tuples, key=itemgetter(2), reverse=True) [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] >>> sorted(student_objects, key=attrgetter('age'), reverse=True) [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]

5)排序稳定性和复杂的排序从Python2.2开始，排序都保证是稳定的。意思是当多个记录有相同的关键字时，它们原始的排序保留。

>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]

>>> sorted(data, key=itemgetter(0)) [('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]

注意到两个'blue'的记录保留了它们原始的顺序，因此('blue',1)保证在('blue',2)之前。这个好的特性能让你建立复杂的排序。例如，将学生记录按成绩降序排序、按年两升序排列。先按年龄排序，再按成绩排序。

>>> s=sorted(student_object,key=attrgettter('age')) # sort on secondary key

>>> sorted(s,key=attrgetter('grade'),reverse=True) [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

http://www.cnblogs.com/woshitianma/p/3222989.html原帖地址

0 0