机器学习实战_初识kNN算法_理解其python代码

来源：互联网发布：上海淘宝厂家编辑：程序博客网时间：2024/05/24 06:40

这是经过我修改后能在python3.0中完美运行的KNN project,可以直接拿来学习:
http://download.csdn.net/download/qq_36396104/10142842

以下为我搜索后结合自己的想法编写，如有侵权，可联系我核实后删除（恕我小白一只~）
（一）
python基础：
numpy：
1、shape函数是numpy.core.fromnumeric中的函数，它的功能是读取矩阵的长度，比如shape[0]就是读取矩阵第一维度的长度。它的输入参数可以使一个整数表示维度，也可以是一个矩阵。
2、tile函数位于python模块 numpy.lib.shape_base中，他的功能是重复某个数组。比如tile(A,n)，功能是将数组A重复n次，构成一个新的数组，具体有：
得到一个数组![]
这里写图片描述
得到一个一维数组![]

得到一个二维数组![]

注意和
matrix([[0,1,2],
[0,1,2]])区分，虽然它们形式上看上去是一致的
3、1. Python 自己的sum（）
输入的参数首先是［］，输入的是数组

>>> sum([0,1,2])  3  >>> sum([0,1,2],3)  6  >>> sum([0,1,2],[3,2,1])  Traceback (most recent call last):    File "<stdin>", line 1, in <module>  TypeError: can only concatenate list (not "int") to list

2.python的 numpy当中
现在对于数据的处理更多的还是numpy。没有axis参数表示全部相加，axis＝0表示按列相加，axis＝1表示按照行的方向相加
这里面输入的可以是矩阵

>>> import numpy as np  >>> a=np.sum([[0,1,2],[2,1,3]])  >>> a  9  >>> a.shape  ()  >>> a=np.sum([[0,1,2],[2,1,3]],axis=0)  >>> a  array([2, 2, 5])  >>> a.shape  (3,)  >>> a=np.sum([[0,1,2],[2,1,3]],axis=1)  >>> a  array([3, 6])  >>> a.shape  (2,)

4、argsort()的用法：
[4,5,1]中，‘4’的索引是0，‘5’的索引是1，‘1’的索引是2。
从小到大排列，就是1、4、5，对应索引就是2、0、1（0是指list【4，5，1】中的第0个数，我才开始还以为是系统中的4对应的是0，憋笑俺＞﹏＜）
PS：argsort函数返回的是：数组值从小到大的索引值
5、列表、元组、字典的区别及使用：
（1）列表
（2）元组
（3）字典
6、range（）函数：

>>> range(1,5) #代表从1到5(不包含5)[1, 2, 3, 4]>>> range(1,5,2) #代表从1到5，间隔2(不包含5)[1, 3]>>> range(5) #代表从0到5(不包含5)[0, 1, 2, 3, 4]

简单的KNN算法：

def classify0(inX,dataSet,labels,k):    dataSetSize = dataSet.shape[0]    diffMat = tile(inX,(dataSetSize,1)) - dataSet    sqDiffMat = diffMat**2    sqDistances = sqDiffMat.sum(axis=1)    distances = sqDistances**0.5    sortedDistIndicies = distances.argsort()#得到测试集和样本集的从小到大的距离    classCount = {}    #print(distances)    #print(0 , sortedDistIndicies)    for i in range(k):        voteIlabel = labels[sortedDistIndicies[i]]        #print(1 , voteIlabel)        classCount[voteIlabel] = classCount.get(voteIlabel,0)+1#搜索字典classCount中的voteIlabel,如果存在则加1，不存在则创建        #print(2 ,classCount)    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)    print(sortedDistIndicies)    return sortedClassCount[0][0]

（二）
python基础：
1、zeros()

>>> np.zeros(5)array([ 0.,  0.,  0.,  0.,  0.])>>> np.zeros((5,), dtype=np.int)array([0, 0, 0, 0, 0])>>> np.zeros((2, 1))array([[ 0.],       [ 0.]])>>> s = (2,2)>>> np.zeros(s)array([[ 0.,  0.],       [ 0.,  0.]])>>> np.zeros((2,), dtype=[('x', 'i4'), ('y', 'i4')]) # custom dtypearray([(0, 0), (0, 0)],      dtype=[('x', '<i4'), ('y', '<i4')])

2、strip():
Python strip() 方法用于移除字符串头尾指定的字符（默认为空格）。
3、list

从所给文本中解析数据def file2matrix(filename):    fr = open(filename)    numberOfLines = len(fr.readlines())         #get the number of lines in the file    returnMat = zeros((numberOfLines,3))        #prepare matrix to return    classLabelVector = []                       #prepare labels return    fr = open(filename)    index = 0    for line in fr.readlines():        line = line.strip()        listFromLine = line.split('\t')        returnMat[index,:] = listFromLine[0:3]#将listFormLine的前三行复制给returnMat的第indeX行        classLabelVector.append(listFromLine[-1])#存储所给数据的最后一行        index += 1    n = 0    for element in classLabelVector:        if element == "largeDoses":            classLabelVector[n] = 3;            n += 1        if element == "smallDoses":            classLabelVector[n] = 2;            n += 1        if element == "didntLike":            classLabelVector[n] = 1;            n += 1    return returnMat,classLabelVector#Test：import matplotlib.pyplot as pltfrom numpy import arrayimport CreateDateSetdatingDataMat,datingLabels=CreateDateSet.file2matrix('datingTestSet.txt')# print(datingDataMat)fig =plt.figure()ax = fig.add_subplot(111)# print(datingDataMat[:,1])# print(datingDataMat[:,2])ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))plt.show()

（三）

from numpy import *"""归一化数值"""def autoNorm(dataSet):    minVals = dataSet.min(0)    maxVals = dataSet.max(0)    ranges = maxVals - minVals   # normDataSet = zeros(shape(dataSet))#书上多余的代码，但有助于理解算法    m = dataSet.shape[0]    normDataSet = dataSet - tile(minVals, (m,1))#为归一化做准备    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide，归一化    return normDataSet, ranges, minVals

（四）留出法测试算法和数据集

from CreateDateSet import file2matrixfrom HandleDate import autoNormfrom kNN import classify0hoRatio = 0.10      #hold out 10%，留出法，用10%作为测试集datingDataMat,datingLabels = file2matrix('datingTestSet.txt')       #load data setfrom filenormMat, ranges, minVals = autoNorm(datingDataMat)#归一化m = normMat.shape[0]numTestVecs = int(m*hoRatio)#得到的100个测试集errorCount = 0.0for i in range(numTestVecs):#从第一个到第100个    classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)#使用knn算法分类    print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))#得到结果    if (classifierResult != datingLabels[i]): errorCount += 1.0print('the total error rate is: %f' % (errorCount / float(numTestVecs)))print(errorCount)

阅读全文

1 0