K-近邻算法

来源：互联网发布：20m网络限速多少合适编辑：程序博客网时间：2024/05/21 11:16

K-近邻算法特点

优点：精度高、对异常值不敏感、无数据输入假定。
缺点：计算复杂度高、空间复杂度高。
使用数据类型：数值型和标称型。

K-近邻伪代码

输入：样本集 D={(x1,y1),(x2,y2)...,(xm,ym)};
聚类簇数 k.
输出：簇划分C={C1,C2,...,Ck}

对未知类别属性的数据集中的每个点依次执行以下操作：
(1) 计算已知类别数据集中的点与当前点之间的距离；
(2) 按照距离递增次序排序；
(3) 选取与当前点距离最小的k个点；
(4) 确定前k个点所在类别的出现频率；
(5) 返回前k点出现频率最高的类别作为当前的预测分类。

任务

使用k-近邻算法将约会网站推荐的匹配对象归入恰当的分类。

使用k-近邻算法完成简单手写识别系统。

数据集

任务1数据集 存放在datingTestSet2.txt中，每个样本数据占一行，总共有1000行。样本主要包含以下3种特征：

No feature label 1 每年获得的飞行常客里程数不喜欢的人 2 玩视频游戏所耗时间百分比魅力一般的人 3 每周消费的冰淇淋公升数极其魅力的人

任务2数据集 训练集存放在trainingDigits中，包含大约2000个例子；测试集存放在tesDigits中，包含大约900个例子。

No feature label 1 每张图片所有的像素点 0 - 9

k-近邻算法python实现

任务1

kNN.classifyPerson

def classifyPerson():    resultList = ['not at all', 'in small doses', 'in large doses']    percentTats = float(raw_input("percentage of time spent playing video games?"))    ffMiles = float(raw_input("frequent filer miles earned per year?"))    iceCream = float(raw_input("liters of ice cream consumed per year?"))    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')    normMat, ranges, minVals = autoNorm(datingDataMat)    inArr = array([ffMiles, percentTats, iceCream])    classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)    print "You will probably like this person: ", resultList[classifierResult - 1]

kNN.file2matrix

#input：dataSet file#output：dataSet Matrix, classLabelVectordef file2matrix(filename):    fr = open(filename)    numberOfLines = len(fr.readlines())         #get the number of lines in the file    returnMat = zeros((numberOfLines,3))        #prepare matrix to return    classLabelVector = []                       #prepare labels return       fr = open(filename)    index = 0    for line in fr.readlines():        line = line.strip()        listFromLine = line.split('\t')        returnMat[index,:] = listFromLine[0:3]        classLabelVector.append(int(listFromLine[-1]))        index += 1    return returnMat,classLabelVector

kNN.autoNorm

Min-Max Normalization : x∗=x−minmax−min
目的：消除不同特征数据分布不均对分类带来的影响

#input: dataSet#output: normalized dataSet#algorithm: Min-Max Normalizationdef autoNorm(dataSet):    minVals = dataSet.min(0)    maxVals = dataSet.max(0)    ranges = maxVals - minVals    normDataSet = zeros(shape(dataSet))    m = dataSet.shape[0]    normDataSet = dataSet - tile(minVals, (m,1))    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide    return normDataSet, ranges, minVals

kNN.classify0

#input'''inX: input vector to classify;dataSet: our full matrix of training examples;labels: a vector of labels of the training examples;k: the number of nearest neighbors to use in the voting.'''#output: label of the input inX#algorithm: kNNdef classify0(inX, dataSet, labels, k):    # calculate the distances between inX and each element in dataSet    dataSetSize = dataSet.shape[0]    diffMat = tile(inX, (dataSetSize,1)) - dataSet    sqDiffMat = diffMat**2    sqDistances = sqDiffMat.sum(axis=1)    distances = sqDistances**0.5    # find the k nearest neighbors and get the most frequent label of k neighbors (hasg method)    sortedDistIndicies = distances.argsort()         classCount={}              for i in range(k):        voteIlabel = labels[sortedDistIndicies[i]]        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]

任务1结果

任务2

kNN.handwritingClassTest

def handwritingClassTest():    hwLabels = []    trainingFileList = listdir('trainingDigits')           #load the training set    m = len(trainingFileList)    trainingMat = zeros((m,1024))    # construct training matrix and corresponding labels    for i in range(m):        fileNameStr = trainingFileList[i]        fileStr = fileNameStr.split('.')[0]     #take off .txt        classNumStr = int(fileStr.split('_')[0])        hwLabels.append(classNumStr)        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)    testFileList = listdir('testDigits')        #iterate through the test set    errorCount = 0.0    mTest = len(testFileList)    # construct testing matrix and corresponding labels    for i in range(mTest):        fileNameStr = testFileList[i]        fileStr = fileNameStr.split('.')[0]     #take off .txt        classNumStr = int(fileStr.split('_')[0])        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)        if (classifierResult != classNumStr): errorCount += 1.0    print "\nthe total number of errors is: %d" % errorCount    print "\nthe total error rate is: %f" % (errorCount/float(mTest))

kNN.img2vector

#input: image filename#output: vector of imagedef img2vector(filename):    returnVect = zeros((1,1024))    fr = open(filename)    for i in range(32):        lineStr = fr.readline()        for j in range(32):            returnVect[0,32*i+j] = int(lineStr[j])    return returnVect

kNN.classify0等同上

任务2结果

存在的问题及解决方式

采用遍历的方式获取k个最邻近的样本这种方式的计算代价较高，常用的改进方法为kd树。

阅读全文

0 0