机器学习笔记：K-最近邻算法

来源：互联网发布：做快递怎么找淘宝客户编辑：程序博客网时间：2024/04/29 23:08

K-最近邻算法（k-Nearest Neighbors）

KNN基本思想

计算输入值的坐标与当前所有点的坐标距离（利用欧几里得距离），将这些距离保存在一个递增的列表里，获取k个最小的距离的值，在这些值中找到最主要的分类，即出现次数最多的类别，这个类别就是要预测的输入值的类别。

General approach to kNN

Collect: Any method.
Prepare: Numeric values are needed for a distance calculation. A structured dataformat is best.
Analyze: Any method.
Train: Does not apply to the kNN algorithm.
Test: Calculate the error rate.
Use: This application needs to get some input data and output structured num-eric values. Next, the application runs the kNN algorithm on this input data and determines which class the input data should belong to. The application then takes some action on the calculated class.

练习举例

产生如下图坐标所示的数据
这里写图片描述

import numpy as npdef createDataSet():    group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])    labels = ['A','A','B','B']    return group, labels

numpy 是python的一个第三方库，开源，可以快速处理数据，尤其是对大型矩阵计算性能优异。
简单实现kNN算法：

import operatordef classify0(inX, dataSet, labels, k):    #区一维数组长度    dataSetSize = dataSet.shape[0]     #计算距离    # tile()的用法参考：http://blog.csdn.net/april_newnew/article/details/44176059    diffMat = np.tile(inX, (dataSetSize,1))-dataSet       #用欧几里得距离(欧氏距离)计算距离    sqDiffMat = diffMat**2                                sqDistances = sqDiffMat.sum(axis=1)               distances = sqDistances**0.5          # argsort()返回数组从小到大排列后的索引                sortedDistIndicies = distances.argsort()     classCount={}     for i in range(k):        voteIlabel = labels[sortedDistIndicies[i]]                          classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1       # iteritems()迭代函数，获取键值对    # itemgetter() 用于获取对象的哪些维的数据，参数为一些序号    sortedClassCount = sorted(classCount.items(),                         key=operator.itemgetter(1), reverse=True)     return sortedClassCount[0][0]

调用测试：

print(classify0([0.6,0.8], createDataSet()[0],createDataSet()[1], 3))

输入值为[0.6,0.8],取k=3,结果为A
输入值为[0.3,0.5],取k=3,结果为B

欧几里得距离(欧氏距离)计算距离

对于二维坐标，假设有两个点(xA0,xB0),(xA1,xB1)，它们之间的距离为

d = (x A 1 - x A 0) 2 + (x B 1 - x B 0) 2 - - - - - - - - - - - - - - - - - - - - - - - - \sqrt

例如计算(0,0),(1,2)的距离，套公式为

d = (1 - 0) 2 + (2 - 0) 2 - - - - - - - - - - - - - - - \sqrt

如果是多维坐标，就针对对应的维相减再平方求和，例如计算(1,0,0,1)和(7,6,8,4)之间的距离，套公式如下：

d = (7 - 1) 2 + (6 - 0) 2 + (8 - 0) 2 + (4 - 1) 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \sqrt

补充

当存在其中某一样本类数据计算得到的距离有较大差异的时候，需要进行数据规范化，缩放之后使数值在某一范围内达到统一，这个范围一般是0~1，或者-1~1，缩放的公式如下：

n e w V a l u e = o l d V a l u e - m i n m a x - m i n

最小值min,最大值max

1 0