机器学习——KNN实现

来源：互联网发布：淘宝鹊桥报名入口编辑：程序博客网时间：2024/06/01 21:48

一、KNN（K近邻）概述

KNN一种基于距离的计算的分类和回归的方法。

其主要过程为：

计算训练样本和测试样本中每个样本点的距离（常见的距离度量有欧式距离，马氏距离等）；
对上面所有的距离值进行排序(升序)；
选前k个最小距离的样本；
根据这k个样本的标签进行投票，得到最后的分类类别；

优点：

理论成熟，思想简单，既可以用来做分类也可以用来做回归；
可用于非线性分类；
训练时间复杂度为O(n)；
对数据没有假设，准确度高，对异常值不敏感；

缺点：

计算量大（体现在距离计算上）；
样本不平衡问题（即有些类别的样本数量很多，而其它样本的数量很少）效果差；
需要大量内存；

二、实现——sklearn

1、sklearn.neighbors

与近邻法这一大类相关的类库都在sklearn.neighbors包之中。

KNN分类树的类是KNeighborsClassifier，KNN回归树的类是KNeighborsRegressor。

除此之外，还有KNN的扩展，即限定半径最近邻分类树的类RadiusNeighborsClassifier和限定半径最近邻回归树的类RadiusNeighborsRegressor，以及最近质心分类算法NearestCentroid。

2、KNN分类的实现

（1）数据的随机生成

import numpy as npimport matplotlib.pyplot as plt%matplotlib inlinefrom sklearn.datasets.samples_generator import make_classification# X为样本特征，Y为样本类别输出， 共1000个样本，每个样本2个特征，输出有3个类别，没有冗余特征，每个类别一个簇X, Y = make_classification(n_samples=1000, n_features=2, n_redundant=0,                             n_clusters_per_class=1, n_classes=3)plt.scatter(X[:, 0], X[:, 1], marker='o', c=Y)plt.show()

结果如下图所示：
这里写图片描述

make_classification 函数

from sklearn.datasets.samples_generator import make_classificationsklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True,shift=0.0, scale=1.0, shuffle=True, random_state=None)

通常用于分类算法。
n_features :特征个数= n_informative（） + n_redundant + n_repeated
n_informative：多信息特征的个数
n_redundant：冗余信息，informative特征的随机线性组合
n_repeated ：重复信息，随机提取n_informative和n_redundant 特征
n_classes：分类类别
n_clusters_per_class ：某一个类别是由几个cluster构成的

（2）模型的拟合

用KNN来拟合模型，我们选择K=15，权重为距离远近。代码如下：

from sklearn import neighborsclf = neighbors.KNeighborsClassifier(n_neighbors = 15 , weights='distance')clf.fit(X, Y)

（3）模型的预测

from matplotlib.colors import ListedColormapcmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])#确认训练集的边界x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1#生成随机数据来做测试集，然后作预测xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),                         np.arange(y_min, y_max, 0.02))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# 画出测试集数据Z = Z.reshape(xx.shape)plt.figure()plt.pcolormesh(xx, yy, Z, cmap=cmap_light)# 也画出所有的训练集数据plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cmap_bold)plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.title("3-Class classification (k = 15, weights = 'distance')" )

结果如下图所示：
这里写图片描述

三、KNN源码

def classify0(inX, dataSet, labels, k):       dataSetSize = dataSet.shape[0] # the number of samples       # tile function is the same as "replicate" function of MATLAB      # 这个技巧就避免了循环语句      diffMat = tile(inX, (dataSetSize, 1)) - dataSet # replicate inX into dataSetSize * 1      sqDiffMat = diffMat**2  # 对应元素平方      sqDistances = sqDiffMat.sum(axis = 1)  # 按行求和      distances = sqDistances**0.5  # 开方求距离      sortedDistIndicies = distances.argsort()  # argsort函数返回的是数组值从小到大的索引值      classCount = {}       # 投票      for i in range(k):           voteIlabel = labels[sortedDistIndicies[i]] #排名第i近的样本的label          classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  #get字典的元素，如果不存在key，则为0      # operator.itemgetter(1)按照value排序；也可以用 key = lambda asd:asd[1]      # 排序完，原classCount不变      sortedClassCount = sorted(classCount.iteritems(),  # 键值对                                key = operator.itemgetter(1), reverse = True)  #逆序排列       return sortedClassCount[0][0]  #输出第一个，也就是最近邻

阅读全文

0 0