K-近邻算法(一)
来源:互联网 发布:网络购物平台图片大全 编辑:程序博客网 时间:2024/04/29 12:26
简单的说K-近邻算法采用测量不同特征值之间距离方法进行分类:
一、原理
在k近邻算法中,当训练集、最近邻值k、距离度量、决策规则等确定下来时,整个算法实际上是利用训练集把特征空间划分成一个个子空间,训练集中的每个样本占据一部分空间。对最近邻而言,当测试样本落在某个训练样本的领域内,就把测试样本标记为这一类。
二、算法
就是在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与训练集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是K个数据中出现次数最多的那个分类,其算法的描述为:
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的K个点;
4)确定前K个点所在类别的出现频率;
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类。
三、python实现
(1)约会网站配对效果预测
实验数据介绍
datintTestSet2.txt文件下载:http://download.csdn.net/detail/jay_xio/8543027
from numpy import *import operatordef file2matrix (filename) : fr =open(filename) arrayOLines = fr.readlines() numberOfLines = len(arrayOLines) returnMat = zeros((numberOfLines,3)) classLabelVector = [] index =0 for line in arrayOLines : line =line.strip() listFromLine = line.split('\t') returnMat[index ,:]=listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index+=1 return returnMat,classLabelVectordef classify0 (inX,dataSet,labels,k): dataSetSize =dataSet.shape[0] diffMat = tile(inX,(dataSetSize,1))-dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() ClassCount = {} for i in range(k) : voteIlabel = labels[sortedDistIndicies[i]] ClassCount[voteIlabel] = ClassCount.get(voteIlabel,0)+1 sortedClassCount =sorted(ClassCount.items(),key = operator.itemgetter(1),reverse =True) return sortedClassCount[0][0]def autoNorm(dataSet): minVals =dataSet.min(0) maxVals =dataSet.max(0) ranges= maxVals-minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet =dataSet - tile(minVals,(m,1)) normDataSet =normDataSet/tile(ranges,(m,1)) return normDataSet,ranges,minValsdef datingClassTest(): hoRatio = 0.10 datingDataMat,datingLabels = file2matrix('E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt') normMat,ranges,minVals =autoNorm(datingDataMat) m= normMat.shape[0] numTestVecs = int(m*hoRatio) errorCount =0.0 print(normMat[numTestVecs:m,:]) for i in range (numTestVecs): classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],5) print("the classifier came back with : %d,the real answer is :%d" %(classifierResult,datingLabels[i])) if(classifierResult !=datingLabels[i]): errorCount += 1.0 print("the total error rate is: %f"%(errorCount/float(numTestVecs)))datingClassTest()
运行结果
the classifier came back with : 3,the real answer is :3the classifier came back with : 2,the real answer is :2the classifier came back with : 1,the real answer is :1the classifier came back with : 1,the real answer is :1...the classifier came back with : 3,the real answer is :3the classifier came back with : 3,the real answer is :3the classifier came back with : 2,the real answer is :2the classifier came back with : 2,the real answer is :1the classifier came back with : 1,the real answer is :1the total error rate is: 0.050000
对原始数据的画图表示:
方法一:
import kNNimport matplotlibimport matplotlib.pyplot as pltfrom numpy import *datingDataMat ,datingLabels = kNN.file2matrix("E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt")fig =plt.figure()ax =fig.add_subplot(111)ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))plt.show()
运行结果:
方法二:
import kNNimport matplotlibimport matplotlib.pyplot as pltfrom numpy import *datingDataMat ,datingLabels = kNN.file2matrix("E:\机器学习\machinelearninginaction\Ch02\datingTestSet2.txt")fig = plt.figure()# 指定图像所在的子视图位置,add_subplot(nmi),意思为在fig视图被划分为n*m个子视图,i指定接下来的图像放在哪一个位置ax = fig.add_subplot(111)l = datingDataMat.shape[0]# 存储第一类,第二类,第三类的数组X1 = []Y1 = []X2 = []Y2 = []X3 = []Y3 = []for i in range(l): if int(datingLabels[i]) == 1: X1.append(datingDataMat[i, 1]) Y1.append(datingDataMat[i, 2]) elif int(datingLabels[i]) == 2: X2.append(datingDataMat[i, 1]) Y2.append(datingDataMat[i, 2]) else: X3.append(datingDataMat[i, 1]) Y3.append(datingDataMat[i, 2]) # 画出散点图,坐标分别为datingDataMat的第一列数据与第二列数据,c='color'指定点的颜色type1 = ax.scatter(X1, Y1, c='red')type2 = ax.scatter(X2, Y2, c='green')type3 = ax.scatter(X3, Y3, c='blue')ax.axis([-2, 20, -0.2, 1.75])ax.legend([type1, type2, type3], ["Did Not Like", "Liked in Small Doses", "Liked in Large Doses"], loc=2)plt.xlabel('Percentage of Time Spent Playing Video Games')plt.ylabel('Liters of Ice Cream Consumed Per Week')plt.show()
运行结果:
阅读全文
1 0
- K近邻算法(一)
- K-近邻算法(一)
- k-近邻算法(一)
- K近邻算法(一)
- k-近邻算法(一)
- k近邻算法理论(一)
- 机器学习 K-近邻算法(一)
- 机器学习(一)----k-近邻算法
- 机器学习(一)-----K近邻算法
- 一文搞懂k近邻(k-NN)算法(一)
- 一、K -近邻算法(KNN:k-Nearest Neighbors)
- 数据挖掘算法(一)--K近邻算法 (KNN)
- ML算法一:K-近邻算法
- 如何优雅的ML(一) k-近邻算法
- 机器学习实战(一)--k近邻算法
- K-近邻算法的Python实现(一)
- k近邻算法及其实例(实战笔记一)
- 机器学习算法—K-近邻(一)(KNN)
- 111- Minimum Depth of Binary Tree
- IIC协议
- 跳转到系统设置页面
- 112- Path Sum
- 114- Flatten Binary Tree to Linked List
- K-近邻算法(一)
- 听大神论系统优化
- 120-Triangle
- 设计模式(18)-Iterator 迭代器
- 121-Best Time to Buy and Sell Stock
- 94. Binary Tree Inorder Traversal 中序遍历
- Java
- 123-Best Time to Buy and Sell Stock III
- Maven:Failed to create a Maven project ‘…pom.xml’ already exists in VFS 解决