k近邻算法(KNN)
来源:互联网 发布:大数据调研提纲 编辑:程序博客网 时间:2024/06/16 10:51
k近邻算法
KNN定义:
给定新样本求其分类y,是从离x最近的k个点的类别中选取最多的分类(投票),定义为x的分类y
优点:精度高,对异常值不敏感,无数据输入假定
缺点:计算复杂度高,空间复杂度高
适合数据范围:数值型和标称型
通常k是个不大于20的整数,选择样本数据集中前k个最相似的数据
k值减小意味着整体模型变得复杂,容易发生过拟合
代码伪码
1 计算已知类别数据集中的点与当前点之间的距离
2按照距离递增次序排序
3选取与当前距离最小的k个点
4确定前k个点所在类别出现的频率
5返回前k个频率最高的类别作为当前点的预测分类
代码
# coding=utf-8# __author__=Eshter Yuu#无需言,做自己import numpy as npfrom os import listdirimport operator##运行这个operator会产生pi,e以及gramma三个变量import matplotlib.pyplot as pltdef createDataSet(): group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] return group, labels##k近邻算法def classify0(inX, dataSet, labels, k): dataSetSize = np.shape(dataSet)[0] diffMat =np.tile(inX, (dataSetSize,1)) - dataSet ##复制,相当于matlab的repmat sqDiffMat = diffMat **2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances** 0.5 sortedDistIndicies = distances.argsort() classCount ={} for i in range(k): votelabel = labels[sortedDistIndicies[i]] classCount[votelabel] = classCount.get(votelabel,0) + 1 sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]##将文本记录转换为numpy的解析程序def file2matrix (filename): fr = open(filename) arrayOLines = fr.readlines() numberOfLines = len(arrayOLines) returnMat = np.zeros((numberOfLines,3)) index= 0 classLabelsVector = [] for line in arrayOLines: line = line.strip() listFromLine = line.split('\t') returnMat[index,:] = listFromLine[0:3] classLabelsVector.append(int(listFromLine[-1])) index += 1 return returnMat, classLabelsVector###归一化特征值---min-max归一化def autoNorm(dataSet): minVals = dataSet.min(0)##对每一列求最小值,,max(1)是对每一行求最小值 maxVals = dataSet.max(0)##对每一列求最大值 ranges = maxVals- minVals normDataSet = np.zeros(np.shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - np.tile(minVals,(m,1)) normDataSet =normDataSet/np.tile(ranges,(m,1)) return normDataSet, ranges, minValsdef datingClassTest(): hoRatio = 0.10 datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') normMat,ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m * hoRatio) errorCount = 0.0 for i in range(numTestVecs): classifierReult = classify0(normMat[i,:], normMat[numTestVecs:m,:], datingLabels[numTestVecs:m],3) print("the classifier came back with:%d, the real answer is %d"% (classifierReult, datingLabels[i])) if (classifierReult != datingLabels[i]): errorCount += 1.0 print("the total error rate is :%f"% (errorCount / float(numTestVecs)))##约会网络数据##raw_input 该函数允许用户输入文本行命令并返回用户所输入的命令# def classfyPerson():# resultList = ['not at all ', 'in samll doses','in large doses']# percentTats = float (raw_input("percentagy of time spent playing vodeo games?"))# ffMiles = float(raw_input("frequent flier miles earned per year?"))# iceCream = float(raw_input("liters of ice cream consumed per year?"))# datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')# normMat, ranges, minVals = autoNorm(datingDataMat)# inArr = np.array([ffMiles, percentTats, iceCream])# classifierResult = classify0((inArr- minVals)/ranges, normMat, datingLabels,3)# print(" you will probably like this person:", resultList[classifierResult -1])###手写体识别def img2vector(filename): returnVect = np.zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect'''k近邻算法识别手写数字'''def handwritingClassTest(): hwLables = [] traingFileList = listdir('trainingDigits') m = len(traingFileList) trainingMat = np.zeros((m,1024)) for i in range(m): fileNameStr = traingFileList[i] fileStr = fileNameStr.split('.')[0] classNumStr = int(fileStr.split('_')[0]) hwLables.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s'% fileNameStr) testFileList = listdir('testDigits') errorCount = 0.0 mTest = len(testFileList) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s'% fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat,hwLables,3) print("the classifier came back with: %d, the real answer is :%d"%(classifierResult,classNumStr)) if (classifierResult != classNumStr) : errorCount += 1.0 print("\n the total number of errors is :%d"% errorCount) print("\n the total error rate is :%f"% (errorCount/float(mTest)) )group,labels = createDataSet()print(group,'\n')print(labels)## b =classify0([0,0], group, labels,3)# print('类别为:',b)#datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')# datingDataMat = np.mat(datingDataMat)# datingLabels = np.mat(datingLabels)# print(np.shape(datingDataMat)[0],'\n')# print(np.shape(datingLabels)[0])# fig = plt.figure()# ax = fig.add_subplot(111)# ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*np.array(datingLabels), 15.0*np.array(datingLabels))# plt.show()#normMat, ranges, minVals = autoNorm(datingDataMat)# print('ranges = ',ranges)# print('minVals', minVals)#datingClassTest()'''手写体识别'''# testVector = img2vector('testDigits/1_13.txt')# print(testVector[0,0:31])# handwritingClassTest()
阅读全文
1 0
- k-近邻算法(kNN)
- k-近邻算法(kNN)
- k近邻算法(kNN)
- KNN(K近邻)算法
- KNN(k近邻算法)
- K-近邻算法(KNN)
- KNN ( K近邻算法 )
- k近邻算法(kNN)
- K-近邻算法:KNN
- k-近邻算法(kNN)
- kNN-k近邻算法
- K近邻算法-KNN
- K近邻算法-KNN
- kNN(k近邻算法)
- KNN,k-近邻算法
- KNN(K近邻算法)
- kNN k-近邻算法
- KNN(k近邻算法)
- 决策单调性的利用 jzoj5427【NOIP2017提高A组集训10.25】吃草
- PAT Basic 1007
- mikroC PRO for PIC32 2017(PIC32编译器) v4.0.0官方版下载
- String 类中 split方法问题 获取小数点前的字符串两种方法
- 编写一个tf广播
- k近邻算法(KNN)
- <分块>[HNOI 2010] 弹飞绵羊
- DockerToolbox安装
- 响应式网页设计优势:流体网格的网站适合响应式网页设计。
- STL-链表
- softirq_init
- LuoguP1351[NOIP2014] 联合权值 解题报告【树形DP】
- JAVA课程学习九:类训练-学生管理实现
- Java -- 反射