K-近邻算法
来源:互联网 发布:20m网络限速多少合适 编辑:程序博客网 时间:2024/05/21 11:16
K-近邻算法特点
优点:精度高、对异常值不敏感、无数据输入假定。
缺点:计算复杂度高、空间复杂度高。
使用数据类型:数值型和标称型。
K-近邻伪代码
输入:样本集
D={(x1,y1),(x2,y2)...,(xm,ym)} ;
聚类簇数k .
输出:簇划分C={C1,C2,...,Ck} 对未知类别属性的数据集中的每个点依次执行以下操作:
(1) 计算已知类别数据集中的点与当前点之间的距离;
(2) 按照距离递增次序排序;
(3) 选取与当前点距离最小的k个点;
(4) 确定前k个点所在类别的出现频率;
(5) 返回前k点出现频率最高的类别作为当前的预测分类。
任务
- 使用k-近邻算法将约会网站推荐的匹配对象归入恰当的分类。
- 使用k-近邻算法完成简单手写识别系统。
数据集
任务1数据集
存放在datingTestSet2.txt中,每个样本数据占一行,总共有1000行。样本主要包含以下3种特征:
任务2数据集
训练集存放在trainingDigits中,包含大约2000个例子;测试集存放在tesDigits中,包含大约900个例子。
k-近邻算法python实现
任务1
kNN.classifyPerson
def classifyPerson(): resultList = ['not at all', 'in small doses', 'in large doses'] percentTats = float(raw_input("percentage of time spent playing video games?")) ffMiles = float(raw_input("frequent filer miles earned per year?")) iceCream = float(raw_input("liters of ice cream consumed per year?")) datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') normMat, ranges, minVals = autoNorm(datingDataMat) inArr = array([ffMiles, percentTats, iceCream]) classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3) print "You will probably like this person: ", resultList[classifierResult - 1]
kNN.file2matrix
#input:dataSet file#output:dataSet Matrix, classLabelVectordef file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines()) #get the number of lines in the file returnMat = zeros((numberOfLines,3)) #prepare matrix to return classLabelVector = [] #prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat,classLabelVector
kNN.autoNorm
Min-Max Normalization :
x∗=x−minmax−min
目的:消除不同特征数据分布不均对分类带来的影响
#input: dataSet#output: normalized dataSet#algorithm: Min-Max Normalizationdef autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m,1)) normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide return normDataSet, ranges, minVals
kNN.classify0
#input'''inX: input vector to classify;dataSet: our full matrix of training examples;labels: a vector of labels of the training examples;k: the number of nearest neighbors to use in the voting.'''#output: label of the input inX#algorithm: kNNdef classify0(inX, dataSet, labels, k): # calculate the distances between inX and each element in dataSet dataSetSize = dataSet.shape[0] diffMat = tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 # find the k nearest neighbors and get the most frequent label of k neighbors (hasg method) sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]
任务1结果
任务2
kNN.handwritingClassTest
def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') #load the training set m = len(trainingFileList) trainingMat = zeros((m,1024)) # construct training matrix and corresponding labels for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = listdir('testDigits') #iterate through the test set errorCount = 0.0 mTest = len(testFileList) # construct testing matrix and corresponding labels for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr) if (classifierResult != classNumStr): errorCount += 1.0 print "\nthe total number of errors is: %d" % errorCount print "\nthe total error rate is: %f" % (errorCount/float(mTest))
kNN.img2vector
#input: image filename#output: vector of imagedef img2vector(filename): returnVect = zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect
kNN.classify0等同上
任务2结果
存在的问题及解决方式
采用遍历的方式获取k个最邻近的样本这种方式的计算代价较高,常用的改进方法为kd树。
阅读全文
0 0
- K近邻算法
- K近邻算法
- K近邻算法
- K近邻算法
- K近邻算法
- K近邻算法
- k近邻算法
- OpenCv K近邻算法
- k-近邻算法(kNN)
- k-近邻算法
- k-近邻算法(kNN)
- K近邻算法
- K近邻分类算法
- K近邻分类算法
- K近邻算法
- K近邻算法
- k最近邻算法
- K-近邻算法
- ssm框架下fileupload图片上传实践
- 编译内核出现错误
- python入门系列16―——正则表达式1
- Eclipse全文搜索,很实用
- 电动汽车控制策略
- K-近邻算法
- java冒泡排序的优化
- 捡拾日记:令牌机制
- 【简记】Operating System—— file system in Linux
- PHP连接MySQL数据库的三种方式(mysql、mysqli、pdo)
- error LNK2019: 无法解析的外部符号
- Android中com.android.camera.action.CROP(图片裁剪)所有属性
- NP完全问题 课后习题
- SOCK_STREAM和SOCK_DGRAM两种类型的区别