《机器学习实战》学习笔记1

来源:互联网 发布:2017淘宝开店营业执照 编辑:程序博客网 时间:2024/04/30 11:00

K-近邻算法

1、KNN算法是一种最基础的分类算法,采用不同的测量值,通过计算距离来对样本进行分类。
算法的原理是基于一个样本的训练集,计算新数据与样本集的特质的距离,然后算法提取样本集中的特征最相近的数据的分类标签。为什么叫KNN算法呢?因为一般提取前K个距离最近的样本,之后选取其中分类标签数量最多的一类作为待测集的结果。距离的计算采用最直接的欧式距离进行计算。

2、书中代码实现

from numpy import *from os import listdirimport operatordef createDataset():    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])    labels = ['A','A','B','B']    return group,labelsdef classify0(inx,dataset,labels,k):    dataSetSize = dataset.shape[0]    diffMax = tile(inx,(dataSetSize,1))-dataset    sqDiffMat = diffMax**2    sqDistances = sqDiffMat.sum(axis=1)    distance = sqDistances ** 0.5    sortedDistIndicies = distance.argsort()    classCount = {}    for i in range(k):        votellabel = labels[sortedDistIndicies[i]]        classCount[votellabel] = classCount.get(votellabel,0) + 1    sortedDistIndicies = sorted(classCount.iteritems(),                                key=operator.itemgetter(1),reverse=True)    return sortedDistIndicies[0][0]def file2matrix(filename):    fr = open(filename)    arrayOfLines = fr.readlines()    numberOfLines = len(arrayOfLines)    returnMat = zeros((numberOfLines,3))    classLabelVector = []    index = 0    for line in arrayOfLines:        line = line.strip()        listFromLine = line.split('\t')        returnMat[index,:] = listFromLine[0:3]        classLabelVector.append(int(listFromLine[-1]))        index += 1    return returnMat,classLabelVectordef autoNorm(dataSet):    minVals = dataSet.min(0)    maxVals = dataSet.max(0)    ranges = maxVals - minVals    normDataSet = zeros(shape(dataSet))    m = dataSet.shape[0]    normDataSet = dataSet - tile(minVals, (m , 1))    normDataSet = normDataSet/tile(ranges , (m , 1))    return normDataSet, ranges , minValsdef datingClassTest():    hoRatio = 0.1    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')    normMat , ranges , minVals = autoNorm(datingDataMat)    m = normMat.shape[0]    numTestVecs = int(hoRatio*m)    errorCount = 0.0    for i in range(numTestVecs):        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)        print "test:%d,real:%d" %(classifierResult,datingLabels[i])        if(classifierResult != datingLabels[i]) : errorCount += 1.0    print "ratio: %f" % (errorCount/float(numTestVecs))def img2Vector(filename):    returnVector = zeros((1,1024))    fr = open(filename)    for i in range(32):        lineStr = fr.readline()        for j in range(32):            returnVector[0,32 * i + j] = int(lineStr[j])    return returnVectordef handwritingClassTest():    hwLabels = []    trainingFileList = listdir('trainingDigits')    m = len(trainingFileList)    trainingMat = zeros((m,1024))    for i in range(m):        fileNameStr = trainingFileList[i]        fileStr = fileNameStr.split('.')[0]        classNumStr = int(fileStr.split('_')[0])        hwLabels.append(classNumStr)        trainingMat[i,:] = img2Vector('trainingDigits/%s' % fileNameStr)    testFileList = listdir('testDigits')    errorCount = 0.0    mTest = len(testFileList)    for i in range(mTest):        fileNameStr = testFileList[i]        fileStr = fileNameStr.split('.')[0]        classNumStr = int(fileStr.split('_')[0])        vectorUnderTest = img2Vector('testDigits/%s' % fileNameStr)        classifierResult = classify0(vectorUnderTest,trainingMat,hwLabels,50)        print "test = %d, real = %d" % (classifierResult,classNumStr)        if(classifierResult != classNumStr): errorCount += 1.0    print "\ntotalerror: %d" % errorCount    print "\nratio : %f" %(errorCount/float(mTest))

ps:感觉用了mac之后各种环境配置起来简单很多,之前在Windows系统上估计环境配置要搞很久。

原创粉丝点击