机器学习实战——K-近邻算法【2：改进约会网站配对效果】

来源：互联网发布：王牌特工特工学院淘宝编辑：程序博客网时间：2024/05/16 11:30

机器学习实战学习笔记系列
机器学习实战——K-近邻算法【1：从文本中解析数据并可视化】
机器学习实战——K-近邻算法【2：改进约会网站配对效果】

首先准备数据，将数据特征归一化

采用下列公式将任意取值范围的特征值转化为0~1区间内的值：
NewValue = (oldValue - min) / (max - min)
其中max 和min 分别是数据集中的最小特征值和最大特征值
归一化代码为：

def autoNorm(dataSet):    minVals = dataSet.min(0)    maxVals = dataSet.max(0)    ranges = maxVals - minVals    normDataSet = zeros(shape(dataSet))    m = dataSet.shape[0]    normDataSet = dataSet - tile(minVals, (m,1))    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide    return normDataSet, ranges, minVals

在函数autoNorm（）中，我们将每列的最小值放在变量minVals中，将最大值放在变量 maxVals中, 其中dataSet.min(0)中的参数0使得函数可以从列中选取最小值，而不是选取当前行的最小值。然后，函数计算可能的取值范围，并创建新的返回矩阵。正如前面给出的公式，为了归一化特征值，我们必须使用当前值减去最小值，然后除以取值范围。需要注意的是，特征值矩阵有1000*3个值，而minVals和ranges的值都为1*3。为了解决这个冋题，我们使用Numpy 库中tile()函数将变量内容复制成输人矩阵同样大小的矩阵，注意这是具体特征值相除，而对于某些数值处理软件包，“/”可能意味着矩阵除法，但在Numpy库中，矩阵除法需要使用函数linalg.solve(matA, matB)

执行归一化函数：

>>> imp.reload(kNN)<module 'kNN' from 'D:\\Program Files\\Python35\\machinelearninginaction\\Ch02\\kNN.py'>>>> normMat, ranges, minVals = kNN.autoNorm(datingDataMat)>>> normMatarray([[ 0.44832535,  0.39805139,  0.56233353],       [ 0.15873259,  0.34195467,  0.98724416],       [ 0.28542943,  0.06892523,  0.47449629],       ...,        [ 0.29115949,  0.50910294,  0.51079493],       [ 0.52711097,  0.43665451,  0.4290048 ],       [ 0.47940793,  0.3768091 ,  0.78571804]])>>> rangesarray([  9.12730000e+04,   2.09193490e+01,   1.69436100e+00])>>> minValsarray([ 0.      ,  0.      ,  0.001156])

测试验证分类器
创建函数datingClassTest()，针对约会网站进行测试。

def classify0(inX, dataSet, labels, k):    dataSetSize = dataSet.shape[0]    diffMat = tile(inX, (dataSetSize,1)) - dataSet    sqDiffMat = diffMat**2    sqDistances = sqDiffMat.sum(axis=1)    distances = sqDistances**0.5    sortedDistIndicies = distances.argsort()         classCount={}              for i in range(k):        voteIlabel = labels[sortedDistIndicies[i]]        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]def datingClassTest():    hoRatio = 0.50      #hold out 10%    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file    normMat, ranges, minVals = autoNorm(datingDataMat)    m = normMat.shape[0]    numTestVecs = int(m*hoRatio)    errorCount = 0.0    for i in range(numTestVecs):        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)        print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))        if (classifierResult != datingLabels[i]): errorCount += 1.0    print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))    print (errorCount)

注意这里原文中代码
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
中iteritems()为python2.x中的格式，python3.X直接调用的话会报错

AttributeError: ‘dict’ object has no attribute ‘iteritems’

这里将iteritems 改为items即可。
即：
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

重新加载kNN.py就可以调用啦

>>> imp.reload(kNN)<module 'kNN' from 'D:\\Program Files\\Python35\\machinelearninginaction\\Ch02\\kNN.py'>>>> kNN.datingClassTest()the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 3, the real answer is: 3the classifier came back with: 3, the real answer is: 3the classifier came back with: 1, the real answer is: 1the classifier came back with: 3, the real answer is: 3the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 2, the real answer is: 2the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 2the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 3, the real answer is: 3the classifier came back with: 1, the real answer is: 1…the total error rate is: 0.064000

分类器处理约会数据集的错误率是6.4%。

我们可以改变函数 datingClassTest内变量hoRatio和变量k的值，检测错误率是否随着变量值的变化而增加。依赖于分类算法、数据集和程序设置，分类器的输出结果可能有很大的不同。

阅读全文

0 0