机器学习实战_02-k临近

来源：互联网发布：大型网络部署需求分析编辑：程序博客网时间：2024/06/03 20:24

2.1 K-近邻算法

采用测量不同特征值之间的距离

优点：精度高、对异常值不敏感、无数据输入假定。

缺点：计算复杂度高、空间复杂度高。

适用数据范围：数值型和标称型。

工作原理：

存在一个样本数据集，也称作训练样本集，并且样本集中每个数据都存在标签，即我们知道样本集中每一数据与所属分类的对应关系。

输入没有标签的新数据后，将新数据的每个特征与样本集中数据对应的特征进行比较，然后算法提取样本集中特征最相似数据的分类标签。

只选择样本数据集中前k个最相似的数据，这就是k-近邻算法中k的出处。一般小于20。

最后，选择k个最相似数据中出现次数最多的分类，作为新数据的分类。

代码

# -*- coding: UTF-8 -*-
from numpy import *
import operator
import matplotlib
import matplotlib.pyplot as plt
'''
k 临近算法：
   （1）计算已知类别数据集中的点与当前点之间的距离
   （2）按照距离递增次序排序
   （3）选取与当前点距离最小的k个点
   （4）确定前k个点所在类别的出现频率
   （5）返回前k个点出现频率最高的类别最为当前点的预测分类
'''
def classify(inX, dataSet, label, k):
   dataSetSize = dataSet.shape[0]
   diffMat = tile(inX, (dataSetSize,1)) - dataSet  # tile（A,n）函数，A沿各维度重复n次数
   sqDiffMat = diffMat**2                  # **2 表示平方，**3表示立方
   sqDistances = sqDiffMat.sum(axis=1)   # axis =1表示每一行向量相加0表示每一列
   distances = sqDistances**0.5
   #argsort()函数是将x中的元素从小到大排列，提取其对应的index(索引)，然后输出到y。

   sortedDistIndicies = distances.argsort()
   classCount={}
   for i in range(k):
      voteIlabel = label[sortedDistIndicies[i]]
      classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
   sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1), reverse=True)
   return sortedClassCount[0][0]

'''

打开文件，转为对应的格式
'''

def file2matrix(filename):
   fr=open(filename)
   arrayLines=fr.readlines()              #读取所有的内容，分析成行
   numberOfLines = len(arrayLines)       #n行
   returnMat = zeros((numberOfLines,3))   # 3列n行初始化为0
   classLabelVoctor = []
   index = 0
   for line in arrayLines:
      line = line.strip()                           #移除头尾空格
      listFormLine = line.split('\t')                  #根据空格切割
      returnMat[index,:]=listFormLine[0:3]           # 行数，前3列
      classLabelVoctor.append(int(listFormLine[-1]))   # 标签存储
      index+=1
   return returnMat,classLabelVoctor

'''

归一化处理
'''
def autoNorm(dataSet):
   minVals = dataSet.min(0)    # 每一列的最小值
   maxVals = dataSet.max(0)
   ranges = maxVals- minVals
   normDataSet = zeros(shape(dataSet))  #shape(dataSet) 读取长度、列数
   m =dataSet.shape[0]    # 长度（行数）=1000
   normDataSet = dataSet - tile(minVals,(m,1))   # [ 0 0 0.00156] 重复1000行
   normDataSet = normDataSet/(tile(ranges, (m, 1)))
   return normDataSet,ranges , minVals
'''
# 判断错误率
'''
def datingClassTest():
   hoRatio = 0.10
   datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
   normMat,ranges , minVals = autoNorm(datingDataMat)  #归一化处理
   m = normMat.shape[0]
   numTestVecs = int(m*hoRatio)
   errorCount = 0.0
   for i in range(numTestVecs):  # 前numTestVecs个作为测试，后面的作为训练
      classifierResult =classify(normMat[i,:],normMat[numTestVecs:m,:],\

                           datingLabels[numTestVecs:m],3)
      print ("the guess :%d ,the real answer is :%d" %(classifierResult ,datingLabels[i]))
      if (classifierResult != datingLabels[i] ):
         errorCount +=1.0
      print ("the total error rate is : %f" %(errorCount/float(numTestVecs)))

#根据输入进行判断，并输出结果
def classifyPerson():
   resultList =['not at all','in small doses','in large deses']
   percentDats = float(input('percentage of time spent plating video games?'))
   ffMiles = float(input('frequent flier miled earned per year?'))
   iceCream = float(input('liters if ice cream consumed per year?'))
   datingDataMat ,datingLabels =file2matrix('datingTestSet2.txt')
   normMat, ranges, minVals = autoNorm(datingDataMat)
   inArr =array([ffMiles,percentDats,iceCream])  # 封装
   classifierResult = classify((inArr-minVals)/ranges,normMat,datingLabels,3) # 进行判断
   print ("you will probably like this person :",resultList[classifierResult -1 ])

代码详解

classify函数的参数：

inX：用于分类的输入向量

dataSet：训练样本集合

labels：标签向量

k：K-近邻算法中的k

shape：是array的属性，描述一个多维数组的维度

tile（inX, (dataSetSize,1)）：把inX二维数组化，dataSetSize表示生成数组后的行数，1表示列的倍数。整个这一行代码表示前一个二维数组矩阵的每一个元素减去后一个数组对应的元素值，这样就实现了矩阵之间的减法，简单方便得不让你佩服不行！

axis=1：参数等于1的时候，表示矩阵中行之间的数的求和，等于0的时候表示列之间数的求和。

argsort()：对一个数组进行非降序排序

classCount.get(numOflabel,0) + 1：这一行代码不得不说的确很精美啊。get()：该方法是访问字典项的方法，即访问下标键为numOflabel的项，如果没有这一项，那么初始值为0。然后把这一项的值加1。所以Python中实现这样的操作就只需要一行代码，实在是很简洁高效。

阅读全文

0 0