Kaggle数据挖掘入门之KNN算法--Didit Recognizer

来源：互联网发布：军工复合体知乎编辑：程序博客网时间：2024/06/13 05:48

由于本人也是在数据挖掘的入门阶段，以下内容仅为个人练习经验和见解，如有不恰当之处，希望能与其他人探讨。

在开始之前先说说要准备或者说要具备的知识：

1、学习过python，不需要什么精通，起码要知道python的基本数据结构和大的框架；

2、了解KNN（K最近邻）算法的基本原理，这里我就不多说了，百度一下就能找到，因为我觉得自己没别人解释的好；

3、对数据挖掘有概念性的了解，起码要知道数据挖掘是要做什么。

给点干货，这里有比较详细的KNN算法的python简单实现，很容易理解：http://blog.csdn.net/zouxy09/article/details/16955347

接下来开始我们的入门练习。

1、获取资源

首先，当然是要先去kaggle上获取相关的资源了，点击这里就能进去kaggle,然后自然就是要先注册了。在这里和大家说一下，注册时如果你能翻墙的话当然是没有什么问题，如果不能翻墙的话，就先注册一个yahoo邮箱，再用yahoo在kaggle上面注册就没问题了，否则邮箱验证的时候验证码窗口弹不出来会导致无法验证。

注册成功登陆之后，拿着你的鼠标使劲往下拉吧，上面的项目都是大牛级别参加的有偿的竞赛，我们要做的练习在下面的‘101’标签部分的‘Didit Recognizer’，这一部分都是供入门练习使用的，其中也有不少个人和team贡献出的源代可供研究。ok，我们要找的就是下面这个。

点击打开之后呢，主要有三个部分，Competition Details、Get the Data、Make a submission，Competition Details这一部分尤为重要，要仔细读这一部分的内容，这里主要对项目做了详细的描述，以及项目的具体需求。

接下来需要得到我们所需的数据，即在Get the Data下面可以找到，可得到的数据有两个，一个是train（训练集），一个事test（测试集），且均为csv文件（逗号分隔符文件），csv也是一种通用的数据文件格式，类似于jason和xml。同样，下面的描述需要仔细阅读，是关于提供的数据的基本信息和提交结果数据的格式。我在做时就没有注意到提交结果数据的格式，导致倒腾了半天。原谅我的英文水平！

2、开始练习

下载下来的数据，可以自行打开看看数据的具体形式。train数据集大小是42001*785，其中包含顶行的描述行和第一列的label列；test数据集大小是28001*784，包含了顶行的描述行。与train数据集不同的是test数据集没有label列，没错，这就是我们接下来需要做的内容，通过KNN算法，将test数据集与train数据集比对，预测出test数据集中数据的label列。

ok，下面就开始撸代码了，在这里，我用的是windows python2.7的平台。忘了提醒一点，需要将numpy库提前装上，在这里可以下载，查找对应的版本，windows下载下来的是exe程序，直接单机下一步直至安装完成。

</pre><pre name="code" class="python">

'''  Make a train of DataMining on kaggle with Digit Distinguish.The trian-file is train.csv,and the test-file is test.csv.All of these path is'C:\Users\john\Desktop\kaggle''''from numpy import *import operatorimport csv,logging,timeimport logging.configdef loadTrainData():'''Load the train data from train.csv,and split label and data.'''I = []with open('E:\\kaggle\\train.csv','rb') as file:lines = csv.reader(file)for line in lines:I.append(line) #42001*785I.remove(I[0]) # Remove the describe rowI = array(I)   # Array the train datalabel = I[:,0] #42000*1,Get the label columndata = I[:,1:] #42000*784, Get the data blockreturn normalizing(toInt(data)),toInt(label)def loadTestData():'''Load the test data from test.csv,and cut the description.'''I = []with open('E:\\kaggle\\test.csv','rb') as file:lines = csv.reader(file)for line in lines:I.append(line) #28001*784I.remove(I[0]) #remove descriptionarray_I = array(I) #28000*784return normalizing(toInt(array_I))def toInt(array):'''Exchange the elements'type of array as int type from str.'''array=mat(array)   rows,lines = shape(array)newArray = zeros((rows,lines))for i in xrange(rows):for j in xrange(lines):newArray[i,j] = int(array[i,j])return newArraydef normalizing(array):'''Normalizing the elements of input array.All the values normalizing 0 or 1(!=0)'''rows,lines = shape(array)for i in xrange(rows):for j in xrange(lines):if array[i,j]!=0:array[i,j]=1return array

1、初始化数据

loadTrainData 方法和 loadTestData 方法分别将训练集和测试集的数据加载进来，这里使用到了python的csv模块。这里对csv模块不做详述，只说说用到其中的方法。打开文件的时候，mode 参数一定要给成'rb'或者'wb'，否则会出现奇葩的错误，我当时就忘了加个b，折腾好长时间。

csv的reader方法：reader(iterable [, dialect='excel'][, optional keyword args])，返回值是一个迭代器，每执行一次，返回可迭代对象的一行内容。

参数说明：iterable：操作的对象必须是能够按行返回的可迭代对象，例如file对象或list对象；

dialect：编码风格，默认是‘excel’，也即是逗号分隔，csv模块中dialect也可进行自定义；

optional keyword args：操作参数，可以给指定参数覆盖dialect中的参数定义。

返回值那里对数据做了一个类型转换，因为从文件读出的数据均为字符串格式，我们后面的计算需要的是数字型的。对data block 部分进行了归一化，以简化计算，即非0 的置为1。

2、处理数据

def classify(inX,dataSet,labels,k):'''Classifying by K-NN algorithm'''inX = mat(inX)dataSet = mat(dataSet)labels = mat(labels)dataSetSize = dataSet.shape[0]diffMat = tile(inX,(dataSetSize,1)) - dataSet # Make a diff between train data and test datasqDiffMat = array(diffMat)**2  # Make square for diffMatsqDistance = sqDiffMat.sum(axis=1) # Sum by rowdistance = sqDistance**0.5sortedDistIndecies = distance.argsort()classCount={}for i in xrange(k):votellabel = labels[0,sortedDistIndecies[i]]classCount[votellabel] = classCount.get(votellabel,0)+1sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)return sortedClassCount[0][0]

title方法，将输入的每条待预测数据inX的格式扩充成与train data 相同的格式，以求得每行待测数据与训练集每行数据的距离。对 diffMat 先平方再开方是为了对距离取绝对值，平方和开方后的同其他距离值之间的差值是等比例放大的，所以不影响后面的距离比较。

argsort方法对距离值做排序，返回每个值得下标。k是K邻域算法的参数，一般取（1，20），即取待测记录的k个邻居进行统计，for循环统计k个邻居中出现次数最多的邻居。

3、写入文件

def saveResult(result):'''Write the result to the result file.'''with open('E:\\kaggle\\result.csv','wb') as myFile:myWriter = csv.writer(myFile)for i in result:tmp=[]tmp.append(i)myWriter.writerow(tmp)

切记mode参数要加b。

4、排版布局

这里就相当于C中的main函数了吧，从这里调用之前写的所有方法完成预测。

def handwritingClassTest():start_time = time.time()logger = logging.getLogger('example01')logging.config.fileConfig('E:\\python\\log.conf')loadTrainDataTime_start = time.time()traintData,traintLabel = loadTrainData()loadTrainDataTime_end = time.time()logger.info('Traint data load successful! And load-time is:'+str(loadTrainDataTime_end-loadTrainDataTime_start))loadTestDataTime_start = time.time()testData = loadTestData()loadTestDataTime_end = time.time()logger.info('Test data load successful! And load-time is:'+str(loadTestDataTime_end-loadTestDataTime_start))m,n = shape(testData)resultList = []logger.info('Digit distinguish start!')for i in xrange(m):classifierResult = classify(testData[i],traintData,traintLabel,5)resultList.append(classifierResult)if (i+1)%1000==0:logger.info(str(i+1)+'lines data were deal succeessful.')saveResult(resultList)lost_time = (time.time() - start_time)/60logger.info('The process is succeessful! Time:'+str(lost_time))

我在这里取k的值为5，用了全部的训练数据进行训练，总共下来在单机上好像跑了4个小时左右，期间跑的时间太长，刚好自己正在学日志部分，就加了个日志，自己看着也不急了，起码知道跑到哪里了。最后面会附上log的参数配置。

提交数据的格式：需要有describe 行。

提交成功，稍等片刻便会自动给出排名。

log参数配置：

#log.conf#####################################################[loggers]keys=root,example01,example02[handlers]keys=hand01,hand02,hand03[formatters]keys=form01,form02[formatter_form01]format=%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)sdatefmt=%a, %d %b %Y %H:%M:%S[formatter_form02]format=%(name)-12s: %(levelname)-8s %(message)sdatefmt=%a, %d %b %Y %H:%M:%S#####################################################[logger_root]level=DEBUGhandlers=hand01,hand02[logger_example01]#levelhandlers=hand01,hand02qualname=example01propagate=0[logger_example02]#levelhandlers=hand01,hand03qualname=example02propagate=0#######################################################[handler_hand01]class=StreamHandlerlevel=INFOformatter=form02args=(sys.stderr,)[handler_hand02]class=FileHandlerlevel=DEBUGformatter=form01args=('E:\\kaggle\\DigitDistinguish.log','a')[handler_hand03]class=handlers.RotatingFileHandlerlevel=INFOformatter=form02args=('myapp.log','a',10*1024*1024,5)########################################################

0 0