机器学习实战学习笔记(二)分类—ID3决策树算法(python3实现)

来源：互联网发布：spss软件安装编辑：程序博客网时间：2024/05/16 18:27

概述

决策树算法也是目前经常使用的数据挖掘算法。它的优势在于计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关的特征值。缺点在于可能会产生过度匹配的问题。算法适用于数值型数据和标称型数据。

算法原理

1.信息增益

在构造决策树时，我们需要解决的第一个问题就是在划分数据集时哪一个特征值起到决定性作用。在划分数据集的方法中，一个不变的大原则就是，将无序的数据变得有序。其中一种就是使用信息论度量信息。在划分数据集之前和之后信息的变化叫做信息增益，选择信息增益最大的特征值作为划分数据集的依据即可。计算信息增益，需要引入度量集合信息的(香农)熵，熵是信息的期望值，数学公式粘贴不了，直接用python3计算

"""计算香农熵输入：特征值和标签组成的矩阵输出：该数据集的香农熵"""def calcShannonEnt(dataSet):    numEntries=len(dataSet)    labelCounts={}    #统计各标签及其所占的次数    for featVec in dataSet:        currentLabel=featVec[-1]        if currentLabel not in labelCounts.keys():            labelCounts[currentLabel]=0        labelCounts[currentLabel]+=1    shannonEnt=0.0    #计算香农熵    for key in labelCounts:        prob=float(labelCounts[key])/numEntries        shannonEnt-=prob*log(prob,2)    return shannonEnt

2.划分数据集

按照给定特征划分数据集，python3代码如下

"""按照给定的特征划分数据集输入：带划分数据集，特征，特征返回值输出：符合给定条件的数据记录"""def splitDataSet(dataSet,axis,value):    #创建一个列表用于存放特征值符合给定条件的数据记录    retDataSet=[]    for featVec in dataSet:        if featVec[axis]==value:            temp=featVec[:axis]            temp.extend(featVec[axis+1:])            retDataSet.append(temp)    return retDataSet

3.找出最优划分特征

python3代码

"""从数据集中选择最佳划分特征输入：数据集矩阵输出：最佳特征"""def chooseBestFeatureToSplit(dataSet):    numFeatures=len(dataSet[0])-1    baseEntropy=calcShannonEnt(dataSet)    bestInfoGain=0.0    bestFeature=-1    for i in range(numFeatures):        #取得第i个特征的所有取值        featList=[example[i] for example in dataSet]        #得到第i个特征的可能的所有不重复取值        uniqueVals=set(featList)        #划分后的信息熵        newEntropy=0.0        for value in uniqueVals:            subDataSet=splitDataSet(dataSet,i,value)            prob=len(subDataSet)/float(len(dataSet))            #将所有划分的子集的信息熵加起来            newEntropy+=prob*calcShannonEnt(subDataSet)        #信息增益=旧熵-新熵   {熵代表数据集的无序程度，信息增益就是熵的减小值，变整齐了多少..}        infoGain=baseEntropy-newEntropy        if(infoGain>bestInfoGain):            bestInfoGain=infoGain            bestFeature=i    return bestFeature

算法实现-递归构造决策树

这个算法有这样的流程，得到数据集，选择最好的属性值进行划分，第一次划分之后数据集向下一个节点传递，如此递归直到程序遍历完所有划分数据集的属性或者每个分支下的所有实例都具有相同的分类。若在所有属性值都遍历完后类标签仍然不是唯一的，这时我们一般采用多数表决的方法确定叶子结点的分类，多数表决的python3代码如下：

"""多数表决输入：特征值出现的列表输出：出现次数逆序的特征值列表"""def majorityCnt(classList):    classCount={}    for vote in classList:        if vote not in classCount.keys():            classCount[vote]=0        classCount[vote]+=1        sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)        return sortedClassCount[0][0]

然后，我们创建决策树，python3代码如下

"""创建树输入：数据集和标签列表输出：用字典表示的树"""def createTree(dataSet,labels):    classList=[example[-1] for example in dataSet]    if classList.count(classList[0])==len(classList):        return classList[0]    if len(dataSet[0])==1:        return majorityCnt(classList)    bestFeat=chooseBestFeatureToSplit(dataSet)    bestFeatLabel=labels[bestFeat]    myTree={bestFeatLabel:{}}    del(labels[bestFeat])    featValues=[example[bestFeat] for example in dataSet]    uniqueVlas=set(featValues)    for value in uniqueVlas:        subLabels=labels[:]        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)    return myTree

将用字典形式表示的决策树用Matplotlib绘制，python3代码如下：

import matplotlib.pyplot as pltdecisionNode=dict(boxstyle="sawtooth",fc="0.8")#boxstyle控制注解框的边缘线型，fc控制的注解框内的颜色深度leafNode=dict(boxstyle="round4",fc="0.8")#箭头标志arrow_args=dict(arrowstyle="<-")def plotNode(nodeTxt, centerPt, parentPt, nodeType):    createPlot.ax1.annotate(nodeTxt,xy=parentPt,#起点位置                            xycoords='axes fraction',                            xytext=centerPt,#注解框位置                            textcoords='axes fraction',                            va="center",                            ha="center",                            bbox=nodeType,                            arrowprops=arrow_args)def getNumLeafs(myTree):    numLeafs=0    k=list(myTree.keys())    firstStr=k[0]    secondDict=myTree[firstStr]    for key in secondDict.keys():        if type(secondDict[key]).__name__=='dict':            numLeafs+=getNumLeafs(secondDict[key])        else:            numLeafs+=1    return numLeafsdef getTreeDepth(myTree):    maxDepth=0    k=list(myTree.keys())    firstStr=k[0]    secondDict=myTree[firstStr]    for key in secondDict.keys():        if type(secondDict[key]).__name__=='dict':            thisDepth=1+getTreeDepth(secondDict[key])        else:            thisDepth=1        if thisDepth>maxDepth:            maxDepth=thisDepth    return maxDepthdef plotMidText(cntrPt,parentPt,txtString):    xMid=(parentPt[0]-cntrPt[0])/2.0+cntrPt[0]    yMid=(parentPt[1]-cntrPt[1])/2.0+cntrPt[1]    createPlot.ax1.text(xMid,yMid,txtString)def plotTree(myTree,parentPt,nodeTxt):    numLeafs=getNumLeafs(myTree)    depth=getTreeDepth(myTree)    k=list(myTree.keys())    firstStr=k[0]    cntrPt=(plotTree.x0ff+(1.0+float(numLeafs))/2.0/plotTree.totalW,            plotTree.y0ff)    plotMidText(cntrPt,parentPt,nodeTxt)    plotNode(firstStr,cntrPt,parentPt,decisionNode)    secondDict=myTree[firstStr]    plotTree.y0ff=plotTree.y0ff-1.0/plotTree.totalD    for key in secondDict:        if type(secondDict[key]).__name__=='dict':            plotTree(secondDict[key],cntrPt,str(key))        else:            plotTree.x0ff=plotTree.x0ff+1.0/plotTree.totalW            plotNode(secondDict[key],(plotTree.x0ff,plotTree.y0ff),                     cntrPt,leafNode)            plotMidText((plotTree.x0ff,plotTree.y0ff),cntrPt,str(key))        plotTree.y0ff=plotTree.y0ff+1.0/plotTree.totalDdef createPlot(inTree):    fig=plt.figure(1,facecolor='white')    fig.clf()    axprops=dict(xticks=[],yticks=[])    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)    plotTree.totalW=float(getNumLeafs(inTree))    plotTree.totalD=float(getTreeDepth(inTree))    plotTree.x0ff=-0.5/plotTree.totalW    plotTree.y0ff=1.0    plotTree(inTree,(0.5,1.0),'')    plt.show()

算法测试-执行

决策树的储存

为了省去每次都去构造决策树的开销，对于同一个问题我们可以利用pickle模块将决策树序列化，等需要的时候在将其读取出来，而kNN算法是无法进行这样的持久化的，序列化决策树并储存的python3代码如下：

"""使用pickle模块储存决策树"""def storeTree(inputTree,filename):    import pickle    fw=open(filename,'w')    pickle.dump(inputTree,fw)    fw.close()def grabTree(filename):    import pickle    fr=open(filename)    return pickle.load(fr)

示例：使用ID3决策树预测隐形眼镜种类

例子其实没有什么代码量就是上面几个模块的使用罢了，最重要的就是将文本数据转换为合适的矩阵格式，关于过度匹配的问题，以后再做深入学习罢。

阅读全文

0 0