机器学习算法的Python实现 (2)：ID3决策树

来源：互联网发布：找老婆知乎编辑：程序博客网时间：2024/05/01 23:13

本文数据参照机器学习-周志华一书中的决策树一章。可作为此章课后习题3的答案

代码则参照《机器学习实战》一书的内容，并做了一些修改。

本文使用的Python库包括

numpy
pandas
math
operator
matplotlib

本文所用的数据如下：

Idx色泽根蒂敲声纹理脐部触感密度含糖率 label1青绿蜷缩浊响清晰凹陷硬滑0.6970.4612乌黑蜷缩沉闷清晰凹陷硬滑0.7740.37613乌黑蜷缩浊响清晰凹陷硬滑0.6340.26414青绿蜷缩沉闷清晰凹陷硬滑0.6080.31815浅白蜷缩浊响清晰凹陷硬滑0.5560.21516青绿稍蜷浊响清晰稍凹软粘0.4030.23717乌黑稍蜷浊响稍糊稍凹软粘0.4810.14918乌黑稍蜷浊响清晰稍凹硬滑0.4370.21119乌黑稍蜷沉闷稍糊稍凹硬滑0.6660.091010青绿硬挺清脆清晰平坦软粘0.2430.267011浅白硬挺清脆模糊平坦硬滑0.2450.057012浅白蜷缩浊响模糊平坦软粘0.3430.099013青绿稍蜷浊响稍糊凹陷硬滑0.6390.161014浅白稍蜷沉闷稍糊凹陷硬滑0.6570.198015乌黑稍蜷浊响清晰稍凹软粘0.360.37016浅白蜷缩浊响模糊平坦硬滑0.5930.042017青绿蜷缩沉闷稍糊稍凹硬滑0.7190.1030

由于我没搞定matplotlib的中文输出，因此将中文字符全换成了英文，如下：

Idxcolorrootknockstexturenaveltouchdensitysugar_ratiolabel1dark_greencurl_uplittle_heavilydistinctsinkinghard_smooth0.6970.4612blackcurl_upheavilydistinctsinkinghard_smooth0.7740.37613blackcurl_uplittle_heavilydistinctsinkinghard_smooth0.6340.26414dark_greencurl_upheavilydistinctsinkinghard_smooth0.6080.31815light_whitecurl_uplittle_heavilydistinctsinkinghard_smooth0.5560.21516dark_greenlittle_curl_uplittle_heavilydistinctlittle_sinkingsoft_stick0.4030.23717blacklittle_curl_uplittle_heavilylittle_blurlittle_sinkingsoft_stick0.4810.14918blacklittle_curl_uplittle_heavilydistinctlittle_sinkinghard_smooth0.4370.21119blacklittle_curl_upheavilylittle_blurlittle_sinkinghard_smooth0.6660.091010dark_greenstiffcleardistinctevensoft_stick0.2430.267011light_whitestiffclearblurevenhard_smooth0.2450.057012light_whitecurl_uplittle_heavilyblurevensoft_stick0.3430.099013dark_greenlittle_curl_uplittle_heavilylittle_blursinkinghard_smooth0.6390.161014light_whitelittle_curl_upheavilylittle_blursinkinghard_smooth0.6570.198015blacklittle_curl_uplittle_heavilydistinctlittle_sinkingsoft_stick0.360.37016light_whitecurl_uplittle_heavilyblurevenhard_smooth0.5930.042017dark_greencurl_upheavilylittle_blurlittle_sinkinghard_smooth0.7190.1030

字符的含义可自行对照上下两表

决策树生成的代码参照机器学习实战第三章的代码，但是书上第三章是针对离散特征的，下面程序中对其进行了修改，使其能用于同时包含离散与连续特征的数据集。

决策树生成代码如下：

# -*- coding: utf-8 -*-from numpy import *import numpy as npimport pandas as pdfrom math import logimport operator#计算数据集的香农熵def calcShannonEnt(dataSet):    numEntries=len(dataSet)    labelCounts={}    #给所有可能分类创建字典    for featVec in dataSet:        currentLabel=featVec[-1]        if currentLabel not in labelCounts.keys():            labelCounts[currentLabel]=0        labelCounts[currentLabel]+=1    shannonEnt=0.0    #以2为底数计算香农熵    for key in labelCounts:        prob = float(labelCounts[key])/numEntries        shannonEnt-=prob*log(prob,2)    return shannonEnt#对离散变量划分数据集，取出该特征取值为value的所有样本def splitDataSet(dataSet,axis,value):    retDataSet=[]    for featVec in dataSet:        if featVec[axis]==value:            reducedFeatVec=featVec[:axis]            reducedFeatVec.extend(featVec[axis+1:])            retDataSet.append(reducedFeatVec)    return retDataSet#对连续变量划分数据集，direction规定划分的方向，#决定是划分出小于value的数据样本还是大于value的数据样本集def splitContinuousDataSet(dataSet,axis,value,direction):    retDataSet=[]    for featVec in dataSet:        if direction==0:            if featVec[axis]>value:                reducedFeatVec=featVec[:axis]                reducedFeatVec.extend(featVec[axis+1:])                retDataSet.append(reducedFeatVec)        else:            if featVec[axis]<=value:                reducedFeatVec=featVec[:axis]                reducedFeatVec.extend(featVec[axis+1:])                retDataSet.append(reducedFeatVec)    return retDataSet#选择最好的数据集划分方式def chooseBestFeatureToSplit(dataSet,labels):    numFeatures=len(dataSet[0])-1    baseEntropy=calcShannonEnt(dataSet)    bestInfoGain=0.0    bestFeature=-1    bestSplitDict={}    for i in range(numFeatures):        featList=[example[i] for example in dataSet]        #对连续型特征进行处理        if type(featList[0]).__name__=='float' or type(featList[0]).__name__=='int':            #产生n-1个候选划分点            sortfeatList=sorted(featList)            splitList=[]            for j in range(len(sortfeatList)-1):                splitList.append((sortfeatList[j]+sortfeatList[j+1])/2.0)                        bestSplitEntropy=10000            slen=len(splitList)            #求用第j个候选划分点划分时，得到的信息熵，并记录最佳划分点            for j in range(slen):                value=splitList[j]                newEntropy=0.0                subDataSet0=splitContinuousDataSet(dataSet,i,value,0)                subDataSet1=splitContinuousDataSet(dataSet,i,value,1)                prob0=len(subDataSet0)/float(len(dataSet))                newEntropy+=prob0*calcShannonEnt(subDataSet0)                prob1=len(subDataSet1)/float(len(dataSet))                newEntropy+=prob1*calcShannonEnt(subDataSet1)                if newEntropy<bestSplitEntropy:                    bestSplitEntropy=newEntropy                    bestSplit=j            #用字典记录当前特征的最佳划分点            bestSplitDict[labels[i]]=splitList[bestSplit]            infoGain=baseEntropy-bestSplitEntropy        #对离散型特征进行处理        else:            uniqueVals=set(featList)            newEntropy=0.0            #计算该特征下每种划分的信息熵            for value in uniqueVals:                subDataSet=splitDataSet(dataSet,i,value)                prob=len(subDataSet)/float(len(dataSet))                newEntropy+=prob*calcShannonEnt(subDataSet)            infoGain=baseEntropy-newEntropy        if infoGain>bestInfoGain:            bestInfoGain=infoGain            bestFeature=i    #若当前节点的最佳划分特征为连续特征，则将其以之前记录的划分点为界进行二值化处理    #即是否小于等于bestSplitValue    if type(dataSet[0][bestFeature]).__name__=='float' or type(dataSet[0][bestFeature]).__name__=='int':              bestSplitValue=bestSplitDict[labels[bestFeature]]                labels[bestFeature]=labels[bestFeature]+'<='+str(bestSplitValue)        for i in range(shape(dataSet)[0]):            if dataSet[i][bestFeature]<=bestSplitValue:                dataSet[i][bestFeature]=1            else:                dataSet[i][bestFeature]=0    return bestFeature#特征若已经划分完，节点下的样本还没有统一取值，则需要进行投票def majorityCnt(classList):    classCount={}    for vote in classList:        if vote not in classCount.keys():            classCount[vote]=0        classCount[vote]+=1    return max(classCount)#主程序，递归产生决策树def createTree(dataSet,labels,data_full,labels_full):    classList=[example[-1] for example in dataSet]    if classList.count(classList[0])==len(classList):        return classList[0]    if len(dataSet[0])==1:        return majorityCnt(classList)    bestFeat=chooseBestFeatureToSplit(dataSet,labels)    bestFeatLabel=labels[bestFeat]    myTree={bestFeatLabel:{}}    featValues=[example[bestFeat] for example in dataSet]    uniqueVals=set(featValues)    if type(dataSet[0][bestFeat]).__name__=='str':        currentlabel=labels_full.index(labels[bestFeat])        featValuesFull=[example[currentlabel] for example in data_full]        uniqueValsFull=set(featValuesFull)    del(labels[bestFeat])    #针对bestFeat的每个取值，划分出一个子树。    for value in uniqueVals:        subLabels=labels[:]        if type(dataSet[0][bestFeat]).__name__=='str':            uniqueValsFull.remove(value)        myTree[bestFeatLabel][value]=createTree(splitDataSet\         (dataSet,bestFeat,value),subLabels,data_full,labels_full)    if type(dataSet[0][bestFeat]).__name__=='str':        for value in uniqueValsFull:            myTree[bestFeatLabel][value]=majorityCnt(classList)    return myTree

通过以下语句进行调用：

df=pd.read_csv('watermelon_4_3.csv')data=df.values[:,1:].tolist()data_full=data[:]labels=df.columns.values[1:-1].tolist()labels_full=labels[:]myTree=createTree(data,labels,data_full,labels_full)

可以得到以下结果

>>> myTree
{'texture': {'distinct': {'density<=0.3815': {0: 1L, 1: 0L}}, 'little_blur': {'touch': {'hard_smooth': 0L, 'soft_stick': 1L}}, 'blur': 0L}}

以下为画图代码：

import matplotlib.pyplot as pltdecisionNode=dict(boxstyle="sawtooth",fc="0.8")leafNode=dict(boxstyle="round4",fc="0.8")arrow_args=dict(arrowstyle="<-")#计算树的叶子节点数量def getNumLeafs(myTree):    numLeafs=0    firstStr=myTree.keys()[0]    secondDict=myTree[firstStr]    for key in secondDict.keys():        if type(secondDict[key]).__name__=='dict':            numLeafs+=getNumLeafs(secondDict[key])        else: numLeafs+=1    return numLeafs#计算树的最大深度def getTreeDepth(myTree):    maxDepth=0    firstStr=myTree.keys()[0]    secondDict=myTree[firstStr]    for key in secondDict.keys():        if type(secondDict[key]).__name__=='dict':            thisDepth=1+getTreeDepth(secondDict[key])        else: thisDepth=1        if thisDepth>maxDepth:            maxDepth=thisDepth    return maxDepth#画节点def plotNode(nodeTxt,centerPt,parentPt,nodeType):    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',\    xytext=centerPt,textcoords='axes fraction',va="center", ha="center",\    bbox=nodeType,arrowprops=arrow_args)#画箭头上的文字def plotMidText(cntrPt,parentPt,txtString):    lens=len(txtString)    xMid=(parentPt[0]+cntrPt[0])/2.0-lens*0.002    yMid=(parentPt[1]+cntrPt[1])/2.0    createPlot.ax1.text(xMid,yMid,txtString)    def plotTree(myTree,parentPt,nodeTxt):    numLeafs=getNumLeafs(myTree)    depth=getTreeDepth(myTree)    firstStr=myTree.keys()[0]    cntrPt=(plotTree.x0ff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.y0ff)    plotMidText(cntrPt,parentPt,nodeTxt)    plotNode(firstStr,cntrPt,parentPt,decisionNode)    secondDict=myTree[firstStr]    plotTree.y0ff=plotTree.y0ff-1.0/plotTree.totalD    for key in secondDict.keys():        if type(secondDict[key]).__name__=='dict':            plotTree(secondDict[key],cntrPt,str(key))        else:            plotTree.x0ff=plotTree.x0ff+1.0/plotTree.totalW            plotNode(secondDict[key],(plotTree.x0ff,plotTree.y0ff),cntrPt,leafNode)            plotMidText((plotTree.x0ff,plotTree.y0ff),cntrPt,str(key))    plotTree.y0ff=plotTree.y0ff+1.0/plotTree.totalDdef createPlot(inTree):    fig=plt.figure(1,facecolor='white')    fig.clf()    axprops=dict(xticks=[],yticks=[])    createPlot.ax1=plt.subplot(111,frameon=False,**axprops)    plotTree.totalW=float(getNumLeafs(inTree))    plotTree.totalD=float(getTreeDepth(inTree))    plotTree.x0ff=-0.5/plotTree.totalW    plotTree.y0ff=1.0    plotTree(inTree,(0.5,1.0),'')    plt.show()

调用方式为

createPlot(myTree)

以上的决策树计算代码以及画图代码可以放在不同的文件中进行调用，也可以直接放在一个py文件中。

得到的决策树如下图所示：

与机器学习教材P85页的图一致。

若文中或代码中有错误之处，烦请指正，不甚感激。

更新：

2016.4.3对决策树生成的createTree函数进行了更新（上文代码已经是更新后的代码）。

原来的代码为：

#主程序，递归产生决策树  def createTree(dataSet,labels):      classList=[example[-1] for example in dataSet]      if classList.count(classList[0])==len(classList):          return classList[0]      if len(dataSet[0])==1:          return majorityCnt(classList)      bestFeat=chooseBestFeatureToSplit(dataSet,labels)      bestFeatLabel=labels[bestFeat]      myTree={bestFeatLabel:{}}      del(labels[bestFeat])      featValues=[example[bestFeat] for example in dataSet]      uniqueVals=set(featValues)      #针对bestFeat的每个取值，划分出一个子树。      for value in uniqueVals:          subLabels=labels[:]          myTree[bestFeatLabel][value]=createTree(splitDataSet\           (dataSet,bestFeat,value),subLabels)      return myTree

比如颜色有 dark_green, black, light_white 三种，纹理有 distinct,little_blur, blur 这几种。若先按照纹理进行划分，则划分出distinct的子样本集中的颜色就没有light_white这个取值了。这使得得到的决策树在遇到新数据时可能无法进行决策（比如一个 texture:distinct; color:light_white的西瓜）。因此在递归的时候需要传递完整的训练数据集。从而产生完整的决策树。（缺失取值的类别划分选择当前数据集的多数类别（投票法））

如使用书上的表4.2（就是前面表格去掉密度和含糖量这两行）。使用之前代码得到的图为

以下为修改后的结果图，与书上P78的图4.4一致

可以看出，修改后左侧colo特征的划分是完整的

3 1