手把手生成决策树(dicision tree)

来源:互联网 发布:国家重点软件企业认定 编辑:程序博客网 时间:2024/05/21 21:02

原文链接:http://blog.csdn.net/xuelabizp/article/details/50979469

1.什么是决策树

决策树是一种基本的分类和回归方法,本文主要讲解用于分类的决策树。决策树就是根据相关的条件进行分类的一种树形结构,比如某高端约会网站针对女客户约会对象见面的安排过程就是一个决策树: 

根据给定的数据集创建一个决策树就是机器学习的课程,创建一个决策树可能会花费较多的时间,但是使用一个决策树却非常快。

创建决策树时最关键的问题就是选取哪一个特征作为分类特征,好的分类特征能够最大化的把数据集分开,将无序变为有序。这里就出现了一个问题,如何描述一个数据集有序的程度?在信息论和概率统计中,表示随机变量不确定性的度量,即有序的程度。

现给出一个集合D,本文所有的讨论都以该集合为例:

序号不浮出水面是否可以生存是否有脚蹼是否为鱼类1是是是2是是是3是否否4否是否5否是否

创建该集合的代码如下:

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">create_data_set</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">()</span>:</span>    dataSet =   [[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'yes'</span>],                [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'yes'</span>],                [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'no'</span>],                [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'no'</span>],                [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'no'</span>]]    labels = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'no surfacing'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'flippers'</span>] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#不浮出水面是否可以生存,是否有脚蹼</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> dataSet, labels</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

2.熵,信息增益和信息增益比

2.1熵(entropy)

博主第一次接触“熵”这个字,是在高中的化学课上,但是感觉“熵”在化学课上的含义和信息论中的含义没什么区别,都是表示混乱的程度,熵越大,越混乱,比如一杯浑浊水的熵就比一杯纯净的水熵大。

在信息论和概率统计中,设X是一个取有限个值的离散随机变量,其概率分布为: 

P(X=xi)=pi,i=1,2,3,..,n(1)

则随机变量X的熵定义为: 
H(X)=i=1npilog2pi(2)

pi=0,则规定0log0=0。需要说明的是,熵只依赖于X的分布,而不依赖于X的值。根据(2)式就可以计算出上面给定的集合D的熵: 

H(D)=(25log22535log235)=0.971

编写计算熵的函数,其中dataSet是建立决策树的数据集,每行最后一个元素表示类别:

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">cal_Ent</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(dataSet)</span>:</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#根据给定数据集计算熵</span>    num = len(dataSet)    labels = {}    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> row <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> dataSet: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#统计所有标签的个数</span>        label = row[-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> label <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> labels.keys():            labels[label] = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>        labels[label] += <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>    Ent = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.0</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> key <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> labels: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#计算熵</span>        prob = float(labels[key]) / num        Ent -= prob * log(prob, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> Ent</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

2.2信息增益(information gain)

信息增益表示得知特征X的信息而使得类Y的信息的不确定性减少的程度。

换一个角度解释一下,一杯浑浊的水Y,其熵为H1,现在将其中悬浮的一类物质X去除,这杯水的熵下降为H2,则物质X对于这杯水的信息增益就为H1H2

特征X对数据集D的信息增益记为g(D,X),计算公式如下: 

g(D,X)=H(D)H(D|X)(3)

其中H(D|X)为特征X给定条件下D的经验条件熵。 
先解释什么是条件熵:

条件熵H(Y|X)表示在已知随机变量X的条件下随机变量Y的不确定性,定义为X给定条件下Y的条件概率分布的熵对X的数学期望。

条件熵的计算公式如下: 

H(Y|X)=i=1npiH(Y|X=xi)(4)

当熵和条件熵中的概率由数据估计得到时,所对应的熵与条件熵分别称为经验熵经验条件熵

决策树选择某个特征作为其分类特征的依据就是该特征对于集合的信息增益最大,即去除该特征后,集合变得最有序。仍旧以给定的集合D为例,根据计算信息增益准则选择最优分类特征。

X1表示“不浮出水面是否可以生存”,则 

g(D,X1)=H(D)[35H(D1)+25H(D2)]=0.971[35(23log22313log213)+25(22log222)]=0.420

其中D1,D2表示DX1取“是”和“否”的样本子集。 
X2表示“是否有脚蹼”,则 
g(D,X2)=H(D)[45H(D1)+15H(D2)]=0.971[45(24log22424log224)+15(11log211)]=0.171

其中D1,D2表示DX2取“是”和“否”的样本子集。 
比较各个特征的信息增益,X1的信息增益较大,所以选择X1作为分类的最优特征。

编写选择最佳决策特征的函数,其中dataSet是建立决策树的数据集,每行最后一个元素表示类别:

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#按照给定特征划分数据集,返回第axis个特征的值为value的所有数据</span><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">split_data_set</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(dataSet, axis, value)</span>:</span>     retDataSet = []    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> row <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> dataSet:        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (row[axis]) == value:            reducedRow = row[:axis]            reducedRow.extend(row[axis+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>:])            retDataSet.append(reducedRow)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> retDataSet<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#选择最佳决策特征</span><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">choose_best_feature</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(dataSet)</span>:</span>     num = len(dataSet[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]) - <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#特征数</span>    baseEnt = cal_Ent(dataSet)    bestInfoGain = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.0</span>    bestFeature = -<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> i <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> range(num):        featlist = [example[i] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> example <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> dataSet] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#按列遍历数据集,选取一个特征的所有值</span>        uniqueVals = set(featlist) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#一个特征可以取的值</span>        newEnt = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.0</span>        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> value <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> uniqueVals:            subDataSet = split_data_set(dataSet, i, value)            prob = len(subDataSet) / float(len(dataSet))            newEnt += prob * cal_Ent(subDataSet)        infoGain = baseEnt - newEnt <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#信息增益</span>        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (infoGain > bestInfoGain):            bestInfoGain = infoGain            bestFeature = i    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> bestFeature</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li></ul>

ID3决策树在生成的过程中,根据信息增益来选择特征。

2.3信息增益比(information gain ratio)

以信息增益作为划分训练数据集的特征,存在偏向于选择取值较多的特征的问题,使用信息增益比可以对这一问题进行校正。

特征X对训练数据集D的信息增益比gR(D,X)定义为其信息增益g(D,X)与训练数据集D关于特征X的值的熵HX(D)之比。

信息增益比计算公式如下: 

gR(D,X)=g(D,X)HX(D)(5)

其中 
HX(D)=i=1n|Di||D|log2|Di||D|(6)

以给定的集合D为例,计算信息增益比。 

HX1(D)=(35log23525log225)=0.971

gR(D,X1)=g(D,X1)HX1(D)=0.4200.971=0.433

HX2(D)=(45log24515log215)=0.722

gR(D,X2)=g(D,X2)HX2(D)=0.1710.722=0.237

根据信息增益比,选择X1作为分类的最优特征。

C4.5决策树在生成的过程中,根据信息增益比来选择特征。

3.实现一个决策树

3.1创建或载入数据集

首先需要创建或载入训练的数据集,第一节用的是创建数据集的方法,不过更常用的是利用open()函数打开文件,载入一个数据集。

3.2生成决策树

决策树一般使用递归的方法生成。

编写递归函数有一个好习惯,就是先考虑结束条件。生成决策树结束的条件有两个:其一是划分的数据都属于一个类,其二是所有的特征都已经使用了。在第二种结束情况中,划分的数据有可能不全属于一个类,这个时候需要根据多数表决准则确定这个子数据集的分类。

在非结束的条件下,首先选择出信息增益最大的特征,然后根据其分类。分类开始时,记录分类的特征到决策树中,然后在特征标签集中删除该特征,表示已经使用过该特征。根据选中的特征将数据集分为若干个子数据集,然后将子数据集作为参数递归创建决策树,最终生成一棵完整的决策树。

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#多数表决法则</span><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">majorityCnt</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(classList)</span>:</span>     <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> classList     classCount = {}    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> vote <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> classList: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#统计数目</span>        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> vote <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">not</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> classCount.keys(): classCount[vote] = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>        classCount += <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>    sorted(classCount.iteritems(), key=operator.itemgetter(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>), reverse=<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">True</span>)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> classCount[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>][<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># 生成决策树</span><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">create_tree</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(dataSet, labels)</span>:</span>    labelsCloned = labels[:]    classList = [example[-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> example <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> dataSet] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#[yes,yes,no,no,no]</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> classList.count(classList[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]) == len(classList): <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#只有一种类别,则停止划分</span>        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> classList[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> len(dataSet[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]) == <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#没有特征,则停止划分</span>        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> majorityCnt(classList)    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#print dataSet</span>    bestFeat = choose_best_feature(dataSet)    bestFeatLabel = labelsCloned[bestFeat] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#最佳特征的名字</span>    myTree = {bestFeatLabel:{}}    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">del</span>(labelsCloned[bestFeat])    featValues = [example[bestFeat] <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> example <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> dataSet] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#获取最佳特征的所有属性</span>    uniqueVals = set(featValues)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> value <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> uniqueVals: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#建立子树</span>        subLabels = labelsCloned[:] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#深拷贝,不能改变原始列表的内容,因为每一个子树都要使用</span>        myTree[bestFeatLabel][value] = create_tree(split_data_set(dataSet, bestFeat, value), subLabels)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> myTree</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li></ul>

生成的决策树如下所示: 

3.3使用决策树

使用决策树对输入进行分类的函数也是一个递归函数。分类函数需要三个参数:决策树,特征列表,待分类数据。特征列表是联系决策树和待分类数据的桥梁,决策树的特征通过特征列表获得其索引,再通过索引访问待分类数据中该特征的值。

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">classify</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(tree, featLabels, testVec)</span>:</span>    firstJudge = tree.keys()[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]    secondDict = tree[firstJudge]    featIndex = featLabels.index(firstJudge) <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#获得特征索引</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> key <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> secondDict: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#进入对应的分类集合</span>        <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> key == testVec[featIndex]: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#按特征分类</span>            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> type(secondDict[key]).__name__ == <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'dict'</span>: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#如果分类结果是一个字典,则说明还要继续分类</span>                classLabel = classify(secondDict[key], featLabels, testVec)            <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">else</span>: <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#分类结果不是字典,则分类结束</span>                classLabel = secondDict[key]    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> classLabel</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul>

3.4保存或者载入决策树

生成决策树是比较花费时间的,所以决策树生成以后存储起来,等要用的时候直接读取即可。

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">store_tree</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(tree, fileName)</span>:</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#保存树</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> pickle    fw = open(fileName, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'w'</span>)    pickle.dump(tree, fw)    fw.close()<span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">grab_tree</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(fileName)</span>:</span> <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#读取树</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> pickle    fr = open(fileName)    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> pickle.load(fr)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>

4.决策树可视化

使用字典的形式表示决策树对于人类来说还是有点抽象,如果能以图像的方式呈现就很方便了。非常幸运,matplotlib中有模块可以使决策树可视化,这里就不讲解了,直接“拿来使用”。将treePlotter.py拷贝到我们文件的根目录,直接导入treePlotter,然后调用treePlotter.createPlot()函数即可:

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> treePlottertreePlotter.createPlot(tree)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

如上面的决策树可视化后如下: 

5.使用决策树预测隐形眼镜类型

隐形眼镜数据集包含患者的眼睛状况以及医生推荐的隐形眼镜类型,患者信息有4维,分别表示年龄,视力类型,是否散光,眼睛状况,隐形眼镜类型有3种,分别是软材质,硬材质和不适合带隐形眼镜。

想要把我们编写的脚本应用于别的数据集?没问题,只要修改载入数据集的函数即可,其他的函数不需要改变,具体如下:

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#载入数据</span><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">file2matrix</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">()</span>:</span>    file = open(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"lenses.data.txt"</span>)    allLines = file.readlines()    row = len(allLines)    dataSet = []    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> line <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">in</span> allLines:        line = line.strip()        listFromLine = line.split()        dataSet.append(listFromLine)    labels = [<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'age'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'prescription'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'astigmatic'</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'tear rate'</span>] <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#年龄,视力类型,是否散光,眼睛状况</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> dataSet, labels</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li></ul>

生成的决策树可视化后如下: 

其实博主还尝试了其他的数据集,不过决策树实在是太复杂了,太大了,密密麻麻根本看不清楚,谁有兴趣可以尝试一下别的数据集。

6.总结

  • 源码在我的GitHub中,包含了可视化脚本以及数据集
  • MachineLearningAction仓库里面有常见的机器学习算法处理常见数据集的各种实例,欢迎访问
  • 决策树的优点 
    • 决策树易于理解和解释,尤其是可视化后的决策树非常直观
    • 决策树分类非常快
  • 决策树的缺点 
    • 容易过拟合
    • 对缺失数据的数据集处理困难
    • 忽略数据集中特征的相互关联
  • 常见的决策树有ID3,C4.5和CART决策树 
    • C4.5较之ID3更优,信息增益比可以解决信息增益选取取值较多的特征的问题
    • C4.5决策树生成的过程中有剪枝,可以降低决策树的拟合度
    • C4.5可以处理数值型数据,而ID3只能处理标称型数据
  • 决策树可以应用在贷款发放,约会见面等方面
0 0