机器学习实战(二)--决策树

来源：互联网发布：虚拟专用网vpn 软件编辑：程序博客网时间：2024/05/21 21:44

1.关于决策树算法：

The kNN algorithm in chapter 2 did a great job of classifying, but it didn’t lead toany major insights about the data. One of the best things about decision trees is that humans can easily understand the data.也就是说knn并不能解释数据的内在特征，而相反决策树对于人们来说是十分容易理解的（良好的可解释性）。

2.决策树构造的一般过程

在构造决策树时，第一个问题就是当前数据集上哪个特征在划分数据分类时起决定性作用。为了找到决定性的特征，划分出最好的结果，我们必须评估每个特征。

决策树的一般流程如下：

the process of building decision trees from nothing but a pile of data

（1）收集数据

（2）准备数据：树构造算法只适用于标称型数据，因此数值型数据必须离散化。

（3）分析数据：构造树完成之后，我们应该检查图形是否符合预期。

（4）训练算法：构造树的数据结构。

（5）测试算法：使用经验树计算错误率。

（6）使用算法

3.Tree Construction

信息论：We’ll first discuss the mathematics that decide how to split a dataset using something calledinformation theory.

（1）which feature is used to split the data:

To determine this, you try every feature and measure which split will give you thebest results.

如何决定选择哪个特征来分裂数据：选择能够获得最好结果的特征。

then, 最好结果means what？==>We choose to split our dataset in a way that makes our unorganized datamore organized.我们趋向于选择使散漫的数据变得更加有条理的分裂方式。从信息论的角度来说，使熵更小，混乱度更小。

（2）信息增益 Information Gain

def：The change in information before and after the split is known as the information gain.定义为分裂前后的熵的变化。

A.Entropy

defined as the expected value of the information.熵H是信息量的期望值。

计算熵的代码如下所示：

<pre name="code" class="python">'''@abstract: calculate the shannonEntropy of the input dataSet@input: dataSet==>type(list of list)@output: the shannonEntropy of the input dataSet==>type(float)'''def calcShannonEnt(dataSet):    numEntries = len(dataSet)    """"这里必须得用字典类型，这样的话feature就可以是字符类型的，而且用字典类型很方便    eg: dataSet=[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]    """    labelCounts = {}    #标签及其个数统计    for featVecs in dataSet:        currentLabel = featVecs[-1]        if currentLabel not in labelCounts.keys():            labelCounts[currentLabel]=0        labelCounts[currentLabel]+=1    shannonEnt = 0    for key in labelCounts:        prob = float(labelCounts[key])/numEntries        shannonEnt -= prob*log(prob,2)    return shannonEnt

B. Splitting the data

想象一个二维数据图，你想要画一条线来分开两个类，你应该在X坐标轴还是Y坐标轴进行呢？（决策树的实质就是不断地进行空间分割，对于n维特征的空间，要考虑的就是以什么样的特征维度顺序进行分割。）

要想知道哪个特征分裂得到的信息增益最大，需要通过实际地分割来度量。

对指定特征进行数据集分割的代码如下：

'''@abstract: dataset splitting on a given feature@input: dataSet==> the dataset to be splitted(type:a list of lists)        axis==> the feature to splitted on        value==> the value of the feature to return@output: 划分的属性值等于value的记录集合，并且每条记录已删除划分的属性值'''def splitDataSet(dataSet,axis,value):    retDataSet = []    for featVec in dataSet:        if featVec[axis] == value:            #featVec除去split on的特征剩下的变成reducedFeatvec            reducedFeatvec = featVec[:axis]            reducedFeatvec.extend(featVec[axis+1:])            retDataSet.append(reducedFeatvec)            '''            although they all are the method of list,            pay attention to the difference of extend() and append()            eg:            a=[1,2,3]; b=[4,5,6]            a.append(b) == [1,2,3,[4,5,6]]            a.extend(b) == [1,2,3,4,5,6]            '''    return retDataSet

C.Choosing the best feature to split on

基本思想就是对所有的特征进行遍历，找到信息增益最大的特征，代码如下：

As you can guess, it chooses the feature that, when split on, best organizes your data.

def chooseBestFeatureToSplit(dataSet):    #求特征数量    numFeatures = len(dataSet[0]) - 1  #the last column is used for the labels<pre>    #求原数据集的熵

baseEntropy = calcShannonEnt(dataSet) #初始化最好信息增益，及相应特征 bestInfoGain = 0.0; bestFeatures = -1 #遍历所有特征求最大信息增益 for i in range(numFeatures): featList = [example[i] for example in dataSet] #利用set方法去除list中的重复元素 uniqueVals = set(featList) newEntropy = 0.0 for value in uniqueVals: #求分割后的信息熵==>对该特征的每一个值分割得到的子集求熵，并加权求和 subDataSet = splitDataSet(dataSet, i, value) #dataSet以及subDataSet的类型都是list of list prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob*calcShannonEnt(subDataSet) infoGain = baseEntropy-newEntropy if (infoGain>bestInfoGain): bestInfoGain = infoGain bestFeatures = i #return type==>integer return bestFeatures

从以上几个代码可以看出，我们对数据集的一些假设：

The next assumption is that the last column in the data or the lastitem in each instance is the class label of that instance.

a. 数据集是 a list of lists的格式，并且其中的lists都是等长度的；

b. 数据集中的每条数据的最后一项是该实例的

D. A Test Example

接下来我们试着运行一下来寻找某一给定数据集的最大信息增益的属性

>>>myDat = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

>>>decisionTree.chooseBestFeatureToSplit(myDat)

可以看到对myDat这个数据集第0个属性将带来最大的信息增益。我们试着从之前提到的平面二维坐标分割来理解一下（虽然这里的属性是离散的）。

如上图，从x坐标轴作为分割（即垂直y轴的分割线）得到的信息增益应该是大于从y轴分割（即垂直x轴的分割线）。即以垂直y轴的分割线进行分割后所得到的子数据集混乱程度更低。可以设想，当(1,1，‘yes’)以及(0,1,'no')这样的点越来越多（数据集足够大）时，(1,0,'no')这样的点可以看做噪声了。

（3）递归地构建树

一般地，一个递归算法应该包含递归公式和终止条件：

a. 递归公式：Once split, the data will traverse down the branches of the tree to another node.

b.终止条件：you run out of attributes on which to split or all the instances in a branch are the same class. 终止条件有两个，要么是子集全都属于同一类，要么是所有的特征都已经用于分割。

对于以上两种去终止情况的处理方式也不同：当子集全都属于同一类时，我们简单地创建一个叶子节点。而当特征全部耗尽，其子集中的类别标签却不完全相同时，我们通过多数表决来决定该叶子节点所代表的class label。

伪代码：

Check if every item in the dataset is in the same class:

If so return the class label

Else

find the best feature to split the data split the dataset

create a branch node

for each split

call createBranch and add the result to the branch node

return branch node

If you don’t meet the stopping conditions, then you choose the best feature. And next, you create your tree.

Then the question is how to store the tree? ==> Here we use thePython dictionary to store the tree. 我们使用python字典来存储树。

利用字典存储树结构的方法：a lot of nested dictionaries。一个决策树的树结构可以唯一的由树节点及其相连的树干以及通过枝干连接的节点确定。我们的决策树是通过递归得到的，那么该树结构必然可以通过递归的结构来存储。我们定义一个树结构类似{bestFeatLabel:{value1:{},value2:{},value3:{}.....}}，每个value指向一个节点。

实现代码如下：

'''@abstract:Tree-building code@input:dataSet==>type(a list of lists)@output:myTree==>type(dictionary)'''def createTree(dataSet,labels):    classList = [example[-1] for example in dataSet]    #判断classList中的class是否完全相同    if classList.count(classList[0]) == len(classList):        return classList[0]    if len(dataSet[0]) ==1: #没有特征，仅有类别        return majorityCnt(classList)    # If you don’t meet the stopping conditions, then you choose the best feature.    # And next, you create your tree.    bestFeat = chooseBestFeatureToSplit(dataSet)    bestFeatLabel = labels[bestFeat]    # use python dictionary to store the tree    myTree = {bestFeatLabel:{}}    del(labels[bestFeat])    featValues = [example[bestFeat] for example in dataSet]    # 利用set方法去除list中的重复元素    uniqueVals = set(featValues)    # iterate over all the unique values from the chosen feature     # and recursively call createTree() for each split of the dataset.    for value in uniqueVals:        subLabels = labels[:]        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)    return myTree

4.Plotting trees in Python with Matplotlib annotations

Now that you’ve properly constructed the tree, you need to display it so that humans can properly understand the information.

这一段编程细节比较繁琐，我就先跳过了- -，这一段可以实现用matplotlib画出如下树结构图。

5.Testing and storing the classifier

（1）Test: using the tree for classification

使用决策树分类：The code will then take the data under test and compare it against the values in the decision tree. It will do this recursively until it hits a leaf node; then it will stop because it has arrived at a conclusion.

既然构建树是用的递归，那么分类的时候也可以递归啊！！！代码如下：

'''@abstract: classification function for an exsiting decision tree@input: inputTree==>type(dictionary)        featLabels==>type(list)        testVec==>type(list)@output:classLabel==>'''def classify(inputTree,featLabels,testVec):    firstStr = inputTree.keys()[0]    secondDict = inputTree[firstStr]    featIndex = featLabels.index(firstStr)    for key in secondDict.keys():        if testVec[featIndex] == key:            if type(secondDict[key]).__name__=='dict':                classLabel = classify(secondDict[key],featLabels,testVec)            else:                classLabel = secondDict[key]    return classLabel

（2）Use: persisting the decision tree

Now that you’ve built a classifier, it would be nice to be able to store this so you don’t have to rebuild the tree every time you want to do classification.

因为当数据集足够大的时候，构建一个决策树可能会花费很多时间。如果每次做分类的时候都重新构建树，将会极大地浪费时间。

Solution：使用一个python模块pickle序列化对象。序列化对象可以在磁盘上保存对象，并在需要的时候读取出来。任何对象都可以执行序列化操作。==>持久化功能

注：python的pickle模块实现了基本的数据序列和反序列化。通过pickle模块的序列化操作我们能够将程序中运行的对象信息保存到文件中去，永久存储；通过pickle模块的反序列化，我们能够从文件中创建上一次程序保存的对象。具体可以参考http://www.cnblogs.com/pzxbc/archive/2012/03/18/2404715.html

将构建好的决策树存储好，之后分类的时候就可以使用该决策树，而不需要再次学习。这也是决策树不同于knn的优点之一，knn是无须训练的基于示例的方法。

决策树的存储和提取代码如下：

'''@abstract: methods for persisting the decision tree with pickle'''def storeTree(inputTree, filename):    import pickle    fw = open(filename,'w')    pickle.dump(inputTree,fw)    fw.close()def grabTree(filename):    import pickle    fr = open(filename)    return pickle.load(fr)

注：我们知道决策树有ID3，C45，CART等等不同的算法，这里我们只是对其中最为基础的ID3算法进行了实现。具体的关于这些算法的不同之处，以及关于ID3算法更细致的原理我们会再讨论~

0 0