决策树

来源：互联网发布：淘宝win10激活密匙编辑：程序博客网时间：2024/06/10 03:53

- 1基础概念
  - 1什么是决策树
  - 2 信息的定义
  - 3熵香农熵
  - 4信息的增益
- 2决策树特点
  - 优点
  - 缺点
  - 适用数据类型
- 3机器实战代码
- 4lensestxt数据

1、基础概念

1.1什么是决策树

     决策树(Decision Tree）是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干，故称决策树。在机器学习中，决策树是一个预测模型，他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度，使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念。
决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。
    分类树（决策树）是一种十分常用的分类方法。他是一种监管学习，所谓监管学习就是给定一堆样本，每个样本都有一组属性和一个类别，这些类别是事先确定的，那么通过学习得到一个分类器，这个分类器能够对新出现的对象给出正确的分类。这样的机器学习就被称之为监督学习。
    在这里给出的是ID3算法

1.2 信息的定义

如果待分类事物可能划分在多个分类当中，啧符号xi的信息定义为：

l (x i) = - l o g 2 p (x i)

1.3熵（香农熵）

熵定义为信息的期望

H = - \sum i = 1 n p (x i) l o g 2 p (x i)

1.4信息的增益

在划分数据集之前之后信息发生变化成为信息增益。
计算每个特征值划分数据集的信息增益，获得信息增益最高的特征就是最好的选择

2、决策树特点

优点：

计算复杂度不高，输出结果易于理解，对中间值缺失不敏感，可以处理不相关的特征数据

缺点：

可能会产生过度匹配问题

适用数据类型：

数值型和标称型

3、机器实战代码

#encoding:utf-8from math import logimport treePlotterfrom win32ras import EnumEntriesfrom _ast import operatordef createDataSet():    dataSet = [[1, 1, 'yes'],               [1, 1, 'yes'],               [1, 0, 'no'],               [0, 1, 'no'],               [0, 1, 'no']]    labels = ['no surfacing','flippers']    #change to discrete values    return dataSet, labels#计算香农熵def calcShannonEnt(dataSet):    numEntries = len(dataSet)#样本数    labelCounts = {}#统计各个种类的数量    for featVec in dataSet:        curlabel = featVec[-1]        if curlabel not in labelCounts.keys():            labelCounts[curlabel] = 0        labelCounts[curlabel] += 1    shannonEnt = 0.0         for key in labelCounts.keys():        prob = float(labelCounts[key])/numEntries        shannonEnt -= prob*log(prob,2)#计算熵    return shannonEntdef splitDataSet(dataSet, axis, value):#根据给定的特征划分数据集，axis代表第几列即划分数据集的特征，value代表种+类    retDataSet=[]    for featVec in dataSet:        if featVec[axis] == value:            #去掉axis这一列            reducedFeatVec = featVec[:axis]            reducedFeatVec.extend(featVec[axis+1:])            retDataSet.append(reducedFeatVec)    return retDataSet def choseBestFeatureToSplit(dataSet):#选择最好的数据集划分方式    numFeatures = len(dataSet[0]) - 1#特征的个数    baseEntroy = calcShannonEnt(dataSet)#原始的熵值    bestFeautre = -1;#记录最好的特征    bestEntroy = 0.0#最好的信息增益    for i in range(numFeatures):#遍历每个特征值        featList = [example[i] for example in dataSet]#将此特征值的所有的样本值放到featList        uniqueVals = set(featList)#该特征值得到所有分类        newEntory = 0.0        for value in uniqueVals:#划分所有分类            subDataSet = splitDataSet(dataSet, i, value)            '''           我的理解，这里的香农熵是整体里的部分（因为划分了uniqueVals里面这么多类）           但是部分里面的香农熵计算出的数值却等同于整体的数值，为了降低这种地位，所以要           乘上这部分在整体所占的比例            '''            prob = len(subDataSet)/float(len(dataSet))            newEntory += prob*calcShannonEnt(subDataSet)#        infoGain = baseEntroy - newEntory#信息增益        if infoGain > bestEntroy:#寻找最优解            bestEntroy = infoGain            bestFeautre = i    return bestFeautre#返回最好的特征值def majorCnt(classList):#投票，哪个种类多就是哪个类    classCount={}    for vote in classList:        if vote not in classCount.keys(): classCount[vote] = 0        classCount[vote] += 1    sc = sorted(classCount.iteritems(), key = operator.itemgertter(1), reverse = True)#对词典的降序排序    return sc[0][0]def createTree(dataSet, label):#递归创造决策树    classList = [example[-1] for example in dataSet]    if classList.count(classList[0])== len(classList):#1.如果所有的分类都是一样的则递归结束        return classList[0]    if len(dataSet[0]) == 1:#如果特征向量只剩一个那么哪个种类多就是返回哪个种类        return majorCnt(classList)    bestFeat = choseBestFeatureToSplit(dataSet)#最好的划分方式#     print bestFeat    bestLabel = label[bestFeat]#在标签里面的名称    myTree = {bestLabel:{}}    del(label[bestFeat])    featValues = [example[bestFeat] for example in dataSet]    uniqueVals = set(featValues)#最好特征的所有分类    for value in uniqueVals:#根据分类递归创建        subLabels = label[:]        myTree[bestLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value), subLabels)    return myTree   def classify(inputTree, featLable, testVec):#分类器（递归分类）    fiststr = inputTree.keys()[0]#相当于根节点了    secondstr = inputTree[fiststr]#该节点所有的孩子    index = featLable.index(fiststr)#在类别表里的位置 用于判断在实际数据集中该属性存储在哪个位置    #比如说‘no surfacing’在第一个位置还是第二个位置  featLabel就是干这个用的    for key in secondstr.keys():#遍历所有的孩子，寻找符合条件的孩子        if testVec[index] == key:#找到符合条件的            if type(secondstr[key]).__name__ =='dict':#如果孩子是词典类型继续递归                classable = classify(secondstr[key], featLable, testVec)            else:                classable =  secondstr[key]    return classable def retrieveTree(i):    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}                  ]    return listOfTrees[i]            if __name__ == '__main__':    a, b = createDataSet()#     print createTree(a, b)#     mytree = retrieveTree(0)#     print classify(mytree, b, [0,1])    fr = open('lenses.txt')    lenses=[inst.strip().split('\t') for inst in fr.readlines()]    lensesLabel = ['age', 'prescipt','astigmatic','tearRate']    lensesTree = createTree(lenses, lensesLabel)    print lensesTree    print treePlotter.createPlot(lensesTree)

4、lenses.txt数据

young   myope   no  reduced no lensesyoung   myope   no  normal  softyoung   myope   yes reduced no lensesyoung   myope   yes normal  hardyoung   hyper   no  reduced no lensesyoung   hyper   no  normal  softyoung   hyper   yes reduced no lensesyoung   hyper   yes normal  hardpre myope   no  reduced no lensespre myope   no  normal  softpre myope   yes reduced no lensespre myope   yes normal  hardpre hyper   no  reduced no lensespre hyper   no  normal  softpre hyper   yes reduced no lensespre hyper   yes normal  no lensespresbyopic  myope   no  reduced no lensespresbyopic  myope   no  normal  no lensespresbyopic  myope   yes reduced no lensespresbyopic  myope   yes normal  hardpresbyopic  hyper   no  reduced no lensespresbyopic  hyper   no  normal  softpresbyopic  hyper   yes reduced no lensespresbyopic  hyper   yes normal  no lenses

阅读全文

0 0