机器学习算法—决策树应用

来源:互联网 发布:手机图片裁剪软件 编辑:程序博客网 时间:2024/06/05 05:54

1、创建数据集

def createDataSet():    dataSet=[[1,1,'yes'],            [1,1,'yes'],            [1,0,'no'],            [0,1,'no'],            [0,1,'no']]    labels = ['no surfaceing','flippers']    return dataSet, labels

2、计算数据的香农熵

#计算香农熵,分两步,第一步计算频率,第二步根据公式计算香农熵def calcShannonEnt(dataSet):    numEntries = len(dataSet)    labelCounts = {}    for feaVec in dataSet:        currentLabel = feaVec[-1]        if currentLabel not in labelCounts:            labelCounts[currentLabel] = 0        labelCounts[currentLabel] += 1    shannonEnt = 0.0    for key in labelCounts:        prob = float(labelCounts[key])/numEntries        shannonEnt -= prob * log(prob, 2)    return shannonEnt

3、按照给定的特征划分数据集

#划分数据集,将满足X[axis]==value的值都划分到一起,返回一个划分好的集合(不包括用来划分的axis属性)def splitDataSet(dataSet, axis, value):    retDataSet = []    for featVec in dataSet:        if featVec[axis] == value:            reducedFeatVec = featVec[:axis]            reducedFeatVec.extend(featVec[axis+1:])            retDataSet.append(reducedFeatVec)    return retDataSet

4、选择最好的属性划分方式

#选择最好的属性进行划分,思路就是对每个属性都划分下,看哪个好。这里使用到了一个set来选取列表中唯一的元素def chooseBestFeatureToSplit(dataSet):    numFeatures = len(dataSet[0]) - 1#因为数据集的最后一项是标签yes/no    baseEntropy = calcShannonEnt(dataSet)    bestInfoGain = 0.0    bestFeature = -1    for i in range(numFeatures):        featList = [example[i] for example in dataSet]        uniqueVals = set(featList)        newEntropy = 0.0        for value in uniqueVals:            subDataSet = splitDataSet(dataSet, i, value)            prob = len(subDataSet) / float(len(dataSet))            newEntropy += prob * calcShannonEnt(subDataSet)        infoGain = baseEntropy -newEntropy        if infoGain > bestInfoGain:            bestInfoGain = infoGain            bestFeature = i    return bestFeature

5、构建决策树

#因为我们递归构建决策树是根据属性的消耗进行计算的,所以可能会存在最后属性用完了,但是分类还是没有算完,这时候就会采用多数表决的方式计算节点分类def majorityCnt(classList):    classCount = {}    for vote in classList:        if vote not in classCount.keys():            classCount[vote] = 0        classCount[vote] += 1    return max(classCount)         #构建决策树    def createTree(dataSet, labels):    classList = [example[-1] for example in dataSet]    if classList.count(classList[0]) ==len(classList):#类别相同则停止划分        return classList[0]    if len(dataSet[0]) == 1:#所有特征已经用完        return majorityCnt(classList)    bestFeat = chooseBestFeatureToSplit(dataSet)    bestFeatLabel = labels[bestFeat]    myTree = {bestFeatLabel:{}}    del(labels[bestFeat])    featValues = [example[bestFeat] for example in dataSet]    uniqueVals = set(featValues)    for value in uniqueVals:        subLabels = labels[:]#为了不改变原始列表的内容复制了一下        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat,value),subLabels)    return myTree
def main():    data,label = createDataSet()    t1 = time.clock()    myTree = createTree(data,label)    t2 = time.clock()    print (myTree)    print ('execute for ',t2-t1)if __name__=='__main__':    main()

结果:
这里写图片描述