统计学习方法 习题5.1 c4.5实现

来源:互联网 发布:win10软件管家 编辑:程序博客网 时间:2024/06/06 21:41

题目要求:根据训练数据集,利用信息增益比(C4.5算法)生成决策树。


信息增益比算法是id3算法的改进:


信息增益比的定义:


补充:信息增益计算方式:



代码实现(机器学习实战的改编,保存为tree.py):

from math import logimport operatordef createDataSet():    dataSet = [1,0,0,1,0],\              [1,0,0,2,0],\              [1,1,0,2,1],\              [1,1,1,1,1],\              [1,0,0,1,0],\              [2,0,0,1,0],\              [2,0,0,2,0],\              [2,1,1,2,1],\              [2,0,1,3,1],\              [2,0,1,3,1],\              [3,0,1,3,1],\              [3,0,1,2,1],\              [3,1,0,2,1],\              [3,1,0,3,1],\              [3,0,0,1,0]    labels = ['age','job','house','creadit']    return dataSet, labelsdef calcShannonEnt(dataSet):    numEntries = len(dataSet)    labelCounts = {}    for featVec in dataSet:        currentLabel = featVec[-1]        if currentLabel not in labelCounts.keys():            labelCounts[currentLabel] = 0        labelCounts[currentLabel] += 1    shannonEnt = 0.0    for key in labelCounts:        prob = float(labelCounts[key]) / numEntries        shannonEnt -= prob * log(prob, 2)    return shannonEntdef splitDataSet(dataSet, axis, value):    retDataSet = []    for featVec in dataSet:        if featVec[axis] == value:            reducedFeatVec = featVec[:axis]            reducedFeatVec.extend(featVec[axis + 1:])            retDataSet.append(reducedFeatVec)    return retDataSetdef chooseBestFeatureToSplit(dataSet):    numFeatures = len(dataSet[0]) - 1    baseEntropy = calcShannonEnt(dataSet)    bestInfoGain = 0.0    bestFeature = -1    for i in range(numFeatures):        featList = [example[i] for example in dataSet]        uniqueVals = set(featList)        newEntropy = 0.0        for value in uniqueVals:            subDataSet = splitDataSet(dataSet, i, value)            prob = len(subDataSet) / float(len(dataSet))            newEntropy += prob * calcShannonEnt(subDataSet)        infoGain = (baseEntropy - newEntropy) / baseEntropy        if (infoGain > bestInfoGain):            bestInfoGain = infoGain            bestFeature = i    return bestFeaturedef majorityCnt(classList):    classCount = {}    for vote in classList:        if vote not in classCount.keys():            classCount[vote] = 0        classCount[vote] += 1    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]def createTree(dataSet, labels):    classList = [example[-1] for example in dataSet]    if classList.count(classList[0]) == len(classList):        return classList[0]    if len(dataSet[0]) == 1:        return majorityCnt(classList)    bestFeat = chooseBestFeatureToSplit(dataSet)    bestFeatLabel = labels[bestFeat]    myTree = {bestFeatLabel: {}}    del (labels[bestFeat])    featValues = [example[bestFeat] for example in dataSet]    uniqueVals = set(featValues)    for value in uniqueVals:        subLabels = labels[:]        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)    return myTree

调用方式:

import treemydat,mylab = tree.createDataSet()mytree = tree.createTree(mydat,mylab)print mytree
结果和id3算法效果一样:


0 0