香农熵和划分数据集

来源:互联网 发布:首都 知乎 编辑:程序博客网 时间:2024/05/19 14:34

划分数据集的原则是:将无序的数据变得更加有序。

信息增益:在划分数据集之前之后信息发生的变化称为信息增益,知道如何计算信息增益,就可以计算每个特征值划分数据集获得的信息增益,获得的信息增益最高的特征就是最好的选择。

香农熵:反映了一个数据集的无序化(有序化)程度,数据集越有序,香农熵的值就越低,反之就越高。计算香农熵的标准:

以下是自己根据书本学习的内容,部分已经做了标注:

#coding:utf-8from math import logdef clcShannonEnt(dataSet):    numEntries = len(dataSet)    labelCounts = {}    for featVet in dataSet:        #print featVet        currentLabel = featVet[-1]        #print currentLabel        if currentLabel not in labelCounts.keys():            labelCounts[currentLabel] = 0        labelCounts[currentLabel] += 1    shannonEnt = 0.0    for key in labelCounts:        prob = float(labelCounts[key])/numEntries        #print key其中的key在字典点中就表示字典中相应的{'key':value}        shannonEnt -= prob*log(prob,2)    return shannonEntdef creatDataSet():    dataSet = [[1,1,'yes'],               [1,1,'yes'],               [1,0,'no'],               [0,1,'no'],               [0,1,'no']]    labels = ['no surfacing','flippers']    return dataSet,labels#按照给定特征划分数据集,参数:(待划分的数据集,划分数据集的特征,需要返回的特征值)def splitDataSet(dataSet,axis,value):    retDataSet = []    for featVec in dataSet:        if featVec[axis] == value:            reduceFeatVec = featVec[:axis]            #print featVec[:axis]            #extend函数表示依次在列表末尾追加多个值            reduceFeatVec.extend(featVec[axis+1:])            retDataSet.append(reduceFeatVec)    return retDataSet#选择最好的数据集划分方式def chooseBestFeatureToSplit(dataSet):    #将列表中最后一个元素当做初始特征,即“yes/no”    numFeatures = len(dataSet[0]) - 1    #print numFeatures  2    #整个数据集的原始香农熵    baseEntropy = clcShannonEnt(dataSet)    #print baseEntropy  0.97    bestInfoGain = 0.0    bestFeature = -1    for i in range(numFeatures):        featList = [example[i] for example in dataSet]        #print featList:[1,1,1,0,0]        uniqueVale = set(featList)        #print uniqueVale:set[0,1]        newEntropy = 0.0        for value in uniqueVale:            #第i个特征,对应的特征值,将dataSet划分并以列表的形式返回            subDataSet = splitDataSet(dataSet,i,value)            #print subDataSet            prob = len(subDataSet)/float(len(dataSet))            #计算指定子列表的香农熵            newEntropy += prob * clcShannonEnt(subDataSet)        #比较子香农熵(已分类)和初始香农熵,值越小越好        infoGain = baseEntropy - newEntropy        if(infoGain > bestInfoGain):            bestInfoGain = infoGain            bestFeature = i    #返回最佳分类特征值:0    return bestFeaturedataSet,labels = creatDataSet()#dataSet.append([1,1,'maybe'])print dataSet#shannonEnt = clcShannonEnt(dataSet)#retDataSet = splitDataSet(dataSet,1,1)#print retDataSetbestFeature = chooseBestFeatureToSplit(dataSet)print bestFeature


原创粉丝点击