C4.5算法

来源:互联网 发布:json测试 编辑:程序博客网 时间:2024/05/21 14:56

C4.5算法是对ID3算法的改进,在决策树的生成过程中,使用了信息增益率作为属性选择的方法,其具体的算法步骤如下:

DAe
T
1.DCTCT
2.A=TDCT
3.ADAk
4.AkeTDCT
5.AkaiAk=aiDDiAkaiTT
6.iDiA{Ak}(1)(5)TiTi

通过上述算法步骤可以发现,ID3算法和C4.5算法步骤基本一致,唯一的变化就是,在第四步时将ID3算法中的信息增益,改成了C4.5算法中的信息增益率。其他步骤两种算法完全一致。

代码实现

# 加载数据def loadDataSet(dataPath):      dataset = []      with open(dataPath) as file:          lines = file.readlines()          for line in lines:              values = line.strip().split(' ')              dataset.append(values)      return dataset  # 根据属性值,分割数据集def splitDataSet(dataset, axis, value):      retDataSet = []      for featVec in dataset:          if featVec[axis] == value:              reducedFeatVec = featVec[:axis]              reducedFeatVec.extend(featVec[axis+1:])              retDataSet.append(reducedFeatVec)      return retDataSet  # 计算数据集的信息熵def calShannonEnt(dataset):      numEntries = len(dataset) * 1.0      labelCounts = dict()      for featVec in dataset:          currentLabel = featVec[-1]          if currentLabel not in labelCounts.keys():              labelCounts[currentLabel] = 0          labelCounts[currentLabel] += 1      shannonEnt = 0.0      for key in labelCounts:          prob = labelCounts[key] / numEntries          import math          shannonEnt -= prob * math.log(prob, 2)      return shannonEnt  # 计算分割后的数据集相较于原数据集的信息增益def InfoGain(dataset, axis, baseShannonEnt):      featList = [example[axis] for example in dataset]      uniqueVals = set(featList)      newShannonEnt = 0.0      numEntries = len(dataset) * 1.0      for value in uniqueVals:          subDataSet = splitDataSet(dataset, axis, value)          ent = calShannonEnt(subDataSet)          prob = len(subDataSet) / numEntries          newShannonEnt += prob * ent      infoGain = baseShannonEnt - newShannonEnt      return infoGain  # 计算属性的分裂信息值def SplitInfo(dataset, axis):      numEntries = len(dataset) * 1.0      labelsCount = dict()      ent = 0.0      for featVec in dataset:          value = featVec[axis]          if value not in labelsCount:              labelsCount[value] = 0          labelsCount[value] += 1      for key in labelsCount:          prob = labelsCount[key] / numEntries          import math          ent -= prob * math.log(prob, 2)      return ent  # 计算属性的信息增益率def GainRate(dataset, baseset, axis, baseShannonEnt):      infoGain = InfoGain(dataset, axis, baseShannonEnt)      splitInfo = SplitInfo(baseset, axis)      return infoGain / splitInfo# 根据信息增益率,来选择属性def ChooseBestFeatureByGainRate(dataset, baseset):      numFeature = len(dataset[0]) - 1      baseShannonEnt = calShannonEnt(dataset)      bestGainRate = 0.0      bestFeature = -1      for i in range(numFeature):          gainRate = GainRate(dataset, baseset, i, baseShannonEnt)          if gainRate > bestGainRate:              bestGainRate = gainRate              bestFeature = i      return bestFeature# 构建决策树def createTree(dataset, baseset, labels):      classList = [example[-1] for example in dataset]      if classList.count(classList[0]) == len(classList):          return classList[0]      if len(dataset[0]) == 1:          return majorityCnt(classList)      bestFeature = ChooseBestFeatureByGainRate(dataset, baseset)      bestFeatureLabel = labels[bestFeature]      myTree = {bestFeatureLabel:{}}      del(labels[bestFeature])      featValues = [example[bestFeature] for example in dataset]      uniqueVals = set(featValues)      for value in uniqueVals:          subLabels = labels[:]          myTree[bestFeatureLabel][value] = \         createTree(splitDataSet(dataset, bestFeature, value), baseset, subLabels)      return myTree
0 0
原创粉丝点击