决策树ID3算法
来源:互联网 发布:linux系统的命令 编辑:程序博客网 时间:2024/06/10 23:19
一、构造决策树步骤:
1.数据准备
数据离散化
2.划分数据
- 计算数据集香农熵
- 计算特征值信息增量,最大的为最好划分
二、代码模块
1.计算香农熵:
def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: #the the number of unique elements and their occurance currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0.0 for key in labelCounts:#公式:H=-∑(n)(i-1)p(xi)log2p(xi) prob = float(labelCounts[key])/numEntries #选择该分类概率p(xi) shannonEnt -= prob * log(prob,2) #log base 2 return shannonEnt
2.划分数据集(按给定特征值划分)
def splitDataSet(dataSet, axis, value): retDataSet = [] for featVec in dataSet: if featVec[axis] == value: reducedFeatVec = featVec[:axis] #chop out axis used for splitting reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet
3.选择最好的数据集划分方式
- 选取特征值
- 划分数据
- 计算最好的划分数据特征
def chooseBestFeatureToSplit(dataSet): numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0; bestFeature = -1 for i in range(numFeatures): #iterate over all the features featList = [example[i] for example in dataSet]#create a list of all the examples of this feature uniqueVals = set(featList) #get a set of unique values newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy if (infoGain > bestInfoGain): #compare this to the best gain so far bestInfoGain = infoGain #if better than current best, set to best bestFeature = i return bestFeature #returns an integer
4.递归构造决策树
递归退出条件:
- 所有的类标签完全相同,直接返回该类标签
- 使用完了所有特征,但是还是不能将数据集划分成只包含唯一类别的分组(返回值:挑出出现次数最多的类别)
def createTree(dataSet,labels): classList = [example[-1] for example in dataSet] #all labels if classList.count(classList[0]) == len(classList): return classList[0]#stop splitting when all of the classes are equal if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) return myTree
0 0
- 决策树之id3算法
- 决策树ID3算法
- ID3决策树建立算法
- ID3 算法实现决策树
- 决策树ID3算法
- 决策树 ID3算法
- 决策树ID3算法
- 决策树之ID3算法
- (决策树)ID3算法
- 决策树之ID3算法
- 决策树: ID3算法
- 决策树,ID3算法
- 决策树之ID3算法
- 决策树(ID3算法)
- 决策树之 ID3 算法
- ID3决策树算法
- 决策树之ID3算法
- 决策树--ID3算法
- Intellij Idea12 旗舰版 安卓(Android) 开发环境搭建流程
- 杂七杂八
- uml里面的依赖和关联
- 主要开源协议一览
- 编程之美,大神和它的三个小伙伴
- 决策树ID3算法
- hdu1828&poj1177(线段树求矩形交周长,扫描线)
- 1.线性表
- IDL keyword_set 使用细节
- Servlet生命周期
- TCL与c/c++的互相调用
- myeclipse链接sql数据库
- Intellij Idea12第一个安卓程序开发(HelloWorld)及简单讲解Android
- A Bug's Life