决策树学习笔记
来源:互联网 发布:djvu mac 打开 编辑:程序博客网 时间:2024/06/03 20:36
目录
- 决策树简介
- 构造决策树
- 1 信息增益
- 2 划分数据集
- 3 递归创建树
- demo预测隐形眼镜类型
- 总结
1 决策树简介
决策树的一个重要任务是为了理解数据中所蕴含的知识信息,因此决策树可以使用不熟悉的数据集合,并从中提取出一系列规则,这些机器根据数据集创建规则的过程,就是机器学习的过程。
优点
- 计算量简单,输出结果易于理解
- 对中间值的缺少不敏感,比较适合处理有缺失属性值的样本,能够处理不相关的特征
缺点
- 容易过拟合
适用范围
- 数值型和标称型
2 构造决策树
在构造决策树时,我们需要解决的第一个问题是:当前的数据集上,那个特征在划分数据分类时起决定性作用。
2.1 信息增益
数据划分的最大原则:把无序数据变得尽可能有序。信息论中,用熵(entropy)量化信息的内容。
H=−∑k=1np(xi)log2p(xi) p(xi)表示符号xi是该分类的概率 就算给定数据即的信息熵
def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} # 为所有可能分配创建字典 for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0.0 # 用公式 for key in labelCounts: prob = float(labelCounts[key])/numEntries shannonEnt -= prob * log(prob,2) #log base 2 return shannonEnt
2.2 划分数据集
- 按照给定特征划分数据
def splitDataSet(dataSet, axis, value): # dataSet:待划分的数据集。axis:划分数据集的特征。value:需要返回的特征值 # 创建新的list对象 retDataSet = [] for featVec in dataSet: if featVec[axis] == value: # 抽取 reducedFeatVec = featVec[:axis] reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet
- 选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet): numFeatures = len(dataSet[0]) - 1 baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0; bestFeature = -1 for i in range(numFeatures): # 创建唯一的分类标签列表 featList = [example[i] for example in dataSet] uniqueVals = set(featList) newEntropy = 0.0 # 计算每种划分方式的信息熵 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy # 找到最好的信息增益 if (infoGain > bestInfoGain): bestInfoGain = infoGain bestFeature = i return bestFeature
2.3 递归创建树
def majorityCnt(classList): classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]def createTree(dataSet,labels): # dataSet:数据集。labels:标签列表。 classList = [example[-1] for example in dataSet] # 类别完全相同则停止 if classList.count(classList[0]) == len(classList): return classList[0] # 遍历完所有特征值时,返回出现次数最多的分类 if len(dataSet[0]) == 1: return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} # 得到列表包含的所有属性的值 del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] # 递归 myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) return myTree
3 demo:预测隐形眼镜类型
def createPlot(inTree): fig = plt.figure(1, facecolor='white') fig.clf() axprops = dict(xticks=[], yticks=[]) createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) plotTree.totalW = float(getNumLeafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0; plotTree(inTree, (0.5,1.0), '') plt.show()fr = open('lenses.txt')lenses = [inst.strip().split('\t') for inst in fr.readlines()]lensesLabels = ['age','prescript', 'astigmatic', 'tearRate']lensesTree = trees.createTree(lenses, lensesLabels)createPlot(lensesTree)
- 相关函数
#!/usr/bin/python# -*- coding:utf8 -*-decisionNode = dict(boxstyle="sawtooth", fc="0.8")leafNode = dict(boxstyle="round4", fc="0.8")arrow_args = dict(arrowstyle="<-")# 获取叶子结点数目def getNumLeafs(myTree): numLeafs = 0 # 找到输入的第一个元素 firstSides = list(myTree.keys()) firstStr = firstSides[0] secondDict = myTree[firstStr] for key in secondDict.keys(): # 判断结点的数据类型是否为字典 if type(secondDict[key]).__name__=='dict': numLeafs += getNumLeafs(secondDict[key]) else: numLeafs +=1 return numLeafs# 获取树的深度def getTreeDepth(myTree): maxDepth = 0 # firstStr = myTree.keys()[0] firstSides = list(myTree.keys()) firstStr = firstSides[0]#找到输入的第一个元素 secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict': thisDepth = 1 + getTreeDepth(secondDict[key]) else: thisDepth = 1 if thisDepth > maxDepth: maxDepth = thisDepth return maxDepthdef plotNode(nodeTxt, centerPt, parentPt, nodeType): createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', xytext=centerPt, textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
# 在父子节点间填充文本信息 def plotMidText(cntrPt, parentPt, txtString): xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0] yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1] createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)def plotTree(myTree, parentPt, nodeTxt): # 计算宽和高 numLeafs = getNumLeafs(myTree) depth = getTreeDepth(myTree) #找到输入的第一个元素 firstSides = list(myTree.keys()) firstStr = firstSides[0] cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff) # 标记子节点属性值 plotMidText(cntrPt, parentPt, nodeTxt) plotNode(firstStr, cntrPt, parentPt, decisionNode) secondDict = myTree[firstStr] # 减少y偏移 plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict': plotTree(secondDict[key],cntrPt,str(key)) else: plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode) plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key)) plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
总结
Matplotlib非常强大,画图甚至比MATLAB还好。
end
阅读全文
0 0
- 决策树学习笔记整理
- 决策树学习笔记
- 决策树算法学习笔记
- 决策树学习笔记整理
- 决策树学习笔记整理
- 决策树学习笔记整理
- 决策树学习笔记
- 决策树学习笔记整理
- 决策树学习笔记整理
- 【复习笔记】决策树学习
- 决策树学习笔记整理
- 决策树算法学习笔记
- 决策树学习笔记
- ML学习笔记-决策树
- 决策树学习笔记整理
- 决策树学习笔记整理
- 决策树学习笔记
- 决策树学习笔记整理
- codeforces——797B——Odd sum
- 调用类自身常量
- Sum of Consecutive Prime Numbers UVA
- 高斯混合模型(GMM)
- C++ Primer Plus, Chapter 14, excercise
- 决策树学习笔记
- WM9876声卡驱动框架
- HDU5093 Battle ships (BZOJ4554)
- UVA 567—Risk
- 杂粮和面包屑
- 电脑不能上网
- thread10
- 【整理归纳】Linux中文件的Access,Modify,Change区别
- 启动Activity之四种模式