《机器学习实战》(三)决策树(decision trees)
来源:互联网 发布:python index 编辑:程序博客网 时间:2024/06/01 08:16
决策树的构造
- 优点
- 计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据。
- 缺点
- 可能会产生过度匹配(overfitting)问题。
- 试用数据范围
- 数值型和标称型。
信息增益:在划分数据集之前之后信息发生的变化。
- 符号
xi 的信息定义为 l(xi)=−log2p(xi)
其中p(xi) 是选择该分类的概率- 集合信息的度量方式称为香农熵或者简称为熵
H=−∑i=1np(xi)log2p(xi)
计算给定数据集的香农熵
from math import logdef calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: # the the number of unique elements and their occurance currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0.0 for key in labelCounts: prob = float(labelCounts[key])/numEntries shannonEnt -= prob * log(prob, 2) # log base 2 return shannonEnt
熵越高,则混合的数据越多
按照给定特征划分数据集
def splitDataSet(dataSet, axis, value): retDataSet = [] for featVec in dataSet: if featVec[axis] == value: reducedFeatVec = featVec[:axis] # chop out axis used for splitting reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet
选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet): numFeatures = len(dataSet[0]) - 1 # the last column is used for the labels baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0 bestFeature = -1 for i in range(numFeatures): # iterate over all the features featList = [example[i] for example in dataSet] # create a list of all the examples of this feature uniqueVals = set(featList) # get a set of unique values newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy # calculate the info gain; ie reduction in entropy if (infoGain > bestInfoGain): # compare this to the best gain so far bestInfoGain = infoGain # if better than current best, set to best bestFeature = i return bestFeature # returns an integer
递归构建决策树
ID3算法
def createTree(dataSet, labels): classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList): return classList[0] # stop splitting when all of the classes are equal if len(dataSet[0]) == 1: # stop splitting when there are no more features in dataSet return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel: {}} del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] # copy all of labels, so trees don't mess up existing labels myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) return myTreedef majorityCnt(classList): classCount = {} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]
用Matplotlib注解绘制树形图
import matplotlib.pyplot as pltdecisionNode = dict(boxstyle="sawtooth", fc="0.8")leafNode = dict(boxstyle="round4", fc="0.8")arrow_args = dict(arrowstyle="<-")def getNumLeafs(myTree): numLeafs = 0 firstStr = myTree.keys()[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes numLeafs += getNumLeafs(secondDict[key]) else: numLeafs += 1 return numLeafsdef getTreeDepth(myTree): maxDepth = 0 firstStr = myTree.keys()[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes thisDepth = 1 + getTreeDepth(secondDict[key]) else: thisDepth = 1 if thisDepth > maxDepth: maxDepth = thisDepth return maxDepthdef plotNode(nodeTxt, centerPt, parentPt, nodeType): createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction', xytext=centerPt, textcoords='axes fraction', va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)def plotMidText(cntrPt, parentPt, txtString): xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0] yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1] createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)def plotTree(myTree, parentPt, nodeTxt): # if the first key tells you what feat was split on numLeafs = getNumLeafs(myTree) # this determines the x width of this tree depth = getTreeDepth(myTree) firstStr = myTree.keys()[0] # the text label for this node should be this cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff) plotMidText(cntrPt, parentPt, nodeTxt) plotNode(firstStr, cntrPt, parentPt, decisionNode) secondDict = myTree[firstStr] plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD for key in secondDict.keys(): if type(secondDict[key]).__name__ == 'dict': # test to see if the nodes are dictonaires, if not they are leaf nodes plotTree(secondDict[key], cntrPt, str(key)) # recursion else: # it's a leaf node print the leaf node plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode) plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key)) plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD# if you do get a dictonary you know it's a tree, and the first element will be another dictdef createPlot(inTree): fig = plt.figure(1, facecolor='white') fig.clf() axprops = dict(xticks=[], yticks=[]) createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) # no ticks # createPlot.ax1 = plt.subplot(111, frameon=False) # ticks for demo puropses plotTree.totalW = float(getNumLeafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.xOff = -0.5/plotTree.totalW plotTree.yOff = 1.0 plotTree(inTree, (0.5, 1.0), '') plt.show()def retrieveTree(i): listOfTrees = [{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}, {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}] return listOfTrees[i]
使用决策树的分类算法
def classify(inputTree, featLabels, testVec): firstStr = inputTree.keys()[0] secondDict = inputTree[firstStr] featIndex = featLabels.index(firstStr) key = testVec[featIndex] valueOfFeat = secondDict[key] if isinstance(valueOfFeat, dict): classLabel = classify(valueOfFeat, featLabels, testVec) else: classLabel = valueOfFeat return classLabel
决策树的存储
在python中,一般可以使用pickle类来进行python对象的序列化,而cPickle提供了一个更快速简单的接口,如python文档所说的:“cPickle – A faster pickle”。
cPickle可以对任意一种类型的python对象进行序列化操作,比如list,dict,甚至是一个类的对象等。而所谓的序列化,我的粗浅的理解就是为了能够完整的保存并能够完全可逆的恢复。在cPickle中,主要有四个函数可以做这一工作,下面使用例子来介绍。
1 dump: 将python对象序列化保存到本地的文件。
import cPickle data = range(1000)cPickle.dump(data,open("test\\data.pkl","wb"))
dump函数需要指定两个参数,第一个是需要序列化的python对象名称,第二个是本地的文件,需要注意的是,在这里需要使用open函数打开一个文件,并指定“写”操作。
2 load:载入本地文件,恢复python对象 data = cPickle.load(open("test\\data.pkl","rb"))
同dump一样,这里需要使用open函数打开本地的一个文件,并指定“读”操作
3 dumps:将python对象序列化保存到一个字符串变量中。 data_string = cPickle.dumps(data)
4 loads:从字符串变量中载入python对象
代码如下: data = cPickle.loads(data_string)
1 0
- 《机器学习实战》(三)决策树(decision trees)
- 机器学习实战:决策树(decision Trees)
- 机器学习算法之:决策树 (decision trees)
- [完]机器学习实战 第三章 决策树(Decision Tree)
- 机器学习(三)决策树算法Decision Tree
- 机器学习实战笔记(三)决策树
- 【机器学习】决策树(Decision Tree)
- 机器学习: 决策树(Decision Tree)
- 机器学习:决策树(Decision Tree)
- 机器学习之:决策树(Decision Tree)
- 决策树(Decision Tree)-机器学习ML
- 机器学习实战第3章-决策树(decision tree)
- scikit-learn学习笔记(六)Decision Trees(决策树)
- 决策树,decision的pyton代码和注释(机器学习实战)
- scikit-learn学习1.10. 决策树(Decision Trees)
- 机器学习(三)GBDT(Gradient Boosting Decision Tree)迭代决策树
- 机器学习(三)——决策树(decision tree)算法介绍
- 机器学习之决策树 Decision Tree(三)scikit-learn算法库
- HDU 2717 Catch That Cow
- 字符串排序问题
- HDU 5294 Tricks Device(spfa+最大流-Dinic)
- javascript学习笔记(一)-廖雪峰教程
- 【Android 工具类】常用工具类(方法)大全
- 《机器学习实战》(三)决策树(decision trees)
- java书籍
- 寻优方法总结:最速下降法,牛顿下降法,阻尼牛顿法,拟牛顿法DFP/BFGS
- python标准日志模块logging的使用方法
- Hbase0.98.6-CDH5.3集群搭建
- 快速排序 QuickSort
- LoadRunner11-遇到问题及解决办法
- HDOJ--1875--畅通工程再续
- redis技术之旅一