机器学习---决策树decision tree的应用
来源:互联网 发布:centos7 nat网络配置 编辑:程序博客网 时间:2024/05/21 09:10
1.Python
2.Python机器学习的库:scikit-learn
2.1 特性:
简单高效的数据挖掘和机器学习分析
对所有用户开放,根据不同需求高度可重用性
基于Numpy,SciPy和matplotlib
开源的,且可达到商用级别,获得BSD许可
安装 Graphviz—-转化dot文件至pdf可视化决策树:dot -Tpdf *.dot -o
2.2覆盖问题领域
分类(classifaction),回归(regression),聚类(clustering),降维(dimensionality reduction)
模型选择(model selection),预处理(preprocessing)
3.使用scikit-learn
安装scikit-learn:
安装必要package:numpy,Scipy和matplotlib。
sklearn两篇优质文章:
《使用sklearn进行集成学习——理论》
《使用sklearn进行集成学习——实践》
4.例子
5.实现
from sklearn.feature_extraction import DictVectorizerimport csvfrom sklearn import treefrom sklearn import preprocessingfrom sklearn.externals.six import StringIO#sklearn对数据有格式要求,首先要对数据进行格式预处理。# Read in the csv file and put features into list of dict and list of class label#读取csv文件,并把属性放到字典列表和类标签中#Python2.x #allElectronicsData = open(r'AllElectronics.csv', 'rb')#reader = csv.reader(allElectronicsData)#headers = reader.next()#上面的语句在python3.X会报错,'_csv.reader' object has no attribute 'next' #在python3.x需改为如下语句allElectronicsData = open(r'AllElectronics.csv', 'rt')reader = csv.reader(allElectronicsData)headers = next(reader)print(headers)#['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']featureList = []labelList = []for row in reader: labelList.append(row[len(row)-1]) rowDict = {} for i in range(1, len(row)-1): rowDict[headers[i]] = row[i] featureList.append(rowDict)print(featureList)'''[{'age': 'youth', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'}, {'age': 'youth', 'credit_rating': 'excellent', 'income': 'high', 'student': 'no'}, {'age': 'middle_aged', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'}, {'age': 'senior', 'credit_rating': 'fair', 'income': 'medium', 'student': 'no'}, {'age': 'senior', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'}, {'age': 'senior', 'credit_rating': 'excellent', 'income': 'low', 'student': 'yes'}, {'age': 'middle_aged', 'credit_rating': 'excellent', 'income': 'low', 'student': 'yes'}, {'age': 'youth', 'credit_rating': 'fair', 'income': 'medium', 'student': 'no'}, {'age': 'youth', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'}, {'age': 'senior', 'credit_rating': 'fair', 'income': 'medium', 'student': 'yes'}, {'age': 'youth', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'yes'}, {'age': 'middle_aged', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'no'},{'age': 'middle_aged', 'credit_rating': 'fair', 'income': 'high', 'student': 'yes'}, {'age': 'senior', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'no'}]'''#从表中可以看出是用字典储存,所以是无序的。# Vetorize featuresvec = DictVectorizer()dummyX = vec.fit_transform(featureList) .toarray()print("dummyX: " + str(dummyX))#将每一行转化为如下格式#youth middle_age senor high medium low yes no fair excellent buy# 1 0 0 1 0 0 0 1 1 0 0 '''dummyX: [[ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.] [ 0. 0. 1. 1. 0. 1. 0. 0. 1. 0.] [ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.] [ 0. 1. 0. 0. 1. 0. 0. 1. 1. 0.] [ 0. 1. 0. 0. 1. 0. 1. 0. 0. 1.] [ 0. 1. 0. 1. 0. 0. 1. 0. 0. 1.] [ 1. 0. 0. 1. 0. 0. 1. 0. 0. 1.] [ 0. 0. 1. 0. 1. 0. 0. 1. 1. 0.] [ 0. 0. 1. 0. 1. 0. 1. 0. 0. 1.] [ 0. 1. 0. 0. 1. 0. 0. 1. 0. 1.] [ 0. 0. 1. 1. 0. 0. 0. 1. 0. 1.] [ 1. 0. 0. 1. 0. 0. 0. 1. 1. 0.] [ 1. 0. 0. 0. 1. 1. 0. 0. 0. 1.] [ 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]'''print(vec.get_feature_names())'''['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'student=no', 'student=yes']'''print("labelList: " + str(labelList))#labelList: #['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']# vectorize class labelslb = preprocessing.LabelBinarizer()dummyY = lb.fit_transform(labelList)print("dummyY: " + str(dummyY))'''dummyY: [[0] [0] [1] [1] [1] [0] [1] [0] [1] [1] [1] [1] [1] [0]]'''# Using decision tree for classification# clf = tree.DecisionTreeClassifier()'''clf就是生成的决策树,参数可以选择决策树的算法种类,这里使用entropy即ID3信息熵算法。'''clf = tree.DecisionTreeClassifier(criterion='entropy')clf = clf.fit(dummyX, dummyY)print("clf: " + str(clf))# Visualize model'''创建.dot文件用于存放可视化决策树数据,决策树已经数值化,如果要还原属性到决策树,需要传入属性参数feature_names=vec.get_feature_names()'''with open("allElectronicInformationGainOri.dot", 'w') as f: f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)'''最后把生成的.dot文件转换成可视化的pdf文件,dot -Tpdf input.dot -o output.pdf'''#决策树生成后,用demo实例预测结果#取第一行数据,并稍做改动oneRowX = dummyX[0, :]print("oneRowX: " + str(oneRowX))#oneRowX: [ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]newRowX = oneRowXnewRowX[0] = 1newRowX[2] = 0print("newRowX: " + str(newRowX))#newRowX: [ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]#predictedY = clf.predict(newRowX)'''直接运行会报如下错误 "if it contains a single sample.".format(array))ValueError: Expected 2D array, got 1D array instead:array=[ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.].Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.提示需要reshape,所以入参改为newRowX.reshape(1,-1)reshape作用可参考http://www.cnblogs.com/iamxyq/p/6683147.html'''predictedY = clf.predict(newRowX.reshape(1,-1))print("predictedY: " + str(predictedY))#predictedY: [1]
生成的决策树如下:
RID age income student credit_rating class_buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no
- 机器学习---决策树decision tree的应用
- 【机器学习】决策树(Decision Tree)
- 机器学习: 决策树(Decision Tree)
- 机器学习之决策树(Decision Tree)
- 机器学习:决策树(Decision Tree)
- 机器学习之:决策树(Decision Tree)
- 机器学习算法实践:决策树 (Decision Tree)
- 机器学习(三)决策树算法Decision Tree
- 决策树(Decision Tree)-机器学习ML
- 机器学习---决策树(decision tree)算法
- 【机器学习算法-python实现】决策树-Decision tree(2) 决策树的实现
- 【机器学习算法-python实现】决策树-Decision tree(2) 决策树的实现
- 【机器学习】决策树(Decision Tree) 学习笔记
- 机器学习系列05——决策树(Decision tree)
- [完]机器学习实战 第三章 决策树(Decision Tree)
- Spark2 机器学习之决策树分类Decision tree classifier
- 机器学习算法—决策树(Decision Tree)
- 【机器学习】分类算法之决策树(Decision tree)
- Masonry学习之使用常量
- 软件单元测试操作步骤(java版)
- PropertyDescriptorCollection.Find 方法
- jz2440开发板基本操作
- IEEE 802.11 MAC理解(一)
- 机器学习---决策树decision tree的应用
- PropertyDescriptorCollection 类
- sqlmap用法【mark一些有用的参数】
- 魔方公式
- Apache log4j-1.2.17源码学习笔记
- mysql 5.6 zip 中文字符配置问题 mysql启动失败 更改my.ini
- 关于C++和Java的一些差异的总结
- 原码, 反码, 补码 详解
- jquery 插件 -- 改变字体颜色大小