决策树(实践)

来源:互联网 发布:大连商品交易所软件 编辑:程序博客网 时间:2024/05/18 03:32

决策树实验

1.准备数据(E:\MachineLearning-data\AllElectronics.csv)

RIDageIncomestudentcredit_ratingClass_buys_computer1youthhighnofairno2youthhighnoexcellentno3middle_agedhighnofairyes4seniormediumnofairyes5seniorlowyesfairyes6seniorlowyesexcellentno7middle_agedlowyesexcellentyes8youthmediumnofairno9youthlowyesfairyes10seniormediumyesfairyes11youthmediumyesexcellentyes12middle_agedmediumnoexcellentyes13middle_agedhighyesfairyes14seniormediumnoexcellentno


2.实验代码

# -*- coding: utf-8 -*-# coding=utf-8# 实现决策树并进行预测from sklearn.feature_extraction import DictVectorizerimport csvfrom sklearn import preprocessingfrom sklearn import tree#1.读取数据,rt模式下,python在读取文本时会自动把\r\n转换成\n.,设置编码格式与文档统一allElectronicsData = open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")reader = csv.reader(allElectronicsData)headers = next(reader)#读出数据的属性名print(headers)#2.存放数据#featuresList:将属性:age、 Income、student、 credit_rating、的值存放在列表中,#labelList:分类的结果存放在列表featuresList = []labelList = []for row in reader:    labelList.append(row[len(row) - 1])    rowDict = {}    for i in range(1, len(row) - 1):        rowDict[headers[i]] = row[i]    featuresList.append(rowDict)#3.将数据向量化vec = DictVectorizer()dummyX = vec.fit_transform(featuresList).toarray()print("dummyX:" + str(dummyX))#输出属性的类别print(vec.get_feature_names())#输出训练集分类结果print("labelList:" + str(labelList))#4.将训练集结果进行数据化处理lb = preprocessing.LabelBinarizer()dummyY = lb.fit_transform(labelList)print("dummyY:" + str(dummyY))#5.属性设置结束,设置决策树构造参数clf = tree.DecisionTreeClassifier(criterion='entropy')clf = clf.fit(dummyX, dummyY)print("clf:" + str(clf))#6.将结果写入文件中with open("E:\MachineLearning-data\AllElectronicInformationGainOri.dot", 'w') as f:    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)#7.给定数据,进行预测,读出第一条数据(一行)oneRowX = dummyX[0, :]print("oneRowX: " + str(oneRowX))#修改数据中的值newRowX = oneRowXnewRowX[0] = 1newRowX[2] = 1print("newRowX: " + str(newRowX))#8.给出预测结果predictedY = clf.predict(newRowX)print("predictedY: " + str(predictedY))
3.实验结果
"D:\Program Files\Python\Anaconda\python.exe" E:/Python/machinelearning/01.py['\ufeffRID', 'age', 'Income', 'student', 'credit_rating', 'Class_buys_computer']dummyX:[[ 1.  0.  0.  0.  0.  1.  0.  1.  1.  0.] [ 1.  0.  0.  0.  0.  1.  1.  0.  1.  0.] [ 1.  0.  0.  1.  0.  0.  0.  1.  1.  0.] [ 0.  0.  1.  0.  1.  0.  0.  1.  1.  0.] [ 0.  1.  0.  0.  1.  0.  0.  1.  0.  1.] [ 0.  1.  0.  0.  1.  0.  1.  0.  0.  1.] [ 0.  1.  0.  1.  0.  0.  1.  0.  0.  1.] [ 0.  0.  1.  0.  0.  1.  0.  1.  1.  0.] [ 0.  1.  0.  0.  0.  1.  0.  1.  0.  1.] [ 0.  0.  1.  0.  1.  0.  0.  1.  0.  1.] [ 0.  0.  1.  0.  0.  1.  1.  0.  0.  1.] [ 0.  0.  1.  1.  0.  0.  1.  0.  1.  0.] [ 1.  0.  0.  1.  0.  0.  0.  1.  0.  1.] [ 0.  0.  1.  0.  1.  0.  1.  0.  1.  0.]]['Income=high', 'Income=low', 'Income=medium', 'age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'student=no', 'student=yes']labelList:['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']dummyY:[[0]
 [0] [1] [1] [1] [0] [1] [0] [1] [1] [1] [1] [1] [0]]clf:DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,            max_features=None, max_leaf_nodes=None,            min_impurity_split=1e-07, min_samples_leaf=1,            min_samples_split=2, min_weight_fraction_leaf=0.0,            presort=False, random_state=None, splitter='best')oneRowX: [ 1.  0.  0.  0.  0.  1.  0.  1.  1.  0.]newRowX: [ 1.  0.  1.  0.  0.  1.  0.  1.  1.  0.]predictedY: [0]D:\Program Files\Python\Anaconda\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.  DeprecationWarning)
4.将dot文件转化为pdf输出(命令为:dot -Tpdf E:\MachineLearning-data\AllElectronics.dot -o  E:\MachineLearning-data\AllElectronics.pdf)
其中将dot转化为pdf的软件graphviz在9中进行详述;

5.错误总结

1..错误1
python读取文件时提示"UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 205: illegal multibyte sequence"

解决办法1.
FILE_OBJECT= open('order.log','r', encoding='UTF-8')
解决办法2.
FILE_OBJECT= open('order.log','rb')
2..错误2
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
原因:循环的数据不应该是二进制数据
open('E:\MachineLearning-data\AllElectronics.csv', 'rb',encoding="utf-8")
解决方案:
open('E:\MachineLearning-data\AllElectronics.csv', 'rt',encoding="utf-8")
说明:rb:以二进制格式打开一个文件用于只读
rt:读文件,python在读取文本时会自动把\r\n转换成\n
3..错误3
.csv文件编码必须与读写时的编码格式相符合;

6.安装graphviz

1)下载:第一个为安装版,第二个为免安装版


2)安装配置环境变量

a.配置环境变量(系统变量PATH中添加)


b.检测是否安装正确




原创粉丝点击