Supervised Learning Model-Classification Learning
来源:互联网 发布:桃源网络硬盘登录APP 编辑:程序博客网 时间:2024/06/06 04:29
Supervised Learning Model-Classification Learning
1. Logisitic Regression
#导入pandas与numpy工具包import pandas as pdimport numpy as np#创建特征列表column_names = ['sample code number','clump thickness','uniformity of cell size','uniformity of cell shape', 'marginal adhesion','single epithelial cell size','bare nuclei','bland chromatin','normal nucleoli', 'mitoses','class']'''#使用pandas.read_csv函数从互联网读取指定数据'''data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data' , names = column_names)#打印完整数据的行列数print(data.shape)#将?替换为标准缺失值表示data = data.replace(to_replace= '?', value=np.nan)#丢弃带有缺失值的数据(只要有一个维度有缺失)data = data.dropna(how = 'any')#输出data的数据量和维度print(data.shape)'''#准备训练数据和测试数据,通常情况下,25%的数据会做测试集,其余75%的数据用作训练集'''#使用sklearn.cross_validation里面的train_test_split模块用于数据分割from sklearn.cross_validation import train_test_split#随机采样25%的数据用于测试,剩下75%用于构建训练集合X_train, X_test, y_train, y_test = train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)#查验训练样本的数量和类别分布print(y_train.value_counts())#查验测试样本的数量和类别分布print(y_test.value_counts())#从sklearn.preprocessing里导入standarscalerfrom sklearn.preprocessing import StandardScaler#从sklearn.linear_model里导入LogisticRegression与SGDClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.linear_model import SGDClassifier'''#标准化数据,保证每一个维度的特征数据的方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导'''ss = StandardScaler()X_train = ss.fit_transform(X_train)X_test = ss.fit_transform(X_test)'''初始化LogisticRegression与SGDClassifier'''lr = LogisticRegression()sgdc = SGDClassifier()#调用LogisticRegression中的fit函数/模块用来训练参数模型参数lr.fit(X_train,y_train)#使用训练好的模型lr对X_test进行预测,结果存储在变量lr_y_predict中lr_y_predict = lr.predict(X_test)#调用SGDClassifier中的fit函数/模块来训练参数模型参数sgdc.fit(X_train,y_train)#使用训练好的模型sgdc对X_test进行预测,结果储存在变量sgdc_y_predict中sgdc_y_predict = sgdc.predict(X_test)'''使用线性分类模型对预测任务进行性能分析'''#从sklearn.metrics里导入classification_report模块from sklearn.metrics import classification_report#使用logisticRegression模型自带的评分函数score来获得模型在测试集上准确性结果print('Accuracy of LR Classifier:', lr.score(X_test,y_test))#使用classification_report模块来获得LogisticRegression其他三个指标的结果print(classification_report(y_test,lr_y_predict,target_names=['benign','malignant']))#使用SGD模型自带的评分函数score来获得模型在测试集上准确性结果print('Accuracy of SGD Classifier:', sgdc.score(X_test,y_test))#使用classification_report模块来获得SGD Classifier其他三个指标的结果print(classification_report(y_test,sgdc_y_predict,target_names=['benign','malignant']))
(699, 11)(683, 11)2 3444 168Name: class, dtype: int642 1004 71Name: class, dtype: int64Accuracy of LR Classifier: 0.970760233918 precision recall f1-score support benign 0.96 0.99 0.98 100 malignant 0.99 0.94 0.96 71avg / total 0.97 0.97 0.97 171Accuracy of SGD Classifier: 0.976608187135 precision recall f1-score support benign 0.97 0.99 0.98 100 malignant 0.99 0.96 0.97 71avg / total 0.98 0.98 0.98 171
2. Support Vector Classification
#从sklearn.datasets里面导入手写体数字加载器from sklearn.datasets import load_digits#将加载的图像数据存储在digits上digit = load_digits()#检查数据的规模和特征维度print(digit.data.shape)'''手写体数据分割代码样例'''#从sklearn.cross_validation中导入train_test_split用于数据分割from sklearn.cross_validation import train_test_split#随机选取75%数据作为训练样本,其余的25%作为测试样本X_train,X_test,y_train,y_test = train_test_split(digit.data,digit.target,test_size=0.25,random_state=33)#分别检验训练和测试数据的规模print(y_train.shape)print(y_test.shape)'''使用支持向量机(分类)对手写数字图像进行识别'''#从sklearn.precessing里导入数据标准化模块from sklearn.preprocessing import StandardScaler#从sklearn.svm里面导入基于线性假设的支持向量分类器LinearSVCfrom sklearn.svm import LinearSVC'''对训练和测试的特征数据进行标准化'''ss = StandardScaler()X_train = ss.fit_transform(X_train)X_test = ss.transform(X_test)'''支持向量机分类器LinearSVC初始化和模型训练'''#初始化线性假设支持向量机分类器LinearSVClsvc = LinearSVC()#进行模型训练lsvc.fit(X_train,y_train)#利用训练好的模型对测试样本的数字类别进行预测,预测结果存储在变量y_predict中y_predict = lsvc.predict(X_test)'''使用模型自带的评估函数进行准确性测评'''print('The Accuracy of Linear SVC is',lsvc.score(X_test,y_test))#依然使用sklearn.metrics里面的classification_report模块对预测结果做更详细的分析from sklearn.metrics import classification_reportprint(classification_report(y_test,y_predict,target_names=digit.target_names.astype(str)))
(1797, 64)(1347,)(450,)The Accuracy of Linear SVC is 0.953333333333 precision recall f1-score support 0 0.92 1.00 0.96 35 1 0.96 0.98 0.97 54 2 0.98 1.00 0.99 44 3 0.93 0.93 0.93 46 4 0.97 1.00 0.99 35 5 0.94 0.94 0.94 48 6 0.96 0.98 0.97 51 7 0.92 1.00 0.96 35 8 0.98 0.84 0.91 58 9 0.95 0.91 0.93 44avg / total 0.95 0.95 0.95 450
3. Naive Bayes
#从sklearn.datasets里面导入新闻数据抓取器fetch_20newsgroupsfrom sklearn.datasets import fetch_20newsgroups#与之前预存的数据不同,fetch_20newsgroups需要即时从互联网下载数据news = fetch_20newsgroups(subset='all')#查验数据的规模和细节print(len((news.data)))print(news.data[1])'''20类新闻文本数据分割'''#从sklearn.cross_validation导入train_test_splitfrom sklearn.cross_validation import train_test_splitX_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)'''用朴素贝叶斯分类器对新闻文本数据进行类别预测'''#从sklearn.feature_extraction.text里导入用于文本特征向量转化模块from sklearn.feature_extraction.text import CountVectorizervec = CountVectorizer()X_train = vec.fit_transform(X_train)X_test = vec.transform(X_test)#从sklearn.naive_bayes里导入朴素贝叶斯模型from sklearn.naive_bayes import MultinomialNB#使用默认初始化朴素贝叶斯模型mnb = MultinomialNB()#利用训练数据对模型的参数进行估计mnb.fit(X_train,y_train)#对测试的样本进行类别预测,结果存储在变量y_predicty_predict = mnb.predict(X_test)'''对朴素贝叶斯分类器在新闻文本数据上的表现性能进行评估'''#从sklearn.metrics里面导入classification_report用来详细的分类性能报告from sklearn.metrics import classification_reportprint('The accuracy of naive bayes classifier is',mnb.score(X_test,y_test))print(classification_report(y_test,y_predict,target_names=news.target_names))
18846From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)Subject: Which high-performance VLB video card?Summary: Seek recommendations for VLB video cardNntp-Posting-Host: midway.ecn.uoknor.eduOrganization: Engineering Computer Network, University of Oklahoma, Norman, OK, USAKeywords: orchid, stealth, vlbLines: 21 My brother is in the market for a high-performance video card that supportsVESA local bus with 1-2MB RAM. Does anyone have suggestions/ideas on: - Diamond Stealth Pro Local Bus - Orchid Farenheit 1280 - ATI Graphics Ultra Pro - Any other high-performance VLB cardPlease post or email. Thank you! - Matt-- | Matthew B. Lawson <------------> (mblawson@essex.ecn.uoknor.edu) | --+-- "Now I, Nebuchadnezzar, praise and exalt and glorify the King --+-- | of heaven, because everything he does is right and all his ways | | are just." - Nebuchadnezzar, king of Babylon, 562 B.C. | The accuracy of naive bayes classifier is 0.839770797963 precision recall f1-score support alt.atheism 0.86 0.86 0.86 201 comp.graphics 0.59 0.86 0.70 250 comp.os.ms-windows.misc 0.89 0.10 0.17 248comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240 comp.sys.mac.hardware 0.93 0.78 0.85 242 comp.windows.x 0.82 0.84 0.83 263 misc.forsale 0.91 0.70 0.79 257 rec.autos 0.89 0.89 0.89 238 rec.motorcycles 0.98 0.92 0.95 276 rec.sport.baseball 0.98 0.91 0.95 251 rec.sport.hockey 0.93 0.99 0.96 233 sci.crypt 0.86 0.98 0.91 238 sci.electronics 0.85 0.88 0.86 249 sci.med 0.92 0.94 0.93 245 sci.space 0.89 0.96 0.92 221 soc.religion.christian 0.78 0.96 0.86 232 talk.politics.guns 0.88 0.96 0.92 251 talk.politics.mideast 0.90 0.98 0.94 231 talk.politics.misc 0.79 0.89 0.84 188 talk.religion.misc 0.93 0.44 0.60 158 avg / total 0.86 0.84 0.82 4712
4. KNN
#读取Iris数据集细节资料#从sklearn.datasets导入iris数据加载器from sklearn.datasets import load_iris#s使用加载器读取数据并且存入变量irisiris = load_iris()#查看数据规模print(iris.data.shape)#查看数据说明,对于一个机器学习实践者来讲,这是一个好习惯print(iris.DESCR)'''对iris数据集进行分割'''from sklearn.cross_validation import train_test_splitX_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,test_size=0.25,random_state=33)'''用KNN对iris数据进行类别预测'''#从sklearn.precessing里导入数据标准化模块from sklearn.preprocessing import StandardScaler#从sklearn.neighbors里选择导入KNeighborsClassifier,K近领分类器from sklearn.neighbors import KNeighborsClassifier#对训练和测试集进行标准化ss = StandardScaler()X_train = ss.fit_transform(X_train)X_test = ss.transform(X_test)#使用K近邻分类器对测试的数据进行类别预测,预测结果存储在y_predict中knc = KNeighborsClassifier()knc.fit(X_train,y_train)y_predict = knc.predict(X_test)'''对KNN在iris数据上的预测性能进行评估'''#使用模型自带的评估函数进行准确性测评print('the accuracy of K-Nearest Neighbor Classifier is',knc.score(X_train,y_train))#依然sklearn.metrics里面的classification_report模块对预测结果进行详细的分析#依然使用sklearn.metrics里面的classification_report模块对预测的结果做出更加详细分析from sklearn.metrics import classification_reportprint(classification_report(y_test,y_predict,target_names=iris.target_names))
(150, 4)Iris Plants DatabaseNotes-----Data Set Characteristics: :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988This is a copy of UCI ML iris datasets.http://archive.ics.uci.edu/ml/datasets/IrisThe famous Iris database, first used by Sir R.A FisherThis is perhaps the best known database to be found in thepattern recognition literature. Fisher's paper is a classic in the field andis referenced frequently to this day. (See Duda & Hart, for example.) Thedata set contains 3 classes of 50 instances each, where each class refers to atype of iris plant. One class is linearly separable from the other 2; thelatter are NOT linearly separable from each other.References---------- - Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...the accuracy of K-Nearest Neighbor Classifier is 0.973214285714 precision recall f1-score support setosa 1.00 1.00 1.00 8 versicolor 0.73 1.00 0.85 11 virginica 1.00 0.79 0.88 19avg / total 0.92 0.89 0.90 38
5. Decision Tree
#泰坦尼克号乘客数据查验import pandas as pd#利用pandas的read_csv模块直接从互联网收集泰坦尼克号乘客数据titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')#观察前几行数据,可以发现数据种类各异,数值型,类别型,甚至还有缺失数据print(titanic.head())#使用pandas,数据都转入pandas独有的dataframe格式(二维数据表格),直接使用info,查看数据的统计特性print(titanic.info())#相当重要的环节- 特征选择,基于一些背景知识,根据对这场故事的了解,sex,age,pclass这些特征都很有可能决定幸免的关键因素X = titanic[['pclass','age','sex']]y = titanic['survived']#对当前选择的特征进行探查print(X.info())'''借助上面的输出,我们设计如下几个数据处理的任务(1)age这个数据列,只有633个,需要补充完整-用均值替代(2)sex与pclass两个数据列的值都是类别型的,需要转化为数值特征,用0/1替代'''#首先我们补充age里的数据,使用平均数或者中位数都是对模型偏离造成最小影响的策略print(X['age'].fillna(X['age'].mean(),inplace=True))#对补完的数据重新探查print(X.info())print(X['age'])print(X['sex'])print(X['pclass'])'''进行数据分割和特征抽取'''#数据分割from sklearn.cross_validation import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)#s使用scikit-learn.feature_extraction中特征转化器from sklearn.feature_extraction import DictVectorizervec = DictVectorizer(sparse=False)#转换特征后,我们发现凡是类别型的特征都单独剥离出来,独成一列特征,数值型的则保持不变X_train = vec.fit_transform(X_train.to_dict(orient='record'))print(vec.feature_names_)#同样需要对测试数据的特征进行转换X_test = vec.transform(X_test.to_dict(orient='record'))print(X_train)print('training samples is',X_train.shape)print('training samples is',X_test.shape)'''导入决策树分类器,使用分割到的训练数据进行模型学习用训练好的决策树模型对测试特征数据进行预测'''#从sklearn.tree中导入决策树分类器from sklearn.tree import DecisionTreeClassifier#使用默认配置初始化决策树分类器dtc = DecisionTreeClassifier()#使用分割的训练数据进行模型学习dtc.fit(X_train,y_train)#用训练好的决策树模型对测试特征数据进行预测y_predict = dtc.predict(X_test)print(y_predict)print(y_test)'''决策树模型对泰坦尼克号乘客是否生还的预测功能'''#从sklearn.metrics导入classification_reportfrom sklearn.metrics import classification_report#输出预测准确性print('the accuracy of Desion Tree is',dtc.score(X_test,y_test))#输出更加详细的分类性能print(classification_report(y_predict,y_test,target_names=['died','survived']))
row.names pclass survived \0 1 1st 1 1 2 1st 0 2 3 1st 0 3 4 1st 0 4 5 1st 1 name age embarked \0 Allen, Miss Elisabeth Walton 29.0000 Southampton 1 Allison, Miss Helen Loraine 2.0000 Southampton 2 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton 3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton 4 Allison, Master Hudson Trevor 0.9167 Southampton home.dest room ticket boat sex 0 St Louis, MO B-5 24160 L221 2 female 1 Montreal, PQ / Chesterville, ON C26 NaN NaN female 2 Montreal, PQ / Chesterville, ON C26 NaN (135) male 3 Montreal, PQ / Chesterville, ON C26 NaN NaN female 4 Montreal, PQ / Chesterville, ON C22 NaN 11 male <class 'pandas.core.frame.DataFrame'>RangeIndex: 1313 entries, 0 to 1312Data columns (total 11 columns):row.names 1313 non-null int64pclass 1313 non-null objectsurvived 1313 non-null int64name 1313 non-null objectage 633 non-null float64embarked 821 non-null objecthome.dest 754 non-null objectroom 77 non-null objectticket 69 non-null objectboat 347 non-null objectsex 1313 non-null objectdtypes: float64(1), int64(2), object(8)memory usage: 112.9+ KBNone<class 'pandas.core.frame.DataFrame'>RangeIndex: 1313 entries, 0 to 1312Data columns (total 3 columns):pclass 1313 non-null objectage 633 non-null float64sex 1313 non-null objectdtypes: float64(1), object(2)memory usage: 30.9+ KBNoneC:\Users\xxz\Anaconda3\lib\site-packages\pandas\core\generic.py:3191: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(new_data)None<class 'pandas.core.frame.DataFrame'>RangeIndex: 1313 entries, 0 to 1312Data columns (total 3 columns):pclass 1313 non-null objectage 1313 non-null float64sex 1313 non-null objectdtypes: float64(1), object(2)memory usage: 30.9+ KBNone0 29.0000001 2.0000002 30.0000003 25.0000004 0.9167005 47.0000006 63.0000007 39.0000008 58.0000009 71.00000010 47.00000011 19.00000012 31.19418113 31.19418114 31.19418115 50.00000016 24.00000017 36.00000018 37.00000019 47.00000020 26.00000021 25.00000022 25.00000023 19.00000024 28.00000025 45.00000026 39.00000027 30.00000028 58.00000029 31.194181 ... 1283 31.1941811284 31.1941811285 31.1941811286 31.1941811287 31.1941811288 31.1941811289 31.1941811290 31.1941811291 31.1941811292 31.1941811293 31.1941811294 31.1941811295 31.1941811296 31.1941811297 31.1941811298 31.1941811299 31.1941811300 31.1941811301 31.1941811302 31.1941811303 31.1941811304 31.1941811305 31.1941811306 31.1941811307 31.1941811308 31.1941811309 31.1941811310 31.1941811311 31.1941811312 31.194181Name: age, dtype: float640 female1 female2 male3 female4 male5 male6 female7 male8 female9 male10 male11 female12 female13 male14 male15 female16 male17 male18 male19 female20 male21 male22 male23 female24 male25 male26 male27 female28 female29 male ... 1283 female1284 male1285 male1286 male1287 male1288 male1289 male1290 male1291 male1292 male1293 female1294 male1295 male1296 male1297 male1298 male1299 male1300 male1301 male1302 male1303 male1304 female1305 male1306 female1307 female1308 male1309 male1310 male1311 female1312 maleName: sex, dtype: object0 1st1 1st2 1st3 1st4 1st5 1st6 1st7 1st8 1st9 1st10 1st11 1st12 1st13 1st14 1st15 1st16 1st17 1st18 1st19 1st20 1st21 1st22 1st23 1st24 1st25 1st26 1st27 1st28 1st29 1st ... 1283 3rd1284 3rd1285 3rd1286 3rd1287 3rd1288 3rd1289 3rd1290 3rd1291 3rd1292 3rd1293 3rd1294 3rd1295 3rd1296 3rd1297 3rd1298 3rd1299 3rd1300 3rd1301 3rd1302 3rd1303 3rd1304 3rd1305 3rd1306 3rd1307 3rd1308 3rd1309 3rd1310 3rd1311 3rd1312 3rdName: pclass, dtype: object['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male'][[ 31.19418104 0. 0. 1. 0. 1. ] [ 31.19418104 1. 0. 0. 1. 0. ] [ 31.19418104 0. 0. 1. 0. 1. ] ..., [ 12. 0. 1. 0. 1. 0. ] [ 18. 0. 1. 0. 0. 1. ] [ 31.19418104 0. 0. 1. 1. 0. ]]training samples is (984, 6)training samples is (329, 6)[0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0]386 089 1183 1746 01211 0384 1729 11076 01290 0385 1722 1169 1275 0485 0790 0159 1302 0296 11208 0428 1745 0109 11300 0906 11291 032 1584 1585 11250 044 0 ..414 0241 131 1764 0932 01226 1657 1256 0803 0677 01279 1396 1462 1693 0927 0628 1506 0371 01198 0232 1709 0508 0282 1698 01049 01048 0106 1618 0175 0937 0Name: survived, dtype: int64the accuracy of Desion Tree is 0.781155015198 precision recall f1-score support died 0.91 0.78 0.84 236 survived 0.58 0.80 0.67 93avg / total 0.81 0.78 0.79 329
6. Ensemble Classification
import pandas as pd'''通过互联网读取泰坦尼克号乘客文档,并存储在变量titanic中'''titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')'''人工选取pclass,age以及sex作为乘客生还的特征'''X = titanic[['pclass','age','sex']]y = titanic['survived']'''对于缺失的年龄信息,我们使用全体乘客的平均年龄替代,这样可以在保证顺序训练模型的同时,尽可能不影响预测任务'''X['age'].fillna(X['age'].mean(),inplace=True)'''对原始数据进行分割,25%的乘客数据用于测试'''from sklearn.cross_validation import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=33)'''对类别型特征进行转化,成为特征向量'''from sklearn.feature_extraction import DictVectorizervec = DictVectorizer(sparse=False)X_train = vec.fit_transform(X_train.to_dict(orient='record'))X_test = vec.transform(X_test.to_dict(orient='record'))'''使用单一决策树进行模型训练以及预测分析'''from sklearn.tree import DecisionTreeClassifierdtc = DecisionTreeClassifier()dtc.fit(X_train,y_train)dtc_y_pred = dtc.predict(X_test)'''使用随机森林分类器进行集成模型的训练以及预测分析'''from sklearn.ensemble import RandomForestClassifierrfc = RandomForestClassifier()rfc.fit(X_train,y_train)rfc_y_pred = rfc.predict(X_test)'''使用梯度提升决策树进行集成模型训练和预测'''from sklearn.ensemble import GradientBoostingClassifiergbc = GradientBoostingClassifier()gbc.fit(X_train,y_train)gbc_y_pred = gbc.predict(X_test)'''集成模型对泰坦尼克号乘客是否生还的预测性能'''#从sklearn.metrics导入classification_reportfrom sklearn.metrics import classification_report#输出单一决策在测试集上的准确率,以及详细的精确率,召回率,F1指标print('the accuracy of decision tree is',dtc.score(X_test,y_test))print(classification_report(dtc_y_pred,y_test,target_names=['died','survived']))#输出随机森林分类器在测试集上的准确率,以及详细的精确率,召回率,F1指标print('the accuracy of random forest classifier is',rfc.score(X_test,y_test))print(classification_report(rfc_y_pred,y_test,target_names=['died','survived']))#输出梯度提升决策树在测试集上的准确率,以及详细的精确率,召回率,F1指标print('the accuracy of decision tree is',gbc.score(X_test,y_test))print(classification_report(gbc_y_pred,y_test,target_names=['died','survived']))
C:\Users\xxz\Anaconda3\lib\site-packages\pandas\core\generic.py:3191: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(new_data)the accuracy of decision tree is 0.781155015198 precision recall f1-score support died 0.91 0.78 0.84 236 survived 0.58 0.80 0.67 93avg / total 0.81 0.78 0.79 329the accuracy of random forest classifier is 0.775075987842 precision recall f1-score support died 0.90 0.77 0.83 236 survived 0.57 0.78 0.66 93avg / total 0.81 0.78 0.78 329the accuracy of decision tree is 0.790273556231 precision recall f1-score support died 0.92 0.78 0.84 239 survived 0.58 0.82 0.68 90avg / total 0.83 0.79 0.80 329
阅读全文
0 0
- Supervised Learning Model-Classification Learning
- Supervised Learning Model Regression Prediction
- supervised learning
- Supervised Learning
- Supervised Learning
- 论文读书笔记-Supervised machine learning:a review of classification techniques
- Multimodal semi-supervised learning for image classification总结
- Unsupervised, Semi-Supervised, Supervised Learning
- semi-supervised learning
- Semi-supervised learning
- semi-supervised learning
- Semi-supervised learning
- 【机器学习】Supervised Learning
- semi-supervised learning
- 2016.03.30 Supervised learning
- Semi-supervised Learning
- supervised learning algorithm
- Lecture4: Supervised Machine Learning
- 为什么不做双微运营?
- C#回顾学习笔记二十二:异常与try-catch的使用
- 排序-快速排序(C版本)
- 項目傳到服務器 參數亂碼
- 决策树vs行为树
- Supervised Learning Model-Classification Learning
- PhpStrom,RubyMine,WebStorm,PyCharm注册机
- Supervised Learning Model Regression Prediction
- SCUT Training 20170920 Problem J
- 多版本下python的pip区分问题
- Tensorflow报错:AttributeError: 'module' object has no attribute 'scalar_summary'
- CGAL 4.11 版 win10-vs2015-64bit 编译与开发关键点指南
- JVM调优总结(二)-一些概念
- 139. Word Break