转：30分钟学会用scikit-learn的基本分类方法（决策树、SVM、KNN）和集成方法（随机森林，Adaboost和GBRT）

来源：互联网发布：乐乎lofter帅哥编辑：程序博客网时间：2024/05/20 09:09

关于回归方法,请参考我的另一篇博客30分钟学会用scikit-learn的基本回归方法（线性、决策树、SVM、KNN）和集成方法（随机森林，Adaboost和GBRT）
本文主要参考了scikit-learn的官方网站

1. 数据准备

关于分类,我们使用了Iris数据集,这个scikit-learn自带了.
Iris数据集是常用的分类实验数据集，由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性。可通过花萼长度，花萼宽度，花瓣长度，花瓣宽度4个属性预测鸢尾花卉属于（Setosa，Versicolour，Virginica）三个种类中的哪一类。

注意,Iris数据集给出的三种花是按照顺序来的,前50个是第0类,51-100是第1类,101~150是第二类,如果我们分训练集和测试集的时候要把顺序打乱
这里我们引入一个两类shuffle的函数,它接收两个参数,分别是x和y,然后把x,y绑在一起shuffle.

def shuffle_in_unison(a, b):    assert len(a) == len(b)    import numpy    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)    permutation = numpy.random.permutation(len(a))    for old_index, new_index in enumerate(permutation):        shuffled_a[new_index] = a[old_index]        shuffled_b[new_index] = b[old_index]    return shuffled_a, shuffled_b1
2
3
4
5
6
7
8
9
10
11

下面我们导入Iris数据并打乱它,然后分为100个训练集和50个测试集

from sklearn.datasets import load_irisiris = load_iris()def load_data():    iris.data, iris.target = shuffle_in_unison(iris.data, iris.target)    x_train ,x_test = iris.data[:100],iris.data[100:]    y_train, y_test = iris.target[:100].reshape(-1,1),iris.target[100:].reshape(-1,1)    return x_train, y_train, x_test, y_test1
2
3
4
5
6
7
8

2. 试验各种不同的方法

常用的分类方法一般有决策树, SVM, kNN, 朴素贝叶斯, 集成方法有随机森林,Adaboost和GBDT
完整代码如下:

from sklearn.datasets import load_irisiris = load_iris()def shuffle_in_unison(a, b):    assert len(a) == len(b)    import numpy    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)    permutation = numpy.random.permutation(len(a))    for old_index, new_index in enumerate(permutation):        shuffled_a[new_index] = a[old_index]        shuffled_b[new_index] = b[old_index]    return shuffled_a, shuffled_bdef load_data():    iris.data, iris.target = shuffle_in_unison(iris.data, iris.target)    x_train ,x_test = iris.data[:100],iris.data[100:]    y_train, y_test = iris.target[:100].reshape(-1,1),iris.target[100:].reshape(-1,1)    return x_train, y_train, x_test, y_testfrom sklearn import tree, svm, naive_bayes,neighborsfrom sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifierx_train, y_train, x_test, y_test = load_data()clfs = {'svm': svm.SVC(),\        'decision_tree':tree.DecisionTreeClassifier(),        'naive_gaussian': naive_bayes.GaussianNB(), \        'naive_mul':naive_bayes.MultinomialNB(),\        'K_neighbor' : neighbors.KNeighborsClassifier(),\        'bagging_knn' : BaggingClassifier(neighbors.KNeighborsClassifier(), max_samples=0.5,max_features=0.5), \        'bagging_tree': BaggingClassifier(tree.DecisionTreeClassifier(), max_samples=0.5,max_features=0.5),        'random_forest' : RandomForestClassifier(n_estimators=50),\        'adaboost':AdaBoostClassifier(n_estimators=50),\        'gradient_boost' : GradientBoostingClassifier(n_estimators=50, learning_rate=1.0,max_depth=1, random_state=0)        }def try_different_method(clf):    clf.fit(x_train,y_train.ravel())    score = clf.score(x_test,y_test.ravel())    print('the score is :', score)for clf_key in clfs.keys():    print('the classifier is :',clf_key)    clf = clfs[clf_key]    try_different_method(clf)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

给出的结果如下:

the classifier is : svmthe score is : 0.94the classifier is : decision_treethe score is : 0.88the classifier is : naive_gaussianthe score is : 0.96the classifier is : naive_multhe score is : 0.8the classifier is : K_neighborthe score is : 0.94the classifier is : gradient_boostthe score is : 0.88the classifier is : adaboostthe score is : 0.62the classifier is : bagging_treethe score is : 0.94the classifier is : bagging_knnthe score is : 0.94the classifier is : random_forestthe score is : 0.92

阅读全文

0 0