sklearn中的Pipeline

来源:互联网 发布:亿恩网络 编辑:程序博客网 时间:2024/06/05 01:52

一般来说,使用sklearn建模时步骤如下:


0、start
1、分隔训练集和测试集(和验证集)
2、数据预处理
3、特征选择
4、模型选择
5、使用GridSearchCV进行参数寻优
6、end


其中,数据预处理部分可能需要先fit_transform再transform,相对较为繁琐,此时可以通过Pipeline(管道)进行流水线处理。
代码讲解如下:

#导入需要的包In [296]: import numpy as npIn [297]: from sklearn.datasets import load_digitsIn [299]: from sklearn.svm import SVCIn [300]: from sklearn.preprocessing import MinMaxScalerIn [301]: from sklearn.pipeline import PipelineIn [304]: from sklearn.model_selection import train_test_split#划分训练集测试集In [307]: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)#数据预处理In [308]: scaler = MinMaxScaler()#模型选择In [309]: model = SVC(probability=True)#Pipeline通过一个由2个参数的元组组成的列表构成,其中元组中第一个参数为自定义name,第二个为处理对象#需要按照流水线顺序放入In [310]: pipe = Pipeline([('norm', scaler), ('clf', model)])#通过上一步自定义的name+'__'(双下划线)+ 处理对象的参数进行参数设置(调整)In [311]: pipe.set_params(clf__C=0.1, clf__kernel='linear')Out[311]: Pipeline(steps=[('norm', MinMaxScaler(copy=True, feature_range=(0, 1))), ('clf', SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',  max_iter=-1, probability=True, random_state=None, shrinking=True,  tol=0.001, verbose=False))])In [312]: pipe.fit(x_train, y_train)Out[312]: Pipeline(steps=[('norm', MinMaxScaler(copy=True, feature_range=(0, 1))), ('clf', SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',  max_iter=-1, probability=True, random_state=None, shrinking=True,  tol=0.001, verbose=False))])In [313]: pipe.score(x_train, y_train)Out[313]: 0.98488464598249803In [314]: pipe.score(x_test, y_test)Out[314]: 0.97222222222222221In [315]: pipe.predict_proba(x_test)Out[315]: array([[  4.69962015e-04,   4.97407244e-03,   9.79898456e-01, ...,          5.75429245e-04,   2.65709061e-03,   1.17111690e-03],       [  2.40777940e-02,   1.27367464e-02,   2.99985439e-03, ...,          7.33476986e-01,   1.10373410e-02,   1.16970592e-01],       [  1.24980459e-03,   1.13328865e-03,   1.69594075e-04, ...,          1.40292171e-02,   4.47836232e-03,   9.48609308e-01],       ...,        [  7.41032366e-04,   7.42471241e-04,   2.00989211e-03, ...,          6.78079248e-03,   9.75634942e-01,   7.76885127e-03],       [  1.10805460e-03,   1.64272554e-03,   1.62070859e-03, ...,          3.38955798e-04,   9.84667618e-03,   3.12873187e-03],       [  7.83652624e-01,   2.92113219e-03,   3.39037889e-02, ...,          5.34675567e-03,   1.22726588e-02,   5.98495463e-02]])In [317]: np.set_printoptions(suppress=True)In [318]: pipe.predict_proba(x_test)Out[318]: array([[ 0.00046996,  0.00497407,  0.97989846, ...,  0.00057543,         0.00265709,  0.00117112],       [ 0.02407779,  0.01273675,  0.00299985, ...,  0.73347699,         0.01103734,  0.11697059],       [ 0.0012498 ,  0.00113329,  0.00016959, ...,  0.01402922,         0.00447836,  0.94860931],       ...,        [ 0.00074103,  0.00074247,  0.00200989, ...,  0.00678079,         0.97563494,  0.00776885],       [ 0.00110805,  0.00164273,  0.00162071, ...,  0.00033896,         0.00984668,  0.00312873],       [ 0.78365262,  0.00292113,  0.03390379, ...,  0.00534676,         0.01227266,  0.05984955]])
原创粉丝点击