sklearn管道类库使用小结

来源:互联网 发布:python subprocess pid 编辑:程序博客网 时间:2024/06/05 02:05

Pipeline可以将多个估计器串起来,例如将特征提取、正则化和分类串起来形成一个典型的机器学习工作流是非常有用的。管道的两个目的:

方便性:只需要调用fit和predict一次,就能适合所有估计器

联合参数选择:在管道中,结合网格搜索对估计器参数进行选择

在管道中的所有估计器,除了最后一个外,都必须是transformers(转换器),最后一个估计器可以是转换器或分类器

Pipeline由键值对元组列表组成的,键是一个字符串,定义指定步骤的名称,可以随意取,值是一个估计器对象

①利用Pipeline实例化管道对象

In [1]: from sklearn.pipeline import Pipeline   ...: from sklearn.svm import SVC   ...: from sklearn.decomposition import PCA   ...: estimators = [('reduce_dim',PCA()),('clf',SVC())]   ...: pipe = Pipeline(estimators)   ...: pipe   ...:Out[1]:Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False))])
②利用make_pipeline构造一个Pipeline对象

sklearn.pipeline.make_pipeline(*steps):构造时,不需要,也不允许定义估计器名称,自动有估计器类型的小写字母命名

In [2]: from sklearn.naive_bayes import GaussianNB   ...: from sklearn.preprocessing import StandardScaler   ...: from sklearn.pipeline import make_pipeline   ...: make_pipeline(StandardScaler(),GaussianNB(priors=None))   ...:Out[2]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])

获取管道中估计器方法:

①管道中各个估计器是以元组列表的方式存储在steps属性中,可以列表索引的方式访问具体估计器

In [4]: pipe.stepsOut[4]:[('reduce_dim',  PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,    svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',    max_iter=-1, probability=False, random_state=None, shrinking=True,    tol=0.001, verbose=False))]In [5]: pipe.steps[0]Out[5]:('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,   svd_solver='auto', tol=0.0, whiten=False))
②管道中所有估计器是以字典的方式存储在named_steps属性,可以以字典索引方式访问具体估计器

In [2]: pipe.named_stepsOut[2]:{'clf': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,   decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',   max_iter=-1, probability=False, random_state=None, shrinking=True,   tol=0.001, verbose=False), 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,   svd_solver='auto', tol=0.0, whiten=False)}In [3]: pipe.named_steps['reduce_dim']Out[3]:PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)

③可以以<estimator>__<parameter>方式设置估计器参数

In [6]: pipe.set_params(clf__C=10)Out[6]:Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200,class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False))])In [7]: pipe.get_params('clf__C')Out[7]:{'clf': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,   decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',   max_iter=-1, probability=False, random_state=None, shrinking=True,   tol=0.001, verbose=False), 'clf__C': 10, 'clf__cache_size': 200, 'clf__class_weight': None, 'clf__coef0': 0.0, 'clf__decision_function_shape': None, 'clf__degree': 3, 'clf__gamma': 'auto', 'clf__kernel': 'rbf', 'clf__max_iter': -1, 'clf__probability': False, 'clf__random_state': None, 'clf__shrinking': True, 'clf__tol': 0.001, 'clf__verbose': False, 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,   svd_solver='auto', tol=0.0, whiten=False), 'reduce_dim__copy': True, 'reduce_dim__iterated_power': 'auto', 'reduce_dim__n_components': None, 'reduce_dim__random_state': None, 'reduce_dim__svd_solver': 'auto', 'reduce_dim__tol': 0.0, 'reduce_dim__whiten': False, 'steps': [('reduce_dim',   PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,     svd_solver='auto', tol=0.0, whiten=False)),  ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,     decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',     max_iter=-1, probability=False, random_state=None, shrinking=True,     tol=0.001, verbose=False))]}In [9]: pipe.get_params('clf__C')['clf__C']Out[9]: 10

④结合网格搜索GridSearchCV进行参数调优

from sklearn.linear_model import LogisticRegressionparam_grid = dict(reduce_dim = [None,PCA(5),PCA(10)],                 clf = [SVC(),LogisticRegression()],                 clf__C=[0.1,10,100])grid_search = GridSearchCV(pipe,param_grid = param_grid)

FeatureUnion将多个转换器结合成一个新的转换器,由一个转换器对象列表组成,在训练期间,各个转换器独立训练数据,对于数据转换,各个转换器都是并行应用,最终就是将各个转换器输出的样本矩阵合并成一个大的矩阵。FeatureUnion和pipeline具有相同的功能,两者结合建立复杂模型。

FeatureUnion由键值对元组列表组成,键是给转换步骤随意取名的字符串,值时一个估计器对象

①利用FeatureUnion实例化FeatureUnion对象

In [12]: from sklearn.pipeline import FeatureUnion    ...: from sklearn.decomposition import PCA    ...: from sklearn.decomposition import KernelPCA    ...: estimators = [('linear_pca',PCA()),('kernel_pca',PCA())]    ...: combined = FeatureUnion(estimators)    ...: combined    ...:Out[12]:FeatureUnion(n_jobs=1,       transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False))],       transformer_weights=None)
②利用make_union实例化FeatureUnion对象

In [14]: from sklearn.pipeline import make_union    ...: from sklearn.decomposition import PCA    ...: from sklearn.decomposition import KernelPCA    ...: make_union(PCA(),KernelPCA())    ...:Out[14]:FeatureUnion(n_jobs=1,       transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)), ('kernelpca', KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',     fit_inverse_transform=False, gamma=None, kernel='linear',     kernel_params=None, max_iter=None, n_components=None, n_jobs=1,     random_state=None, remove_zero_eig=False, tol=0))],       transformer_weights=None)
和Pipeline类似,也可以利用set_params方法去掉某步骤,通过制定参数为None

In [15]: combined.set_params(kernel_pca=None)Out[15]:FeatureUnion(n_jobs=1,       transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,  svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],       transformer_weights=None)