scikit-learn：4.1. Pipeline and FeatureUnion: combining estimators（特征与预测器结合；特征与特征结合）

来源：互联网发布：淘宝买家黑名单编辑：程序博客网时间：2024/06/06 20:06

带病在网吧里写，，，，给点鼓励吧。。。

http://scikit-learn.org/stable/modules/pipeline.html

1、pipeline和featureUnion是干什么的：

pipeline之前已经介绍过了，结合transformer和estimator。

featureUinon听名字就知道，将多个transformer的结果vector拼接成大的vector。

2、两者的区别：

前者相当于feature串行处理，后一个transformer处理前一个transformer的feature结果；

后者相当于feature的并行处理，将所有transformer的处理结果拼接成大的feature vector。

3、pipeline：chaining estimators

Pipeline can be used to chain multiple estimators into one. 因为我们处理数据的过程一般都是比较固定的，比如特征选择、规范化、分类。所以pipeline主要由两个目的：

方便：fit、predict一次即可处理所有estimators的结果。

拼接参数选择：仅需一次即可grid search所有estimators的所有parameters。

pipeline的所有的estimators（除了最后一个）都必须是transformer（有transform方法），最后一个estimator可以使任何类型（transformer、classifier）

使用：通过一组（key, value）对来串联所有的estimators，key是自己对每一步骤的随意的命名，value是一个estimator object，例如：

>>> from sklearn.pipeline import Pipeline>>> from sklearn.svm import SVC>>> from sklearn.decomposition import PCA>>> estimators = [('reduce_dim', PCA()), ('svm', SVC())]>>> clf = Pipeline(estimators)>>> clf Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None,    whiten=False)), ('svm', SVC(C=1.0, cache_size=200, class_weight=None,    coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1,    probability=False, random_state=None, shrinking=True, tol=0.001,    verbose=False))])

每一个阶段的estimators存放在steps属性中，可以通过索引这样取出每一个estimators：

>>> clf.steps[0]('reduce_dim', PCA(copy=True, n_components=None, whiten=False))

也可以通过name这样取出每一个estimators（as a dict in named_steps:）：

>>> clf.named_steps['reduce_dim']PCA(copy=True, n_components=None, whiten=False)

想改变estimators的parameter值？用这样的语法：<estimator>__<parameter> syntax，例如：

>>> clf.set_params(svm__C=10) Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None,    whiten=False)), ('svm', SVC(C=10, cache_size=200, class_weight=None,    coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1,    probability=False, random_state=None, shrinking=True, tol=0.001,    verbose=False))])

终极目的，grid searches：

>>> from sklearn.grid_search import GridSearchCV>>> params = dict(reduce_dim__n_components=[2, 5, 10],...               svm__C=[0.1, 10, 100])>>> grid_search = GridSearchCV(clf, param_grid=params)

最经典的文本分类来了：

# define a pipeline combining a text feature extractor with a simple# classifierpipeline = Pipeline([    ('vect', CountVectorizer()),    ('tfidf', TfidfTransformer()),    ('clf', SGDClassifier()),])# uncommenting more parameters will give better exploring power but will# increase processing time in a combinatorial wayparameters = {    'vect__max_df': (0.5, 0.75, 1.0),    #'vect__max_features': (None, 5000, 10000, 50000),    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams    #'tfidf__use_idf': (True, False),    #'tfidf__norm': ('l1', 'l2'),    'clf__alpha': (0.00001, 0.000001),    'clf__penalty': ('l2', 'elasticnet'),    #'clf__n_iter': (10, 50, 80),}if __name__ == "__main__":    # multiprocessing requires the fork to happen in a __main__ protected    # block    # find the best parameters for both the feature extraction and the    # classifier    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

Notes：重要的事情不翻译，

Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step.

Thepipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.

4、FeatureUnion：composite feature spaces

featureUnion描述，重要的不翻译：

FeatureUnion combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and thesample vectors they output are concatenated end-to-end into larger vectors.

featureUnion和pipleline同样是为了方便和joint parameter，两者也可以结合成更加复杂的模型。

（featureUnion不管两个transformers是否产生相同的特征，他仅仅简单的拼接所有的特征，判重工作还是要你自己来做的。。。）

使用：通过一组（key, value）对来串联所有的estimators，key是自己对每一步骤的随意的命名，value是一个estimator object，例如：

>>> from sklearn.pipeline import FeatureUnion>>> from sklearn.decomposition import PCA>>> from sklearn.decomposition import KernelPCA>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]>>> combined = FeatureUnion(estimators)>>> combined FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,    n_components=None, whiten=False)), ('kernel_pca', KernelPCA(alpha=1.0,    coef0=1, degree=3, eigen_solver='auto', fit_inverse_transform=False,    gamma=None, kernel='linear', kernel_params=None, max_iter=None,    n_components=None, remove_zero_eig=False, tol=0))],    transformer_weights=None)

最后给个例子：

http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

感谢

Author: Andreas Mueller <amueller@ais.uni-bonn.de>

# Author: Andreas Mueller <amueller@ais.uni-bonn.de>## License: BSD 3 clausefrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.grid_search import GridSearchCVfrom sklearn.svm import SVCfrom sklearn.datasets import load_irisfrom sklearn.decomposition import PCAfrom sklearn.feature_selection import SelectKBestiris = load_iris()X, y = iris.data, iris.target# This dataset is way to high-dimensional. Better do PCA:pca = PCA(n_components=2)# Maybe some original features where good, too?selection = SelectKBest(k=1)# Build estimator from PCA and Univariate selection:combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])# Use combined features to transform dataset:X_features = combined_features.fit(X, y).transform(X)svm = SVC(kernel="linear")# Do grid search over k, n_components and C:pipeline = Pipeline([("features", combined_features), ("svm", svm)])param_grid = dict(features__pca__n_components=[1, 2, 3],                  features__univ_select__k=[1, 2],                  svm__C=[0.1, 1, 10])grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)grid_search.fit(X, y)print(grid_search.best_estimator_)

完，看来以后提取特征有可以省很多事了。。。。。。。。

3 0