pipeline应用例子

来源:互联网 发布:spss mac 中文破解版 编辑:程序博客网 时间:2024/05/16 06:32

管道命令不经常使用,但是很有用,可以把多个步骤组合成一个对象执行。这样可以更方便灵活地调节和控制整个模型的配置,而不是一个一个步骤调节。

下面通过pipeline把多个数据处理步骤组合成一个对象,先将缺失数据填充,然后数据集标准化

构造缺失数据集:

In [3]: from sklearn import datasets   ...: import numpy as np   ...: mat = datasets.make_spd_matrix(5)   ...: masking_array = np.random.binomial(1,0.1,mat.shape).astype(bool)   ...: mat[masking_array]=np.nan   ...: mat   ...:Out[3]:array([[ 2.52711026,         nan, -1.04043074,  2.06389549,  0.47661882],       [        nan,         nan,  0.15508676,  0.01415386, -0.06226449],       [-1.04043074,  0.15508676,         nan, -1.19658827, -0.08381655],       [ 2.06389549,         nan, -1.19658827,  3.12278438,  0.66169001],       [ 0.47661882, -0.06226449, -0.08381655,  0.66169001,  0.5308833 ]])
不使用pipeline命令处理的过程:

In [4]: from sklearn import preprocessing   ...: impute = preprocessing.Imputer()   ...: scaler = preprocessing.StandardScaler()   ...: mat_imputed = impute.fit_transform(mat)   ...: mat_imputed_scaled = scaler.fit_transform(mat_imputed)   ...: mat_imputed_scaled   ...:Out[4]:array([[  1.20941553e+00,   1.00955064e-16,  -9.52312512e-01,          7.44689675e-01,   5.47324361e-01],       [  0.00000000e+00,   1.00955064e-16,   1.32929273e+00,         -6.05279457e-01,  -1.16750001e+00],       [ -1.62858091e+00,   1.58113883e+00,   0.00000000e+00,         -1.40267971e+00,  -1.23608259e+00],       [  8.40925896e-01,   1.00955064e-16,  -1.25033395e+00,          1.44207870e+00,   1.13625444e+00],       [ -4.21760517e-01,  -1.58113883e+00,   8.73353728e-01,         -1.78809211e-01,   7.20003790e-01]])
使用pipeline命令处理:

In [5]: from sklearn import pipeline   ...: pipe = pipeline.Pipeline([('impute',impute),('scaler',scaler)])   ...: pipe.fit_transform(mat)   ...:Out[5]:array([[  1.20941553e+00,   1.00955064e-16,  -9.52312512e-01,          7.44689675e-01,   5.47324361e-01],       [  0.00000000e+00,   1.00955064e-16,   1.32929273e+00,         -6.05279457e-01,  -1.16750001e+00],       [ -1.62858091e+00,   1.58113883e+00,   0.00000000e+00,         -1.40267971e+00,  -1.23608259e+00],       [  8.40925896e-01,   1.00955064e-16,  -1.25033395e+00,          1.44207870e+00,   1.13625444e+00],       [ -4.21760517e-01,  -1.58113883e+00,   8.73353728e-01,         -1.78809211e-01,   7.20003790e-01]])

如果管线命令中有N个对象,前N-1个对象必须实现fit和transform方法,第N个对至少实现fit方法,否则会出现错误。只要把管线命令参数设置好,就会按照顺序执行,对每一个步骤执行fit和transform方法,然后把结果传递到下一个变换操作中。

使用管线命令的有点:

①首先是方便,代码会简洁,不需要重复调用fit和transform方法

②可以结果网格搜索对模型参数选择





参考:《sklearn-cookbook》