Sklearn库学习笔记1 Feature_Engineering之预处理篇

来源:互联网 发布:怎样申请做淘宝模特 编辑:程序博客网 时间:2024/06/03 13:25

一、预处理

1. Binarizer 二值化处理

from sklearn.preprocessing import Binarizerimport numpy as np'''数据二值化处理:适用场景:泊松分布,文本数据操作特点:返回对于数值特征的阈值判断'''x_train = np.array([[1,2,-1],                  [2, 3, -2],                  [1, -1 ,1]])bina = Binarizer(threshold=1.0, copy=True)bina.fit(x_train)bina.transform(x_train)

2. Imputer 填补缺失值

from sklearn.preprocessing import Imputerimport numpy as np '''缺失值计算:填补方式: “mean”, "median", "most_frequent"'''x_train = np.array([[1,np.nan,-1],                  [2, 3, -2],                  [1, -1 ,1]])imp = Imputer(missing_values='NaN', strategy='mean', axis=1, verbose=0, copy=True)imp.fit(x_train)imp.transform(x_train)

3. Normalizer 归一化

from sklearn.preprocessing import Normalizerimport numpy as np '''归一化处理数据:适用场景:       比如计算两个L2归一化后的TF-IDF向量内积实际上是计算它们的余弦相似度,余弦值越接近于1,它们的方向更加吻合,则越相似。'''x_train = np.array([[1,-5,-1],                  [2, 3, -2],                  [1, -1 ,1]])imp = Normalizer(norm='l2', copy=True)'''正则化方式: 'l1' ,'l2', 'max''''imp.fit(x_train)imp.transform(x_train)

4. OneHotEncoder独热编码

from sklearn.preprocessing import OneHotEncoderimport numpy as np '''独热编码:对类别型特征编码,one-of-K的形式      '''x_train = np.array([1,3,4]).reshape(-1, 1)onehot = OneHotEncoder(n_values='auto', categorical_features='all', dtype=np.float64, sparse=True, handle_unknown='error')'''n_values: 每个特征的数量categorical_features: 需要编码的特征名dtype: 数据类型sparse: 是否返回稀疏矩阵handle_unknown: 遇到错误如何处理'''onehot.fit(x_train)print onehot.transform(x_train).toarray()

5. StandardScaler 和 MinMaxScaler标准化

from sklearn.preprocessing import StandardScalerfrom sklearn.preprocessing import MinMaxScalerimport numpy as np '''StandardScaler 数据标准化:    适用场景:比如PCA, SVM的RBF核等    注意事项:不能分别对训练集和测试集训练与转换,应该在训练集上训练,在测试集在转化,如下所示:    X_train = scaler.fit_transform(X_train)    X_test = scaler.transform(X_test)'''x_train = np.array([[1,2,-1],                  [2, 3, -2],                  [1, -1, 1]])stan = StandardScaler(copy=True, with_mean=True, with_std=True)stan.fit(x_train)stan.transform(x_train)maxmin = MinMaxScaler(feature_range=(0, 1), copy=True)maxmin.fit(x_train)maxmin.transform(x_train)#feature_range: 压缩范围

6. RobustScaler鲁棒性缩放

RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
原创粉丝点击