[Sklearn应用5] Feature Selection 特征选择（一） SelectFromModel

来源：互联网发布：大漠驼铃 php 编辑：程序博客网时间：2024/06/04 22:13

此内容在sklearn官网地址： http://scikit-learn.org/stable/modules/feature_selection.html
sklearn版本：0.18.2

sklearn.feature_selection

The module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.
用于特征选择/降维，可提高精度和性能。

特征选择有很多种方式，下面讲第一种：通过SelectFromModel选择。

sklearn.feature_selection.SelectFromModel

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.
fit后得到coef_ 或feature_importances_，应用于特征中，小于设定的阈值特征认为不重要，可去除。coef_（系数）适用于线性模型，而无系数的非线性模型使用feature_importances_ 。

SelectFromModel包括：L1-based feature selection 、 Tree-based feature selection 等

L1-based feature selection

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with feature_selection.SelectFromModel to select the non-zero coefficients. In particular, sparse estimators useful for this purpose are the linear_model.Lasso for regression, and of linear_model.LogisticRegression and svm.LinearSVC for classification

有系数的线性模型中，L1正则化可生成一个稀疏矩阵，利于计算，所以可以做特征选择。L1正则造成稀疏的原因具体请参考：http://blog.csdn.net/jinping_shi/article/details/52433975

# class sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False) #from sklearn.feature_selection import SelectFromModelfrom sklearn.linear_model import Lasso          # 此处以L1正则化的线性模型Lasso为例lasso = Lasso()                                 # 可在此步对模型进行参数设置，这里用默认值。lasso.fit(X, y)                                 # 训练模型，传入X、y, 数据中不能包含miss_valuemodel = SelectFromModel(lasso)X_new = model.transform(X)                      # 此步可删除参数为0的特征，使用get_dummies处理后的数据需一个特征包含的所有项都为0

Tree-based feature selection

Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute feature importances, which in turn can be used to discard irrelevant features (when coupled with the sklearn.feature_selection.SelectFromModel meta-transformer)
在无系数的非线性模型中，通过计算得到特征重要性，根据重要性筛选无关特征。

from sklearn.feature_selection import SelectFromModelfrom sklearn.ensemble import RandomForestRegressor      # 同样以此模型举例rf = RandomForestRegressor()                            # 默认参数rf.fit(X, y)model = SelectFromModel(rf)X_new = model.transform(X)

阅读全文

0 0