sklearn-学习：Dimensionality reduction(降维)-（feature selection）特征选择

来源：互联网发布：电脑网络监控录像软件编辑：程序博客网时间：2024/05/18 02:46

本文主要对对应文档的内容进行简化（以代码示例为主）及汉化

对应文档位置：http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection

1.13. Feature selection

feature selection 作用：增加分类器的score ，提升分类器在高纬数据集上的表现

1.13.1. Removing features with low variance

from sklearn.feature_selection import VarianceThresholdX = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]sel = VarianceThreshold(threshold=(.8 * (1 - .8)))sel.fit_transform(X)array([[0, 1],       [1, 0],       [0, 0],       [1, 1],       [1, 0],       [1, 1]])

说明：

VarianceThreshold

默认值：去除差异值为0（或者为相同值的变量）

VarianceThreshold(threshold=(.8 * (1 - .8))) ，例子中假设为bool型变量（取值为0,1），其参数threshold的值为方差值；                            对于伯努利分布，其方差为p(1-p)=0.8*(1-0.8)

1.13.2. Univariate feature selection(单变量特征选择)
Univariate feature selection works by selecting the best features based on univariate statistical tests(单特征选择分为两部分：特征评分，选择特征），特征选择的方法有：
SelectKBest：按照评分函数，选出最好的K个特征
SelectPercentile
SelectFpr
SelectFdr
SelectFwe
GenericUnivariateSelect

from sklearn.datasets import load_irisfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2iris = load_iris()X, y = iris.data, iris.targetX.shape(150, 4)X_new = SelectKBest(chi2, k=2).fit_transform(X, y)X_new.shape(150, 2)

class sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)

返回值：scores_ : array-like, shape=(n_features,)Scores of features.               pvalues_ : array-like, shape=(n_features,)p-values of feature scores.

评分函数：

For regression: f_regression （用于回归）
For classification: chi2 or f_classif （用于分类）

1.13.3. Recursive feature elimination（RFE）

递归消除特征：

First, the estimator is trained on the initial set of features and weights are assigned to each one of them.

首先，通过全部特征训练评估函数，得出每个特征的权重

Then, features whose absolute weights are the smallest are pruned from the current set features.

然后，将最小权重的特征（按照一定阈值）从特征集合中去除

That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached

循环执行过程1、2，直到特征数达成需要

RFECV performs RFE in a cross-validation loop to find the optimal number of features.

RFECV 通过较差验证，来找到选定数量的特征

RFE示例：

from sklearn.svm import SVCfrom sklearn.datasets import load_digitsfrom sklearn.feature_selection import RFEimport matplotlib.pyplot as pltdigits = load_digits()X = digits.images.reshape((len(digits.images), -1))y = digits.targetsvc = SVC(kernel="linear", C=1)rfe = RFE(estimator=svc, n_features_to_select=1, step=1) #进行递归消除特征rfe.fit(X, y)ranking = rfe.ranking_.reshape(digits.images[0].shape)

说明：svm可以替换为逻辑回归

RFECV 示例：

from sklearn.svm import SVCfrom sklearn.cross_validation import StratifiedKFoldfrom sklearn.feature_selection import RFECVfrom sklearn.datasets import make_classificationX, y = make_classification(n_samples=1000, n_features=25, n_informative=3,                           n_redundant=2, n_repeated=0, n_classes=8,                           n_clusters_per_class=1, random_state=0)svc = SVC(kernel="linear")rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),              scoring='accuracy')rfecv.fit(X, y)print("Optimal number of features : %d" % rfecv.n_features_)

1.13.4. Feature selection using SelectFromModel
- 1.13.4.1. L1-based feature selection
- 1.13.4.2. Randomized sparse models
- 1.13.4.3. Tree-based feature selection
1.13.5. Feature selection as part of a pipeline

0 0