104 缺失值预处理

来源：互联网发布：Aes算法的函数运算有编辑：程序博客网时间：2024/05/17 06:32

104 缺失值预处理

http://scikit-learn.org/stable/auto_examples/missing_values.html#example-missing-values-py
对于缺失值的处理，一定程度上能够决定算法模型的表现，常用的缺失值的处理方法有平均值，中间值，最常用的值等等，这三种分别对应着sklearn里preprocessing。imputer的三种处理策略。文中说当变化范围大的时候用中间值似乎是一个比较稳定的选择。

1、函数介绍

按照文中出现的顺序介绍其中的几个函数

1.1shape[0]

取shape的第一个数，对于多维数据也就是数据的行数

1.2np.floor

取一个或者一组浮点数的小于它的最大整数，比如np.floor(-0.2)—>-1

1.3np.hstack

水平stack（h=horizontally）,相当于np.concatenate(tup, axis=1)
http://docs.scipy.org/doc/numpy-1.6.0/reference/generated/numpy.hstack.html?highlight=hstack#numpy.hstack
example

 a = np.array([[1],[2],[3]]) b = np.array([[2],[3],[4]])np.hstack((a,b))array([[1, 2],       [2, 3],       [3, 4]])

1.4 random.mtrand.RandomState.shuffle

洗牌，注意对于多维数据，只随机打乱第一index的顺序，比如

>>> arr = np.arange(9).reshape((3, 3))>>> np.random.shuffle(arr)>>> arrarray([[3, 4, 5],       [6, 7, 8],       [0, 1, 2]])

1.5 imputer

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer
缺失值的处理，本文中用的是默认的平均值，其参数有sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0,copy=True)
其中axis=0 意味着impute along columns.也就是以横向样本为单位，取一列的平均值，如果全都是空缺值就会直接删掉这个特征，如果axis=1，且一行全都是空缺值，就会exception。

2.代码

直接粘贴上

import numpy as npfrom sklearn.datasets import load_bostonfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import Imputerfrom sklearn.cross_validation import cross_val_scorerng = np.random.RandomState(0)dataset = load_boston()X_full, y_full = dataset.data, dataset.targetn_samples = X_full.shape[0]n_features = X_full.shape[1]#第一步预测整个数据集# Estimate the score on the entire dataset, with no missing valuesestimator = RandomForestRegressor(random_state=0, n_estimators=100)score = cross_val_score(estimator, X_full, y_full).mean()print("Score with the entire dataset = %.2f" % score)# Add missing values in 75% of the linesmissing_rate = 0.75n_missing_samples = np.floor(n_samples * missing_rate)missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,                                      dtype=np.bool),                             np.ones(n_missing_samples,                                     dtype=np.bool)))rng.shuffle(missing_samples)missing_features = rng.randint(0, n_features, n_missing_samples)#去掉空缺值时的预测结果# Estimate the score without the lines containing missing valuesX_filtered = X_full[~missing_samples, :]y_filtered = y_full[~missing_samples]estimator = RandomForestRegressor(random_state=0, n_estimators=100)score = cross_val_score(estimator, X_filtered, y_filtered).mean()print("Score without the samples containing missing values = %.2f" % score)#imputer处理空缺值，并预测# Estimate the score after imputation of the missing valuesX_missing = X_full.copy()X_missing[np.where(missing_samples)[0], missing_features] = 0y_missing = y_full.copy()estimator = Pipeline([("imputer", Imputer(missing_values=0,                                          strategy="mean",                                          axis=0)),                      ("forest", RandomForestRegressor(random_state=0,                                                       n_estimators=100))])score = cross_val_score(estimator, X_missing, y_missing).mean()print("Score after imputation of the missing values = %.2f" % score)

最后的得分情况是原始数据集为0.56
去掉空缺值得到的结果是0.48
用平均值补充的结果为0.55
看来这里用均值补充是一个很好的选择。

0 0