python归一化、标准化、正则化

来源:互联网 发布:mac os sierra下载 编辑:程序博客网 时间:2024/05/16 01:23

        在前面的 文章中,我们对一些特征做了分析,根据describe得到统计信息,很多特征是稀疏的。

#首先去掉空值,并查看数据的统计ads = ads.dropna(axis=0)print(ads.describe())#我们可以看到,大量的特征的25%分位数显示,特征分布#              0            1            2            3            4     \# count  2359.000000  2359.000000  2359.000000  2359.000000  2359.000000# mean     63.912251   155.631624     3.912982     0.759644     0.002120# std      54.881130   130.237867     6.047220     0.427390     0.045999# min       1.000000     1.000000     0.001500     0.000000     0.000000# 25%      25.000000    80.500000     1.033450     1.000000     0.000000# 50%      51.000000   110.000000     2.111100     1.000000     0.000000# 75%      84.000000   184.000000     5.333300     1.000000     0.000000# max     640.000000   640.000000    60.000000     1.000000     1.000000##          5            6            7            8            9     \# count  2359.0  2359.000000  2359.000000  2359.000000  2359.000000# mean      0.0     0.006359     0.004663     0.004663     0.014837# std       0.0     0.079504     0.068141     0.068141     0.120925# min       0.0     0.000000     0.000000     0.000000     0.000000# 25%       0.0     0.000000     0.000000     0.000000     0.000000# 50%       0.0     0.000000     0.000000     0.000000     0.000000# 75%       0.0     0.000000     0.000000     0.000000     0.000000# max       0.0     1.000000     1.000000     1.000000     1.000000##           ...              1549         1550         1551         1552  \# count     ...       2359.000000  2359.000000  2359.000000  2359.000000# mean      ...          0.003815     0.001272     0.002120     0.002543# std       ...          0.061662     0.035646     0.045999     0.050379# min       ...          0.000000     0.000000     0.000000     0.000000# 25%       ...          0.000000     0.000000     0.000000     0.000000# 50%       ...          0.000000     0.000000     0.000000     0.000000# 75%       ...          0.000000     0.000000     0.000000     0.000000# max       ...          1.000000     1.000000     1.000000     1.000000##               1553         1554         1555        1556         1557  \# count  2359.000000  2359.000000  2359.000000  2359.00000  2359.000000# mean      0.008478     0.013989     0.014837     0.00975     0.000848# std       0.091705     0.117470     0.120925     0.09828     0.029111# min       0.000000     0.000000     0.000000     0.00000     0.000000# 25%       0.000000     0.000000     0.000000     0.00000     0.000000# 50%       0.000000     0.000000     0.000000     0.00000     0.000000# 75%       0.000000     0.000000     0.000000     0.00000     0.000000# max       1.000000     1.000000     1.000000     1.00000     1.000000##               1558# count  2359.000000# mean      0.161509# std       0.368078# min       0.000000# 25%       0.000000# 50%       0.000000# 75%       0.000000# max       1.000000## [8 rows x 1559 columns]

对特征进行一定的处理,可以提升算法模型的结果,主要分为归一化,标准化,正则化。python的sklearn.preprocessing提供了相应的方法,使用起来非常方便。

#导入sklearn.preprocessing数据预处理包

from sklearn.preprocessing import MinMaxScalerdf_all = ads.valuesX = df_all[:,:-1]y = df_all[:,-1]
#归一化:消除不同数据之间的量纲,方便数据比较和共同处理,并维持了数据的稀疏性质# 比如在神经网络中,归一化可以加快训练网络的收敛性。
X_scaler = MinMaxScaler().fit_transform(X)print(X_scaler)
# [[ 0.19405321  0.19405321  0.01664208 ...,  0.          0.          0.        ]#  [ 0.08763693  0.73082942  0.13682009 ...,  0.          0.          0.        ]#  [ 0.05007825  0.35837246  0.1161379  ...,  0.          0.          0.        ]#  ...,#  [ 0.15649452  0.21752739  0.02307724 ...,  0.          0.          0.        ]#  [ 0.03442879  0.18622848  0.08693217 ...,  0.          0.          0.        ]#  [ 0.06103286  0.06103286  0.01664208 ...,  0.          0.          0.        ]]
#标准化:使每个特征均值为0,方差为1。# 更利于使用标准正态分布的性质,进行处理。
from sklearn.preprocessing import StandardScalerscaler = StandardScaler().fit(X)X_scaler = scaler.transform(X)print(scaler.mean_,scaler.std_)print(X_scaler)# [[ 1.11332804 -0.23524739 -0.48180809 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.12597621  2.39895364  0.71081076 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.5633777   0.57114068  0.50556553 ..., -0.12272017 -0.09922646#   -0.02912965]#  ...,#  [ 0.67592654 -0.12004909 -0.41794703 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.74562833 -0.27364682  0.21573459 ..., -0.12272017 -0.09922646#   -0.02912965]#  [-0.43580227 -0.88803773 -0.48180809 ..., -0.12272017 -0.09922646#   -0.02912965]]#正则化:与上述方法不同,正则化是对每个样本进行加工,使得每个样本的范数为1,用来计算样本之间相似度from sklearn.preprocessing import Normalizerscaler = Normalizer().fit(X)X_scaler = scaler.transform(X)print(X_scaler)# [[ 0.70693714  0.70693714  0.0056555  ...,  0.          0.          0.        ]#  [ 0.12088013  0.99248947  0.01741204 ...,  0.          0.          0.        ]#  [ 0.14193112  0.98921689  0.02997585 ...,  0.          0.          0.        ]#  ..., #  [ 0.58495049  0.81082246  0.00802772 ...,  0.          0.          0.        ]#  [ 0.18799975  0.98086824  0.0426457  ...,  0.          0.          0.        ]#  [ 0.70589457  0.70589457  0.01764736 ...,  0.          0.          0.        ]]

原创粉丝点击