Python数据处理(整理更新中...)

来源：互联网发布：大数据平台竞品分析编辑：程序博客网时间：2024/05/28 23:21

pandas模块的使用

导入csv文件

import pandas as pdfrom pandas import read_csvurl="https://goo.gl/vhm1eU"names = ['preg','plas','skin','test','mass','pedi','age','class']data = read_csv(url,names = names)print data.shapepeek = data.head(10)print peek

(768, 8)    preg  plas  skin  test  mass   pedi  age  class6    148    72    35     0  33.6  0.627   50      11     85    66    29     0  26.6  0.351   31      08    183    64     0     0  23.3  0.672   32      11     89    66    23    94  28.1  0.167   21      00    137    40    35   168  43.1  2.288   33      15    116    74     0     0  25.6  0.201   30      03     78    50    32    88  31.0  0.248   26      110   115     0     0     0  35.3  0.134   29      02    197    70    45   543  30.5  0.158   53      18    125    96     0     0   0.0  0.232   54      1

查看文件中列表的统计信息(describe)

Count
Mean
标准差
最小值
25th
50th
75th
最大值

pd.set_option('display.width',100) #set a prrferred widthpd.set_option('precision',3) #change output precisiondes = data.describe()print des

          preg     plas     skin     test     mass     pedi      age    classcount  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000mean   120.895   69.105   20.536   79.799   31.993    0.472   33.241    0.349std     31.973   19.356   15.952  115.244    7.884    0.331   11.760    0.477min      0.000    0.000    0.000    0.000    0.000    0.078   21.000    0.00025%     99.000   62.000    0.000    0.000   27.300    0.244   24.000    0.00050%    117.000   72.000   23.000   30.500   32.000    0.372   29.000    0.00075%    140.250   80.000   32.000  127.250   36.600    0.626   41.000    1.000max    199.000  122.000   99.000  846.000   67.100    2.420   81.000    1.000

类别分类（groupby的使用）

class_counts = data.groupby("class").size()print class_counts

class0    5001    268dtype: int64

数据预处理

使用sklearn的preprocessing模块进行处理

from sklearn.preprocessing import MinMaxScalerimport numpy as nparray = data.values #转化成二维数组X = array[:,0:7]Y = array[:,7]scaler = MinMaxScaler(feature_range=(0,1)) #将数据缩放到指定数值域中,默认是0-1之间rescaledX = scaler.fit_transform(X)np.set_printoptions(precision=3)print rescaledX[0:5,:]

[[ 0.744  0.59   0.354  0.     0.501  0.234  0.483] [ 0.427  0.541  0.293  0.     0.396  0.117  0.167] [ 0.92   0.525  0.     0.     0.347  0.254  0.183] [ 0.447  0.541  0.232  0.111  0.419  0.038  0.   ] [ 0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]

标准化数据

标准差标准化也叫作Z-zero标准化，经过处理的数据会符合标准正态分布，即均值为0，方差为1。转化函数为：
x∗ = (x-μ)/σ
公式中标准化后的值x* 等于原来的值x先减去原数据的均值μ，然后在除以原数据的标准差σ。最后得到的新的数据的均值就是0，方差/标准差为1.
注：是否要进行标准化，要根据具体实验定。如果特征非常稀疏，并且有大量的0（现实应用中很多特征都具有这个特点），Z-score 标准化的过程几乎就是一个除0的过程，结果不可预料。

from sklearn.preprocessing import StandardScalerfrom sklearn import preprocessingscaler = preprocessing.StandardScaler().fit(X)rescaledX = scaler.transform(X)print rescaledX[0:5,:]

[[ 0.848  0.15   0.907 -0.693  0.204  0.468  1.426] [-1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191] [ 1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106] [-0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042] [ 0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]

归一化

scaler = preprocessing.Normalizer().fit(X)normalized = scaler.transform(X)#print X[0:5,:]print normalized[0:5,:]

[[ 0.828  0.403  0.196  0.     0.188  0.004  0.28 ] [ 0.716  0.556  0.244  0.     0.224  0.003  0.261] [ 0.925  0.323  0.     0.     0.118  0.003  0.162] [ 0.588  0.436  0.152  0.622  0.186  0.001  0.139] [ 0.596  0.174  0.152  0.731  0.188  0.01   0.144]]

阅读全文

0 0