python 之 pandas 详解

来源:互联网 发布:安卓底部导航栏源码 编辑:程序博客网 时间:2024/05/18 13:23

一、导入库

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt

二、创建对象
1、通过传递一个list对象来创建一个 Series,pandas会默认创建整型索引:

import pandas as pdimport numpy as npimport matplotlib.pyplot as plts= pd.Series([1,3,5,np.nan,6,8])print(s)//输出0    1.01    3.02    5.03    NaN4    6.05    8.0dtype: float64

2、通过传递一个numpy array,时间索引以及列标签来创建一个DataFrame:

import pandas as pdimport numpy as npimport matplotlib.pyplot as plts= pd.Series([1,3,5,np.nan,6,8])# print(s)dates = pd.date_range('20170801',periods=6)df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))print(df)//                   A         B         C         D2017-08-01 -0.352464  0.207259 -0.379499 -0.0400172017-08-02 -1.400304 -1.609741  1.149181  0.3538922017-08-03 -0.244790  0.343552 -2.269876 -0.8329832017-08-04  0.817258 -0.256027  0.917888 -0.2407812017-08-05  0.312557 -1.103468  1.140370 -0.0975152017-08-06 -0.945031  0.178114  0.099680 -0.993974

3、通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame:

df2 = pd.DataFrame({    'A': 1,    'B':pd.Timestamp('20170801'),    'C':pd.Series(1,index=list(range(4)),dtype='float64'),    'D':np.array([3]*4,dtype='int64'),    'E':pd.Categorical(["test","train","test","train"]),    'F':'fooo'})print(df2)//   A          B    C  D      E     F0  1 2017-08-01  1.0  3   test  fooo1  1 2017-08-01  1.0  3  train  fooo2  1 2017-08-01  1.0  3   test  fooo3  1 2017-08-01  1.0  3  train  fooo

4、查看不同列的数据类型

print(df2.dtype)//A             int64B    datetime64[ns]C           float64D             int64E          categoryF            objectdtype: object

三、查看数据:
查看frame中头部和尾部的行

df.head(3)df.tail(3)

2、显示索引、列和底层的numpy数据。

print(df.index)print(df.columns)print(df.values)//DatetimeIndex(['2017-08-01', '2017-08-02', '2017-08-03', '2017-08-04',               '2017-08-05', '2017-08-06'],              dtype='datetime64[ns]', freq='D')Index(['A', 'B', 'C', 'D'], dtype='object')[[-0.99340137 -0.12548833 -1.08581538  1.17112124] [ 0.29427792  0.20182606 -0.84427908 -1.42783808] [-0.06528316  1.55255719  1.15818104  0.05286832] [ 1.49223138 -1.11517738  2.01386582  0.3496034 ] [ 0.23697223  0.43241392 -0.97061103  0.79105226] [ 2.96518919  0.8754193  -1.26479748  0.29930427]]

3、describe()函数对数据快速的统计汇总:

print(describe())//              A         B         C         Dcount  6.000000  6.000000  6.000000  6.000000mean   0.396478  0.089271 -0.006982  0.281260std    1.342037  1.158896  0.513877  0.997442min   -1.395445 -1.287098 -1.047989 -1.06265125%   -0.390341 -0.907016  0.116073 -0.07381850%    0.210131  0.248303  0.193435  0.05850975%    1.401632  0.856122  0.235431  0.655625max    2.145753  1.559510  0.279513  1.896458

4、对数据的转置,按轴进行排序,按值进行排序

df.T    df.sort_index(axis=1,ascending=False)df.sort(columns='B')

四、选择:
虽然标准的python/numpy的选择和设置表达式都能够直接派上用场,但是作为工程使用的代码,我们推荐使用经过优化的pandas数据访问 方式:.at .iat .iloc .loc .ix
获取:
1、选择一个单独的列,这将会返回一个Series,等同于df.A:

print(df['A'])//2017-08-01    0.5591562017-08-02   -1.3812632017-08-03    0.1272862017-08-04    0.7713172017-08-05    0.7636902017-08-06   -0.123249Freq: D, Name: A, dtype: float64

2、通过多个标签在多个轴上进行选择 用loc 两个中括号

print(df.loc[:,['A','B']])//                   A         B2017-08-01  0.040603  1.5194482017-08-02 -2.429148  0.8748732017-08-03 -0.315154  0.2856032017-08-04  0.878968 -0.1307172017-08-05  0.201083 -0.2804192017-08-06  1.089989  0.754818

3、标签切片,类似于2

print(df.loc['20170802':'20170804',['A','B']])//                   A         B2017-08-02  1.923047  0.3247152017-08-03 -0.349817  0.0853592017-08-04 -0.675602  1.166312

4、对于返回的对象进行维度缩减

print(df.loc['20170803',['A','B']])//A   -1.133357B   -1.008605Name: 2017-08-03 00:00:00, dtype: float64

5、获取一个标量以及快速访问一个标量。(意思等价)

df.loc(dates[0],'A')df.at[dates[0],'A']

五、通过位置选择。
1、通过传递数值进行 位置选择(选择的是行)

print(df)print(df.iloc[3])# 取第四行的数据//2017-08-01  0.150731 -0.167577  1.524639 -0.3246762017-08-02 -0.118888 -0.100348  0.666961  0.6189532017-08-03  0.634666 -0.937274  0.280727 -2.5047982017-08-04  0.395685 -0.650699 -0.573055 -2.5973232017-08-05  1.403470  0.636823  0.289136  1.2940452017-08-06 -1.337154  1.718215 -0.093115 -1.529378A    0.395685B   -0.650699C   -0.573055D   -2.597323Name: 2017-08-04 00:00:00, dtype: float64

2、通过数值进行切片,与numpy /python 中的情况类似

print(df.iloc[3:5,0:2])//                   A         B2017-08-04  1.064645  1.3464982017-08-05 -0.253172  0.713125Process finished with exit code 0

3、通过指定一个位置的列表,与numpy/python 中的情况类似

print(df.iloc[[1,2,4],[0,2]])//                   A         C2017-08-02 -0.987798 -0.0812092017-08-03  0.039192 -0.2445562017-08-05 -0.043044 -0.337542Process finished with exit code 0

4、对行和列进行切片

print(df.iloc[1:3,:])#对行进行切片print(df.iloc[:,1:3]) #对列进行切片//                   A         B         C         D2017-08-02  0.598266  0.410904 -0.587577  0.9795112017-08-03 -0.206643 -0.859047 -0.110966 -0.696556                   B         C2017-08-01 -0.341683 -0.3784772017-08-02  0.410904 -0.5875772017-08-03 -0.859047 -0.1109662017-08-04 -0.486971  0.2483432017-08-05  2.641393 -1.2293952017-08-06  1.023585  0.630160Process finished with exit code 0

5、获取特定的值

print(df.iloc[1,1])print(df.iat[1,1])//0.4267748277390.426774827739Process finished with exit code 0

六 布尔索引
1、使用一个单独列的值来选择数据:

print(df[df.A > 0])//                   A         B         C         D2017-08-01  1.113008 -1.663732 -0.321016  0.0791692017-08-03  0.000043  0.217510  0.929183  0.6871622017-08-04  0.167007 -2.045754  0.318775 -0.1687352017-08-05  0.721842  0.982646  0.367562  0.4682502017-08-06  0.853404 -0.344893  2.266556 -0.304485

2、使用where操作来选择数据:

print(df[df>0])//                   A         B         C         D2017-08-01  0.183377       NaN  0.411767  0.5356072017-08-02  0.169412  1.331423  0.815747       NaN2017-08-03  0.807875       NaN       NaN       NaN2017-08-04  0.559794       NaN  1.384336  0.0114422017-08-05  0.250613       NaN       NaN       NaN2017-08-06       NaN       NaN       NaN       NaN

3、使用isin()方法来过滤:

df2 = df.copy()df2['E'] = ['one','one','two','three','four','three']print(df2)print(df2[df2['E'].isin(['one','four'])])//                   A         B         C         D      E2017-08-01  0.812330  1.333924 -1.267516  0.536965    one2017-08-02  0.775528  1.071335 -1.344108  2.075409    one2017-08-03 -0.826466  2.613913  0.153449  0.475277    two2017-08-04 -0.190586 -0.613647 -0.044821 -1.224435  three2017-08-05 -0.025830 -0.432906  0.990122  0.583328   four2017-08-06  1.258012 -1.513605  2.128874  0.615266  three                   A         B         C         D     E2017-08-01  0.812330  1.333924 -1.267516  0.536965   one2017-08-02  0.775528  1.071335 -1.344108  2.075409   one2017-08-05 -0.025830 -0.432906  0.990122  0.583328  fourProcess finished with exit code 0

七、设置
1、设置新的列:

s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20170801',periods=6))print(s1)//2017-08-01    12017-08-02    22017-08-03    32017-08-04    42017-08-05    52017-08-06    6Freq: D, dtype: int64

2、通过标签设置新的值

pdf.at[dates[0],'A']=0

3、通过位置设置新的值:

df.iat[0,1]=0

4、通过一个numpy数组设置一组新值

df.loc[:'D'] = np.array([5]*len(df))

5、通过where操作来设置新的值:

df2 =df.copy()df2[df2>0]=-df2print(df2)//                   A         B         C         D2017-08-01 -0.994060 -0.915353 -0.103166 -0.4003632017-08-02 -0.483559 -0.775021 -0.549397 -0.5259692017-08-03 -1.394320 -1.430371 -1.592913 -0.4799442017-08-04 -0.164794 -0.728599 -0.187677 -1.9501282017-08-05 -0.827711 -0.678634 -1.089334 -1.2240332017-08-06 -0.159199 -0.204388 -0.697539 -0.193927

八、缺失值的处理
在pandas中,使用np.nan来代替缺失值, 这些值将默认不会包含在计算中,
1、reindex()方法可以对指定轴上的索引进行改变/增加、删除操作 这将返回原始数据的一个拷贝

df2 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])df2.loc[0:2,'E'] = 1print(df2)//                   A         B         C         D    E2017-08-01 -0.631494 -0.427917 -2.263502  0.499206  1.02017-08-02 -0.363730 -1.641072  0.251015  1.127523  1.02017-08-03  1.284590 -0.249052 -0.410623  0.304096  NaN2017-08-04  1.166190  1.371307 -0.636533  0.959298  NaN

2、去掉包含缺失值的行

df2 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])df2.loc[0:2,'E'] = 1print(df2)print(df2.dropna(how='any'))//                   A         B         C         D    E2017-08-01  0.844367  0.282881 -0.596537  1.106986  1.02017-08-02 -0.297101 -0.935122  0.100942  2.362901  1.02017-08-03  0.775431  0.187812  1.244166  1.549058  NaN2017-08-04  0.146374  0.604248 -0.275827 -1.339286  NaN                   A         B         C         D    E2017-08-01  0.844367  0.282881 -0.596537  1.106986  1.02017-08-02 -0.297101 -0.935122  0.100942  2.362901  1.0

3、对缺失值进行补充:

print(df2.fillna(value=5))//                   A         B         C         D    E2017-08-01  0.134828 -0.409695  0.398769 -0.607453  1.02017-08-02  0.461754 -0.555881  1.073785  2.268998  1.02017-08-03  2.494864  0.894457 -1.012622  0.225400  5.02017-08-04 -0.789087 -0.632234 -1.368574 -1.554802  5.0

4、对数据进行布尔填充:

print(pd.isnull(df2))//                A      B      C      D      E2017-08-01  False  False  False  False  False2017-08-02  False  False  False  False  False2017-08-03  False  False  False  False   True2017-08-04  False  False  False  False   True

九、相关操作
1、执行描述性统计:

print(df)print(df.mean())print(df.mean(1))//                   A         B         C         D2017-08-01 -0.211556  0.151164  0.827327 -1.0558082017-08-02 -0.181771 -0.388966  2.045035  0.1186862017-08-03 -1.790196 -1.234819  0.432210  0.3909792017-08-04  0.240327  0.018650  0.558570  0.0446622017-08-05 -0.336993 -0.496467  2.512642  0.1984162017-08-06  0.824033 -0.083081  0.021884  3.237294A   -0.242693B   -0.338920C    1.066278D    0.489038dtype: float642017-08-01   -0.0722182017-08-02    0.3982462017-08-03   -0.5504572017-08-04    0.2155532017-08-05    0.4694002017-08-06    1.000032Freq: D, dtype: float64

2、对于拥有不同维度,需要对齐的对象进行操作。pandas会自动的沿着指定的维度进行广播:

s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)print(s)//2017-08-01    NaN2017-08-02    NaN2017-08-03    1.02017-08-04    3.02017-08-05    5.02017-08-06    NaNFreq: D, dtype: float64s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)print(s)print(df.sub(s,axis='index'))//2017-08-01    NaN2017-08-02    NaN2017-08-03    1.02017-08-04    3.02017-08-05    5.02017-08-06    NaNFreq: D, dtype: float64                   A         B         C         D2017-08-01       NaN       NaN       NaN       NaN2017-08-02       NaN       NaN       NaN       NaN2017-08-03 -1.830701  0.504959 -0.709195 -0.8476862017-08-04 -3.172816 -2.310923 -1.233996 -1.9667262017-08-05 -4.368654 -4.300136 -3.945498 -3.2712502017-08-06       NaN       NaN       NaN       NaN

3、对数据应用函数

print(df)print(df.apply(np.cumsum))# 累计求和//                   A         B         C         D2017-08-01  2.643719 -1.955916 -0.625852 -0.2001302017-08-02  0.510827 -1.552670 -0.669396  1.1461572017-08-03 -0.168740  0.055871 -0.108548  1.1981552017-08-04 -1.524010  0.844605 -0.157555  1.3879632017-08-05  0.220436 -1.345371  0.038241 -0.0553062017-08-06 -1.299420 -1.663129 -2.586310  1.218230                   A         B         C         D2017-08-01  2.643719 -1.955916 -0.625852 -0.2001302017-08-02  3.154545 -3.508586 -1.295248  0.9460272017-08-03  2.985805 -3.452715 -1.403796  2.1441822017-08-04  1.461795 -2.608110 -1.561352  3.5321442017-08-05  1.682231 -3.953481 -1.523111  3.4768382017-08-06  0.382811 -5.616610 -4.109421  4.695068print(df.apply(lambda x:x.max() - x.min()))//A    1.510451B    3.455243C    1.772878D    2.703384dtype: float64

4、value_counts()

s = pd.Series(np.random.randint(0,7,size=10))print(s)print(s.value_counts())print(type(s.value_counts()))print(pd.value_counts(s))//0    11    12    33    54    25    46    27    08    29    6dtype: int322    31    26    15    14    13    10    1dtype: int64<class 'pandas.core.series.Series'>2    31    26    15    14    13    10    1dtype: int64Process finished with exit code 0

5、字符串方法
Series对象在其str属性中配备了一组字符串处理方法,可以很容易的应用到数组中的每个元素,如下段代码所示。

 s.str.lower() #将s Series 字符串变成小写。

十、合并
Pandas 提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并 。
1、concat

df = pd.DataFrame(np.random.randn(10,4))print("is--",df)pieces = [df[:3],df[3:7],df[7:]]df1 =  pd.concat(pieces)print("is==",df1)//is--           0         1         2         30 -1.033090 -0.884844 -0.868251  0.5048891  0.264918 -2.486587 -1.985965 -1.7731312 -1.056175 -0.647278 -2.123045  0.2922243  0.664369 -0.185612 -0.628173  0.2149244  0.740074  0.515744 -2.547639 -0.1164055 -0.578900 -0.966765  0.455904 -0.6230746 -2.000682  1.712143  0.532171  0.8075207 -1.604986  1.015452  2.270513  0.7345738 -0.817223  0.211418  0.543868  0.9300369 -0.288155 -1.231245 -1.162904  0.372432is==           0         1         2         30 -1.033090 -0.884844 -0.868251  0.5048891  0.264918 -2.486587 -1.985965 -1.7731312 -1.056175 -0.647278 -2.123045  0.2922243  0.664369 -0.185612 -0.628173  0.2149244  0.740074  0.515744 -2.547639 -0.1164055 -0.578900 -0.966765  0.455904 -0.6230746 -2.000682  1.712143  0.532171  0.8075207 -1.604986  1.015452  2.270513  0.7345738 -0.817223  0.211418  0.543868  0.9300369 -0.288155 -1.231245 -1.162904  0.372432

2、Join类似与SQL类型的合并

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})print(left)print(right)print(pd.merge(left,right,on='key))//   key  lval0  foo     11  foo     2   key  rval0  foo     41  foo     5   key  lval  rval0  foo     1     41  foo     1     52  foo     2     43  foo     2     5

3、append将一行连接到一个DataFrame上。

print(df)s =df.iloc[3]print(df.append(s,ignore_index = True))//                   A         B         C         D2017-08-01  0.040061 -0.552461 -1.442001 -0.4434292017-08-02  1.850017 -0.526213  0.548660 -0.7601852017-08-03  1.233839 -0.338561 -1.441615  0.0084512017-08-04 -0.622002 -1.308038 -0.926464 -0.8901232017-08-05 -0.271971  0.994797 -0.806631 -0.1916092017-08-06  0.078366 -0.014979  0.603089 -1.145712          A         B         C         D0  0.040061 -0.552461 -1.442001 -0.4434291  1.850017 -0.526213  0.548660 -0.7601852  1.233839 -0.338561 -1.441615  0.0084513 -0.622002 -1.308038 -0.926464 -0.8901234 -0.271971  0.994797 -0.806631 -0.1916095  0.078366 -0.014979  0.603089 -1.1457126 -0.622002 -1.308038 -0.926464 -0.890123

4、分组
对于”group by”操作,我们通常是指以下一个或多个操作步骤:

  • (Splitting)按照一些规则将数据分为不同的组;
  • (Applying)对于每组数据分别执行一个函数;
  • (Combining)将结果组合到一个数据结构中;

1、 分组并对每个分组执行sum函数:
2、 通过多个列进行分组形成一个层次索引,然后执行函数:

详细参考:http://www.cnblogs.com/chaosimple/p/4153083.html

原创粉丝点击