十分钟搞定pandas

来源：互联网发布：国外买淘宝编辑：程序博客网时间：2024/06/05 02:05

原文出处： pandas.pydata.org 译文出处：石卓林

这是关于pandas的简短介绍，主要面向新用户。可以参阅Cookbook了解更复杂的使用方法。

习惯上,我们做以下导入

In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: import matplotlib.pyplot as plt

In[1]:import pandasas pd

In [2]:import numpyas np

In[3]:import matplotlib.pyplotas plt

创建对象

使用传递的值列表序列创建序列, 让pandas创建默认整数索引

In [4]: s = pd.Series([1,3,5,np.nan,6,8])In [5]: sOut[5]: 0     11     32     53   NaN4     65     8dtype: float64

In[4]:s =pd.Series([1,3,5,np.nan,6,8])

In [5]:s

Out[5]:

0 1

1 3

2 5

3 NaN

4 6

5 8

dtype:float64

使用传递的numpy数组创建数据帧,并使用日期索引和标记列.

In [6]: dates = pd.date_range('20130101',periods=6)In [7]: datesOut[7]: <class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01, ..., 2013-01-06]Length: 6, Freq: D, Timezone: NoneIn [8]: df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))In [9]: dfOut[9]:                    A         B         C         D2013-01-01  0.469112 -0.282863 -1.509059 -1.1356322013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929  1.0718042013-01-04  0.721555 -0.706771 -1.039575  0.2718602013-01-05 -0.424972  0.567020  0.276232 -1.0874012013-01-06 -0.673690  0.113648 -1.478427  0.524988

In[6]:dates =pd.date_range('20130101',periods=6)

In [7]:dates

Out[7]:

<class'pandas.tseries.index.DatetimeIndex'>

[2013-01-01,...,2013-01-06]

Length:6,Freq:D,Timezone:None

In [8]:df =pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

In[9]:df

Out[9]:

A B C D

2013-01-01 0.469112-0.282863-1.509059-1.135632

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-03-0.861849-2.104569-0.494929 1.071804

2013-01-04 0.721555-0.706771-1.039575 0.271860

2013-01-05-0.424972 0.567020 0.276232-1.087401

2013-01-06-0.673690 0.113648-1.478427 0.524988

使用传递的可转换序列的字典对象创建数据帧.

In [10]: df2 = pd.DataFrame({ 'A' : 1.,   ....:                      'B' : pd.Timestamp('20130102'),   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),   ....:                      'D' : np.array([3] * 4,dtype='int32'),   ....:                      'E' : pd.Categorical(["test","train","test","train"]),   ....:                      'F' : 'foo' })   ....: In [11]: df2Out[11]:    A          B  C  D      E    F0  1 2013-01-02  1  3   test  foo1  1 2013-01-02  1  3  train  foo2  1 2013-01-02  1  3   test  foo3  1 2013-01-02  1  3  train  foo

In[10]:df2 =pd.DataFrame({'A' :1.,

....: 'B': pd.Timestamp('20130102'),

....: 'C': pd.Series(1,index=list(range(4)),dtype='float32'),

....: 'D': np.array([3]* 4,dtype='int32'),

....: 'E': pd.Categorical(["test","train","test","train"]),

....: 'F': 'foo'})

....:

In [11]:df2

Out[11]:

A B C D E F

0 12013-01-02 1 3 test foo

1 12013-01-02 1 3 train foo

2 12013-01-02 1 3 test foo

3 12013-01-02 1 3 train foo

所有明确类型

In [12]: df2.dtypesOut[12]: A           float64B    datetime64[ns]C           float32D             int32E          categoryF            objectdtype: object

In[12]:df2.dtypes

Out[12]:

A float64

B datetime64[ns]

C float32

D int32

E category

F object

dtype:object

如果你这个正在使用IPython，标签补全列名（以及公共属性）将自动启用。这里是将要完成的属性的子集：

In [13]: df2.<TAB>df2.A                  df2.boxplotdf2.abs                df2.Cdf2.add                df2.clipdf2.add_prefix         df2.clip_lowerdf2.add_suffix         df2.clip_upperdf2.align              df2.columnsdf2.all                df2.combinedf2.any                df2.combineAdddf2.append             df2.combine_firstdf2.apply              df2.combineMultdf2.applymap           df2.compounddf2.as_blocks          df2.consolidatedf2.asfreq             df2.convert_objectsdf2.as_matrix          df2.copydf2.astype             df2.corrdf2.at                 df2.corrwithdf2.at_time            df2.countdf2.axes               df2.covdf2.B                  df2.cummaxdf2.between_time       df2.cummindf2.bfill              df2.cumproddf2.blocks             df2.cumsumdf2.bool               df2.D

In[13]:df2.<TAB>

df2.A df2.boxplot

df2.abs df2.C

df2.add df2.clip

df2.add_prefix df2.clip_lower

df2.add_suffix df2.clip_upper

df2.align df2.columns

df2.all df2.combine

df2.any df2.combineAdd

df2.append df2.combine_first

df2.apply df2.combineMult

df2.applymap df2.compound

df2.as_blocks df2.consolidate

df2.asfreq df2.convert_objects

df2.as_matrix df2.copy

df2.astype df2.corr

df2.at df2.corrwith

df2.at_time df2.count

df2.axes df2.cov

df2.B df2.cummax

df2.between_time df2.cummin

df2.bfill df2.cumprod

df2.blocks df2.cumsum

df2.bool df2.D

如你所见, 列 A, B, C, 和 D 也是自动完成标签. E 也是可用的; 为了简便起见,后面的属性显示被截断.

查看数据

参阅基础部分

查看帧顶部和底部行

In [14]: df.head()Out[14]:                    A         B         C         D2013-01-01  0.469112 -0.282863 -1.509059 -1.1356322013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929  1.0718042013-01-04  0.721555 -0.706771 -1.039575  0.2718602013-01-05 -0.424972  0.567020  0.276232 -1.087401In [15]: df.tail(3)Out[15]:                    A         B         C         D2013-01-04  0.721555 -0.706771 -1.039575  0.2718602013-01-05 -0.424972  0.567020  0.276232 -1.0874012013-01-06 -0.673690  0.113648 -1.478427  0.524988

In[14]:df.head()

Out[14]:

A B C D

2013-01-01 0.469112-0.282863-1.509059-1.135632

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-03-0.861849-2.104569-0.494929 1.071804

2013-01-04 0.721555-0.706771-1.039575 0.271860

2013-01-05-0.424972 0.567020 0.276232-1.087401

In [15]:df.tail(3)

Out[15]:

A B C D

2013-01-04 0.721555-0.706771-1.039575 0.271860

2013-01-05-0.424972 0.567020 0.276232-1.087401

2013-01-06-0.673690 0.113648-1.478427 0.524988

显示索引,列,和底层numpy数据

In [16]: df.indexOut[16]: <class 'pandas.tseries.index.DatetimeIndex'>[2013-01-01, ..., 2013-01-06]Length: 6, Freq: D, Timezone: NoneIn [17]: df.columnsOut[17]: Index([u'A', u'B', u'C', u'D'], dtype='object')In [18]: df.valuesOut[18]: array([[ 0.4691, -0.2829, -1.5091, -1.1356],       [ 1.2121, -0.1732,  0.1192, -1.0442],       [-0.8618, -2.1046, -0.4949,  1.0718],       [ 0.7216, -0.7068, -1.0396,  0.2719],       [-0.425 ,  0.567 ,  0.2762, -1.0874],       [-0.6737,  0.1136, -1.4784,  0.525 ]])

In[16]:df.index

Out[16]:

<class'pandas.tseries.index.DatetimeIndex'>

[2013-01-01,...,2013-01-06]

Length:6,Freq:D,Timezone:None

In[17]:df.columns

Out[17]:Index([u'A',u'B',u'C',u'D'],dtype='object')

In [18]:df.values

Out[18]:

array([[0.4691,-0.2829,-1.5091,-1.1356],

[ 1.2121,-0.1732, 0.1192,-1.0442],

[-0.8618,-2.1046,-0.4949, 1.0718],

[ 0.7216,-0.7068,-1.0396, 0.2719],

[-0.425, 0.567, 0.2762,-1.0874],

[-0.6737, 0.1136,-1.4784, 0.525]])

描述显示数据快速统计摘要

In [19]: df.describe()Out[19]:               A         B         C         Dcount  6.000000  6.000000  6.000000  6.000000mean   0.073711 -0.431125 -0.687758 -0.233103std    0.843157  0.922818  0.779887  0.973118min   -0.861849 -2.104569 -1.509059 -1.13563225%   -0.611510 -0.600794 -1.368714 -1.07661050%    0.022070 -0.228039 -0.767252 -0.38618875%    0.658444  0.041933 -0.034326  0.461706max    1.212112  0.567020  0.276232  1.071804

In[19]:df.describe()

Out[19]:

A B C D

count 6.000000 6.000000 6.000000 6.000000

mean 0.073711 -0.431125-0.687758-0.233103

std 0.843157 0.922818 0.779887 0.973118

min -0.861849-2.104569-1.509059-1.135632

25% -0.611510-0.600794-1.368714-1.076610

50% 0.022070-0.228039-0.767252-0.386188

75% 0.658444 0.041933-0.034326 0.461706

max 1.212112 0.567020 0.276232 1.071804

转置数据

In [20]: df.TOut[20]:    2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988

In[20]:df.T

Out[20]:

2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06

A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690

B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648

C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427

D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988

按轴排序

In [21]: df.sort_index(axis=1, ascending=False)Out[21]:                    D         C         B         A2013-01-01 -1.135632 -1.509059 -0.282863  0.4691122013-01-02 -1.044236  0.119209 -0.173215  1.2121122013-01-03  1.071804 -0.494929 -2.104569 -0.8618492013-01-04  0.271860 -1.039575 -0.706771  0.7215552013-01-05 -1.087401  0.276232  0.567020 -0.4249722013-01-06  0.524988 -1.478427  0.113648 -0.673690

In[21]:df.sort_index(axis=1,ascending=False)

Out[21]:

D C B A

2013-01-01-1.135632-1.509059-0.282863 0.469112

2013-01-02-1.044236 0.119209-0.173215 1.212112

2013-01-03 1.071804-0.494929-2.104569-0.861849

2013-01-04 0.271860-1.039575-0.706771 0.721555

2013-01-05-1.087401 0.276232 0.567020-0.424972

2013-01-06 0.524988-1.478427 0.113648-0.673690

按值排序

In [22]: df.sort(columns='B')Out[22]:                    A         B         C         D2013-01-03 -0.861849 -2.104569 -0.494929  1.0718042013-01-04  0.721555 -0.706771 -1.039575  0.2718602013-01-01  0.469112 -0.282863 -1.509059 -1.1356322013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-06 -0.673690  0.113648 -1.478427  0.5249882013-01-05 -0.424972  0.567020  0.276232 -1.087401

In[22]:df.sort(columns='B')

Out[22]:

A B C D

2013-01-03-0.861849-2.104569-0.494929 1.071804

2013-01-04 0.721555-0.706771-1.039575 0.271860

2013-01-01 0.469112-0.282863-1.509059-1.135632

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-06-0.673690 0.113648-1.478427 0.524988

2013-01-05-0.424972 0.567020 0.276232-1.087401

选择器

注释: 标准Python / Numpy表达式可以完成这些互动工作, 但在生产代码中, 我们推荐使用优化的pandas数据访问方法, .at, .iat, .loc, .iloc 和 .ix.

参阅索引文档索引和选择数据 and 多索引/高级索引

读取

选择单列, 这会产生一个序列, 等价df.A

In [23]: df['A']Out[23]: 2013-01-01    0.4691122013-01-02    1.2121122013-01-03   -0.8618492013-01-04    0.7215552013-01-05   -0.4249722013-01-06   -0.673690Freq: D, Name: A, dtype: float64

In[23]:df['A']

Out[23]:

2013-01-01 0.469112

2013-01-02 1.212112

2013-01-03 -0.861849

2013-01-04 0.721555

2013-01-05 -0.424972

2013-01-06 -0.673690

Freq:D,Name:A,dtype:float64

使用[]选择行片断

In [24]: df[0:3]Out[24]:                    A         B         C         D2013-01-01  0.469112 -0.282863 -1.509059 -1.1356322013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929  1.071804In [25]: df['20130102':'20130104']Out[25]:                    A         B         C         D2013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929  1.0718042013-01-04  0.721555 -0.706771 -1.039575  0.271860

In[24]:df[0:3]

Out[24]:

A B C D

2013-01-01 0.469112-0.282863-1.509059-1.135632

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-03-0.861849-2.104569-0.494929 1.071804

In [25]:df['20130102':'20130104']

Out[25]:

A B C D

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-03-0.861849-2.104569-0.494929 1.071804

2013-01-04 0.721555-0.706771-1.039575 0.271860

使用标签选择

更多信息请参阅按标签选择

使用标签获取横截面

In [26]: df.loc[dates[0]]Out[26]: A    0.469112B   -0.282863C   -1.509059D   -1.135632Name: 2013-01-01 00:00:00, dtype: float64

In[26]:df.loc[dates[0]]

Out[26]:

A 0.469112

B -0.282863

C -1.509059

D -1.135632

Name:2013-01-0100:00:00,dtype:float64

使用标签选择多轴

In [27]: df.loc[:,['A','B']]Out[27]:                    A         B2013-01-01  0.469112 -0.2828632013-01-02  1.212112 -0.1732152013-01-03 -0.861849 -2.1045692013-01-04  0.721555 -0.7067712013-01-05 -0.424972  0.5670202013-01-06 -0.673690  0.113648

In[27]:df.loc[:,['A','B']]

Out[27]:

A B

2013-01-01 0.469112-0.282863

2013-01-02 1.212112-0.173215

2013-01-03-0.861849-2.104569

2013-01-04 0.721555-0.706771

2013-01-05-0.424972 0.567020

2013-01-06-0.673690 0.113648

显示标签切片, 包含两个端点

In [28]: df.loc['20130102':'20130104',['A','B']]Out[28]:                    A         B2013-01-02  1.212112 -0.1732152013-01-03 -0.861849 -2.1045692013-01-04  0.721555 -0.706771

In[28]:df.loc['20130102':'20130104',['A','B']]

Out[28]:

A B

2013-01-02 1.212112-0.173215

2013-01-03-0.861849-2.104569

2013-01-04 0.721555-0.706771

降低返回对象维度

In [29]: df.loc['20130102',['A','B']]Out[29]: A    1.212112B   -0.173215Name: 2013-01-02 00:00:00, dtype: float64

In[29]:df.loc['20130102',['A','B']]

Out[29]:

A 1.212112

B -0.173215

Name:2013-01-0200:00:00,dtype:float64

获取标量值

In [30]: df.loc[dates[0],'A']Out[30]: 0.46911229990718628

In[30]:df.loc[dates[0],'A']

Out[30]:0.46911229990718628

快速访问并获取标量数据 (等价上面的方法)

In [31]: df.at[dates[0],'A']Out[31]: 0.46911229990718628

In[31]:df.at[dates[0],'A']

Out[31]:0.46911229990718628

按位置选择

更多信息请参阅按位置参阅

传递整数选择位置

In [32]: df.iloc[3]Out[32]: A    0.721555B   -0.706771C   -1.039575D    0.271860Name: 2013-01-04 00:00:00, dtype: float64

In[32]:df.iloc[3]

Out[32]:

A 0.721555

B -0.706771

C -1.039575

D 0.271860

Name:2013-01-0400:00:00,dtype:float64

使用整数片断,效果类似numpy/python

In [33]: df.iloc[3:5,0:2]Out[33]:                    A         B2013-01-04  0.721555 -0.7067712013-01-05 -0.424972  0.567020

In[33]:df.iloc[3:5,0:2]

Out[33]:

A B

2013-01-04 0.721555-0.706771

2013-01-05-0.424972 0.567020

使用整数偏移定位列表,效果类似 numpy/python 样式

In [34]: df.iloc[[1,2,4],[0,2]]Out[34]:                    A         C2013-01-02  1.212112  0.1192092013-01-03 -0.861849 -0.4949292013-01-05 -0.424972  0.276232

In[34]:df.iloc[[1,2,4],[0,2]]

Out[34]:

A C

2013-01-02 1.212112 0.119209

2013-01-03-0.861849-0.494929

2013-01-05-0.424972 0.276232

显式行切片

In [35]: df.iloc[1:3,:]Out[35]:                    A         B         C         D2013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-03 -0.861849 -2.104569 -0.494929  1.071804

In[35]:df.iloc[1:3,:]

Out[35]:

A B C D

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-03-0.861849-2.104569-0.494929 1.071804

显式列切片

In [36]: df.iloc[:,1:3]Out[36]:                    B         C2013-01-01 -0.282863 -1.5090592013-01-02 -0.173215  0.1192092013-01-03 -2.104569 -0.4949292013-01-04 -0.706771 -1.0395752013-01-05  0.567020  0.2762322013-01-06  0.113648 -1.478427

In[36]:df.iloc[:,1:3]

Out[36]:

B C

2013-01-01-0.282863-1.509059

2013-01-02-0.173215 0.119209

2013-01-03-2.104569-0.494929

2013-01-04-0.706771-1.039575

2013-01-05 0.567020 0.276232

2013-01-06 0.113648-1.478427

显式获取一个值

In [37]: df.iloc[1,1]Out[37]: -0.17321464905330861

In[37]:df.iloc[1,1]

Out[37]:-0.17321464905330861

快速访问一个标量（等同上个方法）

In [38]: df.iat[1,1]Out[38]: -0.17321464905330861

In[38]:df.iat[1,1]

Out[38]:-0.17321464905330861

布尔索引

使用单个列的值选择数据.

In [39]: df[df.A > 0]Out[39]:                    A         B         C         D2013-01-01  0.469112 -0.282863 -1.509059 -1.1356322013-01-02  1.212112 -0.173215  0.119209 -1.0442362013-01-04  0.721555 -0.706771 -1.039575  0.271860

In[39]:df[df.A> 0]

Out[39]:

A B C D

2013-01-01 0.469112-0.282863-1.509059-1.135632

2013-01-02 1.212112-0.173215 0.119209-1.044236

2013-01-04 0.721555-0.706771-1.039575 0.271860

where 操作.

In [40]: df[df > 0]Out[40]:                    A         B         C         D2013-01-01  0.469112       NaN       NaN       NaN2013-01-02  1.212112       NaN  0.119209       NaN2013-01-03       NaN       NaN       NaN  1.0718042013-01-04  0.721555       NaN       NaN  0.2718602013-01-05       NaN  0.567020  0.276232       NaN2013-01-06       NaN  0.113648       NaN  0.524988

In[40]:df[df> 0]

Out[40]:

A B C D

2013-01-01 0.469112 NaN NaN NaN

2013-01-02 1.212112 NaN 0.119209 NaN

2013-01-03 NaN NaN NaN 1.071804

2013-01-04 0.721555 NaN NaN 0.271860

2013-01-05 NaN 0.567020 0.276232 NaN

2013-01-06 NaN 0.113648 NaN 0.524988

使用 isin() 筛选：

In [41]: df2 = df.copy()In [42]: df2['E']=['one', 'one','two','three','four','three']In [43]: df2Out[43]:                    A         B         C         D      E2013-01-01  0.469112 -0.282863 -1.509059 -1.135632    one2013-01-02  1.212112 -0.173215  0.119209 -1.044236    one2013-01-03 -0.861849 -2.104569 -0.494929  1.071804    two2013-01-04  0.721555 -0.706771 -1.039575  0.271860  three2013-01-05 -0.424972  0.567020  0.276232 -1.087401   four2013-01-06 -0.673690  0.113648 -1.478427  0.524988  threeIn [44]: df2[df2['E'].isin(['two','four'])]Out[44]:                    A         B         C         D     E2013-01-03 -0.861849 -2.104569 -0.494929  1.071804   two2013-01-05 -0.424972  0.567020  0.276232 -1.087401  four

In[41]:df2 =df.copy()

In [42]:df2['E']=['one','one','two','three','four','three']

In [43]:df2

Out[43]:

A B C D E

2013-01-01 0.469112-0.282863-1.509059-1.135632 one

2013-01-02 1.212112-0.173215 0.119209-1.044236 one

2013-01-03-0.861849-2.104569-0.494929 1.071804 two

2013-01-04 0.721555-0.706771-1.039575 0.271860 three

2013-01-05-0.424972 0.567020 0.276232-1.087401 four

2013-01-06-0.673690 0.113648-1.478427 0.524988 three

In [44]:df2[df2['E'].isin(['two','four'])]

Out[44]:

A B C D E

2013-01-03-0.861849-2.104569-0.494929 1.071804 two

2013-01-05-0.424972 0.567020 0.276232-1.087401 four

赋值

赋值一个新列，通过索引自动对齐数据

In [45]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))In [46]: s1Out[46]: 2013-01-02    12013-01-03    22013-01-04    32013-01-05    42013-01-06    52013-01-07    6Freq: D, dtype: int64In [47]: df['F'] = s1

In[45]:s1 =pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))

In [46]:s1

Out[46]:

2013-01-02 1

2013-01-03 2

2013-01-04 3

2013-01-05 4

2013-01-06 5

2013-01-07 6

Freq:D,dtype:int64

In [47]:df['F']= s1

按标签赋值

In [48]: df.at[dates[0],'A'] = 0

In[48]:df.at[dates[0],'A']= 0

按位置赋值

In [49]: df.iat[0,1] = 0

In[49]:df.iat[0,1]= 0

通过numpy数组分配赋值

In [50]: df.loc[:,'D'] = np.array([5] * len(df))

In[50]:df.loc[:,'D']= np.array([5]* len(df))

之前的操作结果

In [51]: dfOut[51]:                    A         B         C  D   F2013-01-01  0.000000  0.000000 -1.509059  5 NaN2013-01-02  1.212112 -0.173215  0.119209  5   12013-01-03 -0.861849 -2.104569 -0.494929  5   22013-01-04  0.721555 -0.706771 -1.039575  5   32013-01-05 -0.424972  0.567020  0.276232  5   42013-01-06 -0.673690  0.113648 -1.478427  5   5

In[51]:df

Out[51]:

A B C D F

2013-01-01 0.000000 0.000000-1.509059 5NaN

2013-01-02 1.212112-0.173215 0.119209 5 1

2013-01-03-0.861849-2.104569-0.494929 5 2

2013-01-04 0.721555-0.706771-1.039575 5 3

2013-01-05-0.424972 0.567020 0.276232 5 4

2013-01-06-0.673690 0.113648-1.478427 5 5

where 操作赋值.

In [52]: df2 = df.copy()In [53]: df2[df2 > 0] = -df2In [54]: df2Out[54]:                    A         B         C  D   F2013-01-01  0.000000  0.000000 -1.509059 -5 NaN2013-01-02 -1.212112 -0.173215 -0.119209 -5  -12013-01-03 -0.861849 -2.104569 -0.494929 -5  -22013-01-04 -0.721555 -0.706771 -1.039575 -5  -32013-01-05 -0.424972 -0.567020 -0.276232 -5  -42013-01-06 -0.673690 -0.113648 -1.478427 -5  -5

In[52]:df2 =df.copy()

In [53]:df2[df2> 0]= -df2

In[54]:df2

Out[54]:

A B C D F

2013-01-01 0.000000 0.000000-1.509059-5NaN

2013-01-02-1.212112-0.173215-0.119209-5 -1

2013-01-03-0.861849-2.104569-0.494929-5 -2

2013-01-04-0.721555-0.706771-1.039575-5 -3

2013-01-05-0.424972-0.567020-0.276232-5 -4

2013-01-06-0.673690-0.113648-1.478427-5 -5

丢失的数据

pandas主要使用np.nan替换丢失的数据. 默认情况下它并不包含在计算中. 请参阅 Missing Data section

重建索引允许更改/添加/删除指定轴索引,并返回数据副本.

In [55]: df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])In [56]: df1.loc[dates[0]:dates[1],'E'] = 1In [57]: df1Out[57]:                    A         B         C  D   F   E2013-01-01  0.000000  0.000000 -1.509059  5 NaN   12013-01-02  1.212112 -0.173215  0.119209  5   1   12013-01-03 -0.861849 -2.104569 -0.494929  5   2 NaN2013-01-04  0.721555 -0.706771 -1.039575  5   3 NaN

In[55]:df1 =df.reindex(index=dates[0:4],columns=list(df.columns)+ ['E'])

In [56]:df1.loc[dates[0]:dates[1],'E']= 1

In[57]:df1

Out[57]:

A B C D F E

2013-01-01 0.000000 0.000000-1.509059 5NaN 1

2013-01-02 1.212112-0.173215 0.119209 5 1 1

2013-01-03-0.861849-2.104569-0.494929 5 2 NaN

2013-01-04 0.721555-0.706771-1.039575 5 3 NaN

删除任何有丢失数据的行.

In [58]: df1.dropna(how='any')Out[58]:                    A         B         C  D  F  E2013-01-02  1.212112 -0.173215  0.119209  5  1  1

In[58]:df1.dropna(how='any')

Out[58]:

A B C D F E

2013-01-02 1.212112-0.173215 0.119209 5 1 1

填充丢失数据

In [59]: df1.fillna(value=5)Out[59]:                    A         B         C  D  F  E2013-01-01  0.000000  0.000000 -1.509059  5  5  12013-01-02  1.212112 -0.173215  0.119209  5  1  12013-01-03 -0.861849 -2.104569 -0.494929  5  2  52013-01-04  0.721555 -0.706771 -1.039575  5  3  5

In[59]:df1.fillna(value=5)

Out[59]:

A B C D F E

2013-01-01 0.000000 0.000000-1.509059 5 5 1

2013-01-02 1.212112-0.173215 0.119209 5 1 1

2013-01-03-0.861849-2.104569-0.494929 5 2 5

2013-01-04 0.721555-0.706771-1.039575 5 3 5

获取值是否nan的布尔标记

In [60]: pd.isnull(df1)Out[60]:                 A      B      C      D      F      E2013-01-01  False  False  False  False   True  False2013-01-02  False  False  False  False  False  False2013-01-03  False  False  False  False  False   True2013-01-04  False  False  False  False  False   True

In[60]:pd.isnull(df1)

Out[60]:

A B C D F E

2013-01-01 False False False False True False

2013-01-02 False False False False False False

2013-01-03 False False False False False True

2013-01-04 False False False False False True

运算

参阅二元运算基础

统计

计算时一般不包括丢失的数据

执行描述性统计

In [61]: df.mean()Out[61]: A   -0.004474B   -0.383981C   -0.687758D    5.000000F    3.000000dtype: float64

In[61]:df.mean()

Out[61]:

A -0.004474

B -0.383981

C -0.687758

D 5.000000

F 3.000000

dtype:float64

在其他轴做相同的运算

In [62]: df.mean(1)Out[62]: 2013-01-01    0.8727352013-01-02    1.4316212013-01-03    0.7077312013-01-04    1.3950422013-01-05    1.8836562013-01-06    1.592306Freq: D, dtype: float64

In[62]:df.mean(1)

Out[62]:

2013-01-01 0.872735

2013-01-02 1.431621

2013-01-03 0.707731

2013-01-04 1.395042

2013-01-05 1.883656

2013-01-06 1.592306

Freq:D,dtype:float64

用于运算的对象有不同的维度并需要对齐.除此之外，pandas会自动沿着指定维度计算.

In [63]: s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)In [64]: sOut[64]: 2013-01-01   NaN2013-01-02   NaN2013-01-03     12013-01-04     32013-01-05     52013-01-06   NaNFreq: D, dtype: float64In [65]: df.sub(s,axis='index')Out[65]:                    A         B         C   D   F2013-01-01       NaN       NaN       NaN NaN NaN2013-01-02       NaN       NaN       NaN NaN NaN2013-01-03 -1.861849 -3.104569 -1.494929   4   12013-01-04 -2.278445 -3.706771 -4.039575   2   02013-01-05 -5.424972 -4.432980 -4.723768   0  -12013-01-06       NaN       NaN       NaN NaN NaN

In[63]:s =pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)

In [64]:s

Out[64]:

2013-01-01 NaN

2013-01-02 NaN

2013-01-03 1

2013-01-04 3

2013-01-05 5

2013-01-06 NaN

Freq:D,dtype:float64

In [65]:df.sub(s,axis='index')

Out[65]:

A B C D F

2013-01-01 NaN NaN NaN NaN NaN

2013-01-02 NaN NaN NaN NaN NaN

2013-01-03-1.861849-3.104569-1.494929 4 1

2013-01-04-2.278445-3.706771-4.039575 2 0

2013-01-05-5.424972-4.432980-4.723768 0 -1

2013-01-06 NaN NaN NaN NaN NaN

Apply

在数据上使用函数

In [66]: df.apply(np.cumsum)Out[66]:                    A         B         C   D   F2013-01-01  0.000000  0.000000 -1.509059   5 NaN2013-01-02  1.212112 -0.173215 -1.389850  10   12013-01-03  0.350263 -2.277784 -1.884779  15   32013-01-04  1.071818 -2.984555 -2.924354  20   62013-01-05  0.646846 -2.417535 -2.648122  25  102013-01-06 -0.026844 -2.303886 -4.126549  30  15In [67]: df.apply(lambda x: x.max() - x.min())Out[67]: A    2.073961B    2.671590C    1.785291D    0.000000F    4.000000dtype: float64

In[66]:df.apply(np.cumsum)

Out[66]:

A B C D F

2013-01-01 0.000000 0.000000-1.509059 5 NaN

2013-01-02 1.212112-0.173215-1.389850 10 1

2013-01-03 0.350263-2.277784-1.884779 15 3

2013-01-04 1.071818-2.984555-2.924354 20 6

2013-01-05 0.646846-2.417535-2.648122 25 10

2013-01-06-0.026844-2.303886-4.126549 30 15

In[67]:df.apply(lambdax:x.max()- x.min())

Out[67]:

A 2.073961

B 2.671590

C 1.785291

D 0.000000

F 4.000000

dtype:float64

直方图

请参阅直方图和离散化

In [68]: s = pd.Series(np.random.randint(0,7,size=10))In [69]: sOut[69]: 0    41    22    13    24    65    46    47    68    49    4dtype: int32In [70]: s.value_counts()Out[70]: 4    56    22    21    1dtype: int64

In[68]:s =pd.Series(np.random.randint(0,7,size=10))

In [69]:s

Out[69]:

0 4

1 2

2 1

3 2

4 6

5 4

6 4

7 6

8 4

9 4

dtype:int32

In [70]:s.value_counts()

Out[70]:

4 5

6 2

2 2

1 1

dtype:int64

字符串方法

序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。注意字符匹配方法默认情况下通常使用正则表达式（并且大多数时候都如此）. 更多信息请参阅字符串向量方法.

In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])In [72]: s.str.lower()Out[72]: 0       a1       b2       c3    aaba4    baca5     NaN6    caba7     dog8     catdtype: object

In[71]:s =pd.Series(['A','B','C','Aaba','Baca',np.nan,'CABA','dog','cat'])

In [72]:s.str.lower()

Out[72]:

0 a

1 b

2 c

3 aaba

4 baca

5 NaN

6 caba

7 dog

8 cat

dtype:object

合并

连接

pandas提供各种工具以简便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数.

请参阅合并部分

把pandas对象连接到一起

In [73]: df = pd.DataFrame(np.random.randn(10, 4))In [74]: dfOut[74]:           0         1         2         30 -0.548702  1.467327 -1.015962 -0.4830751  1.637550 -1.217659 -0.291519 -1.7455052 -0.263952  0.991460 -0.919069  0.2660463 -0.709661  1.669052  1.037882 -1.7057754 -0.919854 -0.042379  1.247642 -0.0099205  0.290213  0.495767  0.362949  1.5481066 -1.131345 -0.089329  0.337863 -0.9458677 -0.932132  1.956030  0.017587 -0.0166928 -0.575247  0.254161 -1.143704  0.2158979  1.193555 -0.077118 -0.408530 -0.862495# break it into piecesIn [75]: pieces = [df[:3], df[3:7], df[7:]]In [76]: pd.concat(pieces)Out[76]:           0         1         2         30 -0.548702  1.467327 -1.015962 -0.4830751  1.637550 -1.217659 -0.291519 -1.7455052 -0.263952  0.991460 -0.919069  0.2660463 -0.709661  1.669052  1.037882 -1.7057754 -0.919854 -0.042379  1.247642 -0.0099205  0.290213  0.495767  0.362949  1.5481066 -1.131345 -0.089329  0.337863 -0.9458677 -0.932132  1.956030  0.017587 -0.0166928 -0.575247  0.254161 -1.143704  0.2158979  1.193555 -0.077118 -0.408530 -0.862495

In[73]:df =pd.DataFrame(np.random.randn(10,4))

In [74]:df

Out[74]:

0 1 2 3

0-0.548702 1.467327-1.015962-0.483075

1 1.637550-1.217659-0.291519-1.745505

2-0.263952 0.991460-0.919069 0.266046

3 -0.709661 1.669052 1.037882-1.705775

4-0.919854-0.042379 1.247642-0.009920

5 0.290213 0.495767 0.362949 1.548106

6-1.131345-0.089329 0.337863-0.945867

7 -0.932132 1.956030 0.017587-0.016692

8-0.575247 0.254161-1.143704 0.215897

9 1.193555-0.077118-0.408530-0.862495

# break it into pieces

In[75]:pieces =[df[:3],df[3:7],df[7:]]

In [76]:pd.concat(pieces)

Out[76]:

0 1 2 3

0-0.548702 1.467327-1.015962-0.483075

1 1.637550-1.217659-0.291519-1.745505

2-0.263952 0.991460-0.919069 0.266046

3 -0.709661 1.669052 1.037882-1.705775

4-0.919854-0.042379 1.247642-0.009920

5 0.290213 0.495767 0.362949 1.548106

6-1.131345-0.089329 0.337863-0.945867

7 -0.932132 1.956030 0.017587-0.016692

8-0.575247 0.254161-1.143704 0.215897

9 1.193555-0.077118-0.408530-0.862495

连接

SQL样式合并. 请参阅数据库style联接

In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})In [79]: leftOut[79]:    key  lval0  foo     11  foo     2In [80]: rightOut[80]:    key  rval0  foo     41  foo     5In [81]: pd.merge(left, right, on='key')Out[81]:    key  lval  rval0  foo     1     41  foo     1     52  foo     2     43  foo     2     5

In[77]:left =pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})

In [78]:right =pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})

In[79]:left

Out[79]:

key lval

0 foo 1

1 foo 2

In[80]:right

Out[80]:

key rval

0 foo 4

1 foo 5

In[81]:pd.merge(left,right,on='key')

Out[81]:

key lval rval

0 foo 1 4

1 foo 1 5

2 foo 2 4

3 foo 2 5

添加

添加行到数据增. 参阅添加

In [82]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])In [83]: dfOut[83]:           A         B         C         D0  1.346061  1.511763  1.627081 -0.9905821 -0.441652  1.211526  0.268520  0.0245802 -1.577585  0.396823 -0.105381 -0.5325323  1.453749  1.208843 -0.080952 -0.2646104 -0.727965 -0.589346  0.339969 -0.6932055 -0.339355  0.593616  0.884345  1.5914316  0.141809  0.220390  0.435589  0.1924517 -0.096701  0.803351  1.715071 -0.708758In [84]: s = df.iloc[3]In [85]: df.append(s, ignore_index=True)Out[85]:           A         B         C         D0  1.346061  1.511763  1.627081 -0.9905821 -0.441652  1.211526  0.268520  0.0245802 -1.577585  0.396823 -0.105381 -0.5325323  1.453749  1.208843 -0.080952 -0.2646104 -0.727965 -0.589346  0.339969 -0.6932055 -0.339355  0.593616  0.884345  1.5914316  0.141809  0.220390  0.435589  0.1924517 -0.096701  0.803351  1.715071 -0.7087588  1.453749  1.208843 -0.080952 -0.264610

In[82]:df =pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])

In [83]:df

Out[83]:

A B C D

0 1.346061 1.511763 1.627081-0.990582

1 -0.441652 1.211526 0.268520 0.024580

2-1.577585 0.396823-0.105381-0.532532

3 1.453749 1.208843-0.080952-0.264610

4-0.727965-0.589346 0.339969-0.693205

5 -0.339355 0.593616 0.884345 1.591431

6 0.141809 0.220390 0.435589 0.192451

7 -0.096701 0.803351 1.715071-0.708758

In [84]:s =df.iloc[3]

In[85]:df.append(s,ignore_index=True)

Out[85]:

A B C D

0 1.346061 1.511763 1.627081-0.990582

1-0.441652 1.211526 0.268520 0.024580

2 -1.577585 0.396823-0.105381-0.532532

3 1.453749 1.208843-0.080952-0.264610

4 -0.727965-0.589346 0.339969-0.693205

5-0.339355 0.593616 0.884345 1.591431

6 0.141809 0.220390 0.435589 0.192451

7-0.096701 0.803351 1.715071-0.708758

8 1.453749 1.208843-0.080952-0.264610

分组

对于“group by”指的是以下一个或多个处理

将数据按某些标准分割为不同的组
在每个独立组上应用函数
组合结果为一个数据结构

请参阅分组部分

In [86]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',   ....:                          'foo', 'bar', 'foo', 'foo'],   ....:                    'B' : ['one', 'one', 'two', 'three',   ....:                          'two', 'two', 'one', 'three'],   ....:                    'C' : np.random.randn(8),   ....:                    'D' : np.random.randn(8)})   ....: In [87]: dfOut[87]:      A      B         C         D0  foo    one -1.202872 -0.0552241  bar    one -1.814470  2.3959852  foo    two  1.018601  1.5528253  bar  three -0.595447  0.1665994  foo    two  1.395433  0.0476095  bar    two -0.392670 -0.1364736  foo    one  0.007207 -0.5617577  foo  three  1.928123 -1.623033

In[86]:df =pd.DataFrame({'A': ['foo','bar','foo','bar',

....: 'foo','bar','foo','foo'],

....: 'B': ['one','one','two','three',

....: 'two','two','one','three'],

....: 'C': np.random.randn(8),

....: 'D': np.random.randn(8)})

....:

In [87]:df

Out[87]:

A B C D

0 foo one-1.202872-0.055224

1 bar one-1.814470 2.395985

2 foo two 1.018601 1.552825

3 bar three-0.595447 0.166599

4 foo two 1.395433 0.047609

5 bar two-0.392670-0.136473

6 foo one 0.007207-0.561757

7 foo three 1.928123-1.623033

分组然后应用函数统计总和存放到结果组

In [88]: df.groupby('A').sum()Out[88]:             C        DA                     bar -2.802588  2.42611foo  3.146492 -0.63958

In[88]:df.groupby('A').sum()

Out[88]:

C D

bar-2.802588 2.42611

foo 3.146492-0.63958

按多列分组为层次索引,然后应用函数

In [89]: df.groupby(['A','B']).sum()Out[89]:                   C         DA   B                        bar one   -1.814470  2.395985    three -0.595447  0.166599    two   -0.392670 -0.136473foo one   -1.195665 -0.616981    three  1.928123 -1.623033    two    2.414034  1.600434

In[89]:df.groupby(['A','B']).sum()

Out[89]:

C D

A B

barone -1.814470 2.395985

three-0.595447 0.166599

two -0.392670-0.136473

foo one -1.195665-0.616981

three 1.928123-1.623033

two 2.414034 1.600434

重塑

请参阅章节分层索引和重塑.

堆叠

In [90]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',   ....:                      'foo', 'foo', 'qux', 'qux'],   ....:                     ['one', 'two', 'one', 'two',   ....:                      'one', 'two', 'one', 'two']]))   ....: In [91]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])In [92]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])In [93]: df2 = df[:4]In [94]: df2Out[94]:                      A         Bfirst second                    bar   one     0.029399 -0.542108      two     0.282696 -0.087302baz   one    -1.575170  1.771208      two     0.816482  1.100230

In[90]:tuples =list(zip(*[['bar','bar','baz','baz',

....: 'foo','foo','qux','qux'],

....: ['one','two','one','two',

....: 'one','two','one','two']]))

....:

In [91]:index =pd.MultiIndex.from_tuples(tuples,names=['first','second'])

In[92]:df =pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])

In [93]:df2 =df[:4]

In[94]:df2

Out[94]:

A B

first second

bar one 0.029399-0.542108

two 0.282696 -0.087302

baz one -1.575170 1.771208

two 0.816482 1.100230

堆叠函数 “压缩” 数据桢的列一个级别.

In [95]: stacked = df2.stack()In [96]: stackedOut[96]: first  second   bar    one     A    0.029399               B   -0.542108       two     A    0.282696               B   -0.087302baz    one     A   -1.575170               B    1.771208       two     A    0.816482               B    1.100230dtype: float64

In[95]:stacked =df2.stack()

In [96]:stacked

Out[96]:

first second

bar one A 0.029399

B -0.542108

two A 0.282696

B -0.087302

baz one A -1.575170

B 1.771208

two A 0.816482

B 1.100230

dtype:float64

被“堆叠”数据桢或序列(有多个索引作为索引), 其堆叠的反向操作是未堆栈, 上面的数据默认反堆叠到上一级别:

In [97]: stacked.unstack()Out[97]:                      A         Bfirst second                    bar   one     0.029399 -0.542108      two     0.282696 -0.087302baz   one    -1.575170  1.771208      two     0.816482  1.100230In [98]: stacked.unstack(1)Out[98]: second        one       twofirst                      bar   A  0.029399  0.282696      B -0.542108 -0.087302baz   A -1.575170  0.816482      B  1.771208  1.100230In [99]: stacked.unstack(0)Out[99]: first          bar       bazsecond                      one    A  0.029399 -1.575170       B -0.542108  1.771208two    A  0.282696  0.816482       B -0.087302  1.100230

In[97]:stacked.unstack()

Out[97]:

A B

first second

bar one 0.029399-0.542108

two 0.282696 -0.087302

baz one -1.575170 1.771208

two 0.816482 1.100230

In [98]:stacked.unstack(1)

Out[98]:

second one two

first

bar A 0.029399 0.282696

B-0.542108-0.087302

baz A-1.575170 0.816482

B 1.771208 1.100230

In[99]:stacked.unstack(0)

Out[99]:

first bar baz

second

one A 0.029399-1.575170

B-0.542108 1.771208

two A 0.282696 0.816482

B-0.087302 1.100230

数据透视表

查看数据透视表.

In [100]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,   .....:                    'B' : ['A', 'B', 'C'] * 4,   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,   .....:                    'D' : np.random.randn(12),   .....:                    'E' : np.random.randn(12)})   .....: In [101]: dfOut[101]:         A  B    C         D         E0     one  A  foo  1.418757 -0.1796661     one  B  foo -1.879024  1.2918362     two  C  foo  0.536826 -0.0096143   three  A  bar  1.006160  0.3921494     one  B  bar -0.029716  0.2645995     one  C  bar -1.146178 -0.0574096     two  A  foo  0.100900 -1.4256387   three  B  foo -1.035018  1.0240988     one  C  foo  0.314665 -0.1060629     one  A  bar -0.773723  1.82437510    two  B  bar -1.170653  0.59597411  three  C  bar  0.648740  1.167115

In[100]:df =pd.DataFrame({'A': ['one','one','two','three']* 3,

.....: 'B': ['A','B','C']* 4,

.....: 'C': ['foo','foo','foo','bar','bar','bar']* 2,

.....: 'D': np.random.randn(12),

.....: 'E': np.random.randn(12)})

.....:

In[101]:df

Out[101]:

A B C D E

0 one A foo 1.418757-0.179666

1 one B foo-1.879024 1.291836

2 two C foo 0.536826-0.009614

3 three A bar 1.006160 0.392149

4 one B bar-0.029716 0.264599

5 one C bar-1.146178-0.057409

6 two A foo 0.100900-1.425638

7 three B foo-1.035018 1.024098

8 one C foo 0.314665-0.106062

9 one A bar-0.773723 1.824375

10 two B bar-1.170653 0.595974

11 three C bar 0.648740 1.167115

我们可以从此数据非常容易的产生数据透视表:

In [102]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])Out[102]: C             bar       fooA     B                    one   A -0.773723  1.418757      B -0.029716 -1.879024      C -1.146178  0.314665three A  1.006160       NaN      B       NaN -1.035018      C  0.648740       NaNtwo   A       NaN  0.100900      B -1.170653       NaN      C       NaN  0.536826

In[102]:pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])

Out[102]:

C bar foo

A B

one A -0.773723 1.418757

B-0.029716-1.879024

C-1.146178 0.314665

three A 1.006160 NaN

B NaN -1.035018

C 0.648740 NaN

two A NaN 0.100900

B-1.170653 NaN

C NaN 0.536826

时间序列

pandas有易用,强大且高效的函数用于高频数据重采样转换操作(例如,转换秒数据到5分钟数据), 这是很普遍的情况，但并不局限于金融应用, 请参阅时间序列章节

In [103]: rng = pd.date_range('1/1/2012', periods=100, freq='S')In [104]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)In [105]: ts.resample('5Min', how='sum')Out[105]: 2012-01-01    25083Freq: 5T, dtype: int32

In[103]:rng =pd.date_range('1/1/2012',periods=100,freq='S')

In [104]:ts =pd.Series(np.random.randint(0,500,len(rng)),index=rng)

In[105]:ts.resample('5Min',how='sum')

Out[105]:

2012-01-01 25083

Freq:5T,dtype:int32

时区表示

In [106]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')In [107]: ts = pd.Series(np.random.randn(len(rng)), rng)In [108]: tsOut[108]: 2012-03-06    0.4640002012-03-07    0.2273712012-03-08   -0.4969222012-03-09    0.3063892012-03-10   -2.290613Freq: D, dtype: float64In [109]: ts_utc = ts.tz_localize('UTC')In [110]: ts_utcOut[110]: 2012-03-06 00:00:00+00:00    0.4640002012-03-07 00:00:00+00:00    0.2273712012-03-08 00:00:00+00:00   -0.4969222012-03-09 00:00:00+00:00    0.3063892012-03-10 00:00:00+00:00   -2.290613Freq: D, dtype: float64

In[106]:rng =pd.date_range('3/6/2012 00:00',periods=5,freq='D')

In [107]:ts =pd.Series(np.random.randn(len(rng)),rng)

In[108]:ts

Out[108]:

2012-03-06 0.464000

2012-03-07 0.227371

2012-03-08 -0.496922

2012-03-09 0.306389

2012-03-10 -2.290613

Freq:D,dtype:float64

In [109]:ts_utc =ts.tz_localize('UTC')

In[110]:ts_utc

Out[110]:

2012-03-0600:00:00+00:00 0.464000

2012-03-0700:00:00+00:00 0.227371

2012-03-0800:00:00+00:00 -0.496922

2012-03-0900:00:00+00:00 0.306389

2012-03-1000:00:00+00:00 -2.290613

Freq:D,dtype:float64

转换到其它时区

In [111]: ts_utc.tz_convert('US/Eastern')Out[111]: 2012-03-05 19:00:00-05:00    0.4640002012-03-06 19:00:00-05:00    0.2273712012-03-07 19:00:00-05:00   -0.4969222012-03-08 19:00:00-05:00    0.3063892012-03-09 19:00:00-05:00   -2.290613Freq: D, dtype: float64

In[111]:ts_utc.tz_convert('US/Eastern')

Out[111]:

2012-03-0519:00:00-05:00 0.464000

2012-03-0619:00:00-05:00 0.227371

2012-03-0719:00:00-05:00 -0.496922

2012-03-0819:00:00-05:00 0.306389

2012-03-0919:00:00-05:00 -2.290613

Freq:D,dtype:float64

转换不同的时间跨度

In [112]: rng = pd.date_range('1/1/2012', periods=5, freq='M')In [113]: ts = pd.Series(np.random.randn(len(rng)), index=rng)In [114]: tsOut[114]: 2012-01-31   -1.1346232012-02-29   -1.5618192012-03-31   -0.2608382012-04-30    0.2819572012-05-31    1.523962Freq: M, dtype: float64In [115]: ps = ts.to_period()In [116]: psOut[116]: 2012-01   -1.1346232012-02   -1.5618192012-03   -0.2608382012-04    0.2819572012-05    1.523962Freq: M, dtype: float64In [117]: ps.to_timestamp()Out[117]: 2012-01-01   -1.1346232012-02-01   -1.5618192012-03-01   -0.2608382012-04-01    0.2819572012-05-01    1.523962Freq: MS, dtype: float64

In[112]:rng =pd.date_range('1/1/2012',periods=5,freq='M')

In [113]:ts =pd.Series(np.random.randn(len(rng)),index=rng)

In[114]:ts

Out[114]:

2012-01-31 -1.134623

2012-02-29 -1.561819

2012-03-31 -0.260838

2012-04-30 0.281957

2012-05-31 1.523962

Freq:M,dtype:float64

In [115]:ps =ts.to_period()

In[116]:ps

Out[116]:

2012-01 -1.134623

2012-02 -1.561819

2012-03 -0.260838

2012-04 0.281957

2012-05 1.523962

Freq:M,dtype:float64

In [117]:ps.to_timestamp()

Out[117]:

2012-01-01 -1.134623

2012-02-01 -1.561819

2012-03-01 -0.260838

2012-04-01 0.281957

2012-05-01 1.523962

Freq:MS,dtype:float64

转换时段并且使用一些运算函数, 下例中, 我们转换年报11月到季度结束每日上午9点数据

In [118]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')In [119]: ts = pd.Series(np.random.randn(len(prng)), prng)In [120]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9In [121]: ts.head()Out[121]: 1990-03-01 09:00   -0.9029371990-06-01 09:00    0.0681591990-09-01 09:00   -0.0578731990-12-01 09:00   -0.3682041991-03-01 09:00   -1.144073Freq: H, dtype: float64

In[118]:prng =pd.period_range('1990Q1','2000Q4',freq='Q-NOV')

In [119]:ts =pd.Series(np.random.randn(len(prng)),prng)

In[120]:ts.index= (prng.asfreq('M','e')+ 1).asfreq('H','s')+ 9

In [121]:ts.head()

Out[121]:

1990-03-0109:00 -0.902937

1990-06-0109:00 0.068159

1990-09-0109:00 -0.057873

1990-12-0109:00 -0.368204

1991-03-0109:00 -1.144073

Freq:H,dtype:float64

分类

自版本0.15起, pandas可以在数据桢中包含分类. 完整的文档, 请查看分类介绍 and theAPI文档.

In [122]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

In[122]:df =pd.DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})

转换原始类别为分类数据类型.

In [123]: df["grade"] = df["raw_grade"].astype("category")In [124]: df["grade"]Out[124]: 0    a1    b2    b3    a4    a5    eName: grade, dtype: categoryCategories (3, object): [a, b, e]

In[123]:df["grade"]= df["raw_grade"].astype("category")

In [124]:df["grade"]

Out[124]:

0 a

1 b

2 b

3 a

4 a

5 e

Name:grade,dtype:category

Categories(3,object):[a,b,e]

重命令分类为更有意义的名称 (分配到Series.cat.categories对应位置!)

In [125]: df["grade"].cat.categories = ["very good", "good", "very bad"]

In[125]:df["grade"].cat.categories= ["very good","good","very bad"]

重排顺分类,同时添加缺少的分类(序列 .cat方法下返回新默认序列)

In [126]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])In [127]: df["grade"]Out[127]: 0    very good1         good2         good3    very good4    very good5     very badName: grade, dtype: categoryCategories (5, object): [very bad, bad, medium, good, very good]

In[126]:df["grade"]= df["grade"].cat.set_categories(["very bad", "bad","medium","good","very good"])

In [127]:df["grade"]

Out[127]:

0 verygood

1 good

2 good

3 verygood

4 verygood

5 very bad

Name:grade,dtype:category

Categories(5,object):[very bad,bad,medium,good,very good]

排列分类中的顺序,不是按词汇排列.

In [128]: df.sort("grade")Out[128]:    id raw_grade      grade5   6         e   very bad1   2         b       good2   3         b       good0   1         a  very good3   4         a  very good4   5         a  very good

In[128]:df.sort("grade")

Out[128]:

id raw_grade grade

5 6 e verybad

1 2 b good

2 3 b good

0 1 a verygood

3 4 a verygood

4 5 a verygood

类别列分组,并且也显示空类别.

In [129]: df.groupby("grade").size()Out[129]: gradevery bad      1bad         NaNmedium      NaNgood          2very good     3dtype: float64

In[129]:df.groupby("grade").size()

Out[129]:

grade

very bad 1

bad NaN

medium NaN

good 2

very good 3

dtype:float64

绘图

绘图文档.

In [130]: ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))In [131]: ts = ts.cumsum()In [132]: ts.plot()Out[132]: <matplotlib.axes._subplots.AxesSubplot at 0xb02091ac>

In[130]:ts =pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))

In [131]:ts =ts.cumsum()

In[132]:ts.plot()

Out[132]:<matplotlib.axes._subplots.AxesSubplotat 0xb02091ac>

在数据桢中,可以很方便的绘制带标签列:

In [133]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,   .....:                   columns=['A', 'B', 'C', 'D'])   .....: In [134]: df = df.cumsum()In [135]: plt.figure(); df.plot(); plt.legend(loc='best')Out[135]: <matplotlib.legend.Legend at 0xb01c9cac>

In[133]:df =pd.DataFrame(np.random.randn(1000,4),index=ts.index,

.....: columns=['A','B','C','D'])

.....:

In [134]:df =df.cumsum()

In[135]:plt.figure();df.plot();plt.legend(loc='best')

Out[135]:<matplotlib.legend.Legendat 0xb01c9cac>

获取数据输入/输出

CSV

写入csv文件

In [136]: df.to_csv('foo.csv')

In[136]:df.to_csv('foo.csv')

读取csv文件

In [137]: pd.read_csv('foo.csv')Out[137]:      Unnamed: 0          A          B         C          D0    2000-01-01   0.266457  -0.399641 -0.219582   1.1868601    2000-01-02  -1.170732  -0.345873  1.653061  -0.2829532    2000-01-03  -1.734933   0.530468  2.060811  -0.5155363    2000-01-04  -1.555121   1.452620  0.239859  -1.1568964    2000-01-05   0.578117   0.511371  0.103552  -2.4282025    2000-01-06   0.478344   0.449933 -0.741620  -1.9624096    2000-01-07   1.235339  -0.091757 -1.543861  -1.084753..          ...        ...        ...       ...        ...993  2002-09-20 -10.628548  -9.153563 -7.883146  28.313940994  2002-09-21 -10.390377  -8.727491 -6.399645  30.914107995  2002-09-22  -8.985362  -8.485624 -4.669462  31.367740996  2002-09-23  -9.558560  -8.781216 -4.499815  30.518439997  2002-09-24  -9.902058  -9.340490 -4.386639  30.105593998  2002-09-25 -10.216020  -9.480682 -3.933802  29.758560999  2002-09-26 -11.856774 -10.671012 -3.216025  29.369368[1000 rows x 5 columns]

In[137]:pd.read_csv('foo.csv')

Out[137]:

Unnamed:0 A B C D

0 2000-01-01 0.266457 -0.399641-0.219582 1.186860

1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953

2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536

3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896

4 2000-01-05 0.578117 0.511371 0.103552 -2.428202

5 2000-01-06 0.478344 0.449933-0.741620 -1.962409

6 2000-01-07 1.235339 -0.091757-1.543861 -1.084753

.. ... ... ... ... ...

993 2002-09-20-10.628548 -9.153563-7.883146 28.313940

994 2002-09-21-10.390377 -8.727491-6.399645 30.914107

995 2002-09-22 -8.985362 -8.485624-4.669462 31.367740

996 2002-09-23 -9.558560 -8.781216-4.499815 30.518439

997 2002-09-24 -9.902058 -9.340490-4.386639 30.105593

998 2002-09-25-10.216020 -9.480682-3.933802 29.758560

999 2002-09-26-11.856774-10.671012-3.216025 29.369368

[1000rows x5 columns]

HDF5

读写HDF存储

写入HDF5存储

In [138]: df.to_hdf('foo.h5','df')

In[138]:df.to_hdf('foo.h5','df')

读取HDF5存储

In [139]: pd.read_hdf('foo.h5','df')Out[139]:                     A          B         C          D2000-01-01   0.266457  -0.399641 -0.219582   1.1868602000-01-02  -1.170732  -0.345873  1.653061  -0.2829532000-01-03  -1.734933   0.530468  2.060811  -0.5155362000-01-04  -1.555121   1.452620  0.239859  -1.1568962000-01-05   0.578117   0.511371  0.103552  -2.4282022000-01-06   0.478344   0.449933 -0.741620  -1.9624092000-01-07   1.235339  -0.091757 -1.543861  -1.084753...               ...        ...       ...        ...2002-09-20 -10.628548  -9.153563 -7.883146  28.3139402002-09-21 -10.390377  -8.727491 -6.399645  30.9141072002-09-22  -8.985362  -8.485624 -4.669462  31.3677402002-09-23  -9.558560  -8.781216 -4.499815  30.5184392002-09-24  -9.902058  -9.340490 -4.386639  30.1055932002-09-25 -10.216020  -9.480682 -3.933802  29.7585602002-09-26 -11.856774 -10.671012 -3.216025  29.369368[1000 rows x 4 columns]

In[139]:pd.read_hdf('foo.h5','df')

Out[139]:

A B C D

2000-01-01 0.266457 -0.399641-0.219582 1.186860

2000-01-02 -1.170732 -0.345873 1.653061 -0.282953

2000-01-03 -1.734933 0.530468 2.060811 -0.515536

2000-01-04 -1.555121 1.452620 0.239859 -1.156896

2000-01-05 0.578117 0.511371 0.103552 -2.428202

2000-01-06 0.478344 0.449933-0.741620 -1.962409

2000-01-07 1.235339 -0.091757-1.543861 -1.084753

... ... ... ... ...

2002-09-20-10.628548 -9.153563-7.883146 28.313940

2002-09-21-10.390377 -8.727491-6.399645 30.914107

2002-09-22 -8.985362 -8.485624-4.669462 31.367740

2002-09-23 -9.558560 -8.781216-4.499815 30.518439

2002-09-24 -9.902058 -9.340490-4.386639 30.105593

2002-09-25-10.216020 -9.480682-3.933802 29.758560

2002-09-26-11.856774-10.671012-3.216025 29.369368

[1000rows x4 columns]

Excel

读写MS Excel

写入excel文件

In [140]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

In[140]:df.to_excel('foo.xlsx',sheet_name='Sheet1')

读取excel文件

In [141]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])Out[141]:                     A          B         C          D2000-01-01   0.266457  -0.399641 -0.219582   1.1868602000-01-02  -1.170732  -0.345873  1.653061  -0.2829532000-01-03  -1.734933   0.530468  2.060811  -0.5155362000-01-04  -1.555121   1.452620  0.239859  -1.1568962000-01-05   0.578117   0.511371  0.103552  -2.4282022000-01-06   0.478344   0.449933 -0.741620  -1.9624092000-01-07   1.235339  -0.091757 -1.543861  -1.084753...               ...        ...       ...        ...2002-09-20 -10.628548  -9.153563 -7.883146  28.3139402002-09-21 -10.390377  -8.727491 -6.399645  30.9141072002-09-22  -8.985362  -8.485624 -4.669462  31.3677402002-09-23  -9.558560  -8.781216 -4.499815  30.5184392002-09-24  -9.902058  -9.340490 -4.386639  30.1055932002-09-25 -10.216020  -9.480682 -3.933802  29.7585602002-09-26 -11.856774 -10.671012 -3.216025  29.369368[1000 rows x 4 columns]

In[141]:pd.read_excel('foo.xlsx','Sheet1',index_col=None,na_values=['NA'])

Out[141]:

A B C D

2000-01-01 0.266457 -0.399641-0.219582 1.186860

2000-01-02 -1.170732 -0.345873 1.653061 -0.282953

2000-01-03 -1.734933 0.530468 2.060811 -0.515536

2000-01-04 -1.555121 1.452620 0.239859 -1.156896

2000-01-05 0.578117 0.511371 0.103552 -2.428202

2000-01-06 0.478344 0.449933-0.741620 -1.962409

2000-01-07 1.235339 -0.091757-1.543861 -1.084753

... ... ... ... ...

2002-09-20-10.628548 -9.153563-7.883146 28.313940

2002-09-21-10.390377 -8.727491-6.399645 30.914107

2002-09-22 -8.985362 -8.485624-4.669462 31.367740

2002-09-23 -9.558560 -8.781216-4.499815 30.518439

2002-09-24 -9.902058 -9.340490-4.386639 30.105593

2002-09-25-10.216020 -9.480682-3.933802 29.758560

2002-09-26-11.856774-10.671012-3.216025 29.369368

[1000rows x4 columns]

陷阱

如果尝试这样操作可能会看到像这样的异常:

>>> if pd.Series([False, True, False]):    print("I was true")Traceback    ...ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

>>>if pd.Series([False,True,False]):

print("I was true")

Traceback

...

ValueError:The truth valueof an arrayis ambiguous.Use a.empty,a.any()or a.all().

查看对照获取解释和怎么做的帮助

也可以查看陷阱.

阅读全文

0 0