十分钟了解pandas（总结）

来源：互联网发布：快速排序算法 java 编辑：程序博客网时间：2024/06/05 11:23

这里是官网上的原文。每次都还链接到最新版本。这篇文章，对于不是英语母语的，10分钟，我觉得绝逼是看不完的。网上已经有很多翻译了。我只是把文章的结构组织以下，死记硬背一下它提到的知识点。

- 对象创建
  - 创建Series
  - 创建基于时间的index
  - 创建DataFrame
- 查看数据
  - 首尾数据
  - 查看索引列值
  - 查看统计数据
  - 转置
  - 索引排序
  - 按列值排序
- 选取
  - 取列
  - 取行
  - 通过标签来选取 loc
  - 通过位置来选取 iloc
  - 布尔索引匹配
- 设值
- 缺值处理
- 操作
  - 统计
  - 函数应用
  - string方法
- merge
  - contact
  - join
  - Append
  - Grouping
- Reshaping
  - stack
  - Pivot Tables
- TimeSeries
- Categoricals
- Plotting

对象创建

创建Series

s = pd.Series([1,3,5,np.nan,6,8])

创建基于时间的index

dates = pd.date_range('20130101', periods=6)

创建DataFrame

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

查看数据

首尾数据

df.head()df.tail(5)

查看索引，列，值

df.indexdf.columnsdf.values

查看统计数据

df.describe()

转置

df.T

索引排序

df.sort_index(axis=1, ascending=False)

按列值排序

df.sort_values(by='B')

选取

[]取列

df['A']df[['A','B']]

[]取行

df[0:3]df['20130102':'20130104']

'20130102':'20130104'可自动转化为dateIndex
单值’20130102’会尝试匹配列

通过标签来选取 loc

df.loc[dates[0]]df.loc[:,['A','B']]df.loc['20130102':'20130104',['A','B']]df.loc[dates[0],'A']df.at[dates[0],'A']

通过标签，既通过具体的值，而不是位置来取值（语义：’A’列的’2013-01-02’行）
at比loc更快（取单值）

通过位置来选取 iloc

df.iloc[3]df.iloc[3:5,0:2]df.iloc[[1,2,4],[0,2]]df.iloc[1:3,:]df.iloc[:,1:3]df.iloc[1,1]df.iat[1,1]

位置永远都是integer类型，iloc[]只接受整数（语义：第4列的第3行）
iat比iloc要快（取单值）
如果index是整数索引，loc和iloc在取行的时候是一样的，但iloc要快，不需要做key的匹配（语义：第4列的第3行和’4‘列的’3‘行的区别）

布尔索引匹配

df[df.A > 0] #过滤df[df > 0] #填充NaNdf2[df2['E'].isin(['two','four'])]

设值

df['F'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))df.at[dates[0],'A'] = 0df.iat[0,1] = 0df.loc[:,'D'] = np.array([5] * len(df))df[df > 0]=-df #将所有大于0的值，设为负数

缺值处理

df.dropna(how='any')df.fillna(value=5)pd.isnull(df)

操作

统计

df.mean()df.mean(1) #axiss = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2) df.sub(s, axis='index')

函数应用

默认按axis=0的方向，对整列数据使用函数，也可以设置axis

df.apply(np.cumsum)df.apply(lambda x: x.max()-x.min()

string方法

仅针对Series

s.str.lower()

merge

contact

pieces = [df[:3], df[3:7], df[7:]]df.contat(pieces)

join

pd.merge(left, right, on='key')

Append

s = df.iloc[3]df.append(s,ignore_index=True)

Grouping

groupby()返回的不是DataFrame，而是DataFrameGroupBy，需要调用额外的步骤来返回需要的值：
- Splitting，基于应用场景，将值划分到不同的group
- Applying，对每个group上的数据，独立的应用函数来处理
- Combining ，将每个group的数据合并到特定的数据结构中

df.groupby('A').sum()df.groupby(['A','B']).sum()

Reshaping

stack

In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',   ....:                      'foo', 'foo', 'qux', 'qux'],   ....:                     ['one', 'two', 'one', 'two',   ....:                      'one', 'two', 'one', 'two']]))   ....: In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])In [98]: df2 = df[:4]In [99]: df2Out[99]:                      A         Bfirst second                    bar   one     0.029399 -0.542108      two     0.282696 -0.087302baz   one    -1.575170  1.771208      two     0.816482  1.100230In [100]: stacked = df2.stack()In [101]: stackedOut[101]: first  second   bar    one     A    0.029399               B   -0.542108       two     A    0.282696               B   -0.087302baz    one     A   -1.575170               B    1.771208       two     A    0.816482               B    1.100230dtype: float64In [102]: stacked.unstack()Out[102]:                      A         Bfirst second                    bar   one     0.029399 -0.542108      two     0.282696 -0.087302baz   one    -1.575170  1.771208      two     0.816482  1.100230In [103]: stacked.unstack(1)Out[103]: second        one       twofirst                      bar   A  0.029399  0.282696      B -0.542108 -0.087302baz   A -1.575170  0.816482      B  1.771208  1.100230In [104]: stacked.unstack(0)Out[104]: first          bar       bazsecond                      one    A  0.029399 -1.575170       B -0.542108  1.771208two    A  0.282696  0.816482       B -0.087302  1.100230

Pivot Tables

In [105]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,   .....:                    'B' : ['A', 'B', 'C'] * 4,   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,   .....:                    'D' : np.random.randn(12),   .....:                    'E' : np.random.randn(12)})   .....: In [106]: dfOut[106]:         A  B    C         D         E0     one  A  foo  1.418757 -0.1796661     one  B  foo -1.879024  1.2918362     two  C  foo  0.536826 -0.0096143   three  A  bar  1.006160  0.3921494     one  B  bar -0.029716  0.2645995     one  C  bar -1.146178 -0.0574096     two  A  foo  0.100900 -1.4256387   three  B  foo -1.035018  1.0240988     one  C  foo  0.314665 -0.1060629     one  A  bar -0.773723  1.82437510    two  B  bar -1.170653  0.59597411  three  C  bar  0.648740  1.167115In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])Out[107]: C             bar       fooA     B                    one   A -0.773723  1.418757      B -0.029716 -1.879024      C -1.146178  0.314665three A  1.006160       NaN      B       NaN -1.035018      C  0.648740       NaNtwo   A       NaN  0.100900      B -1.170653       NaN      C       NaN  0.536826

TimeSeries

Time Series section

In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)In [110]: ts.resample('5Min').sum()Out[110]: 2012-01-01    25083Freq: 5T, dtype: int64

Categoricals

categorical introduction

Plotting

Plotting docs.

阅读全文

0 0