[pandas] 数据类型学习笔记

来源：互联网发布：picasa3 for mac 编辑：程序博客网时间：2024/05/16 15:02

熟悉了NumPy之后，接下来就是要学习pandas了。pandas建立在NumPy之上，十分强大，好用。学习的资料就是看pandas官网的文档了。本文就是记录自己的学习笔记。

pandas的数据结构

pandas主要有Series（对映一维数组），DataFrame（对映二维数组），Panel（对映三维数组），Panel4D（对映四维数组），PanelND（多维）等数据结构。应用最多的就是Series和DataFrame了。下面就主要介绍这两类数据结构。

Series

Series是一维带标签的数组，它可以包含任何数据类型。包括整数，字符串，浮点数，Python对象等。Series可以通过标签来定位。

创建方法

s = pd.Series(data, index=index)

data可以是：

Python的dict
Numpy的ndarray
一个标量值

从ndarry创建

In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: s = pd.Series(np.random.randn(5), index = list('ABCDE'))In [4]: sOut[4]: A   -1.130657B   -1.539251C    1.503126D    1.266908E    0.335561dtype: float64

从dict创建

In [19]: d = {'a': 1, 'b': 2, 'c': 3}In [20]: pd.Series(d)Out[20]: a    1b    2c    3dtype: int64In [21]: pd.Series(d, index=['b', 'c', 'd', 'a'])Out[21]: b     2c     3d   NaNa     1dtype: float64

从标量创建

In [22]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])Out[22]: a    5b    5c    5d    5e    5dtype: float64In [23]:

Series操作

Series像ndarray一样操作

In [24]: s[0]Out[24]: -0.06036422206791571In [25]: s[:3]Out[25]: a   -0.060364b    0.315560c   -0.520548dtype: float64In [26]: s[s > s.median()]Out[26]: a   -0.060364b    0.315560dtype: float64In [27]: s[[4, 2, 1]]Out[27]: e   -1.900474c   -0.520548b    0.315560dtype: float64In [28]: np.exp(s)Out[28]: a    0.941422b    1.371027c    0.594195d    0.912396e    0.149498dtype: float64

Series像dictionary一样操作

In [30]: s['a'] Out[30]: -0.06036422206791571In [31]: 'e' in s Out[31]: TrueIn [32]: s.get('f')In [33]: s.get('f', np.nan) Out[33]: nanIn [35]: s['f'] = 3.In [35]: s['f'] = 3.In [36]: sOut[36]: a   -0.060364b    0.315560c   -0.520548d   -0.091681e   -1.900474f    3.000000dtype: float64

如果index不存在，则没有返回值。我么也可以给不存在的值附上nan。

运算操作

Series支持+，-，*, /, exp等NumPy的运算。

In [6]: s + sOut[6]: a    0.648688b   -2.729308c   -0.919524d    0.876880e    5.863378f    6.000000dtype: float64In [7]: s * 2Out[7]: a    0.648688b   -2.729308c   -0.919524d    0.876880e    5.863378f    6.000000dtype: float64In [8]: np.exp(s)Out[8]: a     1.383123b     0.255469c     0.631434d     1.550287e    18.759286f    20.085537dtype: float64

当两个index不同的Series一起操作时，不同部分值为nan：

In [9]: s[1:] + s[:-1]Out[9]: a         NaNb   -2.729308c   -0.919524d    0.876880e    5.863378f         NaNdtype: float64

我们也可以给Series命名：

In [10]: s = pd.Series(np.random.randn(5), name='something')In [11]: sOut[11]: 0    1.4472081   -0.5467602    0.8586223    0.6488034   -0.667612Name: something, dtype: float64

DataFrame

DataFrame是二维的带标签的数据结构。我们可以通过标签来定位数据。这是NumPy所没有的。

创建方法

数据可以从不同类型的输入获得：

一维ndarray，列表，字典，字典，或者Series的字典，
二维的ndarray
Series
外部数据引入，比如csv, excel等
其他的DataFrame
等等

我就介绍下怎么用Series的字典创建，其他方法大同小异，可以参考文档。

从Series的字典创建

In [17]: d = {'one': pd.Series([1, 2, 3], index=list('abc')), 'two': pd.Series([1, 2, 3, 4], index=list('abcd'))}In [18]: df = pd.DataFrame(d)In [19]: dfOut[19]:    one  twoa    1    1b    2    2c    3    3d  NaN    4In [20]: df.indexOut[20]: Index(['a', 'b', 'c', 'd'], dtype='object')In [21]: df.columnsOut[21]: Index(['one', 'two'], dtype='object')In [28]: df.index=['A', 'B', 'C', 'D'] # 可以更改indexIn [29]: dfOut[29]:    one  twoA    1    1B    2    2C    3    3D  NaN    4

选择、运算操作

我们可以像操作Series一样操作DataFrame。读取，设置，删除列的操作和dict操作类似。

In [22]: df['one'] # 列操作：选择列标签名Out[22]: a     1b     2c     3d   NaNName: one, dtype: float64In [23]: df['three'] = df['one'] + df['two'] # 创建新列, 和dict一样In [24]: df['flag'] = df['one'] > 2 In [25]: df Out[25]:    one  two  three   flaga    1    1      2  Falseb    2    2      4  Falsec    3    3      6   Trued  NaN    4    NaN  FalseIn [26]: del df['two']In [27]: three = df.pop('three') # df中弹出three列到three变量In [28]: dfOut[28]:    one   flaga    1  Falseb    2  Falsec    3   Trued  NaN  FalseIn [29]: threeOut[29]: a     2b     4c     6d   NaNName: three, dtype: float64In [30]: df['foo'] = 'bar'In [31]: dfOut[31]:    one   flag  fooa    1  False  barb    2  False  barc    3   True  bard  NaN  False  barIn [32]: df['one_trunc'] = df['one'][:2] # 填补的数据为nanIn [33]: dfOut[33]:    one   flag  foo  one_trunca    1  False  bar          1b    2  False  bar          2c    3   True  bar        NaNd  NaN  False  bar        NaNIn [34]: df.insert(1, 'bar', df['one']) # 可以自定义加入列的位置，新增bar列到index为1的列In [35]: dfOut[35]:    one  bar   flag  foo  one_trunca    1    1  False  bar          1b    2    2  False  bar          2c    3    3   True  bar        NaNd  NaN  NaN  False  bar        NaNIn [36]: df.assign(ration = df['one'] / df['bar']) # assign操作会把结果储存在DataFrame中Out[36]:    one  bar   flag  foo  one_trunc  rationa    1    1  False  bar          1       1b    2    2  False  bar          2       1c    3    3   True  bar        NaN       1d  NaN  NaN  False  bar        NaN     NaNIn [37]: dfOut[37]:    one  bar   flag  foo  one_trunca    1    1  False  bar          1b    2    2  False  bar          2c    3    3   True  bar        NaNd  NaN  NaN  False  bar        NaNIn [38]: df.loc['b'] # 用loc操作获取行，loc操作需要行的标签Out[38]: one              2bar              2flag         Falsefoo            barone_trunc        2Name: b, dtype: objectIn [39]: df.iloc[2] # 用iloc操作根据行列获取数据，iloc[row list, columns list] Out[39]: one             3bar             3flag         Truefoo           barone_trunc     NaNName: c, dtype: objectIn [40]: df.iloc[2, :] # 选取第二行，除了最后一列的所有列 Out[40]: one             3bar             3flag         Truefoo           barName: c, dtype: objectIn [44]: df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))In [46]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=list('ABC'))In [47]: df + df2 # 两个DataFrame相加，表情不对应得地方，赋nan值Out[47]:           A         B         C   D0  1.239121  2.705995  1.365740 NaN1  1.507655 -1.092202  0.083471 NaN2 -0.485961 -0.131136 -1.677334 NaN3 -0.858146  0.319006 -1.995003 NaN4 -1.487327  2.030991 -0.565237 NaN5  0.239241 -0.713864 -1.635968 NaN6 -1.656484 -0.420657  0.125534 NaN7       NaN       NaN       NaN NaN8       NaN       NaN       NaN NaN9       NaN       NaN       NaN NaNIn [48]: df - df.iloc[0] # 行减操作Out[48]:           A         B         C         D0  0.000000  0.000000  0.000000  0.0000001 -1.969524 -1.957223 -0.781471  0.2786862 -2.215996 -0.172781 -1.736314 -1.0507413 -2.264761 -1.402786 -2.713273 -0.2470844 -1.157636 -1.445320 -1.985973  1.4857995 -1.689059 -1.160161 -1.453136  0.5880976 -3.359694 -1.415710 -0.493772 -1.0025437 -0.889769  0.220577 -0.023013  0.0243378 -2.223337 -0.068570 -1.117682 -0.8750489 -0.678439 -1.591324  0.107048 -0.880545In [49]: df - df['A'] # 列减操作和行减不一致Out[49]:     A   B   C   D   0   1   2   3   4   5   6   7   8   90 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaNIn [53]: df = pd.DataFrame(random.randn(8, 3), index = index, columns=list('ABC'))In [54]: dfOut[54]:                    A         B         C2000-01-01  0.581713  0.229262 -0.1743592000-01-02  1.355298 -0.901488  1.0821122000-01-03 -0.963151  0.285010 -1.2751642000-01-04 -0.104592  0.744454 -0.7225042000-01-05 -0.794036  0.268566  1.7213572000-01-06 -1.415143 -0.863292 -0.6746502000-01-07  0.505573  0.451317 -0.3909722000-01-08 -1.341107  0.549922  0.120314In [59]: df.sub(df['A'], axis = 0) # 用sub操作实现正真的列减Out[59]:             A         B         C2000-01-01  0 -0.352452 -0.7560722000-01-02  0 -2.256785 -0.2731862000-01-03  0  1.248161 -0.3120132000-01-04  0  0.849046 -0.6179112000-01-05  0  1.062602  2.5153932000-01-06  0  0.551851  0.7404932000-01-07  0 -0.054256 -0.8965452000-01-08  0  1.891028  1.461421In [60]: dfOut[60]:                    A         B         C2000-01-01  0.581713  0.229262 -0.1743592000-01-02  1.355298 -0.901488  1.0821122000-01-03 -0.963151  0.285010 -1.2751642000-01-04 -0.104592  0.744454 -0.7225042000-01-05 -0.794036  0.268566  1.7213572000-01-06 -1.415143 -0.863292 -0.6746502000-01-07  0.505573  0.451317 -0.3909722000-01-08 -1.341107  0.549922  0.120314In [70]: df * 5 + 2 # 运算操作Out[70]:                    A         B          C2000-01-01  4.908567  3.146308   1.1282052000-01-02  8.776488 -2.507439   7.4105582000-01-03 -2.815756  3.425049  -4.3758202000-01-04  1.477038  5.722268  -1.6125182000-01-05 -1.970178  3.342832  10.6067862000-01-06 -5.075717 -2.316460  -1.3732502000-01-07  4.527865  4.256584   0.0451392000-01-08 -4.705535  4.749608   2.601572In [71]: 1 / dfOut[71]:                    A         B         C2000-01-01  1.719060  4.361830 -5.7352932000-01-02  0.737845 -1.109277  0.9241192000-01-03 -1.038259  3.508651 -0.7842132000-01-04 -9.560923  1.343267 -1.3840762000-01-05 -1.259389  3.723473  0.5809372000-01-06 -0.706642 -1.158357 -1.4822502000-01-07  1.977954  2.215738 -2.5577262000-01-08 -0.745653  1.818441  8.311556In [74]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype = bool)In [75]: df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype = bool)In [82]: df1Out[82]:        a      b0   True  False1  False   True2   True   TrueIn [83]: df2Out[83]:        a      b0  False   True1   True   True2   True  False### 下面演示的是布尔运算In [84]: df1 & df2 Out[84]:        a      b0  False  False1  False   True2   True  FalseIn [85]: df1 | df2Out[85]:       a     b0  True  True1  True  True2  True  TrueIn [86]: df1 ^ df2Out[86]:        a      b0   True   True1   True  False2  False   TrueIn [87]: -df1Out[87]:        a      b0  False   True1   True  False2  False  FalseIn [112]: df = pd.DataFrame({'foot1': np.random.randn(5), 'foot2': np.random.randn(5)})In [113]: dfOut[113]:       foot1     foot20  0.953419 -0.9019831 -0.155681 -0.1432132 -0.164418  1.5199703  0.699752 -0.3982244 -0.550058  2.115899In [114]: df.T # 矩阵转置Out[114]:               0         1         2         3         4foot1  0.953419 -0.155681 -0.164418  0.699752 -0.550058foot2 -0.901983 -0.143213  1.519970 -0.398224  2.115899In [116]: df.T.dot(df) # 矩阵相乘Out[116]:           foot1     foot2foot1  1.752495 -2.530108foot2 -2.530108  7.780001In [120]: np.exp(df) # 同样可以用NumPy的方法Out[120]:       foot1     foot20  2.594566  0.4057641  0.855832  0.8665692  0.848387  4.5720873  2.013253  0.6715124  0.576916  8.297037

总结

以上就是Pandas主要的数据结构：Series和DataFrame的简介。记录了怎么创建数据以及常用的算数、选取、增、删、修改的操作。
数据挺好理解的。Series相对于一般的数组来说，就是多了一个标签。因此我们也已把它理解为一个“字典“，标签对映字典的key，值对映字典的值。同样的，DataFrame比一般的矩阵多了行和列的标签。列相当于一个Series。所以我们需要加一个列标签。我们可以把它看成”字典的字典“。用字典的字典的key对映于列标签。
以此类推，Panel的n维数据，类似于n个嵌套的字典。最外维的标签对映于最外维字典的key。

1 0