pandas入门三

来源：互联网发布：刺客列传网络剧百度云编辑：程序博客网时间：2024/06/07 05:47

本文是学习《利用Python进行数据分析》的部分笔记，在这里感谢作者

pandas基本功能

一：重新索引：

调用reindex方法将会根据新索引重新排列：

obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])obj2=obj.reindex(['a','b','c','d','e'])objOut[18]: d    4.5b    7.2a   -5.3c    3.6dtype: float64obj2Out[19]: a   -5.3b    7.2c    3.6d    4.5e    NaN

对于时间序列这样的有序数据，重新索引需要做一些插值处理，method选项可以达到此目的，比如使用ffill进行前向填充，使用bfill进行后向填充

obj3=Series(['blue','purple','yellow'],index=[0,2,4])obj3.reindex(range(6),method='ffill')Out[21]: 0      blue1      blue2    purple3    purple4    yellow5    yellow

对于DataFrame，reindex可以修改行列索引，或者两个都修改，如果仅传入一个序列会重新索引行，使用columns重新索引列：

frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California']frameOut[6]:    Ohio  Texas  Californiaa     0      1           2c     3      4           5d     6      7           8frame2=frame.reindex(['a','b','c','d'])frame2Out[8]:    Ohio  Texas  Californiaa   0.0    1.0         2.0b   NaN    NaN         NaNc   3.0    4.0         5.0d   6.0    7.0         8.0state=['Texas','Utah','California']frame.reindex(columns=state)Out[10]:    Texas  Utah  Californiaa      1   NaN           2c      4   NaN           5d      7   NaN           8

也可以重新对行和列同时进行重新索引，但是插值操作只能按行操作

frame.reindex(index=['a','b','c','d'],method='ffill',columns=state)Out[12]:    Texas  Utah  Californiaa      1   NaN           2b      1   NaN           2c      4   NaN           5d      7   NaN           8

利用ix的标签索引功能，重新索引功能会变得很简单：

frame.ix[['a','b','d','c'],state]Out[14]:    Texas  Utah  Californiaa    1.0   NaN         2.0b    NaN   NaN         NaNd    7.0   NaN         8.0c    4.0   NaN         5.0

fill_value方法在重新索引中，需要引入缺失值时使用的缺失值

frame.reindex(index=['a','b','c','d'],fill_value='1',columns=state)Out[17]:   Texas Utah Californiaa     1    1          2b     1    1          1c     4    1          5d     7    1          8

二：丢弃制定轴上的项：
drop方法：参数是一个数组或者列表：

data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])data.drop(['Colorado','Ohio'])Out[19]:           one  two  three  fourUtah        8    9     10    11New York   12   13     14    15

有一个参数axis=0，默认删除行，axis=1，删除列。

data.drop('two',axis=1)Out[21]:           one  three  fourOhio        0      2     3Colorado    4      6     7Utah        8     10    11New York   12     14    15

三：索引，选取和过滤：
对DataFrame索引就是获取一个或者多个列。

data['two']Out[23]: Ohio         1Colorado     5Utah         9New York    13Name: two, dtype: int32

索引多列就要传入列表：

data[['two','three']]Out[25]:           two  threeOhio        1      2Colorado    5      6Utah        9     10New York   13     14

获取行是用切片的方式，或者用布尔表达式：

data[:2]Out[28]:           one  two  three  fourOhio        0    1      2     3Colorado    4    5      6     7data[data['three']>5]Out[29]:           one  two  three  fourColorado    4    5      6     7Utah        8    9     10    11New York   12   13     14    15

另一种方法是用布尔型DataFrame进行索引：

ata[data<5]=0dataOut[32]:           one  two  three  fourOhio        0    0      0     0Colorado    0    5      6     7Utah        8    9     10    11New York   12   13     14    15

索引行还有一种方式，就是用前面提到的.ix方式：

data.ix['Colorado',['two','three']]Out[33]: two      5three    6          four  one  twoColorado     7    0    5Utah        11    8    9

总的来说，ix接受两个参数，第一个指定行数，第二个指定列数，两个参数可以是字符，可以是列表，也可以是切片表达方式。

三：算术运算和数据对齐

df1=DataFrame(np.arange(9).reshape((3,3)),columns=list('bde'),index=['Ohio','Texas','Colorado'])df2=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])df1Out[7]:           b  d  eOhio      0  1  2Texas     3  4  5Colorado  6  7  8df2Out[8]:           b     d     eUtah    0.0   1.0   2.0Ohio    3.0   4.0   5.0Texas   6.0   7.0   8.0Oregon  9.0  10.0  11.0df1+df2Out[9]:             b     d     eColorado  NaN   NaN   NaNOhio      3.0   5.0   7.0Oregon    NaN   NaN   NaNTexas     9.0  11.0  13.0Utah      NaN   NaN   NaN

算数运算有add，sub,div,mul，这些参数都有一个参数是fill_value,可以将NAN值填充。

df1.add(df2,fill_value=0)Out[10]:             b     d     eColorado  6.0   7.0   8.0Ohio      3.0   5.0   7.0Oregon    9.0  10.0  11.0Texas     9.0  11.0  13.0Utah      0.0   1.0   2.0

dataFrame与Series之间的运算：

frameOut[13]:           b     d     eUtah    0.0   1.0   2.0Ohio    3.0   4.0   5.0Texas   6.0   7.0   8.0Oregon  9.0  10.0  11.0seriesOut[14]: b    0.0d    1.0e    2.0Name: Utah, dtype: float64frame-seriesOut[15]:           b    d    eUtah    0.0  0.0  0.0Ohio    3.0  3.0  3.0Texas   6.0  6.0  6.0Oregon  9.0  9.0  9.0

可以看出，DataFrame与Series之间的运算是用DataFrame中每一行去减去Series对应的列。
如果是希望匹配行在列上进行广播，则必须使用算数运算方法：

frameOut[17]:           b     d     eUtah    0.0   1.0   2.0Ohio    3.0   4.0   5.0Texas   6.0   7.0   8.0Oregon  9.0  10.0  11.0seriesOut[18]: Utah       1.0Ohio       4.0Texas      7.0Oregon    10.0Name: d, dtype: float64frame.sub(series,axis=0)Out[19]:           b    d    eUtah   -1.0  0.0  1.0Ohio   -1.0  0.0  1.0Texas  -1.0  0.0  1.0Oregon -1.0  0.0  1.0

frame
Out[17]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0

series
Out[18]:
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64

frame.sub(series,axis=0)
Out[19]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0

0 0