python数据分析：pandas数据结构与操作

来源：互联网发布：贵阳大数据广场在哪里编辑：程序博客网时间：2024/05/21 17:14

pandas有两个常用的数据结构模块
Series和DataFrame
我们将对这两个数据结构模块进行学习
Series有点类似于一位数组的对象，也有点类似于字典，由一组数据以及一组与之相关的索引组成
我们可以简单的输入一些数据生成一个Series对象

In [1]: from pandas import *In [2]: test=Series([1,2,3,4])In [3]: testOut[3]: 0    11    22    33    4dtype: int64

当然索引不一定得是整数，我们也可以在生成Series的时候对其添加索引

In [4]: test2=Series([1,2,3,4],index=['a','b','c','d'])In [5]: test2Out[5]: a    1b    2c    3d    4dtype: int64In [6]: test2['a']Out[6]: 1In [7]: test2.indexOut[7]: Index(['a', 'b', 'c', 'd'], dtype='object')

索引的方式也是与字典类似，我们之前Numpy的数组中，所学习的一些方法在Series对象中同样适用

In [4]: test2=Series([1,2,3,4],index=['a','b','c','d'])In [5]: test2Out[5]: a    1b    2c    3d    4dtype: int64In [6]: test2['a']Out[6]: 1In [7]: test2.indexOut[7]: Index(['a', 'b', 'c', 'd'], dtype='object')

因为映射方式的类似，其实Python的字典可以直接利用里面的数据来创建一个Series

In [12]: height={'tom':170,'david':175,'harry':180,'mary':170}In [13]: test3=Series(height)In [14]: test3Out[14]: david    175harry    180mary     170tom      170dtype: int64

我们可以直接通过一个列表来传入索引

In [16]: name=['tom','david','harry','jack']In [17]: test4=Series(height,index=name)In [18]: test4Out[18]: tom      170.0david    175.0harry    180.0jack       NaNdtype: float64

对于缺失的数值可以用pandas中的isnull和notnull函数检测

In [19]: pd.isnull(test4)Out[19]: tom      Falsedavid    Falseharry    Falsejack      Truedtype: boolIn [20]: pd.notnull(test4)Out[20]: tom       Truedavid     Trueharry     Truejack     Falsedtype: bool

Series对象本身和它的索引都有一个name属性，这有点像我们平时常用的 excel表格

In [21]: test3.name='height'In [22]: test3.index.name='Name'In [23]: test3Out[23]: Namedavid    175harry    180mary     170tom      170Name: height, dtype: int64

索引也可以通过赋值的方式修改

In [24]: test3.index=['bob','bob','bob','bob']In [25]: test3Out[25]: bob    175bob    180bob    170bob    170Name: height, dtype: int64

Dataframe是pandas里面类似表格型的一个数据结构，他有一组有序的列，每列可以是不同的值类型，又有点像是一个扩充的Series，是以二维结构存储数据。
构建Dataframe有很多办法，我们可以在一个字典的基础上来创建他

In [3]: data={'name':['tom','harry','jack','mary'],'height':['176','170','180','160'],'weight':['100','80','90','50']}In [4]: frame=DataFrame(data)In [5]: frameOut[5]:   height   name weight0    176    tom    1001    170  harry     802    180   jack     903    160   mary     50

Dataframe也可以按照指定序列进行排序

In [7]: DataFrame(data,columns=['name','height','weight'],index=['first','second','third','four'])Out[7]:          name height weightfirst     tom    176    100second  harry    170     80third    jack    180     90four     mary    160     50

如果传入的数据在之中没有对应值也会像Series中一样生成NA值
指定了列序列，会按照指定的序列进行排序

In [7]: DataFrame(data,columns=['name','height','weight'],index=['first','second','third','four'])Out[7]:          name height weightfirst     tom    176    100second  harry    170     80third    jack    180     90four     mary    160     50

可以通过索引把DataFrame的列获取为一个Series

In [9]: frame=DataFrame(data,columns=['name','height','weight'],index=['first','second','third','four'])In [10]: frame['name']Out[10]: first       tomsecond    harrythird      jackfour       maryName: name, dtype: object

加入索引字段ix可以获取一整行的信息

In [12]: frame.ix['third']Out[12]: name      jackheight     180weight      90Name: third, dtype: object

还有一种常见的数据形式是嵌套字典，我们将其传入字典，会被解释为：外层字典的键作为主列，内层的键则作为行索引

In [13]: pop={'Nevada':{2001:2.4,2002:2.9},'ohio':{2000:1.5,2001:1.7,2002:3.6}}In [14]: frame=DataFrame(pop)In [15]: frameOut[15]:       Nevada  ohio2000     NaN   1.52001     2.4   1.72002     2.9   3.6

我们可以对其进行转置

In [16]: frame.TOut[16]:         2000  2001  2002Nevada   NaN   2.4   2.9ohio     1.5   1.7   3.6

内层的字典会被合并，排序成为最后的列。如果指定过索引，则不会这样

In [18]: DataFrame(pop,index=[2001,2002,2003])Out[18]:       Nevada  ohio2001     2.4   1.72002     2.9   3.62003     NaN   NaN

0 0