pandas入门-数据结构(1)

来源：互联网发布：万梓良心事谁人知编辑：程序博客网时间：2024/06/05 04:35

Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近，其区别是：List中的元素可以是不同的数据类型，而Array和Series中则只允许存储相同的数据类型，这样可以更有效的使用内存，提高运算效率。

Time- Series：以时间为索引的Series。

DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。

Pandas官网，更多功能请参考http://pandas-docs.github.io/pandas-docs-travis/index.html

In [1]:# 首先导入库import pandas as pdimport numpy as npimport matplotlib.pyplot as plt

一、数据结构介绍
1、Series

由一组数据（各种Numpy数据类型），以及一组与之相关的标签数据（即索引）组成。仅由一组数据即可产生最简单的Series，可以通过传递一个list对象来创建一个Series，pandas会默认创建整型索引,更多series内容请参考官网 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

In [2]: s = pd.Series([1,3,5,np.nan,6,8])        sOut[2]:0     1       1     3       2     5       3   NaN       4     6       5     8       dtype: float64

也可以通过字典来创建Series

In [3]: sdata = {'a':1,'b':2,'c':3}        s1 = pd.Series(sdata)        s1Out[3]: a    1        b    2        c    3        dtype: int64

索引相匹配的会被找出来并放到相应的位置上，但‘d’所对于的值找不到，所有结果为NaN(即非数字，not a number),可以使用isnull和notnull来检测缺失数据

In [4]: s11 = pd.Series(sdata,index=['a','b','c','d'])        s11Out[4]: a     1        b     2        c     3        d   NaN        dtype: float64In [5]: pd.isnull(s11)#或是s11.isnull()Out[5]: a    False        b    False        c    False        d     True        dtype: boolIn [6]: s11.notnull()#或pd.notnull(s11)Out[6]: a     True        b     True        c     True        d    False        dtype: bool

Series最重要的一个功能是：可以在算术运算中自动对齐不同索引的数据

In [7]: s1+s11Out[7]: a     2             b     4            c     6            d   NaN            dtype: float64

Series对象本身及其索引都有一个name属性

In [8]: s1.name = 'population'        s1.index.name = 'state'        s1Out[8]: state        a    1        b    2        c    3        Name: population, dtype: int64

Series的索引可以通过赋值的方式就地修改

In [9]: s1.index = ['Join','Quant','JQ']        s1Out[9]: Join     1        Quant    2        JQ       3        Name: population, dtype: int64

通过values和index属性获取其数组表示形式和索引对象

In [10]: s.valuesOut[10]: array([  1.,   3.,   5.,  nan,   6.,   8.])In [11]: s.indexOut[11]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

带有一个可以通过对各个数据点进行标记的索引

In [12]: s2 = pd.Series([1,3,-6,8],index=['a','d','e','c'])         s2Out[12]: a    1         d    3         e   -6         c    8         dtype: int64

可以通过索引的方式选取Series中的单个或一组值

In [13]: s2['a']Out[13]: 1In [14]: s2[['a','c']]Out[14]: a    1         c    8         dtype: int64

Numpy数组运算（如根据布尔型数组进行过滤、标量乘法、应用数学函数等）都会保留索引和值之间的链接

In [15]: s2[s2>0]Out[15]: a    1         d    3         c    8         dtype: int64In [16]: s2*2Out[16]: a     2         d     6         e   -12         c    16         dtype: int64In [17]: np.exp(s2)Out[17]: a       2.718282         d      20.085537         e       0.002479         c    2980.957987         dtype: float64

还可以将Series看成是一个定长的有序字典，因为它是索引值到数据值得一个映射。它可以用在许多原本需要字典参数的函数中：

In [18]: 'a' in s2Out[18]: TrueIn [26]: s2Out[26]: a    1         d    3         e   -6         c    8         dtype: int64In [25]: s2.add(1)Out[25]: a    2         d    4         e   -5         c    9         dtype: int64

2、DataFrame

DataFrame是一个表格型的数据结构，它含有一组有序的列，每一列的数据结构都是相同的，而不同的列之间则可以是不同的数据结构（数值、字符、布尔值等）。或者以数据库进行类比，DataFrame中的每一行是一个记录，名称为Index的一个元素，而每一列则为一个字段，是这个记录的一个属性。DataFrame既有行索引也有列索引，可以被看做由Series组成的字典（共用同一个索引）。

更多内容请参考：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

创建DataFrame有多种方式：
1.以字典的字典或Series的字典的结构构建DataFrame，这时候的最外面字典对应的是DataFrame的列，内嵌的字典及Series则是其中每个值。

In [19]: d ={'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),          'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}         df = pd.DataFrame(d)         dfOut[19]:   one two        a   1   1        b   2   2        c   3   3        d   NaN 4

可以看到d是一个字典，其中one的值为Series有3个值，而two为Series有4个值。由d构建的为一个4行2列的DataFrame。其中one只有3个值，因此d行one列为NaN。

In [20]: df2 = pd.DataFrame({ 'A' : 1.,   ....:       'B' : pd.Timestamp('20160101'),   ....:       'C' : pd.Series(1,index=list(range(4)),dtype='float32'),   ....:       'D' : np.array([3] * 4,dtype='int32'),   ....:       'E' : pd.Categorical(["test","train","test","train"]),   ....:       'F' : 'foo' })         df2Out[20]:              A     B   C   D   E   F         0  1   2016-01-01  1   3   test    foo         1  1   2016-01-01  1   3   train   foo         2  1   2016-01-01  1   3   test    foo         3  1   2016-01-01  1   3   train   foo

2.从列表的字典构建DataFrame，其中嵌套的每个列表（List）代表的是一个列，字典的名字则是列标签。这里要注意的是每个列表中的元素数量应该相同。通过传递一个numpy array，时间索引以及列标签来创建一个

In [21]: dates = pd.date_range('20160101',periods=6)         df3 = pd.DataFrame(np.random.randn(6,4),index=dates,columns=           list('ABCD'))         df3Out[21]:                 A          B           C           D        2016-01-01  -1.113506   0.823869    1.334366    -1.228612        2016-01-02  -0.452546   -0.380858   0.471212    -0.553034        2016-01-03  -0.958349   0.528585    0.742589    0.057017        2016-01-04  1.209820    1.099186    -0.841838   1.445381        2016-01-05  -0.425561   -1.152818   -0.172490   -0.516070        2016-01-06  -0.107563   -0.094983   0.102025    -0.524834

3.从字典的列表构建DataFrame，其中每个字典代表的是每条记录（DataFrame中的一行），字典中每个值对应的是这条记录的相关属性。

In [22]: d = [{'one' : 1,'two':1},{'one' : 2,'two' : 2},                        {'one' : 3,'two' : 3},{'two' : 4}]         df = pd.DataFrame(d,index=['a','b','c','d'],columns=['one','two'])         df.index.name='index'         dfOut[22]:      one   two        index                a     1    1         b     2    2         c     3    3         d     NaN  4

以上的语句与以Series的字典形式创建的DataFrame相同，只是思路略有不同，一个是以列为单位构建，将所有记录的不同属性转化为多个Series，行标签冗余，另一个是以行为单位构建，将每条记录转化为一个字典，列标签冗余。使用这种方式，如果不通过columns指定列的顺序，那么列的顺序会是随机的。
为不存在的列赋值会创建出一个新列。关键字del用于删除列

In [25]: df['quant'] = 6         dfOut[25]:      one   two   quant        index                     a     1     1     6          b     2     2     6          c     3     3     6          d     NaN   4     6In [26]: del df['quant']         dfOut[26]:       one  two        index                 a     1    1          b     2    2          c     3    3          d     NaN  4

查看不同列的数据类型：

In [27]: df.dtypesOut[27]: one    float64         two      int64         dtype: object

DataFrame转换为其他类型

orient的参数为‘dict’、‘list’、‘series’和‘records’。

In [28]: df.to_dict(orient='dict')Out[28]: {'one': {'a': 1.0, 'b': 2.0, 'c': 3.0, 'd': nan},         'two': {'a': 1, 'b': 2, 'c': 3, 'd': 4}}

获取DataFrame的列为一个Series有两种方式

In [29]: df['one']Out[29]: index         a     1         b     2         c     3         d   NaN         Name: one, dtype: float64In [30]: df.oneOut[30]: index         a     1         b     2         c     3         d   NaN         Name: one, dtype: float64

返回的Series拥有原DataFrame相同的所有，且其name属性也被相应设置好了。行也可以通过位置或名称方式获取

0 0