Python学习（十二）——pandas函数库1

来源：互联网发布：淘宝外贸男装店铺推荐编辑：程序博客网时间：2024/06/05 16:20

pandas的基本功能:

（1）具备按轴自动或显式数据对齐功能的数据结构；
（2）集成时间序列功能；
（3）既能处理时间序列数据也能处理非时间序列数据的数据结构；
（4）数学运算和约简（如对某个轴求和）可以根据不同的元数据（轴编号）执行；
（5）灵活处理缺失数据；
（6）合并及其他出现在常见数据库（SQL等）中的关系型运算；

pandas的数据结构：

1.Series创建

Series是一种类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。
Series的字符串表现形式为：索引在左边，值在右边。

首先导入模块

from pandas import Series

①用数组生成Series

（默认索引为从0开始，类似一维数组结构）

ser1=Series([111,222,333,-444])print ser1print ser1.valuesprint ser1.index

输出：

0    1111    2222    3333   -444dtype: int64[ 111  222  333 -444]RangeIndex(start=0, stop=4, step=1)

②指定Series的index

（指定索引时，类似于字典dict中的键-值（key-value）存储。）

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print ser2print ser2.index

输出：

fir    10sec    20thi    30fou    40dtype: int64Index([u'fir', u'sec', u'thi', u'fou'], dtype='object')

③使用字典生成Series

d={'ShangHai':21,'TianJin':22,'ChongQing':23,'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}print dser3=Series(d)print ser3

输出：

{'WuHan': 27, 'GuangZhou': 20, 'ShenYang': 24, 'ChengDu': 28, 'TianJin': 22, 'ShangHai': 21, 'XiAn': 29, 'NanJing': 25, 'ChongQing': 23}ChengDu      28ChongQing    23GuangZhou    20NanJing      25ShangHai     21ShenYang     24TianJin      22WuHan        27XiAn         29dtype: int64

④使用字典生成的Series并指定index时，index中不匹配的部分为Nan（not a number）：

d={'ShangHai':21,'TianJin':22,'ChongQing':23,'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)city=['HaErBin','ShangHai','TianJin','ChongQing','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']ser4=Series(d,index=city)print ser4

输出：

HaErBin       NaNShangHai     21.0TianJin      22.0ChongQing    23.0ShenYang     24.0NanJing      25.0GuangZhou    20.0WuHan        27.0ChengDu      28.0XiAn         29.0dtype: float64

2.Series读写

①指定索引index对Series进行读写

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print ser2['thi']ser2['thi']=666print ser2['thi']

输出：

②指定多个index对Series读写

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print ser2[['fir','sec','thi']]

输出：

fir     10sec     20thi    30dtype: int64

③用布尔索引读取Series元素：

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])ser2['thi']=666print '找出小于600的元素'print ser2[ser2<600]

输出：
找出小于600的元素

fir    10sec    20fou    40dtype: int64

④判断index是否存在

类似于字典dict中的判断key值的存在；存在时返回True，否则返回False。

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print 'thi' in ser2print 'no' in ser2

输出：

TrueFalse

3.Series运算

Series相加减，相同索引部分会进行加减，无对应部分的会作为缺失值Nan进行处理：

d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)city=['HaErBin','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']ser4=Series(d,index=city)print ser4+ser3print ser4-0.5*ser3

输出：

ChengDu      56.0GuangZhou    40.0HaErBin       NaNNanJing      50.0ShenYang     48.0WuHan        54.0XiAn         58.0dtype: float64ChengDu      14.0GuangZhou    10.0HaErBin       NaNNanJing      12.5ShenYang     12.0WuHan        13.5XiAn         14.5dtype: float64

4.可以对Series及其索引进行命名：

可提升代码的可读性；

d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)# city=['HaErBin','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']ser3.name='area_code'ser3.index.name='city name'print ser3print ser3.index

输出：

city nameChengDu      28GuangZhou    20NanJing      25ShenYang     24WuHan        27XiAn         29Name: area_code, dtype: int64Index([u'ChengDu', u'GuangZhou', u'NanJing', u'ShenYang', u'WuHan', u'XiAn'], dtype='object', name=u'city name')

5.索引index可以重新指定即可替换：

d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)ser3.index=['SY','NJ','GZ','WH','CD','XA']print ser3

输出：

SY    28NJ    20GZ    25WH    24CD    27XA    29dtype: int64

1.DataFrame构造

DateFrame是一个表格型的数据结构，含有一组有序的列，每列可以为不同的数据类型。
既有行索引也有列索引，可以看作由Series组成的字典（共用一个索引）。
pandas兼具了Numpy高性能的数组计算功能及电子表格个关系型数据库（如SQL）灵活
的数据处理功能。
首先导入模块

from pandas import DataFrame

用字典生成DataFrame，key为列名：

data={'ShenYang':{'AreaCode':24,'GDP':2412.2},      'NanJing':{'AreaCode':25,'GDP':5488.73},      'GuangZhou':{'AreaCode':20,'GDP':9891.48},      'WuHan':{'AreaCode':27,'GDP':6019.08}}dfame=DataFrame(data)print dfame

输出：

          GuangZhou  NanJing  ShenYang    WuHanAreaCode      20.00    25.00      24.0    27.00GDP         9891.48  5488.73    2412.2  6019.08

或：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data)print dfame

输出：

   AreaCode      GDP       city0        24  2412.20   ShenYang1        25  5488.73    NanJing2        20  9891.48  GuangZhou3        27  6019.08      WuHan4        28  6111.40    ChengDu5        29  3304.08       XiAn

可以看到，字典key值本身是无序的，此时列的顺序是无法保证的（输入‘city’、‘AreaCode’、‘GDP’输出AreaCode GDP city）。
若需要确定列的顺序时，DataFrame可以通过columns单独指定列的顺序。

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])print dfame

输出：

        city  AreaCode      GDP0   ShenYang        24  2412.201    NanJing        25  5488.732  GuangZhou        20  9891.483      WuHan        27  6019.084    ChengDu        28  6111.405       XiAn        29  3304.08

如果指定的列中某个列在字典data中不存在，则全部用Nan代替：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'])print dfame

输出：

        city  AreaCode      GDP Population0   ShenYang        24  2412.20        NaN1    NanJing        25  5488.73        NaN2  GuangZhou        20  9891.48        NaN3      WuHan        27  6019.08        NaN4    ChengDu        28  6111.40        NaN5       XiAn        29  3304.08        NaN

若只指定了部分列，则只会输出指定的列：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','GDP','Population'])print dfame

输出：

        city      GDP Population0   ShenYang  2412.20        NaN1    NanJing  5488.73        NaN2  GuangZhou  9891.48        NaN3      WuHan  6019.08        NaN4    ChengDu  6111.40        NaN5       XiAn  3304.08        NaN

同时，还可以指定DataFrame的index；（默认情况下为0 1 2 3 ……）

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])print dfame

输出：

            city  AreaCode      GDP Populationline1   ShenYang        24  2412.20        NaNline2    NanJing        25  5488.73        NaNline3  GuangZhou        20  9891.48        NaNline4      WuHan        27  6019.08        NaNline5    ChengDu        28  6111.40        NaNline6       XiAn        29  3304.08        NaN

也可以指定通过
dfame.index.name=’line’
dfame.columns.name=’brief
‘索引和列的名称：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])dfame.index.name='line'dfame.columns.name='brief'print dfame

输出：

brief       city  AreaCode      GDPline                               0       ShenYang        24  2412.201        NanJing        25  5488.732      GuangZhou        20  9891.483          WuHan        27  6019.084        ChengDu        28  6111.405           XiAn        29  3304.08

2.DataFrame读写

①读取列信息：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])print dfame.columns

输出：

Index([u'city', u'AreaCode', u'GDP'], dtype='object')

读取DataFrame的列可以用dfame[‘AreaCode’]
也可以用dfame.city获取某列

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])print dfame['AreaCode']print dfame.city

输出：

line1    24line2    25line3    20line4    27line5    28line6    29Name: AreaCode, dtype: int64line1     ShenYangline2      NanJingline3    GuangZhouline4        WuHanline5      ChengDuline6         XiAnName: city, dtype: object

也可以利用values直接打印出一个二维数组，不含行列信息：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])print dfame.values

输出：

[['ShenYang' 24L 2412.2] ['NanJing' 25L 5488.73] ['GuangZhou' 20L 9891.48] ['WuHan' 27L 6019.08] ['ChengDu' 28L 6111.4] ['XiAn' 29L 3304.08]]

②直接赋值修改列：

通过赋值，直接修改整列的值：

dfame['Population']=7000000print dfamedfame['Population']=[111,222,333,444,555,666]print dfame

输出：

            city  AreaCode      GDP  Populationline1   ShenYang        24  2412.20     7000000line2    NanJing        25  5488.73     7000000line3  GuangZhou        20  9891.48     7000000line4      WuHan        27  6019.08     7000000line5    ChengDu        28  6111.40     7000000line6       XiAn        29  3304.08     7000000            city  AreaCode      GDP  Populationline1   ShenYang        24  2412.20         111line2    NanJing        25  5488.73         222line3  GuangZhou        20  9891.48         333line4      WuHan        27  6019.08         444line5    ChengDu        28  6111.40         555line6       XiAn        29  3304.08         666

③通过numpy数据修改列：

dfame['Population']=np.arange(100,700,100)print dfame

输出：

            city  AreaCode      GDP  Populationline1   ShenYang        24  2412.20         100line2    NanJing        25  5488.73         200line3  GuangZhou        20  9891.48         300line4      WuHan        27  6019.08         400line5    ChengDu        28  6111.40         500line6       XiAn        29  3304.08         600

④通过Series修改列：

通过Series指定要修改的索引及对应的值，及可指定DataFrame某列中不同行的值，未指定的默认为NaN；

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])ser=Series([111,333,444,555,666],index=['line1','line3','line4','line5','line6'])dfame['Population']=serprint dfame

输出：

            city  AreaCode      GDP  Populationline1   ShenYang        24  2412.20       111.0line2    NanJing        25  5488.73         NaNline3  GuangZhou        20  9891.48       333.0line4      WuHan        27  6019.08       444.0line5    ChengDu        28  6111.40       555.0line6       XiAn        29  3304.08       666.0

⑤增加新列：

增加新列并赋值；

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])dfame['Temperature']=[-2,10,30,18,25,5]print dfame

输出：

            city  AreaCode      GDP  Temperatureline1   ShenYang        24  2412.20           -2line2    NanJing        25  5488.73           10line3  GuangZhou        20  9891.48           30line4      WuHan        27  6019.08           18line5    ChengDu        28  6111.40           25line6       XiAn        29  3304.08            5

3.DataFrame操作

①DataFrame转置：

类比行列式的转置，转置后行列交换；

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])print dfame.T

输出：

             line1    line2      line3    line4    line5    line6city      ShenYang  NanJing  GuangZhou    WuHan  ChengDu     XiAnAreaCode        24       25         20       27       28       29GDP         2412.2  5488.73    9891.48  6019.08   6111.4  3304.08

②DataFrame切片操作：

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],      'AreaCode':[24,25,20,27,28,29],      'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]    }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])print dfame['city'][2:6]

输出：

2    GuangZhou3        WuHan4      ChengDu5         XiAnName: city, dtype: object

①创建Index

索引对象
pandas的索引对象负责管理轴标签和其他元数据（比如轴名称等）。构建Series或DataFrame时，所用到的任何数组或者其他序列的标签都会被转换成Index。
Index对象是不可修改的，这样可以使Index对象在多个数据结构之间安全共享。

首先导入模块：

from pandas import Index

直接利用数组生成Index：

index=Index(np.arange(5))print index

输出：

Int64Index([0, 1, 2, 3, 4], dtype='int64')

生成的Index可以作为Series的index
可根据ser.index is index判断两个index是否为同一个index；

index=Index(np.arange(5))ser=Series(['one','two','three','four','five'],index=index)print serprint ser.index is index

输出：

0      one1      two2    three3     four4     fivedtype: objectTrue

②获取Index

ser=Series(range(5),index=['one','two','three','four','five'])index=ser.indexprint indexprint index[2:5]

输出：

Index([u'one', u'two', u'three', u'four', u'five'], dtype='object')Index([u'three', u'four', u'five'], dtype='object')

③判断索引是否存在

data={'ShenYang':{'AreaCode':24,'GDP':2412.2},      'NanJing':{'AreaCode':25,'GDP':5488.73},      'GuangZhou':{'AreaCode':20,'GDP':9891.48},      'WuHan':{'AreaCode':27,'GDP':6019.08}}dfame=DataFrame(data)print dfameprint 'WuHan' in dfame.columnsprint 'GDP' in dfame.index

输出：

          GuangZhou  NanJing  ShenYang    WuHanAreaCode      20.00    25.00      24.0    27.00GDP         9891.48  5488.73    2412.2  6019.08TrueTrue

④Index的方法和属性：

1）append——链接另外一个index对象，产生一个新的index；
2）diff——计算差集；
3）union——计算交集；
4）isin——计算一个指示各值是否包含在参数集合中的布尔型数组；
5）delete——删除索引处的元素，并包含到新的index；
6）drop——删除传入的值，并的到新的索引；
7）insert——将元素插入到索引处，并得到新的index；
8）unique——计算index中唯一值得到数组；
9）is_monotonic——当各个元素均大于等于的一个元素时返回True；
10）is_unique——当index没有重复值时，返回True；

pandas中主要的index对象：

1）index——最泛华的index对象，将轴标签作为一个由Python对象组成的Numpy数组；
2）int64Index——针对整数的特殊index；
3）MultiIndex——层级索引–“层次化”索引对象，表示单个轴上的多层次索引，可以看作原数组组成的数组；
4）DatetimeIndex——存储纳秒级时间戳；
5）PeriodIndex ——针对Period数据的特殊index。

pandas读取CSV文件

生成csv文件：

#!/usr/bin/python# -*-coding:utf-8-*-# 向csv写入数据import csvwriter = csv.writer(file('credit.csv', 'wb'))# 在首行写入对应数据名称title = ['ID', 'age', 'job', 'house', 'credit', 'class']data = [[1,  'youth',      'no',  'no',  'fair',      'no' ],        [2,  'youth',      'no',  'no',  'good',      'no' ],        [3,  'youth',      'yes', 'no',  'good',      'yes'],        [4,  'youth',      'yes', 'yes', 'fair',      'yes'],        [5,  'youth',      'no',  'no',  'fair',      'no' ],        [6,  'middle_age', 'no',  'no',  'fair',      'no' ],        [7,  'middle_age', 'no',  'no',  'good',      'no' ],        [8,  'middle_age', 'yes', 'yes', 'good',      'yes'],        [9,  'middle_age', 'no',  'yes', 'excellent', 'yes'],        [10, 'middle_age', 'no',  'yes', 'excellent', 'yes'],        [11, 'senior',     'no',  'yes', 'excellent', 'yes'],        [12, 'senior',     'no',  'yes', 'good',      'yes'],        [13, 'senior',     'yes', 'no',  'good',      'yes'],        [14, 'senior',     'yes', 'no',  'excellent', 'yes'],        [15, 'senior',     'no',  'no',  'fair',      'no' ],]# 写入数据writer.writerow(title)for i in data:    writer.writerow(i)

使用pandas读取csv文件：

#!/usr/bin/python# -*-coding:utf-8-*-import pandas as pdp = pd.read_csv('credit.csv')print p

输出：

    ID         age  job house     credit class0    1       youth   no    no       fair    no1    2       youth   no    no       good    no2    3       youth  yes    no       good   yes3    4       youth  yes   yes       fair   yes4    5       youth   no    no       fair    no5    6  middle_age   no    no       fair    no6    7  middle_age   no    no       good    no7    8  middle_age  yes   yes       good   yes8    9  middle_age   no   yes  excellent   yes9   10  middle_age   no   yes  excellent   yes10  11      senior   no   yes  excellent   yes11  12      senior   no   yes       good   yes12  13      senior  yes    no       good   yes13  14      senior  yes    no  excellent   yes14  15      senior   no    no       fair    no

阅读全文

1 1