Python学习(十二)——pandas函数库1
来源:互联网 发布:淘宝外贸男装店铺推荐 编辑:程序博客网 时间:2024/06/05 16:20
pandas的基本功能:
(1)具备按轴自动或显式数据对齐功能的数据结构;
(2)集成时间序列功能;
(3)既能处理时间序列数据也能处理非时间序列数据的数据结构;
(4)数学运算和约简(如对某个轴求和)可以根据不同的元数据(轴编号)执行;
(5)灵活处理缺失数据;
(6)合并及其他出现在常见数据库(SQL等)中的关系型运算;
pandas的数据结构:
1.Series创建
Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。
Series的字符串表现形式为:索引在左边,值在右边。
首先导入模块
from pandas import Series
①用数组生成Series
(默认索引为从0开始,类似一维数组结构)
ser1=Series([111,222,333,-444])print ser1print ser1.valuesprint ser1.index
输出:
0 1111 2222 3333 -444dtype: int64[ 111 222 333 -444]RangeIndex(start=0, stop=4, step=1)
②指定Series的index
(指定索引时,类似于字典dict中的键-值(key-value)存储。)
ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print ser2print ser2.index
输出:
fir 10sec 20thi 30fou 40dtype: int64Index([u'fir', u'sec', u'thi', u'fou'], dtype='object')
③使用字典生成Series
d={'ShangHai':21,'TianJin':22,'ChongQing':23,'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}print dser3=Series(d)print ser3
输出:
{'WuHan': 27, 'GuangZhou': 20, 'ShenYang': 24, 'ChengDu': 28, 'TianJin': 22, 'ShangHai': 21, 'XiAn': 29, 'NanJing': 25, 'ChongQing': 23}ChengDu 28ChongQing 23GuangZhou 20NanJing 25ShangHai 21ShenYang 24TianJin 22WuHan 27XiAn 29dtype: int64
④使用字典生成的Series并指定index时,index中不匹配的部分为Nan(not a number):
d={'ShangHai':21,'TianJin':22,'ChongQing':23,'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)city=['HaErBin','ShangHai','TianJin','ChongQing','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']ser4=Series(d,index=city)print ser4
输出:
HaErBin NaNShangHai 21.0TianJin 22.0ChongQing 23.0ShenYang 24.0NanJing 25.0GuangZhou 20.0WuHan 27.0ChengDu 28.0XiAn 29.0dtype: float64
2.Series读写
①指定索引index对Series进行读写
ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print ser2['thi']ser2['thi']=666print ser2['thi']
输出:
30666
②指定多个index对Series读写
ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print ser2[['fir','sec','thi']]
输出:
fir 10sec 20thi 30dtype: int64
③用布尔索引读取Series元素:
ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])ser2['thi']=666print '找出小于600的元素'print ser2[ser2<600]
输出:
找出小于600的元素
fir 10sec 20fou 40dtype: int64
④判断index是否存在
类似于字典dict中的判断key值的存在;存在时返回True,否则返回False。
ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])print 'thi' in ser2print 'no' in ser2
输出:
TrueFalse
3.Series运算
Series相加减,相同索引部分会进行加减,无对应部分的会作为缺失值Nan进行处理:
d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)city=['HaErBin','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']ser4=Series(d,index=city)print ser4+ser3print ser4-0.5*ser3
输出:
ChengDu 56.0GuangZhou 40.0HaErBin NaNNanJing 50.0ShenYang 48.0WuHan 54.0XiAn 58.0dtype: float64ChengDu 14.0GuangZhou 10.0HaErBin NaNNanJing 12.5ShenYang 12.0WuHan 13.5XiAn 14.5dtype: float64
4.可以对Series及其索引进行命名:
可提升代码的可读性;
d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)# city=['HaErBin','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']ser3.name='area_code'ser3.index.name='city name'print ser3print ser3.index
输出:
city nameChengDu 28GuangZhou 20NanJing 25ShenYang 24WuHan 27XiAn 29Name: area_code, dtype: int64Index([u'ChengDu', u'GuangZhou', u'NanJing', u'ShenYang', u'WuHan', u'XiAn'], dtype='object', name=u'city name')
5.索引index可以重新指定即可替换:
d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}ser3=Series(d)ser3.index=['SY','NJ','GZ','WH','CD','XA']print ser3
输出:
SY 28NJ 20GZ 25WH 24CD 27XA 29dtype: int64
1.DataFrame构造
DateFrame是一个表格型的数据结构,含有一组有序的列,每列可以为不同的数据类型。
既有行索引也有列索引,可以看作由Series组成的字典(共用一个索引)。
pandas兼具了Numpy高性能的数组计算功能及电子表格个关系型数据库(如SQL)灵活
的数据处理功能。
首先导入模块
from pandas import DataFrame
用字典生成DataFrame,key为列名:
data={'ShenYang':{'AreaCode':24,'GDP':2412.2}, 'NanJing':{'AreaCode':25,'GDP':5488.73}, 'GuangZhou':{'AreaCode':20,'GDP':9891.48}, 'WuHan':{'AreaCode':27,'GDP':6019.08}}dfame=DataFrame(data)print dfame
输出:
GuangZhou NanJing ShenYang WuHanAreaCode 20.00 25.00 24.0 27.00GDP 9891.48 5488.73 2412.2 6019.08
或:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data)print dfame
输出:
AreaCode GDP city0 24 2412.20 ShenYang1 25 5488.73 NanJing2 20 9891.48 GuangZhou3 27 6019.08 WuHan4 28 6111.40 ChengDu5 29 3304.08 XiAn
可以看到,字典key值本身是无序的,此时列的顺序是无法保证的(输入‘city’、‘AreaCode’、‘GDP’输出AreaCode GDP city)。
若需要确定列的顺序时,DataFrame可以通过columns单独指定列的顺序。
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])print dfame
输出:
city AreaCode GDP0 ShenYang 24 2412.201 NanJing 25 5488.732 GuangZhou 20 9891.483 WuHan 27 6019.084 ChengDu 28 6111.405 XiAn 29 3304.08
如果指定的列中某个列在字典data中不存在,则全部用Nan代替:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'])print dfame
输出:
city AreaCode GDP Population0 ShenYang 24 2412.20 NaN1 NanJing 25 5488.73 NaN2 GuangZhou 20 9891.48 NaN3 WuHan 27 6019.08 NaN4 ChengDu 28 6111.40 NaN5 XiAn 29 3304.08 NaN
若只指定了部分列,则只会输出指定的列:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','GDP','Population'])print dfame
输出:
city GDP Population0 ShenYang 2412.20 NaN1 NanJing 5488.73 NaN2 GuangZhou 9891.48 NaN3 WuHan 6019.08 NaN4 ChengDu 6111.40 NaN5 XiAn 3304.08 NaN
同时,还可以指定DataFrame的index;(默认情况下为0 1 2 3 ……)
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])print dfame
输出:
city AreaCode GDP Populationline1 ShenYang 24 2412.20 NaNline2 NanJing 25 5488.73 NaNline3 GuangZhou 20 9891.48 NaNline4 WuHan 27 6019.08 NaNline5 ChengDu 28 6111.40 NaNline6 XiAn 29 3304.08 NaN
也可以指定通过
dfame.index.name=’line’
dfame.columns.name=’brief
‘索引和列的名称:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])dfame.index.name='line'dfame.columns.name='brief'print dfame
输出:
brief city AreaCode GDPline 0 ShenYang 24 2412.201 NanJing 25 5488.732 GuangZhou 20 9891.483 WuHan 27 6019.084 ChengDu 28 6111.405 XiAn 29 3304.08
2.DataFrame读写
①读取列信息:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])print dfame.columns
输出:
Index([u'city', u'AreaCode', u'GDP'], dtype='object')
读取DataFrame的列可以用dfame[‘AreaCode’]
也可以用dfame.city获取某列
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])print dfame['AreaCode']print dfame.city
输出:
line1 24line2 25line3 20line4 27line5 28line6 29Name: AreaCode, dtype: int64line1 ShenYangline2 NanJingline3 GuangZhouline4 WuHanline5 ChengDuline6 XiAnName: city, dtype: object
也可以利用values直接打印出一个二维数组,不含行列信息:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])print dfame.values
输出:
[['ShenYang' 24L 2412.2] ['NanJing' 25L 5488.73] ['GuangZhou' 20L 9891.48] ['WuHan' 27L 6019.08] ['ChengDu' 28L 6111.4] ['XiAn' 29L 3304.08]]
②直接赋值修改列:
通过赋值,直接修改整列的值:
dfame['Population']=7000000print dfamedfame['Population']=[111,222,333,444,555,666]print dfame
输出:
city AreaCode GDP Populationline1 ShenYang 24 2412.20 7000000line2 NanJing 25 5488.73 7000000line3 GuangZhou 20 9891.48 7000000line4 WuHan 27 6019.08 7000000line5 ChengDu 28 6111.40 7000000line6 XiAn 29 3304.08 7000000 city AreaCode GDP Populationline1 ShenYang 24 2412.20 111line2 NanJing 25 5488.73 222line3 GuangZhou 20 9891.48 333line4 WuHan 27 6019.08 444line5 ChengDu 28 6111.40 555line6 XiAn 29 3304.08 666
③通过numpy数据修改列:
dfame['Population']=np.arange(100,700,100)print dfame
输出:
city AreaCode GDP Populationline1 ShenYang 24 2412.20 100line2 NanJing 25 5488.73 200line3 GuangZhou 20 9891.48 300line4 WuHan 27 6019.08 400line5 ChengDu 28 6111.40 500line6 XiAn 29 3304.08 600
④通过Series修改列:
通过Series指定要修改的索引及对应的值,及可指定DataFrame某列中不同行的值,未指定的默认为NaN;
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])ser=Series([111,333,444,555,666],index=['line1','line3','line4','line5','line6'])dfame['Population']=serprint dfame
输出:
city AreaCode GDP Populationline1 ShenYang 24 2412.20 111.0line2 NanJing 25 5488.73 NaNline3 GuangZhou 20 9891.48 333.0line4 WuHan 27 6019.08 444.0line5 ChengDu 28 6111.40 555.0line6 XiAn 29 3304.08 666.0
⑤增加新列:
增加新列并赋值;
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])dfame['Temperature']=[-2,10,30,18,25,5]print dfame
输出:
city AreaCode GDP Temperatureline1 ShenYang 24 2412.20 -2line2 NanJing 25 5488.73 10line3 GuangZhou 20 9891.48 30line4 WuHan 27 6019.08 18line5 ChengDu 28 6111.40 25line6 XiAn 29 3304.08 5
3.DataFrame操作
①DataFrame转置:
类比行列式的转置,转置后行列交换;
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])print dfame.T
输出:
line1 line2 line3 line4 line5 line6city ShenYang NanJing GuangZhou WuHan ChengDu XiAnAreaCode 24 25 20 27 28 29GDP 2412.2 5488.73 9891.48 6019.08 6111.4 3304.08
②DataFrame切片操作:
data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'], 'AreaCode':[24,25,20,27,28,29], 'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08] }dfame=DataFrame(data,columns=['city','AreaCode','GDP'])print dfame['city'][2:6]
输出:
2 GuangZhou3 WuHan4 ChengDu5 XiAnName: city, dtype: object
①创建Index
索引对象
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或者其他序列的标签都会被转换成Index。
Index对象是不可修改的,这样可以使Index对象在多个数据结构之间安全共享。
首先导入模块:
from pandas import Index
直接利用数组生成Index:
index=Index(np.arange(5))print index
输出:
Int64Index([0, 1, 2, 3, 4], dtype='int64')
生成的Index可以作为Series的index
可根据ser.index is index判断两个index是否为同一个index;
index=Index(np.arange(5))ser=Series(['one','two','three','four','five'],index=index)print serprint ser.index is index
输出:
0 one1 two2 three3 four4 fivedtype: objectTrue
②获取Index
ser=Series(range(5),index=['one','two','three','four','five'])index=ser.indexprint indexprint index[2:5]
输出:
Index([u'one', u'two', u'three', u'four', u'five'], dtype='object')Index([u'three', u'four', u'five'], dtype='object')
③判断索引是否存在
data={'ShenYang':{'AreaCode':24,'GDP':2412.2}, 'NanJing':{'AreaCode':25,'GDP':5488.73}, 'GuangZhou':{'AreaCode':20,'GDP':9891.48}, 'WuHan':{'AreaCode':27,'GDP':6019.08}}dfame=DataFrame(data)print dfameprint 'WuHan' in dfame.columnsprint 'GDP' in dfame.index
输出:
GuangZhou NanJing ShenYang WuHanAreaCode 20.00 25.00 24.0 27.00GDP 9891.48 5488.73 2412.2 6019.08TrueTrue
④Index的方法和属性:
1)append——链接另外一个index对象,产生一个新的index;
2)diff——计算差集;
3)union——计算交集;
4)isin——计算一个指示各值是否包含在参数集合中的布尔型数组;
5)delete——删除索引处的元素,并包含到新的index;
6)drop——删除传入的值,并的到新的索引;
7)insert——将元素插入到索引处,并得到新的index;
8)unique——计算index中唯一值得到数组;
9)is_monotonic——当各个元素均大于等于的一个元素时返回True;
10)is_unique——当index没有重复值时,返回True;
pandas中主要的index对象:
1)index——最泛华的index对象,将轴标签作为一个由Python对象组成的Numpy数组;
2)int64Index——针对整数的特殊index;
3)MultiIndex——层级索引–“层次化”索引对象,表示单个轴上的多层次索引,可以看作原数组组成的数组;
4)DatetimeIndex——存储纳秒级时间戳;
5)PeriodIndex ——针对Period数据的特殊index。
pandas读取CSV文件
生成csv文件:
#!/usr/bin/python# -*-coding:utf-8-*-# 向csv写入数据import csvwriter = csv.writer(file('credit.csv', 'wb'))# 在首行写入对应数据名称title = ['ID', 'age', 'job', 'house', 'credit', 'class']data = [[1, 'youth', 'no', 'no', 'fair', 'no' ], [2, 'youth', 'no', 'no', 'good', 'no' ], [3, 'youth', 'yes', 'no', 'good', 'yes'], [4, 'youth', 'yes', 'yes', 'fair', 'yes'], [5, 'youth', 'no', 'no', 'fair', 'no' ], [6, 'middle_age', 'no', 'no', 'fair', 'no' ], [7, 'middle_age', 'no', 'no', 'good', 'no' ], [8, 'middle_age', 'yes', 'yes', 'good', 'yes'], [9, 'middle_age', 'no', 'yes', 'excellent', 'yes'], [10, 'middle_age', 'no', 'yes', 'excellent', 'yes'], [11, 'senior', 'no', 'yes', 'excellent', 'yes'], [12, 'senior', 'no', 'yes', 'good', 'yes'], [13, 'senior', 'yes', 'no', 'good', 'yes'], [14, 'senior', 'yes', 'no', 'excellent', 'yes'], [15, 'senior', 'no', 'no', 'fair', 'no' ],]# 写入数据writer.writerow(title)for i in data: writer.writerow(i)
使用pandas读取csv文件:
#!/usr/bin/python# -*-coding:utf-8-*-import pandas as pdp = pd.read_csv('credit.csv')print p
输出:
ID age job house credit class0 1 youth no no fair no1 2 youth no no good no2 3 youth yes no good yes3 4 youth yes yes fair yes4 5 youth no no fair no5 6 middle_age no no fair no6 7 middle_age no no good no7 8 middle_age yes yes good yes8 9 middle_age no yes excellent yes9 10 middle_age no yes excellent yes10 11 senior no yes excellent yes11 12 senior no yes good yes12 13 senior yes no good yes13 14 senior yes no excellent yes14 15 senior no no fair no
- Python学习(十二)——pandas函数库1
- Python学习(十三)——pandas函数库2
- 【Python学习系列十二】Python库pandas之CSV导入
- 【Python学习系列二十二】pandas数据筛选和排序
- Python:Pandas学习笔记(1)
- 学习Python(十二)
- python/pandas/numpy(十二)数据加载、存储与文件格式
- 利用 Python 进行数据分析(十二)pandas:数据合并
- 利用 Python 进行数据分析(十二)pandas:数据合并
- python科学计算笔记(十二)pandas的resample采样
- Python——Pandas
- python——pandas
- pandas学习笔记(1)--pandas简介
- python中pandas包学习笔记(1)
- python 学习笔记(十二)
- Python学习笔记(十二)
- python学习(二十二)
- 常用python——pandas
- 25、数据加密-RSA
- android studio 快捷键
- STM8S103之外部中断
- 图文混排
- QT控件样式设计,Qt中漂亮的几款QSS
- Python学习(十二)——pandas函数库1
- AJAX的使用及规范格式
- MySQL索引总结
- 自定义View之线性百分比进度条
- 【剑指offer】二叉搜索树转换为有序双向链表。要求不创建新节点。只改变指向。
- 用Python实现简单的服务器
- AngularJS实现长按事件监听(ng-onhold)
- 结构体与类(静态类、抽象类)学习
- Java反射机制