Pandas学习笔记（1）

来源：互联网发布：淘宝双十一销售排行榜编辑：程序博客网时间：2024/05/17 03:40

一、Pandas的数据结构介绍
>>> from pandas import Series,DataFrame
>>> import pandas as pd
>>> import numpy as np
1.Series
Series：类似于一维数组的对象，由一组数据（各种numpy的数据类型）以及一组与之相关的数据标签（即索引）组成
>>> obj=Series([1,2,3,4])
#如果不指定索引，会自动生成从0-(N-1)的整数型索引
>>> obj
0 1
1 2
2 3
3 4
dtype: int64
>>> obj.values
array([1, 2, 3, 4])
>>> obj.index
RangeIndex(start=0, stop=4, step=1)
#numpy数组运算保留索引和值之间的关系
>>> obj[obj>2]
2 3
3 4
dtype: int64
>>> obj*2
0 2
1 4
2 6
3 8
dtype: int64
>>> np.exp(obj)
0 2.718282
1 7.389056
2 20.085537
3 54.598150
dtype: float64
#如果数据被存放在有一个python字典中，也可以直接通过这个字典创建Series
>>> score={"Tom":99,"Lucy":90,"John":80,"Green":58}
>>> score
{'John': 80, 'Green': 58, 'Lucy': 90, 'Tom': 99}
>>> obj_score=Series(score)
>>> obj_score
Green 58
John 80
Lucy 90
Tom 99
dtype: int64
#Series可以被看成是一个定长的有序字典，可以用很多原本需要字典参数的函数
>>> "Green" in obj_score
True
>>> "yaoxq" in obj_score
False
#将一个字典传入Series的索引，就可以得到匹配的值，“NaN”表示缺失或者NA值。
>>> name={"A","B","C","Tom"}
>>> obj_score_new=Series(obj_score,index=name)
>>> obj_score_new
A NaN
C NaN
B NaN
Tom 99.0
dtype: float64
#我们可以使用isnull和isnotnull来检测缺失数据
>>> pd.isnull(obj_score)
Green False
John False
Lucy False
Tom False
dtype: bool
>>> pd.isnull(obj_score_new)
A True
C True
B True
Tom False
dtype: bool
>>> pd.notnull(obj_score)
Green True
John True
Lucy True
Tom True
dtype: bool
>>> pd.notnull(obj_score_new)
A False
C False
B False
Tom True
dtype: bool
#pandas会自动对齐不同索引的数据
>>> obj_score+obj_score_new
A NaN
B NaN
C NaN
Green NaN
John NaN
Lucy NaN
Tom 198.0
dtype: float64
#Series本身和索引都有一个name属性，该属性和pandas其他功能关系密切
>>> obj_score.name = "score"
>>> obj_score.index.name = "name"
>>> obj_score
name
Green 58
John 80
Lucy 90
Tom 99
Name: score, dtype: int64
#Series的索引可通过赋值方式修改
>>> obj_score.index=["","","",""]
>>> obj_score
58
80
90
99
Name: score, dtype: int64

2.DataFrame
DataFrame是一个表格型的数据结构。它含有一组有序的列，每列可以是不同的值类型。DataFrame既有行索引，也有列索引，可以被看做是由Series组成的字典（共用一个索引）。
DataFrame中的数据是以一个或多个二维块存放的。
创建DataFrame的方法很多，最常用的是直接传入一个等长列表或numpy数组组成的字典：
>>> data={'name':['Tom','Tom','Tom','Lucy','Lucy','John'],'year':[2014,2015,2016,2015,2016,2016],'score':[80,85,90,86,88,83]}
>>> frame=DataFrame(data)
>>> frame
name score year
0 Tom 80 2014
1 Tom 85 2015
2 Tom 90 2016
3 Lucy 86 2015
4 Lucy 88 2016
5 John 83 2016
#可以指定列序列
>>> DataFrame(data,columns=['year','score','name'])
year score name
0 2014 80 Tom
1 2015 85 Tom
2 2016 90 Tom
3 2015 86 Lucy
4 2016 88 Lucy
5 2016 83 John
#可以通过获取属性或字典标记的方式，来获取一个特定series（name属性已经被设置好了）
>>> frame.year
0 2014
1 2015
2 2016
3 2015
4 2016
5 2016
Name: year, dtype: int64
>>> frame['score']
0 80
1 85
2 90
3 86
4 88
5 83
Name: score, dtype: int64
#为不存在的列赋值会创建一个新列，del用于删除列。
>>> frame['isgirl']= frame.name == 'Lucy'
>>> frame
name score year isgirl
0 Tom 80 2014 False
1 Tom 85 2015 False
2 Tom 90 2016 False
3 Lucy 86 2015 True
4 Lucy 88 2016 True
5 John 83 2016 False
>>> del frame['isgirl']
>>> frame.columns
Index([u'name', u'score', u'year'], dtype='object')

另一种方式是嵌套字典：
>>> data={'Tom':{2000:80,2001:85,2002:90},'Lucy':{2001:90,2002:99},'John':{2002:100}}
>>> frame=DataFrame(data)
>>> frame
John Lucy Tom
2000 NaN NaN 80
2001 NaN 90.0 85
2002 100.0 99.0 90
#使用*.T来对dataframe进行转置
>>> frame.T
2000 2001 2002
John NaN NaN 100.0
Lucy NaN 90.0 99.0
Tom 80.0 85.0 90.0
上面例子中，内层字典的键会被合并、排序以形成最终的索引。如果显式指定了索引，pandas则会过滤数据
>>> frame1=DataFrame(data,index=[1999,2000,2001])
>>> frame1
John Lucy Tom
1999 NaN NaN NaN
2000 NaN NaN 80.0
2001 NaN 90.0 85.0

3.索引对象
Pandas的索引对象负责管理轴标签和其他元数据（比如轴名称）
构建Series或者DataFrame时，所用到的任何数组或者序列的标签都会被转换成一个index。
Index对象是不可修改的。
>>> frame.index[1]
2001
>>> frame.index[1:]
Int64Index([2001, 2002], dtype='int64')
>>> frame.index[1]=2009
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/indexes/base.py", line 1245, in __set
raise TypeError("Index does not support mutable operations")
TypeError: Index does not support mutable operations

#索引的方法和属性

方法属性append链接另一个index对象，产生一个新的Indexdiff计算差集，并得到一个Indexintersection计算交集union计算并集isin计算一个指示各值是否都包含在参数集合中的布尔型数组delete产出索引i出的元素，并得到新的Indexdrop删除传入的值，并得到新的Indexinsert将元素插入到索引i处，并得到新的Indexis_monotonic将各元素均大于等于前一个元素时，返回Trueis_unique将Index没有重复值时，返回Trueunique返回Index中唯一的数组

0 0