pandas入门

来源:互联网 发布:四川广电网络电视台 编辑:程序博客网 时间:2024/06/05 00:10

pandas入门

from pandas import Series,DataFrameimport pandas as pdimport numpy as np
## Series:包含一个数组的数据和一个与数组关联的数据标签,即索引。
obj = Series([4, 7, -5, 3])
print(obj)  # index 在做,value在右。
0 4 1 7 2 -5 3 3 dtype: int64Series属性:values和index
obj.values
array([ 4, 7, -5, 3], dtype=int64)
obj.index
RangeIndex(start=0, stop=4, step=1)设置索引:
obj2 = Series([4, 7, -5, 3],index=['d','e','f','g'])print(obj2)
d 4 e 7 f -5 g 3 dtype: int64通过index获取值或者值集合:
obj2['d']
4
obj2[['d','g']]
d 4 g 3 dtype: int64numpy操作并保持对应的index:
obj2[ obj2 > 0 ]
d 4 e 7 g 3 dtype: int64
obj2 * 2
d 8 e 14 f -10 g 6 dtype: int64
import numpy as npnp.exp(obj2)
d 54.598150 e 1096.633158 f 0.006738 g 20.085537 dtype: float64把Series看作字典:
'b' in obj2
False传递字典创建Series:
zhengchu = {'name':'zhengchu','age': 23,'girlfriend':'No'}
obj3 = Series(zhengchu)print(obj3)
age 23 girlfriend No name zhengchu dtype: object看看NaN(不是一个数),标记缺失值或NA值:
info = ['name','zheng','girlfriend','age']obj4 = Series(zhengchu,index=info)print(obj4)
name zhengchu zheng NaN girlfriend No age 23 dtype: objectpd.isnull和notnull function检测函数:
pd.isnull(obj4)
name False zheng True girlfriend False age False dtype: bool
pd.notnull(obj4)
name True zheng False girlfriend True age True dtype: bool
"""或者这样"""obj4.isnull()
name False zheng True girlfriend False age False dtype: bool算术运算自动对齐:
obj3 + obj4
age 46 girlfriend NoNo name zhengchuzhengchu zheng NaN dtype: objectSeries对象本身和它的索引都有一个 name 属性:
obj4.name = 'YourName'obj4.index.name = 'Indddx'print(obj4)
Indddx name zhengchu zheng NaN girlfriend No age 23 Name: YourName, dtype: object改变Series的index赋值:
obj.index = ['a', 'b', 'c', 'd']print(obj)
a 4 b 7 c -5 d 3 dtype: int64# DataFrame### 一个Datarame表示一个表格,类似电子表格的数据结构,包含一个经过排序的列表集,它们没一个都可以有不同的类型值(数字,字符串,布尔等等)。Datarame有行和列的索引;它可以被看作是一个Series的字典(每个Series共享一个索引)。与其它你以前使用过的(如 R 的 data.frame )类似Datarame的结构相比,在DataFrame里的面向行和面向列的操作大致是对称的。在底层,数据是作为一个或多个二维数组存储的,而不是列表,字典,或其它一维的数组集合。DataDrame内部的精确细节已超出了本书的范围。#### 因为DataFrame在内部把数据存储为一个二维数组的格式,因此你可以采用分层索引以表格格式来表示高维的数据 . 分层索引是后面章节的一个主题,并且是pandas中许多更先进的数据处理功能的关键因素。字典初始化DataFrame:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],        'year': [2000, 2001, 2002, 2001, 2002],        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}frame = DataFrame(data)
frame
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002

设定列的顺序:

DataFrame(data, columns=['year','state','pop'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9

传递的值不在会出现NaN:

frame2 = DataFrame(data, columns=['year','state','pop','zhengchu'],index=['o','p','q','x','y'])frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 NaN p 2001 Ohio 1.7 NaN q 2002 Ohio 3.6 NaN x 2001 Nevada 2.4 NaN y 2002 Nevada 2.9 NaN
"""属性查找"""frame.year
0 2000 1 2001 2 2002 3 2001 4 2002 Name: year, dtype: int64
"""字典查找"""frame['year']
0 2000 1 2001 2 2002 3 2001 4 2002 Name: year, dtype: int64
"""索引行"""frame2.ix['y']
year 2002 state Nevada pop 2.9 zhengchu NaN Name: y, dtype: object
"""列赋值可以去掉NaN"""frame2['zhengchu'] = 'god 'frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 god p 2001 Ohio 1.7 god q 2002 Ohio 3.6 god x 2001 Nevada 2.4 god y 2002 Nevada 2.9 god
frame2['zhengchu'] = np.arange(5)frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 0 p 2001 Ohio 1.7 1 q 2002 Ohio 3.6 2 x 2001 Nevada 2.4 3 y 2002 Nevada 2.9 4

通过列表或数组给一列赋值时,所赋的值的长度必须和DataFrame的长度相匹配。
如果你使用Series来赋值,它会代替在DataFrame中精确匹配的索引的值,并在说有的空洞插入丢失数据:

val =Series([-1,-2,-3],index=['x','y','p'])frame2['zhengchu'] =valframe2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 NaN p 2001 Ohio 1.7 -3.0 q 2002 Ohio 3.6 NaN x 2001 Nevada 2.4 -1.0 y 2002 Nevada 2.9 -2.0

给一个不存在的列赋值,将会创建一个新的列:

frame2['useness'] = frame2.state == 'Ohio'frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu useness o 2000 Ohio 1.5 NaN True p 2001 Ohio 1.7 -3.0 True q 2002 Ohio 3.6 NaN True x 2001 Nevada 2.4 -1.0 False y 2002 Nevada 2.9 -2.0 False

像字典一样 del 关键字将会删除列:

del frame2['useness']frame2.columns
Index([‘year’, ‘state’, ‘pop’, ‘zhengchu’], dtype=’object’)### 索引DataFrame时返回的列是底层数据的一个视窗,而不是一个拷贝。### 因此,任何在Series上的就地修改都会影响DataFrame。列可以使用Series的 copy 函数来显式的拷贝。传入嵌套着字典的字典格式:可以看到 外部key成了column index,内部key成了row index:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}frame3 = DataFrame(pop)frame3
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6

转置试试:

frame3.T
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } 2000 2001 2002 Nevada NaN 2.4 2.9 Ohio 1.5 1.7 3.6

指定了一个特定的索引,结果不一样:

DataFrame(pop,index=[2001,2002,2003])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2003 NaN NaN
"""显示index名字和列名"""frame3.index.name = 'year'frame3.columns.name = 'state'frame3
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
"""返回值"""frame3.values
array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
"""如果DataFrame的列有不同的dtypes,返回值数组将会给所有的列选择一个合适的dtyps"""frame2.values
array([[2000, ‘Ohio’, 1.5, nan], [2001, ‘Ohio’, 1.7, -3.0], [2002, ‘Ohio’, 3.6, nan], [2001, ‘Nevada’, 2.4, -1.0], [2002, ‘Nevada’, 2.9, -2.0]], dtype=object)## 可能的传递到DataFrame的构造器#### 二维ndarray~~~一个数据矩阵,有可选的行标和列标#### 数组,列表或元组的字典 ~~~每一个序列成为DataFrame中的一列。所有的序列必须有相同的长度。#### NumPy的结构/记录数组 ~~~和“数组字典”一样处理#### Series的字典 ~~~每一个值成为一列。如果没有明显的传递索引,将结合每一个Series的索引来形成结果的行索引。#### 字典的字典 ~~~每一个内部的字典成为一列。和“Series的字典”一样,结合键值来形成行索引。#### 字典或Series的列表 ~~~每一项成为DataFrame中的一列。结合字典键或Series索引形成DataFrame的列标。#### 列表或元组的列表 ~~~和“二维ndarray”一样处理#### 另一个DataFrame ~~~DataFrame的索引将被使用,除非传递另外一个#### NumPy伪装数组(MaskedArray) ~~~除了蒙蔽值在DataFrame中成为NA/丢失数据之外,其它的和“二维ndarray”一样## 2.1.3索引对象:构建一个Series或DataFrame时任何数组或其它序列标签在内部转化为索引:
obj = Series(range(3),index = ['a', 'b', 'c'])
obj.index
Index([‘a’, ‘b’, ‘c’], dtype=’object’)
"""index不可变"""obj.index[1] = 'd'
————————————————————————— TypeError Traceback (most recent call last) in () 1 “”“index不可变”“” —-> 2 obj.index[1] = ‘d’ ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 1668 1669 def __setitem__(self, key, value): -> 1670 raise TypeError(“Index does not support mutable operations”) 1671 1672 def __getitem__(self, key): TypeError: Index does not support mutable operations
index = pd.Index(np.arange(3))obj2 = Series([1.5, -2.5, 0], index=index)obj2.index is index
True重新索引 reindex:索引对不上的话就是NaN咯
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d','b','x','y'])obj2 = obj.reindex(['r','w','x','o','y'])print(obj,'\n',obj2)
d 4.5 b 7.2 x -5.3 y 3.6 dtype: float64 r NaN w NaN x -5.3 o NaN y 3.6 dtype: float64
 obj.reindex(['r','w','o','x','y'],fill_value=0.0) # 缺失值处理
r 0.0 w 0.0 o 0.0 x -5.3 y 3.6 dtype: float64method的ffill前向填充和bfill后向填充:
obj3 = Series(['blue', 'purple','yellow'], index = [0, 2, 4])obj3.reindex(range(6),method="ffill")
0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
"""行重新索引"""frame = DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],columns=['wo','shi','shui'])print(frame)frame2 = frame.reindex(['a','b','c','d'])print(frame2)
wo shi shui a 0 1 2 b 3 4 5 c 6 7 8 wo shi shui a 0.0 1.0 2.0 b 3.0 4.0 5.0 c 6.0 7.0 8.0 d NaN NaN NaN
"""列重新索引"""states = ['wo','ai','ni']frame3 = frame.reindex(columns=states)print(frame3)"""可以看到没出现的新索引对应的都是NaN"""
wo ai ni a 0 NaN NaN b 3 NaN NaN c 6 NaN NaN ‘可以看到没出现的新索引对应的都是NaN’#### 删除条目:
obj = Series(np.arange(5), index=['a','b','c','d','e'])new_obj = obj.drop('c')print(new_obj)
a 0 b 1 d 3 e 4 dtype: int32
obj.drop(['d','e'])
a 0 b 1 c 2 dtype: int32使用标签切片会把结束点也包括在内:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])obj['b':'c']
b 1.0 c 2.0 dtype: float64
data = DataFrame(np.arange(16).reshape((4, 4)),                 index=['Ohio', 'Colorado', 'Utah', 'New York'],                 columns=['one', 'two', 'three', 'four'])print(data)
one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
data <5 
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False
data[data<5]=0print(data)
one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15索引字段ix:
data.ix['Colorado',['two','three']]  #不建议使用
C:\Users\Xiaowang Zhang\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated “”“Entry point for launching an IPython kernel. two 5 three 6 Name: Colorado, dtype: int32
data.ix[2]
one 8 two 9 three 10 four 11 Name: Utah, dtype: int32
data.ix[data.three >5, :3]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } one two three Colorado 0 5 6 Utah 8 9 10 New York 12 13 14

参考:http://pda.readthedocs.io/en/latest/chp5.html#id15
《利用python进行 数据分析》

原创粉丝点击