pandas入门

来源：互联网发布：四川广电网络电视台编辑：程序博客网时间：2024/06/05 00:10

pandas入门

from pandas import Series,DataFrameimport pandas as pdimport numpy as np

## Series:包含一个数组的数据和一个与数组关联的数据标签，即索引。

obj = Series([4, 7, -5, 3])

print(obj)  # index 在做，value在右。

0 4 1 7 2 -5 3 3 dtype: int64Series属性:values和index

obj.values

array([ 4, 7, -5, 3], dtype=int64)

obj.index

RangeIndex(start=0, stop=4, step=1)设置索引：

obj2 = Series([4, 7, -5, 3],index=['d','e','f','g'])print(obj2)

d 4 e 7 f -5 g 3 dtype: int64通过index获取值或者值集合：

obj2['d']

obj2[['d','g']]

d 4 g 3 dtype: int64numpy操作并保持对应的index：

obj2[ obj2 > 0 ]

d 4 e 7 g 3 dtype: int64

obj2 * 2

d 8 e 14 f -10 g 6 dtype: int64

import numpy as npnp.exp(obj2)

d 54.598150 e 1096.633158 f 0.006738 g 20.085537 dtype: float64把Series看作字典：

'b' in obj2

False传递字典创建Series：

zhengchu = {'name':'zhengchu','age': 23,'girlfriend':'No'}

obj3 = Series(zhengchu)print(obj3)

age 23 girlfriend No name zhengchu dtype: object看看NaN(不是一个数),标记缺失值或NA值：

info = ['name','zheng','girlfriend','age']obj4 = Series(zhengchu,index=info)print(obj4)

name zhengchu zheng NaN girlfriend No age 23 dtype: objectpd.isnull和notnull function检测函数：

pd.isnull(obj4)

name False zheng True girlfriend False age False dtype: bool

pd.notnull(obj4)

name True zheng False girlfriend True age True dtype: bool

"""或者这样"""obj4.isnull()

name False zheng True girlfriend False age False dtype: bool算术运算自动对齐：

obj3 + obj4

age 46 girlfriend NoNo name zhengchuzhengchu zheng NaN dtype: objectSeries对象本身和它的索引都有一个 name 属性:

obj4.name = 'YourName'obj4.index.name = 'Indddx'print(obj4)

Indddx name zhengchu zheng NaN girlfriend No age 23 Name: YourName, dtype: object改变Series的index赋值：

obj.index = ['a', 'b', 'c', 'd']print(obj)

a 4 b 7 c -5 d 3 dtype: int64# DataFrame### 一个Datarame表示一个表格，类似电子表格的数据结构，包含一个经过排序的列表集，它们没一个都可以有不同的类型值（数字，字符串，布尔等等）。Datarame有行和列的索引；它可以被看作是一个Series的字典（每个Series共享一个索引）。与其它你以前使用过的（如 R 的 data.frame )类似Datarame的结构相比，在DataFrame里的面向行和面向列的操作大致是对称的。在底层，数据是作为一个或多个二维数组存储的，而不是列表，字典，或其它一维的数组集合。DataDrame内部的精确细节已超出了本书的范围。#### 因为DataFrame在内部把数据存储为一个二维数组的格式，因此你可以采用分层索引以表格格式来表示高维的数据 . 分层索引是后面章节的一个主题，并且是pandas中许多更先进的数据处理功能的关键因素。字典初始化DataFrame：

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],        'year': [2000, 2001, 2002, 2001, 2002],        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}frame = DataFrame(data)

frame

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002

设定列的顺序：

DataFrame(data, columns=['year','state','pop'])

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9

传递的值不在会出现NaN：

frame2 = DataFrame(data, columns=['year','state','pop','zhengchu'],index=['o','p','q','x','y'])frame2

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 NaN p 2001 Ohio 1.7 NaN q 2002 Ohio 3.6 NaN x 2001 Nevada 2.4 NaN y 2002 Nevada 2.9 NaN

"""属性查找"""frame.year

0 2000 1 2001 2 2002 3 2001 4 2002 Name: year, dtype: int64

"""字典查找"""frame['year']

0 2000 1 2001 2 2002 3 2001 4 2002 Name: year, dtype: int64

"""索引行"""frame2.ix['y']

year 2002 state Nevada pop 2.9 zhengchu NaN Name: y, dtype: object

"""列赋值可以去掉NaN"""frame2['zhengchu'] = 'god 'frame2

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 god p 2001 Ohio 1.7 god q 2002 Ohio 3.6 god x 2001 Nevada 2.4 god y 2002 Nevada 2.9 god

frame2['zhengchu'] = np.arange(5)frame2

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 0 p 2001 Ohio 1.7 1 q 2002 Ohio 3.6 2 x 2001 Nevada 2.4 3 y 2002 Nevada 2.9 4

通过列表或数组给一列赋值时，所赋的值的长度必须和DataFrame的长度相匹配。
如果你使用Series来赋值，它会代替在DataFrame中精确匹配的索引的值，并在说有的空洞插入丢失数据：

val =Series([-1,-2,-3],index=['x','y','p'])frame2['zhengchu'] =valframe2

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu o 2000 Ohio 1.5 NaN p 2001 Ohio 1.7 -3.0 q 2002 Ohio 3.6 NaN x 2001 Nevada 2.4 -1.0 y 2002 Nevada 2.9 -2.0

给一个不存在的列赋值，将会创建一个新的列:

frame2['useness'] = frame2.state == 'Ohio'frame2

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } year state pop zhengchu useness o 2000 Ohio 1.5 NaN True p 2001 Ohio 1.7 -3.0 True q 2002 Ohio 3.6 NaN True x 2001 Nevada 2.4 -1.0 False y 2002 Nevada 2.9 -2.0 False

像字典一样 del 关键字将会删除列：

del frame2['useness']frame2.columns

Index([‘year’, ‘state’, ‘pop’, ‘zhengchu’], dtype=’object’)### 索引DataFrame时返回的列是底层数据的一个视窗，而不是一个拷贝。### 因此，任何在Series上的就地修改都会影响DataFrame。列可以使用Series的 copy 函数来显式的拷贝。传入嵌套着字典的字典格式：可以看到外部key成了column index，内部key成了row index:

pop = {'Nevada': {2001: 2.4, 2002: 2.9},       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}frame3 = DataFrame(pop)frame3

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6

转置试试：

frame3.T

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } 2000 2001 2002 Nevada NaN 2.4 2.9 Ohio 1.5 1.7 3.6

指定了一个特定的索引,结果不一样：

DataFrame(pop,index=[2001,2002,2003])

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2003 NaN NaN

"""显示index名字和列名"""frame3.index.name = 'year'frame3.columns.name = 'state'frame3

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6

"""返回值"""frame3.values

array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])

"""如果DataFrame的列有不同的dtypes，返回值数组将会给所有的列选择一个合适的dtyps"""frame2.values

array([[2000, ‘Ohio’, 1.5, nan], [2001, ‘Ohio’, 1.7, -3.0], [2002, ‘Ohio’, 3.6, nan], [2001, ‘Nevada’, 2.4, -1.0], [2002, ‘Nevada’, 2.9, -2.0]], dtype=object)## 可能的传递到DataFrame的构造器#### 二维ndarray~~~一个数据矩阵，有可选的行标和列标#### 数组，列表或元组的字典 ~~~每一个序列成为DataFrame中的一列。所有的序列必须有相同的长度。#### NumPy的结构/记录数组 ~~~和“数组字典”一样处理#### Series的字典 ~~~每一个值成为一列。如果没有明显的传递索引，将结合每一个Series的索引来形成结果的行索引。#### 字典的字典 ~~~每一个内部的字典成为一列。和“Series的字典”一样，结合键值来形成行索引。#### 字典或Series的列表 ~~~每一项成为DataFrame中的一列。结合字典键或Series索引形成DataFrame的列标。#### 列表或元组的列表 ~~~和“二维ndarray”一样处理#### 另一个DataFrame ~~~DataFrame的索引将被使用，除非传递另外一个#### NumPy伪装数组（MaskedArray） ~~~除了蒙蔽值在DataFrame中成为NA/丢失数据之外，其它的和“二维ndarray”一样## 2.1.3索引对象：构建一个Series或DataFrame时任何数组或其它序列标签在内部转化为索引：

obj = Series(range(3),index = ['a', 'b', 'c'])

obj.index

Index([‘a’, ‘b’, ‘c’], dtype=’object’)

"""index不可变"""obj.index[1] = 'd'

————————————————————————— TypeError Traceback (most recent call last) in () 1 “”“index不可变”“” —-> 2 obj.index[1] = ‘d’ ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 1668 1669 def __setitem__(self, key, value): -> 1670 raise TypeError(“Index does not support mutable operations”) 1671 1672 def __getitem__(self, key): TypeError: Index does not support mutable operations

index = pd.Index(np.arange(3))obj2 = Series([1.5, -2.5, 0], index=index)obj2.index is index

True重新索引 reindex：索引对不上的话就是NaN咯

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d','b','x','y'])obj2 = obj.reindex(['r','w','x','o','y'])print(obj,'\n',obj2)

d 4.5 b 7.2 x -5.3 y 3.6 dtype: float64 r NaN w NaN x -5.3 o NaN y 3.6 dtype: float64

 obj.reindex(['r','w','o','x','y'],fill_value=0.0) # 缺失值处理

r 0.0 w 0.0 o 0.0 x -5.3 y 3.6 dtype: float64method的ffill前向填充和bfill后向填充：

obj3 = Series(['blue', 'purple','yellow'], index = [0, 2, 4])obj3.reindex(range(6),method="ffill")

0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object

"""行重新索引"""frame = DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],columns=['wo','shi','shui'])print(frame)frame2 = frame.reindex(['a','b','c','d'])print(frame2)

wo shi shui a 0 1 2 b 3 4 5 c 6 7 8 wo shi shui a 0.0 1.0 2.0 b 3.0 4.0 5.0 c 6.0 7.0 8.0 d NaN NaN NaN

"""列重新索引"""states = ['wo','ai','ni']frame3 = frame.reindex(columns=states)print(frame3)"""可以看到没出现的新索引对应的都是NaN"""

wo ai ni a 0 NaN NaN b 3 NaN NaN c 6 NaN NaN ‘可以看到没出现的新索引对应的都是NaN’#### 删除条目：

obj = Series(np.arange(5), index=['a','b','c','d','e'])new_obj = obj.drop('c')print(new_obj)

a 0 b 1 d 3 e 4 dtype: int32

obj.drop(['d','e'])

a 0 b 1 c 2 dtype: int32使用标签切片会把结束点也包括在内：

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])obj['b':'c']

b 1.0 c 2.0 dtype: float64

data = DataFrame(np.arange(16).reshape((4, 4)),                 index=['Ohio', 'Colorado', 'Utah', 'New York'],                 columns=['one', 'two', 'three', 'four'])print(data)

one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15

data <5

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False

data[data<5]=0print(data)

one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15索引字段ix：

data.ix['Colorado',['two','three']]  #不建议使用

C:\Users\Xiaowang Zhang\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated “”“Entry point for launching an IPython kernel. two 5 three 6 Name: Colorado, dtype: int32

data.ix[2]

one 8 two 9 three 10 four 11 Name: Utah, dtype: int32

data.ix[data.three >5, :3]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } one two three Colorado 0 5 6 Utah 8 9 10 New York 12 13 14

参考：http://pda.readthedocs.io/en/latest/chp5.html#id15
《利用python进行数据分析》

阅读全文

0 0