Pandas基础复习-Series

来源：互联网发布：网上代理销售软件编辑：程序博客网时间：2024/05/21 06:25

Pandas(panel data & Data Analysis)：Python数据分析库。

Pandas是基于Numpy的，专用于数据分析的Python第三方库，最适用于处理大型结构化表格数据

Pandas最初是对冲基金公司开发出来做金融量化数据分析的Python库
Pandas借鉴了R的数据结构
Pandas基于Numpy搭建，支持Numpy中定义的大部分计算
Pandas提供了大量和其他技术交互的接口（比如IO工具 (CSV, XLSX, HDF5, …)，可视化（封装pyplot），方便和其他语言技术的交互和功能扩展
Pandas底层用Cython和C做了速度优化，极大提高了执行效率

Pandas库的数据类型:

Series 一维
DataFrame 二维，Series容器，最常用
Panel 三维，DataFrame容器

Python的list列表,Numpy的ndarray数组和Pdandas的Series

list：Python自带数据类型，功能简单，操作复杂，效率低
ndarray(Numpy)：基础数据类型，关注数据结构/运算/维度(数据间关系)
Series(DataFrame)：扩展数据类型，关注数据实际应用，数据与索引的关系

三种数据类型的区别

list/Series/DataFrame的值类型可以不同，ndarray的值类型必须相同
从实用性、功能强弱和和可操作性比较：list < ndarray < Series(DataFrame)，实践中尽量使用Pandas数据类型。

# 导入pandas库import pandas as pd# 创建Series数据类型se = pd.Series([2, 4, 6, 8, 10])se

0     21     42     63     84    10dtype: int64

# 创建DataFrame数据类型da = pd.DataFrame([    [2, 4, 6, 8, 10],    [12, 14, 16, 18, 20]])da

0 1 2 3 4 0 2 4 6 8 10 1 12 14 16 18 20

Python list 列表 创建Series

#默认索引a = pd.Series([1,2,3,4]) a

0    11    22    33    4dtype: int64

b = pd.Series([1,2,3,4],index=['a','b','c','d']) #自定义索引b

a    1b    2c    3d    4dtype: int64

s = pd.Series([True,1,2.3,'a','你好']) #数据类型s

0    True1       12     2.33       a4      你好dtype: object

标量值 创建Series

# 必须带indexc = pd.Series(10, index=['a', 'b', 'c'])c

a    10b    10c    10dtype: int64

c_null = pd.Series(index=['a', 'b', 'c'])c_null

a   NaNb   NaNc   NaNdtype: float64

s = pd.Series([True,1,2.3,'a','你好'], index=['a', 'b', 'c', 'd', 'e'])s

a    Trueb       1c     2.3d       ae      你好dtype: object

Python字典 创建Series

d = pd.Series({'a':9,'b':8,'c':7})d

a    9b    8c    7dtype: int64

ndarray 创建Series，索引和数据都可以通过ndarray类型生成

import numpy as npn = pd.Series(np.arange(5))n

0    01    12    23    34    4dtype: int32

m = pd.Series(np.arange(5),index=np.arange(9,4,-1))m

9    08    17    26    35    4dtype: int32

其他函数创建Series

n = pd.Series(range(10))n

0    01    12    23    34    45    56    67    78    89    9dtype: int32

Series类型的基本操作

index和value操作

b = pd.Series([9, 8, 7, 6, 5, 4, 3],               ['a', 'b', 'c', 'd', 'e', 'f', 'g'])b

a    9b    8c    7d    6e    5f    4g    3dtype: int64

# 获得索引，输出index类型,就是pandas独有的索引类型b.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

# 获得数据，输出类型为array,就是np的array数组b.values

array([9, 8, 7, 6, 5, 4, 3], dtype=int64)

引索

# 按照引索名称取值b['b']

# 按照下标取值b[1]

# 直接按照key取值b.b

# 按照key值引索取值b[['c','d','a']]

c    7d    6a    9dtype: int64

# 错误，两套索引并存,但不能混用b[['c','d',0]]

c    7.0d    6.00    NaNdtype: float64

切片

# 按照下标切片b[1:]

b    8c    7d    6e    5f    4g    3dtype: int64

b[: 3]

a    9b    8c    7dtype: int64

# 按照key引索切片b[: 'd']

a    9b    8c    7d    6dtype: int64

b[::2]

a    9c    7e    5g    3dtype: int64

# 从头到尾反向切片，步长为-1，即最简单的列表倒序b[::-1]

g    3f    4e    5d    6c    7b    8a    9dtype: int64

类ndarray操作

索引方法相同,都有[]
numpy中的运算和操作可用于Series类型
可以通过自定义索引的列表进行切片
可以通过自动索引进行切片,如果存在自定义索引,则一同被切片

b[3] # 第3个值,结果是索引的值

b[:3] #0-3,结果还是Series类型

a    9b    8c    7dtype: int64

b[b > b.median()] #所有大于中位数的值

a    9b    8c    7dtype: int64

* 类python字典的操作 *

通过自定义索引访问
保留字in操作
使用.get()方法

b['b']

'c' in b # 判断此键在不在b的索引中

True

0 in b #in 不会判断自动索引

False

b.get('f',100) #从b中提取索引f的值,如果存在就取出,不存在就用 100 代替

根据索引对齐操作

series + series

a = pd.Series([1,2,3],['c','d','e'])b = pd.Series([9,8,7,6],['a','b','c','d'])

a + b #结果为两个值的并集,相加时索引对齐加值,索引不对齐的没值,加完也没值

a    NaNb    NaNc    8.0d    8.0e    NaNdtype: float64

Series类型在运算中会自动对齐不同索引的数据
ndarray基于维度运算,series基于索引运算,更精确不易出错

Series类型的name属性

Series对象和索引都可以起一个名字,存储在属性.name中

b = pd.Series([9,8,7,6],['a','b','c','d'])

print(b.name) # 默认没有

None

b.name = 'Series对象' # 对象命名print(b.name)

Series对象

b.index.name = '索引列' # 索引命名b

索引列a    9b    8c    7d    6Name: Series对象, dtype: int64

Series类型的修改

Series对象可以随时修改并立即生效

索引列a    9b    8c    7d    6Name: Series对象, dtype: int64

b['a'] = 15b

索引列a    15b     8c     7d     6Name: Series对象, dtype: int64

b.name = 'Series'b

索引列a    15b     8c     7d     6Name: Series, dtype: int64

b.name = 'new series'b['b','c'] = 20 # b[['b','c']] = 20b

索引列a    15b    20c    20d     6Name: new series, dtype: int64

阅读全文

0 0