pandas模块简明教程

来源:互联网 发布:淘宝蓝海市场 编辑:程序博客网 时间:2024/04/29 18:17

  • pandas模块简介
  • 创建对象
    • Series
    • DataFrame
  • 读取文件
  • 查看数据
    • 查看头尾
    • 行列筛选
      • 序号筛选
      • 索引筛选
      • 标签筛选
      • 布尔值筛选
  • 修改数据
    • 设置新值
    • 缺失值处理
      • 删除缺失值所在的行
      • 填补缺失值
        • 给定值填充
        • 前向填充
        • 后向填充
  • 删除数据
  • 数学运算
    • 加法运算
    • 减乘除运算
    • 其他运算
  • 其他相关操作
    • 快速统计
    • 行列排序
    • 转置
  • 参考资料

pandas模块简介

pandas是基于NumPy的一个常用的数据分析库,拥有丰富的方法,能够完成数据读取、数据清洗和数据统计和数据可视化等一系列功能,大大提升开发效率,下面介绍一些常用的功能,更多功能可以自行查阅官方文档。

创建对象

pandas有两种书籍格式
- Series
- DataFrame

简单理解的话,Series是值的集合,DataFrameSeries的集合

Series

import pandas as pdfrom pandas import Series,DataFrame
data = [5,6,7,8,9]obj = Series(data)
obj
 0    5 1    6 2    7 3    8 4    9 dtype: int64

左边的一列是索引index,右边的一列是值value

obj.index
RangeIndex(start=0, stop=5, step=1)
obj.values
array([5, 6, 7, 8, 9], dtype=int64)

可以使用指定索引的方法来创建Series

obj2 = Series(data, index = ['d','a','m','p','c'])
obj2
d    5a    6m    7p    8c    9dtype: int64

也可以直接使用字典对象来创建

dataDict = {'History':72, 'Math':96, 'Physices':93}obj3 = Series(dataDict)
obj3
History     72Math        96Physices    93dtype: int64

DataFrame

import numpy as np

直接根据 np.array 创建

frame = DataFrame(np.random.randn(5,3))
frame
0 1 2 0 -0.845079 -1.547513 -1.103420 1 -0.180325 -0.124631 -0.456067 2 0.066418 0.934352 1.889571 3 1.987873 -0.603892 -0.446530 4 1.825149 0.920741 0.744808

指定行列

frame1 = DataFrame(np.random.randn(5,3),index = ['a', 'b', 'c', 'd', 'e'],columns = ['f', 'g', 'h'])
frame1
f g h a 0.382079 0.851372 -0.187187 b -0.568949 -0.867452 0.650982 c 1.127272 -1.008783 1.464607 d -1.225070 -1.528853 0.831197 e -3.215866 1.588226 1.500100

通过字典来创建

dataDict = {'History':[72,84,81],            'Math':[94,67,89],             'Physices':[83,98,79]}frame2 = DataFrame(dataDict, index = ['Eric', 'Toms', 'Peter'])
frame2
History Math Physices Eric 72 94 83 Toms 84 67 98 Peter 81 89 79

读取文件

pandas支持读取SQL、CSV、HTML、SAS和EXCEL等多种格式的文件,使用自带的read_xxx方法就可以,下面展示CSV文件的读取

path = 'G:\\o2o_coupon_prediction\\ccf_offline_stage1_train.csv'testTable = pd.read_csv(path)
testTable
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date 0 1439408 2632 null null 0 null 20160217 1 1439408 4663 11002 150:20 1 20160528 null 2 1439408 2632 8591 20:1 0 20160217 null 3 1439408 2632 1078 20:1 0 20160319 null 4 1439408 2632 8591 20:1 0 20160613 null 5 1439408 2632 null null 0 null 20160516 6 1439408 2632 8591 20:1 0 20160516 20160613 7 1832624 3381 7610 200:20 0 20160429 null 8 2029232 3381 11951 200:20 1 20160129 null 9 2029232 450 1532 30:5 0 20160530 null 10 2029232 6459 12737 20:1 0 20160519 null 11 2029232 6459 null null 0 null 20160626 12 2029232 6459 null null 0 null 20160519 13 2747744 6901 1097 50:10 null 20160606 null 14 196342 1579 null null 1 null 20160606 15 196342 1579 10698 20:1 1 20160606 null 16 2223968 3381 9776 10:5 2 20160129 null 17 73611 2099 12034 100:10 null 20160207 null 18 163606 1569 5054 200:30 10 20160421 null 19 3273056 4833 7802 200:20 10 20160130 null 20 94107 3381 7610 200:20 2 20160412 null 21 253750 8390 null null 0 null 20160327 22 253750 6901 2366 30:5 0 20160518 null 23 253750 8390 7531 20:5 0 20160327 null 24 343660 4663 11002 150:20 null 20160528 null 25 376492 1041 13490 30:5 2 20160127 null 26 1964720 7884 null null 10 null 20160115 27 1964720 7884 6704 20:1 10 20160215 null 28 4191584 3051 null null 0 null 20160519 29 4191584 3051 null null 0 null 20160516 … … … … … … … … 1754854 179830 6284 4567 50:10 1 20160124 null 1754855 179830 6284 7379 50:10 1 20160124 null 1754856 179830 3710 13056 150:10 4 20160124 null 1754857 179830 2099 12034 100:10 1 20160124 null 1754858 179830 4660 1480 100:10 0 20160125 null 1754859 179830 5341 7751 50:10 0 20160123 null 1754860 179830 7555 17 30:5 7 20160123 null 1754861 179830 760 3237 20:5 0 20160124 null 1754862 179830 2099 12034 100:10 1 20160203 null 1754863 343276 4206 null null 5 null 20160425 1754864 653784 195 null null 0 null 20160418 1754865 653784 195 null null 0 null 20160416 1754866 3795296 5341 111 30:5 0 20160129 null 1754867 188086 6568 null null 0 null 20160415 1754868 188086 6568 null null 0 null 20160626 1754869 188086 6568 4723 30:1 0 20160415 null 1754870 188086 6568 null null 0 null 20160417 1754871 188086 6568 null null 0 null 20160604 1754872 212662 3532 null null 1 null 20160308 1754873 212662 2934 5686 30:5 2 20160321 20160330 1754874 212662 2934 null null 2 null 20160513 1754875 212662 2934 null null 2 null 20160512 1754876 212662 3532 5267 30:5 1 20160322 null 1754877 212662 3021 3739 30:1 6 20160504 20160508 1754878 212662 2934 5686 30:5 2 20160321 20160322 1754879 212662 3532 null null 1 null 20160322 1754880 212662 3021 3739 30:1 6 20160508 20160602 1754881 212662 2934 null null 2 null 20160321 1754882 752472 7113 1633 50:10 6 20160613 null 1754883 752472 3621 2705 20:5 0 20160523 null

1754884 rows × 7 columns

查看数据

查看头尾

缺省值均为5

testTable.head(3)
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date 0 1439408 2632 null null 0 null 20160217 1 1439408 4663 11002 150:20 1 20160528 null 2 1439408 2632 8591 20:1 0 20160217 null
testTable.tail(3)
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date 1754881 212662 2934 null null 2 null 20160321 1754882 752472 7113 1633 50:10 6 20160613 null 1754883 752472 3621 2705 20:5 0 20160523 null

为了缩小数据量,方便后面使用介绍,只取整个CSV文件的前五行并修改索引

usingTable = testTable.head(5).copy()usingTable.index = ['a', 'b', 'c', 'd', 'e']
usingTable
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date a 1439408 2632 null null 0 null 20160217 b 1439408 4663 11002 150:20 1 20160528 null c 1439408 2632 8591 20:1 0 20160217 null d 1439408 2632 1078 20:1 0 20160319 null e 1439408 2632 8591 20:1 0 20160613 null

行列筛选

因为pandasNumpy的上层库,NumPy的选择和设置表达式都能够使用,但是为了代码的健壮性,我们推荐使用pandas的数据访问方式:.loc和.iloc,更多内容可以查看http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

序号筛选

直接根据序号筛选

usingTable[:2]
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date a 1439408 2632 null null 0 null 20160217 b 1439408 4663 11002 150:20 1 20160528 null

iloc方法可以根据序号筛选出想要的范围

usingTable.iloc[2]
User_id           1439408Merchant_id          2632Coupon_id            8591Discount_rate        20:1Distance                0Date_received    20160217Date                 nullName: c, dtype: object
usingTable.iloc[3:5]
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date d 1439408 2632 1078 20:1 0 20160319 null e 1439408 2632 8591 20:1 0 20160613 null
usingTable.iloc[3:5, 1:3]
Merchant_id Coupon_id d 2632 1078 e 2632 8591

索引筛选

根据索引值筛选出特定的行

usingTable.loc['b']
User_id           1439408Merchant_id          4663Coupon_id           11002Discount_rate      150:20Distance                1Date_received    20160528Date                 nullName: b, dtype: object

标签筛选

usingTable['User_id']
a    1439408b    1439408c    1439408d    1439408e    1439408Name: User_id, dtype: int64

加上索引就可以筛选出某一个标量

usingTable['User_id']['b']
1439408

布尔值筛选

bool = usingTable['Date_received'] != 'null'
bool
a    Falseb     Truec     Trued     Truee     TrueName: Date_received, dtype: bool
usingTable[bool]
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date b 1439408 4663 11002 150:20 1 20160528 null c 1439408 2632 8591 20:1 0 20160217 null d 1439408 2632 1078 20:1 0 20160319 null e 1439408 2632 8591 20:1 0 20160613 null # 增加数据增加新的列,`pandas`可以根据标签自动创建
usingTable['Nation'] = 'China'
usingTable
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date Nation a 1439408 2632 null null 0 null 20160217 China b 1439408 4663 11002 150:20 1 20160528 null China c 1439408 2632 8591 20:1 0 20160217 null China d 1439408 2632 1078 20:1 0 20160319 null China e 1439408 2632 8591 20:1 0 20160613 null China

修改数据

设置新值

使用上面提到的筛选的方法,选中区域后就可以进行赋值操作

usingTable.loc[['b','c']] = None 
usingTable
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date Nation a 1439408.0 2632.0 null null 0 null 20160217 China b NaN NaN None None None None None None c NaN NaN None None None None None None d 1439408.0 2632.0 1078 20:1 0 20160319 null China e 1439408.0 2632.0 8591 20:1 0 20160613 null China

缺失值处理

usingTable2 = DataFrame(np.random.randn(5,3),index = list('abcde'))for i in range(3):    usingTable2.iloc[np.random.randint(0,4),np.random.randint(0,1)] = np.nanusingTable2
0 1 2 a NaN 1.324122 0.474714 b -1.037397 -0.541610 -0.543801 c NaN -0.528810 1.151078 d -0.870800 NaN -0.050644 e -1.214881 0.554842 -0.250637

可以看到上表中一共3个缺失的值,下面介绍几种常用的数据缺失的处理办法
isnullnotnull方法可以判断是否缺失

usingTable2.isnull()
0 1 2 a True False False b False False False c True False False d False True False e False False False

删除缺失值所在的行

usingTable2.dropna()
0 1 2 b -1.037397 -0.541610 -0.543801 e -1.214881 0.554842 -0.250637

填补缺失值

给定值填充

usingTable2.fillna(usingTable2.mean())
0 1 2 a -1.041026 1.324122 0.474714 b -1.037397 -0.541610 -0.543801 c -1.041026 -0.528810 1.151078 d -0.870800 0.202136 -0.050644 e -1.214881 0.554842 -0.250637

不同的列用不同的值填充

usingTable2.fillna({0:'f0', 1:'f1'})
0 1 2 a f0 1.32412 0.474714 b -1.0374 -0.54161 -0.543801 c f0 -0.52881 1.151078 d -0.8708 f1 -0.050644 e -1.21488 0.554842 -0.250637

前向填充

使用缺失值的前一个值填充

usingTable2.fillna(method = 'ffill')
0 1 2 a NaN 1.324122 0.474714 b -1.037397 -0.541610 -0.543801 c -1.037397 -0.528810 1.151078 d -0.870800 -0.528810 -0.050644 e -1.214881 0.554842 -0.250637

后向填充

usingTable2.fillna(method = 'bfill')
0 1 2 a -1.037397 1.324122 0.474714 b -1.037397 -0.541610 -0.543801 c -0.870800 -0.528810 1.151078 d -0.870800 0.554842 -0.050644 e -1.214881 0.554842 -0.250637

删除数据

使用drop方法舍弃行列
根据索引舍弃某一行

usingTable.drop('a')
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date Nation b NaN NaN None None None None None None c NaN NaN None None None None None None d 1439408.0 2632.0 1078 20:1 0 20160319 null China e 1439408.0 2632.0 8591 20:1 0 20160613 null China

根据标签舍弃某一列

usingTable.drop('Distance', axis = 1)
User_id Merchant_id Coupon_id Discount_rate Date_received Date Nation a 1439408.0 2632.0 null null null 20160217 China b NaN NaN None None None None None c NaN NaN None None None None None d 1439408.0 2632.0 1078 20:1 20160319 null China e 1439408.0 2632.0 8591 20:1 20160613 null China

数学运算

usingFrame1 = DataFrame(np.random.randn(5,3))usingFrame2 = DataFrame(np.random.randn(5,3), index = [3, 4, 5, 6, 7])

加法运算

usingFrame1
0 1 2 0 0.956881 -0.114052 -0.712592 1 1.163835 -0.983432 0.674851 2 -0.598014 -0.606926 -0.065567 3 -0.607522 0.172754 0.391103 4 -2.008146 0.554840 -0.727477
usingFrame2
0 1 2 3 -0.746694 0.989952 -0.715514 4 0.706863 -1.605593 0.037387 5 -1.642772 -0.622830 0.505732 6 2.168925 0.406054 0.484112 7 0.676034 -0.701310 1.761908

可以直接使用+进行计算,只计算交叉部分,剩余的补上NaN

usingFrame1 + usingFrame2
0 1 2 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 -1.354216 1.162705 -0.324412 4 -1.301284 -1.050753 -0.690090 5 NaN NaN NaN 6 NaN NaN NaN 7 NaN NaN NaN

也可以使用add方法,并且可以在每个DataFrame里面补上相应的空缺的值

usingFrame1.add(usingFrame2, fill_value = 123)
0 1 2 0 123.956881 122.885948 122.287408 1 124.163835 122.016568 123.674851 2 122.401986 122.393074 122.934433 3 -1.354216 1.162705 -0.324412 4 -1.301284 -1.050753 -0.690090 5 121.357228 122.377170 123.505732 6 125.168925 123.406054 123.484112 7 123.676034 122.298690 124.761908

减乘除运算

类似地,还有subdivmul对应-*/,这里只展示下除法

usingFrame1/usingFrame2
0 1 2 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 0.813617 0.174507 -0.546603 4 -2.840929 -0.345567 -19.457934 5 NaN NaN NaN 6 NaN NaN NaN 7 NaN NaN NaN
usingFrame1.mul(usingFrame2, fill_value = 1)
0 1 2 0 0.956881 -0.114052 -0.712592 1 1.163835 -0.983432 0.674851 2 -0.598014 -0.606926 -0.065567 3 0.453633 0.171018 -0.279839 4 -1.419483 -0.890847 -0.027198 5 -1.642772 -0.622830 0.505732 6 2.168925 0.406054 0.484112 7 0.676034 -0.701310 1.761908

其他运算

pandas还提供了absminmax等一系列数值运算的方法,这里就不进行罗列了

其他相关操作

快速统计

usingFrame1.describe()
0 1 2 count 5.000000 5.000000 5.000000 mean -0.218593 -0.195363 -0.087937 std 1.302936 0.611425 0.634632 min -2.008146 -0.983432 -0.727477 25% -0.607522 -0.606926 -0.712592 50% -0.598014 -0.114052 -0.065567 75% 0.956881 0.172754 0.391103 max 1.163835 0.554840 0.674851

行列排序

根据行标签排序

usingTable.sort_index(ascending = False)
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date Nation e 1439408.0 2632.0 8591 20:1 0 20160613 null China d 1439408.0 2632.0 1078 20:1 0 20160319 null China c NaN NaN None None None None None None b NaN NaN None None None None None None a 1439408.0 2632.0 null null 0 null 20160217 China

根据列标签排序

usingTable.sort_index(axis = 1)
Coupon_id Date Date_received Discount_rate Distance Merchant_id Nation User_id a null 20160217 null null 0 2632.0 China 1439408.0 b None None None None None NaN None NaN c None None None None None NaN None NaN d 1078 null 20160319 20:1 0 2632.0 China 1439408.0 e 8591 null 20160613 20:1 0 2632.0 China 1439408.0

也可以根据多个列进行排序

usingTable.sort_values(by = ['Date', 'Distance'])
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date Nation a 1439408.0 2632.0 null null 0 null 20160217 China d 1439408.0 2632.0 1078 20:1 0 20160319 null China e 1439408.0 2632.0 8591 20:1 0 20160613 null China b NaN NaN None None None None None None c NaN NaN None None None None None None

转置

usingTable.T
a b c d e User_id 1.43941e+06 NaN NaN 1.43941e+06 1.43941e+06 Merchant_id 2632 NaN NaN 2632 2632 Coupon_id null None None 1078 8591 Discount_rate null None None 20:1 20:1 Distance 0 None None 0 0 Date_received null None None 20160319 20160613 Date 20160217 None None null null Nation China None None China China

写到这里,大部分常用功能都介绍完了,更多功能大家可以自行摸索

参考资料

  • Python for Data Analysis
  • 十分钟搞定pandas

由于本人水平有限,难免出现错漏之处,欢迎批评指正

原创粉丝点击