- pandas模块简介
- 创建对象
- 读取文件
- 查看数据
- 修改数据
- 删除数据
- 数学运算
- 其他相关操作
- 参考资料
pandas模块简介
pandas
是基于NumPy
的一个常用的数据分析库,拥有丰富的方法,能够完成数据读取、数据清洗和数据统计和数据可视化等一系列功能,大大提升开发效率,下面介绍一些常用的功能,更多功能可以自行查阅官方文档。
创建对象
pandas
有两种书籍格式
- Series
- DataFrame
简单理解的话,Series
是值的集合,DataFrame
是Series
的集合
Series
import pandas as pdfrom pandas import Series,DataFrame
data = [5,6,7,8,9]obj = Series(data)
obj
0 5 1 6 2 7 3 8 4 9 dtype: int64
左边的一列是索引index
,右边的一列是值value
obj.index
RangeIndex(start=0, stop=5, step=1)
obj.values
array([5, 6, 7, 8, 9], dtype=int64)
可以使用指定索引的方法来创建Series
obj2 = Series(data, index = ['d','a','m','p','c'])
obj2
d 5a 6m 7p 8c 9dtype: int64
也可以直接使用字典对象来创建
dataDict = {'History':72, 'Math':96, 'Physices':93}obj3 = Series(dataDict)
obj3
History 72Math 96Physices 93dtype: int64
DataFrame
import numpy as np
直接根据 np.array
创建
frame = DataFrame(np.random.randn(5,3))
frame
| 0 | 1 | 2 | 0 | -0.845079 -1.547513 -1.103420 1 | -0.180325 -0.124631 -0.456067 2 | 0.066418 0.934352 1.889571 3 | 1.987873 -0.603892 -0.446530 4 | 1.825149 0.920741 0.744808 指定行列
frame1 = DataFrame(np.random.randn(5,3),index = ['a', 'b', 'c', 'd', 'e'],columns = ['f', 'g', 'h'])
frame1
| f | g | h | a | 0.382079 0.851372 -0.187187 b | -0.568949 -0.867452 0.650982 c | 1.127272 -1.008783 1.464607 d | -1.225070 -1.528853 0.831197 e | -3.215866 1.588226 1.500100 通过字典来创建
dataDict = {'History':[72,84,81], 'Math':[94,67,89], 'Physices':[83,98,79]}frame2 = DataFrame(dataDict, index = ['Eric', 'Toms', 'Peter'])
frame2
| History | Math | Physices | Eric | 72 94 83 Toms | 84 67 98 Peter | 81 89 79 读取文件
pandas
支持读取SQL、CSV、HTML、SAS和EXCEL等多种格式的文件,使用自带的read_xxx
方法就可以,下面展示CSV文件的读取
path = 'G:\\o2o_coupon_prediction\\ccf_offline_stage1_train.csv'testTable = pd.read_csv(path)
testTable
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | 0 | 1439408 2632 null null 0 null 20160217 1 | 1439408 4663 11002 150:20 1 20160528 null 2 | 1439408 2632 8591 20:1 0 20160217 null 3 | 1439408 2632 1078 20:1 0 20160319 null 4 | 1439408 2632 8591 20:1 0 20160613 null 5 | 1439408 2632 null null 0 null 20160516 6 | 1439408 2632 8591 20:1 0 20160516 20160613 7 | 1832624 3381 7610 200:20 0 20160429 null 8 | 2029232 3381 11951 200:20 1 20160129 null 9 | 2029232 450 1532 30:5 0 20160530 null 10 | 2029232 6459 12737 20:1 0 20160519 null 11 | 2029232 6459 null null 0 null 20160626 12 | 2029232 6459 null null 0 null 20160519 13 | 2747744 6901 1097 50:10 null 20160606 null 14 | 196342 1579 null null 1 null 20160606 15 | 196342 1579 10698 20:1 1 20160606 null 16 | 2223968 3381 9776 10:5 2 20160129 null 17 | 73611 2099 12034 100:10 null 20160207 null 18 | 163606 1569 5054 200:30 10 20160421 null 19 | 3273056 4833 7802 200:20 10 20160130 null 20 | 94107 3381 7610 200:20 2 20160412 null 21 | 253750 8390 null null 0 null 20160327 22 | 253750 6901 2366 30:5 0 20160518 null 23 | 253750 8390 7531 20:5 0 20160327 null 24 | 343660 4663 11002 150:20 null 20160528 null 25 | 376492 1041 13490 30:5 2 20160127 null 26 | 1964720 7884 null null 10 null 20160115 27 | 1964720 7884 6704 20:1 10 20160215 null 28 | 4191584 3051 null null 0 null 20160519 29 | 4191584 3051 null null 0 null 20160516 … | … … … … … … … 1754854 | 179830 6284 4567 50:10 1 20160124 null 1754855 | 179830 6284 7379 50:10 1 20160124 null 1754856 | 179830 3710 13056 150:10 4 20160124 null 1754857 | 179830 2099 12034 100:10 1 20160124 null 1754858 | 179830 4660 1480 100:10 0 20160125 null 1754859 | 179830 5341 7751 50:10 0 20160123 null 1754860 | 179830 7555 17 30:5 7 20160123 null 1754861 | 179830 760 3237 20:5 0 20160124 null 1754862 | 179830 2099 12034 100:10 1 20160203 null 1754863 | 343276 4206 null null 5 null 20160425 1754864 | 653784 195 null null 0 null 20160418 1754865 | 653784 195 null null 0 null 20160416 1754866 | 3795296 5341 111 30:5 0 20160129 null 1754867 | 188086 6568 null null 0 null 20160415 1754868 | 188086 6568 null null 0 null 20160626 1754869 | 188086 6568 4723 30:1 0 20160415 null 1754870 | 188086 6568 null null 0 null 20160417 1754871 | 188086 6568 null null 0 null 20160604 1754872 | 212662 3532 null null 1 null 20160308 1754873 | 212662 2934 5686 30:5 2 20160321 20160330 1754874 | 212662 2934 null null 2 null 20160513 1754875 | 212662 2934 null null 2 null 20160512 1754876 | 212662 3532 5267 30:5 1 20160322 null 1754877 | 212662 3021 3739 30:1 6 20160504 20160508 1754878 | 212662 2934 5686 30:5 2 20160321 20160322 1754879 | 212662 3532 null null 1 null 20160322 1754880 | 212662 3021 3739 30:1 6 20160508 20160602 1754881 | 212662 2934 null null 2 null 20160321 1754882 | 752472 7113 1633 50:10 6 20160613 null 1754883 | 752472 3621 2705 20:5 0 20160523 null 1754884 rows × 7 columns
查看数据
查看头尾
缺省值均为5
testTable.head(3)
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | 0 | 1439408 2632 null null 0 null 20160217 1 | 1439408 4663 11002 150:20 1 20160528 null 2 | 1439408 2632 8591 20:1 0 20160217 null testTable.tail(3)
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | 1754881 | 212662 2934 null null 2 null 20160321 1754882 | 752472 7113 1633 50:10 6 20160613 null 1754883 | 752472 3621 2705 20:5 0 20160523 null 为了缩小数据量,方便后面使用介绍,只取整个CSV文件的前五行并修改索引
usingTable = testTable.head(5).copy()usingTable.index = ['a', 'b', 'c', 'd', 'e']
usingTable
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | a | 1439408 2632 null null 0 null 20160217 b | 1439408 4663 11002 150:20 1 20160528 null c | 1439408 2632 8591 20:1 0 20160217 null d | 1439408 2632 1078 20:1 0 20160319 null e | 1439408 2632 8591 20:1 0 20160613 null 行列筛选
因为pandas
是Numpy
的上层库,NumPy
的选择和设置表达式都能够使用,但是为了代码的健壮性,我们推荐使用pandas
的数据访问方式:.loc和.iloc,更多内容可以查看http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
序号筛选
直接根据序号筛选
usingTable[:2]
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | a | 1439408 2632 null null 0 null 20160217 b | 1439408 4663 11002 150:20 1 20160528 null iloc
方法可以根据序号筛选出想要的范围
usingTable.iloc[2]
User_id 1439408Merchant_id 2632Coupon_id 8591Discount_rate 20:1Distance 0Date_received 20160217Date nullName: c, dtype: object
usingTable.iloc[3:5]
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | d | 1439408 2632 1078 20:1 0 20160319 null e | 1439408 2632 8591 20:1 0 20160613 null usingTable.iloc[3:5, 1:3]
| Merchant_id | Coupon_id | d | 2632 1078 e | 2632 8591 索引筛选
根据索引值筛选出特定的行
usingTable.loc['b']
User_id 1439408Merchant_id 4663Coupon_id 11002Discount_rate 150:20Distance 1Date_received 20160528Date nullName: b, dtype: object
标签筛选
usingTable['User_id']
a 1439408b 1439408c 1439408d 1439408e 1439408Name: User_id, dtype: int64
加上索引就可以筛选出某一个标量
usingTable['User_id']['b']
1439408
布尔值筛选
bool = usingTable['Date_received'] != 'null'
bool
a Falseb Truec Trued Truee TrueName: Date_received, dtype: bool
usingTable[bool]
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | b | 1439408 4663 11002 150:20 1 20160528 null c | 1439408 2632 8591 20:1 0 20160217 null d | 1439408 2632 1078 20:1 0 20160319 null e | 1439408 2632 8591 20:1 0 20160613 null # 增加数据增加新的列,`pandas`可以根据标签自动创建
usingTable['Nation'] = 'China'
usingTable
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | Nation | a | 1439408 2632 null null 0 null 20160217 China b | 1439408 4663 11002 150:20 1 20160528 null China c | 1439408 2632 8591 20:1 0 20160217 null China d | 1439408 2632 1078 20:1 0 20160319 null China e | 1439408 2632 8591 20:1 0 20160613 null China 修改数据
设置新值
使用上面提到的筛选的方法,选中区域后就可以进行赋值操作
usingTable.loc[['b','c']] = None
usingTable
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | Nation | a | 1439408.0 2632.0 null null 0 null 20160217 China b | NaN NaN None None None None None None c | NaN NaN None None None None None None d | 1439408.0 2632.0 1078 20:1 0 20160319 null China e | 1439408.0 2632.0 8591 20:1 0 20160613 null China 缺失值处理
usingTable2 = DataFrame(np.random.randn(5,3),index = list('abcde'))for i in range(3): usingTable2.iloc[np.random.randint(0,4),np.random.randint(0,1)] = np.nanusingTable2
| 0 | 1 | 2 | a | NaN 1.324122 0.474714 b | -1.037397 -0.541610 -0.543801 c | NaN -0.528810 1.151078 d | -0.870800 NaN -0.050644 e | -1.214881 0.554842 -0.250637 可以看到上表中一共3个缺失的值,下面介绍几种常用的数据缺失的处理办法
isnull
和notnull
方法可以判断是否缺失
usingTable2.isnull()
| 0 | 1 | 2 | a | True False False b | False False False c | True False False d | False True False e | False False False 删除缺失值所在的行
usingTable2.dropna()
| 0 | 1 | 2 | b | -1.037397 -0.541610 -0.543801 e | -1.214881 0.554842 -0.250637 填补缺失值
给定值填充
usingTable2.fillna(usingTable2.mean())
| 0 | 1 | 2 | a | -1.041026 1.324122 0.474714 b | -1.037397 -0.541610 -0.543801 c | -1.041026 -0.528810 1.151078 d | -0.870800 0.202136 -0.050644 e | -1.214881 0.554842 -0.250637 不同的列用不同的值填充
usingTable2.fillna({0:'f0', 1:'f1'})
| 0 | 1 | 2 | a | f0 1.32412 0.474714 b | -1.0374 -0.54161 -0.543801 c | f0 -0.52881 1.151078 d | -0.8708 f1 -0.050644 e | -1.21488 0.554842 -0.250637 前向填充
使用缺失值的前一个值填充
usingTable2.fillna(method = 'ffill')
| 0 | 1 | 2 | a | NaN 1.324122 0.474714 b | -1.037397 -0.541610 -0.543801 c | -1.037397 -0.528810 1.151078 d | -0.870800 -0.528810 -0.050644 e | -1.214881 0.554842 -0.250637 后向填充
usingTable2.fillna(method = 'bfill')
| 0 | 1 | 2 | a | -1.037397 1.324122 0.474714 b | -1.037397 -0.541610 -0.543801 c | -0.870800 -0.528810 1.151078 d | -0.870800 0.554842 -0.050644 e | -1.214881 0.554842 -0.250637 删除数据
使用drop
方法舍弃行列
根据索引舍弃某一行
usingTable.drop('a')
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | Nation | b | NaN NaN None None None None None None c | NaN NaN None None None None None None d | 1439408.0 2632.0 1078 20:1 0 20160319 null China e | 1439408.0 2632.0 8591 20:1 0 20160613 null China 根据标签舍弃某一列
usingTable.drop('Distance', axis = 1)
| User_id | Merchant_id | Coupon_id | Discount_rate | Date_received | Date | Nation | a | 1439408.0 2632.0 null null null 20160217 China b | NaN NaN None None None None None c | NaN NaN None None None None None d | 1439408.0 2632.0 1078 20:1 20160319 null China e | 1439408.0 2632.0 8591 20:1 20160613 null China 数学运算
usingFrame1 = DataFrame(np.random.randn(5,3))usingFrame2 = DataFrame(np.random.randn(5,3), index = [3, 4, 5, 6, 7])
加法运算
usingFrame1
| 0 | 1 | 2 | 0 | 0.956881 -0.114052 -0.712592 1 | 1.163835 -0.983432 0.674851 2 | -0.598014 -0.606926 -0.065567 3 | -0.607522 0.172754 0.391103 4 | -2.008146 0.554840 -0.727477 usingFrame2
| 0 | 1 | 2 | 3 | -0.746694 0.989952 -0.715514 4 | 0.706863 -1.605593 0.037387 5 | -1.642772 -0.622830 0.505732 6 | 2.168925 0.406054 0.484112 7 | 0.676034 -0.701310 1.761908 可以直接使用+
进行计算,只计算交叉部分,剩余的补上NaN
usingFrame1 + usingFrame2
| 0 | 1 | 2 | 0 | NaN NaN NaN 1 | NaN NaN NaN 2 | NaN NaN NaN 3 | -1.354216 1.162705 -0.324412 4 | -1.301284 -1.050753 -0.690090 5 | NaN NaN NaN 6 | NaN NaN NaN 7 | NaN NaN NaN 也可以使用add方法,并且可以在每个DataFrame
里面补上相应的空缺的值
usingFrame1.add(usingFrame2, fill_value = 123)
| 0 | 1 | 2 | 0 | 123.956881 122.885948 122.287408 1 | 124.163835 122.016568 123.674851 2 | 122.401986 122.393074 122.934433 3 | -1.354216 1.162705 -0.324412 4 | -1.301284 -1.050753 -0.690090 5 | 121.357228 122.377170 123.505732 6 | 125.168925 123.406054 123.484112 7 | 123.676034 122.298690 124.761908 减乘除运算
类似地,还有sub
、div
、mul
对应-
、*
、/
,这里只展示下除法
usingFrame1/usingFrame2
| 0 | 1 | 2 | 0 | NaN NaN NaN 1 | NaN NaN NaN 2 | NaN NaN NaN 3 | 0.813617 0.174507 -0.546603 4 | -2.840929 -0.345567 -19.457934 5 | NaN NaN NaN 6 | NaN NaN NaN 7 | NaN NaN NaN usingFrame1.mul(usingFrame2, fill_value = 1)
| 0 | 1 | 2 | 0 | 0.956881 -0.114052 -0.712592 1 | 1.163835 -0.983432 0.674851 2 | -0.598014 -0.606926 -0.065567 3 | 0.453633 0.171018 -0.279839 4 | -1.419483 -0.890847 -0.027198 5 | -1.642772 -0.622830 0.505732 6 | 2.168925 0.406054 0.484112 7 | 0.676034 -0.701310 1.761908 其他运算
pandas
还提供了abs
、min
和max
等一系列数值运算的方法,这里就不进行罗列了
其他相关操作
快速统计
usingFrame1.describe()
| 0 | 1 | 2 | count | 5.000000 5.000000 5.000000 mean | -0.218593 -0.195363 -0.087937 std | 1.302936 0.611425 0.634632 min | -2.008146 -0.983432 -0.727477 25% | -0.607522 -0.606926 -0.712592 50% | -0.598014 -0.114052 -0.065567 75% | 0.956881 0.172754 0.391103 max | 1.163835 0.554840 0.674851 行列排序
根据行标签排序
usingTable.sort_index(ascending = False)
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | Nation | e | 1439408.0 2632.0 8591 20:1 0 20160613 null China d | 1439408.0 2632.0 1078 20:1 0 20160319 null China c | NaN NaN None None None None None None b | NaN NaN None None None None None None a | 1439408.0 2632.0 null null 0 null 20160217 China 根据列标签排序
usingTable.sort_index(axis = 1)
| Coupon_id | Date | Date_received | Discount_rate | Distance | Merchant_id | Nation | User_id | a | null 20160217 null null 0 2632.0 China 1439408.0 b | None None None None None NaN None NaN c | None None None None None NaN None NaN d | 1078 null 20160319 20:1 0 2632.0 China 1439408.0 e | 8591 null 20160613 20:1 0 2632.0 China 1439408.0 也可以根据多个列进行排序
usingTable.sort_values(by = ['Date', 'Distance'])
| User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | Nation | a | 1439408.0 2632.0 null null 0 null 20160217 China d | 1439408.0 2632.0 1078 20:1 0 20160319 null China e | 1439408.0 2632.0 8591 20:1 0 20160613 null China b | NaN NaN None None None None None None c | NaN NaN None None None None None None 转置
usingTable.T
| a | b | c | d | e | User_id | 1.43941e+06 NaN NaN 1.43941e+06 1.43941e+06 Merchant_id | 2632 NaN NaN 2632 2632 Coupon_id | null None None 1078 8591 Discount_rate | null None None 20:1 20:1 Distance | 0 None None 0 0 Date_received | null None None 20160319 20160613 Date | 20160217 None None null null Nation | China None None China China 写到这里,大部分常用功能都介绍完了,更多功能大家可以自行摸索
参考资料
- Python for Data Analysis
- 十分钟搞定pandas
由于本人水平有限,难免出现错漏之处,欢迎批评指正