[pandas] 数据类型学习笔记
来源:互联网 发布:picasa3 for mac 编辑:程序博客网 时间:2024/05/16 15:02
熟悉了NumPy之后,接下来就是要学习pandas了。pandas建立在NumPy之上,十分强大,好用。学习的资料就是看pandas官网的文档了。本文就是记录自己的学习笔记。
pandas的数据结构
pandas主要有Series(对映一维数组),DataFrame(对映二维数组),Panel(对映三维数组),Panel4D(对映四维数组),PanelND(多维)等数据结构。应用最多的就是Series和DataFrame了。下面就主要介绍这两类数据结构。
Series
Series是一维带标签的数组,它可以包含任何数据类型。包括整数,字符串,浮点数,Python对象等。Series可以通过标签来定位。
创建方法
s = pd.Series(data, index=index)
data可以是:
- Python的dict
- Numpy的ndarray
- 一个标量值
从ndarry创建
In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: s = pd.Series(np.random.randn(5), index = list('ABCDE'))In [4]: sOut[4]: A -1.130657B -1.539251C 1.503126D 1.266908E 0.335561dtype: float64
从dict创建
In [19]: d = {'a': 1, 'b': 2, 'c': 3}In [20]: pd.Series(d)Out[20]: a 1b 2c 3dtype: int64In [21]: pd.Series(d, index=['b', 'c', 'd', 'a'])Out[21]: b 2c 3d NaNa 1dtype: float64
从标量创建
In [22]: pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])Out[22]: a 5b 5c 5d 5e 5dtype: float64In [23]:
Series操作
Series像ndarray一样操作
In [24]: s[0]Out[24]: -0.06036422206791571In [25]: s[:3]Out[25]: a -0.060364b 0.315560c -0.520548dtype: float64In [26]: s[s > s.median()]Out[26]: a -0.060364b 0.315560dtype: float64In [27]: s[[4, 2, 1]]Out[27]: e -1.900474c -0.520548b 0.315560dtype: float64In [28]: np.exp(s)Out[28]: a 0.941422b 1.371027c 0.594195d 0.912396e 0.149498dtype: float64
Series像dictionary一样操作
In [30]: s['a'] Out[30]: -0.06036422206791571In [31]: 'e' in s Out[31]: TrueIn [32]: s.get('f')In [33]: s.get('f', np.nan) Out[33]: nanIn [35]: s['f'] = 3.In [35]: s['f'] = 3.In [36]: sOut[36]: a -0.060364b 0.315560c -0.520548d -0.091681e -1.900474f 3.000000dtype: float64
如果index不存在,则没有返回值。我么也可以给不存在的值附上nan。
运算操作
Series支持+,-,*, /, exp等NumPy的运算。
In [6]: s + sOut[6]: a 0.648688b -2.729308c -0.919524d 0.876880e 5.863378f 6.000000dtype: float64In [7]: s * 2Out[7]: a 0.648688b -2.729308c -0.919524d 0.876880e 5.863378f 6.000000dtype: float64In [8]: np.exp(s)Out[8]: a 1.383123b 0.255469c 0.631434d 1.550287e 18.759286f 20.085537dtype: float64
当两个index不同的Series一起操作时,不同部分值为nan:
In [9]: s[1:] + s[:-1]Out[9]: a NaNb -2.729308c -0.919524d 0.876880e 5.863378f NaNdtype: float64
我们也可以给Series命名:
In [10]: s = pd.Series(np.random.randn(5), name='something')In [11]: sOut[11]: 0 1.4472081 -0.5467602 0.8586223 0.6488034 -0.667612Name: something, dtype: float64
DataFrame
DataFrame是二维的带标签的数据结构。我们可以通过标签来定位数据。这是NumPy所没有的。
创建方法
数据可以从不同类型的输入获得:
- 一维ndarray,列表,字典,字典,或者Series的字典,
- 二维的ndarray
- Series
- 外部数据引入,比如csv, excel等
- 其他的DataFrame
- 等等
我就介绍下怎么用Series的字典创建,其他方法大同小异,可以参考文档。
从Series的字典创建
In [17]: d = {'one': pd.Series([1, 2, 3], index=list('abc')), 'two': pd.Series([1, 2, 3, 4], index=list('abcd'))}In [18]: df = pd.DataFrame(d)In [19]: dfOut[19]: one twoa 1 1b 2 2c 3 3d NaN 4In [20]: df.indexOut[20]: Index(['a', 'b', 'c', 'd'], dtype='object')In [21]: df.columnsOut[21]: Index(['one', 'two'], dtype='object')In [28]: df.index=['A', 'B', 'C', 'D'] # 可以更改indexIn [29]: dfOut[29]: one twoA 1 1B 2 2C 3 3D NaN 4
选择、运算操作
我们可以像操作Series一样操作DataFrame。读取,设置,删除列的操作和dict操作类似。
In [22]: df['one'] # 列操作:选择列标签名Out[22]: a 1b 2c 3d NaNName: one, dtype: float64In [23]: df['three'] = df['one'] + df['two'] # 创建新列, 和dict一样In [24]: df['flag'] = df['one'] > 2 In [25]: df Out[25]: one two three flaga 1 1 2 Falseb 2 2 4 Falsec 3 3 6 Trued NaN 4 NaN FalseIn [26]: del df['two']In [27]: three = df.pop('three') # df中弹出three列到three变量In [28]: dfOut[28]: one flaga 1 Falseb 2 Falsec 3 Trued NaN FalseIn [29]: threeOut[29]: a 2b 4c 6d NaNName: three, dtype: float64In [30]: df['foo'] = 'bar'In [31]: dfOut[31]: one flag fooa 1 False barb 2 False barc 3 True bard NaN False barIn [32]: df['one_trunc'] = df['one'][:2] # 填补的数据为nanIn [33]: dfOut[33]: one flag foo one_trunca 1 False bar 1b 2 False bar 2c 3 True bar NaNd NaN False bar NaNIn [34]: df.insert(1, 'bar', df['one']) # 可以自定义加入列的位置,新增bar列到index为1的列In [35]: dfOut[35]: one bar flag foo one_trunca 1 1 False bar 1b 2 2 False bar 2c 3 3 True bar NaNd NaN NaN False bar NaNIn [36]: df.assign(ration = df['one'] / df['bar']) # assign操作会把结果储存在DataFrame中Out[36]: one bar flag foo one_trunc rationa 1 1 False bar 1 1b 2 2 False bar 2 1c 3 3 True bar NaN 1d NaN NaN False bar NaN NaNIn [37]: dfOut[37]: one bar flag foo one_trunca 1 1 False bar 1b 2 2 False bar 2c 3 3 True bar NaNd NaN NaN False bar NaNIn [38]: df.loc['b'] # 用loc操作获取行,loc操作需要行的标签Out[38]: one 2bar 2flag Falsefoo barone_trunc 2Name: b, dtype: objectIn [39]: df.iloc[2] # 用iloc操作根据行列获取数据,iloc[row list, columns list] Out[39]: one 3bar 3flag Truefoo barone_trunc NaNName: c, dtype: objectIn [40]: df.iloc[2, :] # 选取第二行,除了最后一列的所有列 Out[40]: one 3bar 3flag Truefoo barName: c, dtype: objectIn [44]: df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))In [46]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=list('ABC'))In [47]: df + df2 # 两个DataFrame相加,表情不对应得地方,赋nan值Out[47]: A B C D0 1.239121 2.705995 1.365740 NaN1 1.507655 -1.092202 0.083471 NaN2 -0.485961 -0.131136 -1.677334 NaN3 -0.858146 0.319006 -1.995003 NaN4 -1.487327 2.030991 -0.565237 NaN5 0.239241 -0.713864 -1.635968 NaN6 -1.656484 -0.420657 0.125534 NaN7 NaN NaN NaN NaN8 NaN NaN NaN NaN9 NaN NaN NaN NaNIn [48]: df - df.iloc[0] # 行减操作Out[48]: A B C D0 0.000000 0.000000 0.000000 0.0000001 -1.969524 -1.957223 -0.781471 0.2786862 -2.215996 -0.172781 -1.736314 -1.0507413 -2.264761 -1.402786 -2.713273 -0.2470844 -1.157636 -1.445320 -1.985973 1.4857995 -1.689059 -1.160161 -1.453136 0.5880976 -3.359694 -1.415710 -0.493772 -1.0025437 -0.889769 0.220577 -0.023013 0.0243378 -2.223337 -0.068570 -1.117682 -0.8750489 -0.678439 -1.591324 0.107048 -0.880545In [49]: df - df['A'] # 列减操作和行减不一致Out[49]: A B C D 0 1 2 3 4 5 6 7 8 90 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaNIn [53]: df = pd.DataFrame(random.randn(8, 3), index = index, columns=list('ABC'))In [54]: dfOut[54]: A B C2000-01-01 0.581713 0.229262 -0.1743592000-01-02 1.355298 -0.901488 1.0821122000-01-03 -0.963151 0.285010 -1.2751642000-01-04 -0.104592 0.744454 -0.7225042000-01-05 -0.794036 0.268566 1.7213572000-01-06 -1.415143 -0.863292 -0.6746502000-01-07 0.505573 0.451317 -0.3909722000-01-08 -1.341107 0.549922 0.120314In [59]: df.sub(df['A'], axis = 0) # 用sub操作实现正真的列减Out[59]: A B C2000-01-01 0 -0.352452 -0.7560722000-01-02 0 -2.256785 -0.2731862000-01-03 0 1.248161 -0.3120132000-01-04 0 0.849046 -0.6179112000-01-05 0 1.062602 2.5153932000-01-06 0 0.551851 0.7404932000-01-07 0 -0.054256 -0.8965452000-01-08 0 1.891028 1.461421In [60]: dfOut[60]: A B C2000-01-01 0.581713 0.229262 -0.1743592000-01-02 1.355298 -0.901488 1.0821122000-01-03 -0.963151 0.285010 -1.2751642000-01-04 -0.104592 0.744454 -0.7225042000-01-05 -0.794036 0.268566 1.7213572000-01-06 -1.415143 -0.863292 -0.6746502000-01-07 0.505573 0.451317 -0.3909722000-01-08 -1.341107 0.549922 0.120314In [70]: df * 5 + 2 # 运算操作Out[70]: A B C2000-01-01 4.908567 3.146308 1.1282052000-01-02 8.776488 -2.507439 7.4105582000-01-03 -2.815756 3.425049 -4.3758202000-01-04 1.477038 5.722268 -1.6125182000-01-05 -1.970178 3.342832 10.6067862000-01-06 -5.075717 -2.316460 -1.3732502000-01-07 4.527865 4.256584 0.0451392000-01-08 -4.705535 4.749608 2.601572In [71]: 1 / dfOut[71]: A B C2000-01-01 1.719060 4.361830 -5.7352932000-01-02 0.737845 -1.109277 0.9241192000-01-03 -1.038259 3.508651 -0.7842132000-01-04 -9.560923 1.343267 -1.3840762000-01-05 -1.259389 3.723473 0.5809372000-01-06 -0.706642 -1.158357 -1.4822502000-01-07 1.977954 2.215738 -2.5577262000-01-08 -0.745653 1.818441 8.311556In [74]: df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype = bool)In [75]: df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype = bool)In [82]: df1Out[82]: a b0 True False1 False True2 True TrueIn [83]: df2Out[83]: a b0 False True1 True True2 True False### 下面演示的是布尔运算In [84]: df1 & df2 Out[84]: a b0 False False1 False True2 True FalseIn [85]: df1 | df2Out[85]: a b0 True True1 True True2 True TrueIn [86]: df1 ^ df2Out[86]: a b0 True True1 True False2 False TrueIn [87]: -df1Out[87]: a b0 False True1 True False2 False FalseIn [112]: df = pd.DataFrame({'foot1': np.random.randn(5), 'foot2': np.random.randn(5)})In [113]: dfOut[113]: foot1 foot20 0.953419 -0.9019831 -0.155681 -0.1432132 -0.164418 1.5199703 0.699752 -0.3982244 -0.550058 2.115899In [114]: df.T # 矩阵转置Out[114]: 0 1 2 3 4foot1 0.953419 -0.155681 -0.164418 0.699752 -0.550058foot2 -0.901983 -0.143213 1.519970 -0.398224 2.115899In [116]: df.T.dot(df) # 矩阵相乘Out[116]: foot1 foot2foot1 1.752495 -2.530108foot2 -2.530108 7.780001In [120]: np.exp(df) # 同样可以用NumPy的方法Out[120]: foot1 foot20 2.594566 0.4057641 0.855832 0.8665692 0.848387 4.5720873 2.013253 0.6715124 0.576916 8.297037
总结
以上就是Pandas主要的数据结构:Series和DataFrame的简介。记录了怎么创建数据以及常用的算数、选取、增、删、修改的操作。
数据挺好理解的。Series相对于一般的数组来说,就是多了一个标签。因此我们也已把它理解为一个“字典“,标签对映字典的key,值对映字典的值。同样的,DataFrame比一般的矩阵多了行和列的标签。列相当于一个Series。所以我们需要加一个列标签。我们可以把它看成”字典的字典“。用字典的字典的key对映于列标签。
以此类推,Panel的n维数据,类似于n个嵌套的字典。最外维的标签对映于最外维字典的key。
- [pandas] 数据类型学习笔记
- Pandas学习笔记:pandas基础
- pandas 学习笔记
- pandas学习笔记
- pandas学习笔记
- pandas学习笔记
- pandas学习笔记
- Pandas学习笔记
- pandas学习笔记
- pandas numpy学习笔记
- pandas学习笔记-Series
- Pandas学习笔记
- python pandas学习笔记
- pandas学习笔记
- pandas学习笔记
- Pandas学习笔记
- pandas学习笔记
- Pandas学习笔记
- CSS Overflow Hidden在iPhone & Safari不起作用
- PHP 二维数组根据某个字段进行排序
- iOS开发常见报错及解决方案 by STP
- Spring mvc rest 风格实例
- 利用JDBC获取表信息和字段信息
- [pandas] 数据类型学习笔记
- Android开源框架收集-UI效果(一)
- phpcms 手机模版实现列表:标题+图片+摘要
- 媒体与媒体处理
- android 沉浸式状态栏
- magento 给customer添加属性 给order添加属性
- 整数划分
- Android自定义控件无法通过代码修改大小、高宽,setMinimumHeight无效的问题
- OC中对数组排序的几种方法