十分钟了解pandas学习
来源:互联网 发布:c语言 修改文件名 编辑:程序博客网 时间:2024/06/10 08:39
十分钟了解pandas
创建对象
- 通过传递值列表创建Series
import pandas as pdimport numpy as np# 默认整数索引s = pd.Series([1,3,5,np.nan,6,8])print soutput:0 11 32 53 NaN4 65 8dtype: float64
- 创建时间索引
pd.date_range(start=None, end=None, periods=None, freq=’D’, tz=None,
normalize=False, name=None, closed=None)
pd.date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None) function: 创建索引 start: 索引开始 end: 索引结束 periods:索引个数(如果closed为None,则和索引个数相同,closed取left或right,索引个数都会比periods少1) freq: 索引间隔, 默认是D normalize: 在生成日期之前,将起始/结束日期与午夜的日期进行标准化 name: 给索引起名字 closed: 'left'从左开始 'right'从右开始, 默认None,左右两边都保留 s = pd.date_range('20130101', periods=6, freq='5H', normalize=True, name='time', closed='right') print s 输出:DatetimeIndex(['2013-01-01 05:00:00', '2013-01-01 10:00:00', '2013-01-01 15:00:00', '2013-01-01 20:00:00', '2013-01-02 01:00:00'], dtype='datetime64[ns]', name=u'time', freq='5H')
- np.random.randn(d0, d1, …, dn) 是从标准正态分布中返回一个或多个样本值
如: np.random.randn(3,2)output:[[-0.66203849 0.82071427] [ 0.0292031 0.01885139] [-0.24398997 0.30936218]]
- numpy.random.rand(d0, d1, …, dn) 是从[0, 1)随机样本中返回一个或多个样本值
如: np.random.rand(3,2)output:[[ 0.03652049 0.87310609] [ 0.62958535 0.46013806] [ 0.36548056 0.13320911]]
- 通过传递numpy数组,使用datetime索引和标记的列来创建DataFrame
df = pd.DataFrame(np.random.randn(6,4), index=pd.date_range('20130101', periods=6), columns=list('ABCD'))print dfoutput: A B C D2013-01-01 0.318379 0.417183 -1.302340 0.3362562013-01-02 1.275668 -1.024254 -2.260727 -0.1379992013-01-03 -1.178882 -1.158869 1.729817 -0.3829032013-01-04 1.669776 0.685672 -0.150533 -0.1583562013-01-05 -0.688819 -0.641449 -0.192594 -0.8272982013-01-06 1.783666 0.106125 -0.303890 -0.818119
- 通过传递可以转换为类系列的对象的dict来创建DataFrame
df2 = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' })print df2output: A B C D E F0 1 2013-01-02 1 3 test foo1 1 2013-01-02 1 3 train foo2 1 2013-01-02 1 3 test foo3 1 2013-01-02 1 3 train foo
浏览数据
- df.dtypes 查看类型
- df.head() 查看数据框的头部数据
- df.tail() 查看数据框的尾部数据
- df.index 显示索引
- df.columns 显示列名称
- df.values 数据库的值
- df.describe 显示数据的快速统计摘要
- df.T 转置数据,即横变成列、列变横
- df.sort_index(axis=0, ascending=False) 按轴排序(x轴, y轴)
df.sort_index 参数:axis: 0-index,1 columeascending: ascending True增序, False倒叙inplace: 默认是false, 不创建新实例
- 按值排序 df.sort_values(ascending=True, by=’B’) # ‘B’为列名
- 选择单个列,
df['A']
产生Series,等效于df.A - df[0:3] 按index下标切片
- df[‘20130102’:’20130104’] 按index值切片
Selection by Label
- df2.loc[s[0]] 使用标签获取横截面 s是索引对象
- df.loc[:,[‘A’,’B’]] 按标签选择多轴
- df.loc[‘20130102’:’20130104’,[‘A’,’B’]] 显示标签切片,两个端点都包含
print df.loc['20130102':'20130104',['A','B']]output: A B2013-01-02 0.383816 1.4748082013-01-03 0.874606 0.0949282013-01-04 1.437224 0.761042
- 减少返回对象维度
print df.loc['20130102',['A','B']]Out[29]:A 1.212112B -0.173215
- 获取标量值
print df.loc['20130102','A']Out[30]: 0.46911229990718628
- 对标量的快速访问(等同于之前的方法)
s = pd.date_range('20130101', periods=6)print df.at[s[0],'A'] # 第一个参数传入 20130101 有异常
Selection by Position
- 通过坐标查看
df.iloc[3] index为3的一行数据,列排df.iloc[3][0]df.iloc[3:5,0:2]output: A B2013-01-04 0.467402 0.0528022013-01-05 -0.318575 0.549730df.iloc[[1,2,4],[0,2]] 整数位置的列表选择数据output: A C2013-01-02 1.212112 0.1192092013-01-03 -0.861849 -0.4949292013-01-05 -0.424972 0.276232
- 为了获得对标量的快速访问(等同于
df.iloc[3][0]
)
In [38]: df.iat[1,1]Out[38]: -0.17321464905330858
Boolean Indexing
- 使用单个列的值的条件来选择数据
df[df.A > 0] A B C D2013-01-01 0.509229 0.313454 0.535203 -1.5370802013-01-06 1.609375 1.464626 -0.737054 0.093372print df[df > 0] 会让为负数的值填充NaN A B C D2013-01-01 0.980258 0.338144 NaN 0.8453702013-01-02 NaN 1.104085 0.341050 1.0754452013-01-03 0.171871 NaN 0.797777 0.5366342013-01-04 NaN NaN NaN 0.7583902013-01-05 NaN NaN 1.801255 1.7807752013-01-06 NaN NaN 2.266597 1.060540
- 使用isin过滤
df2 = df.copy()df2['E'] = ['one', 'one','two','three','four','three']print df2print df2[df2['E'].isin(['two','four'])]output: A B C D E2013-01-03 -1.001488 0.465646 -0.330277 0.722562 two2013-01-05 0.258179 -0.727527 0.860856 -0.171767 four
- 按位置设置值 df.iat[0,1] = 11
- 按标签设置值 df.at[dates[0],’A’] = 0
- 负值取反
df2 = df.copy()df2[df2 > 0] = -df2print df2output: A B C D F2013-01-01 -1.470583 -11.000000 -2.021309 -0.395401 NaN2013-01-02 -0.297632 -0.628088 -0.286551 -0.400202 -12013-01-03 -0.584846 -0.520314 -0.795003 -1.253678 -22013-01-04 -0.571857 -0.030617 -1.073085 -0.407656 -32013-01-05 -0.409181 -0.217922 -1.856937 -0.681493 -42013-01-06 -0.909665 -0.031711 -0.021490 -1.221956 -5
Missing Data
- reindexing 允许您更改/添加/删除指定轴上的索引。这将返回数据的副本
d = pd.date_range('20130101', periods=6)df1 = df.reindex(index=d, columns=list(df.columns))df1.loc[d[0]:d[1],'E'] = 1print df1.reindex(fill_value=df.values)output:2013-01-01 0.188434 1.338949 -0.085884 1.882977 12013-01-02 0.396632 0.944758 -0.721702 -1.666582 12013-01-03 0.359807 -0.648172 -0.510065 1.429356 NaN2013-01-04 0.409378 1.320434 -0.293386 -0.159756 NaN2013-01-05 0.604146 0.071139 1.170985 -0.482204 NaN2013-01-06 0.658903 -0.003649 -0.204679 0.472076 NaN
- drop方法删除任何含有缺少数据的行
d = pd.date_range('20130101', periods=6)df.loc[d[0]:d[1], 'D'] = np.nanprint df.dropna(how='any')output: A B C D2013-01-03 0.972982 -1.777415 1.550535 -0.2229592013-01-04 -0.271872 -0.713687 0.034684 0.7686602013-01-05 -0.427849 -1.112800 1.592027 -0.8707962013-01-06 -0.333351 0.064402 -0.523787 0.939407
- df.fillna(value=5) 方法填充缺失值
- pd.isnull(df) 取值为nan的布尔值
print pd.isnull(df)output: A B C D2013-01-01 False False False True2013-01-02 False False False True2013-01-03 False False False False2013-01-04 False False False False2013-01-05 False False False False2013-01-06 False False False False
- 描述性统计
df.mean()df.mean(axis=1) 执行列(axis=1)描述统计
- 对具有不同维度并需要对齐的对象进行操作
d = pd.date_range('20130101', periods=6)res = pd.Series([1, 3, 5, np.nan, 6, 8], index=d).shift(1, axis=0)print resoutput:2013-01-01 NaN2013-01-02 12013-01-03 32013-01-04 52013-01-05 NaN2013-01-06 6
- apply方法,将函数用于数据
1、df.apply(np.cumsum) 返回df列的累计和(比如,返回结果的第二行第一列的值等于df第一行第一列的值+第二列第二行的值)2、df.apply(lambda x: x.max() - x.min()) 计算每一列最大值-最小值output:A 3.457406B 2.542839C 2.566534D 3.116148dtype: float64
- 统计次数
s = pd.Series(np.random.randint(0, 7, size=10))print s.value_counts()output:5 30 36 22 11 1dtype: int64
- 字符串方法, 非字符串值用NaN填充
s = pd.Series(['A', 32, 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat', 1])print s.str.lower()0 a1 NaN2 b3 c4 aaba5 baca6 NaN7 caba8 dog9 cat10 NaN
- Concat连接
df = pd.DataFrame(np.random.randn(2, 4))df3 = pd.DataFrame(np.random.randn(2, 4))pieces = [df, df3]print pd.concat(pieces)output: 0 1 2 30 0.612235 -1.470256 0.155577 0.5590231 -1.375064 1.246578 1.024598 0.9854050 0.223544 0.489176 0.700047 1.1235871 -0.110190 1.703503 -1.339290 -2.199537
- join SQL样式合并
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})print leftprint rightprint pd.merge(left, right, on='key')output: key lval0 foo 11 foo 2 key rval0 foo 41 foo 5 key lval rval0 foo 1 41 foo 1 52 foo 2 43 foo 2 5
- Append 将行附加到数据框
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])s = df.iloc[3]print df.append(s, ignore_index=True)output: A B C D0 -1.321258 0.045675 0.381518 1.7449681 -1.289457 -1.232328 0.518603 -0.4397262 1.580407 1.170709 1.670016 -2.2037993 0.639870 0.111603 -0.518480 0.8289244 1.138179 0.917267 -1.619596 0.8257105 -0.463726 1.555098 1.622742 0.3984596 0.776347 1.068682 -0.468541 -0.3465887 0.665365 1.946883 -0.012843 0.1190178 0.639870 0.111603 -0.518480 0.828924 #附加的行
- Grouping分组
df = pd.DataFrame({ 'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8)})print df.groupby('A').sum()print df.groupby(['A', 'B']).sum() C DAbar -0.907288 1.336085foo 4.172039 0.385459 C DA Bbar one 0.292497 -0.401558 three -1.501987 0.455004 two 0.302203 1.282639foo one 1.641715 -2.024059 three -0.741582 0.566795 two 3.271907 1.842723
- stack()方法“压缩”DataFrame的列中的一个级别。unstac相反
tuples = list(zip(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']))print tuplesindex = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])print indexdf = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])df2 = df[:4]stacked = df2.stack()print stackedoutput:first secondbar one A 0.757833 B -1.218229 two A -0.240841 B 1.111270baz one A -0.899588 B -0.643713 two A 0.865628 B -0.316734dtype: float64print stacked.unstack() # 默认取消堆栈最后一个级别output: A Bfirst secondbar one -0.289143 -0.393785 two 0.207235 0.307347baz one -0.924273 2.437802 two -0.607268 -0.491990print stacked.unstack(1)output:second one twofirstbar A -0.618240 -0.768593 B -1.642071 -0.264534baz A 2.519232 -0.510234 B 0.651714 -0.476661print stacked.unstack(0) # 默认取消堆栈最后一个级别first bar bazsecondone A 0.378821 -0.147900 B 0.521936 1.079642two A -0.462910 0.550019 B -0.879458 -1.254567
数据透视表
pd.pivot_table其实就是对指定行列,进行求和操作(默认,也可以指定成其他自定义方法)
df = pd.DataFrame({ 'A' : ['one', 'one', 'two', 'three'] * 3, 'B' : ['A', 'B', 'C'] * 4, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, 'D' : np.random.randn(12), 'E' : np.random.randn(12)})print dfprint '---------------------------'print pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])output: A B C D E0 one A foo 0.129400 0.3143431 one B foo -0.349982 0.4460822 two C foo -1.963735 0.4690983 three A bar 0.484424 -0.6090184 one B bar 0.886764 -1.3695895 one C bar -0.717089 -1.2777296 two A foo 1.036215 -0.9617807 three B foo -0.026899 0.5508388 one C foo 0.482682 -0.5632409 one A bar -0.779486 -0.31429910 two B bar -0.743693 -0.00108211 three C bar 0.899123 0.000721---------------------------C bar fooA Bone A -0.779486 0.129400 B 0.886764 -0.349982 C -0.717089 0.482682three A 0.484424 NaN B NaN -0.026899 C 0.899123 NaNtwo A NaN 1.036215 B -0.743693 NaN C NaN -1.963735
- 绘图
需要安装 apt-get install python-tk
from pandas import Series,DataFrameimport numpy as npimport matplotlibimport matplotlib.pyplot as pltmatplotlib.style.use('ggplot')s=Series(np.random.randn(10).cumsum(),index=np.arange(0,100,10))df=DataFrame(np.random.randn(10,4).cumsum(0),columns=['A','B','C','D'],index=np.arange(0,100,10))df.plot()plt.show() # 调用次方法,能看到绘制的图
- Time Series
ts.resample()修改时间
rng = pd.date_range('1/1/2012', periods=5, freq='S')ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)print tsprint ts.resample('5Min')output:2012-01-01 00:00:00 3582012-01-01 00:00:01 202012-01-01 00:00:02 642012-01-01 00:00:03 1352012-01-01 00:00:04 446Freq: S, dtype: int322012-01-01 204.6Freq: 5T, dtype: float64
- 时区表示、周期和时间戳之间的转换
rng = pd.date_range('1/1/2012', periods=5, freq='S')ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)ts_utc = ts.tz_localize('UTC')print ts_utc2012-01-01 00:00:00+00:00 1572012-01-01 00:00:01+00:00 392012-01-01 00:00:02+00:00 3472012-01-01 00:00:03+00:00 12012-01-01 00:00:04+00:00 455Freq: S, dtype: int32转换时区ts_utc.tz_convert('US/Eastern')output:2011-12-31 19:00:00-05:00 4942011-12-31 19:00:01-05:00 2622011-12-31 19:00:02-05:00 3772011-12-31 19:00:03-05:00 2962011-12-31 19:00:04-05:00 370Freq: S, dtype: int32
时间跨度表示之间转换
rng = pd.date_range('1/1/2012', periods=5, freq='M')ts = pd.Series(np.random.randn(len(rng)), index=rng)ps = ts.to_period(freq='S')output:2012-01-31 00:00:00 -1.0268382012-02-29 00:00:00 1.3356022012-03-31 00:00:00 0.6760032012-04-30 00:00:00 -0.6216312012-05-31 00:00:00 -1.031011ps.to_timestamp() 转换成时间戳
- 在构建系列时指定dtype=”category”
s = pd.Series(["a","b","c","a"], dtype="category")
- 将现有的系列或列转换为category dtype:
df = pd.DataFrame({"A":["a","b","c","a"]})df["B"] = df["A"].astype('category')print dfoutput: A B0 a a1 b b2 c c3 a a
- 读写csv
pd.read_csv('foo.csv') df.to_csv('foo.csv')
- 读写hdf
df.to_hdf('foo.h5','df')pd.read_hdf('foo.h5','df')
- 读写execl
df.to_excel('foo.xlsx', sheet_name='Sheet1')pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
阅读全文
0 0
- 十分钟了解pandas学习
- 十分钟了解pandas
- 十分钟了解pandas(总结)
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- 十分钟搞定pandas
- toString()模板语法研究
- 【学习笔记四】- 用js实现的一些数组操作和算法
- CSS选择器
- 计算机网络基础(一)之OSI模型
- 函数
- 十分钟了解pandas学习
- MySQL 数据库备份和恢复探讨(全量mysqldump 和 增量mysqlbinlog)
- idea tomcat调试无法启动 Cannot load this JVM TI agent twice
- 机器学习整理(二)-逻辑回归代价函数
- linux修改用户名
- ↓↓点击播放视频↓↓
- 数据库索引使用的数据结构
- Qt 不规则控件
- Unix ls, UVa 400