十分钟了解pandas学习

来源:互联网 发布:c语言 修改文件名 编辑:程序博客网 时间:2024/06/10 08:39

十分钟了解pandas

创建对象

  • 通过传递值列表创建Series
import pandas as pdimport numpy as np# 默认整数索引s = pd.Series([1,3,5,np.nan,6,8])print soutput:0     11     32     53   NaN4     65     8dtype: float64
  • 创建时间索引
    pd.date_range(start=None, end=None, periods=None, freq=’D’, tz=None,
    normalize=False, name=None, closed=None)
    pd.date_range(start=None, end=None, periods=None, freq='D', tz=None,               normalize=False, name=None, closed=None)    function: 创建索引    start: 索引开始    end: 索引结束    periods:索引个数(如果closed为None,则和索引个数相同,closed取left或right,索引个数都会比periods少1)    freq: 索引间隔, 默认是D    normalize: 在生成日期之前,将起始/结束日期与午夜的日期进行标准化    name: 给索引起名字    closed: 'left'从左开始 'right'从右开始,  默认None,左右两边都保留    s = pd.date_range('20130101', periods=6, freq='5H', normalize=True, name='time', closed='right')    print s    输出:DatetimeIndex(['2013-01-01 05:00:00', '2013-01-01 10:00:00',               '2013-01-01 15:00:00', '2013-01-01 20:00:00',               '2013-01-02 01:00:00'],              dtype='datetime64[ns]', name=u'time', freq='5H')
  • np.random.randn(d0, d1, …, dn) 是从标准正态分布中返回一个或多个样本值
如: np.random.randn(3,2)output:[[-0.66203849  0.82071427] [ 0.0292031   0.01885139] [-0.24398997  0.30936218]]
  • numpy.random.rand(d0, d1, …, dn) 是从[0, 1)随机样本中返回一个或多个样本值
如: np.random.rand(3,2)output:[[ 0.03652049  0.87310609] [ 0.62958535  0.46013806] [ 0.36548056  0.13320911]]
  • 通过传递numpy数组,使用datetime索引和标记的列来创建DataFrame
df = pd.DataFrame(np.random.randn(6,4), index=pd.date_range('20130101', periods=6), columns=list('ABCD'))print dfoutput:                   A         B         C         D2013-01-01  0.318379  0.417183 -1.302340  0.3362562013-01-02  1.275668 -1.024254 -2.260727 -0.1379992013-01-03 -1.178882 -1.158869  1.729817 -0.3829032013-01-04  1.669776  0.685672 -0.150533 -0.1583562013-01-05 -0.688819 -0.641449 -0.192594 -0.8272982013-01-06  1.783666  0.106125 -0.303890 -0.818119
  • 通过传递可以转换为类系列的对象的dict来创建DataFrame
df2 = pd.DataFrame({                    'A' : 1.,                    'B' : pd.Timestamp('20130102'),                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),                    'D' : np.array([3] * 4,dtype='int32'),                    'E' : pd.Categorical(["test","train","test","train"]),                    'F' : 'foo' })print df2output:   A          B  C  D      E    F0  1 2013-01-02  1  3   test  foo1  1 2013-01-02  1  3  train  foo2  1 2013-01-02  1  3   test  foo3  1 2013-01-02  1  3  train  foo

浏览数据

  • df.dtypes 查看类型
  • df.head() 查看数据框的头部数据
  • df.tail() 查看数据框的尾部数据
  • df.index 显示索引
  • df.columns 显示列名称
  • df.values 数据库的值
  • df.describe 显示数据的快速统计摘要
  • df.T 转置数据,即横变成列、列变横
  • df.sort_index(axis=0, ascending=False) 按轴排序(x轴, y轴)
df.sort_index 参数:axis: 0-index1 columeascending:  ascending True增序, False倒叙inplace: 默认是false, 不创建新实例
  • 按值排序 df.sort_values(ascending=True, by=’B’) # ‘B’为列名
  • 选择单个列, df['A'] 产生Series,等效于df.A
  • df[0:3] 按index下标切片
  • df[‘20130102’:’20130104’] 按index值切片

Selection by Label

  • df2.loc[s[0]] 使用标签获取横截面 s是索引对象
  • df.loc[:,[‘A’,’B’]] 按标签选择多轴
  • df.loc[‘20130102’:’20130104’,[‘A’,’B’]] 显示标签切片,两个端点都包含
print df.loc['20130102':'20130104',['A','B']]output:                   A         B2013-01-02  0.383816  1.4748082013-01-03  0.874606  0.0949282013-01-04  1.437224  0.761042
  • 减少返回对象维度
print df.loc['20130102',['A','B']]Out[29]:A    1.212112B   -0.173215
  • 获取标量值
print df.loc['20130102','A']Out[30]: 0.46911229990718628
  • 对标量的快速访问(等同于之前的方法)
s = pd.date_range('20130101', periods=6)print df.at[s[0],'A'] # 第一个参数传入 20130101 有异常

Selection by Position

  • 通过坐标查看
df.iloc[3]  index为3的一行数据,列排df.iloc[3][0]df.iloc[3:5,0:2]output:                   A         B2013-01-04  0.467402  0.0528022013-01-05 -0.318575  0.549730df.iloc[[1,2,4],[0,2]] 整数位置的列表选择数据output:                   A         C2013-01-02  1.212112  0.1192092013-01-03 -0.861849 -0.4949292013-01-05 -0.424972  0.276232
  • 为了获得对标量的快速访问(等同于df.iloc[3][0]
In [38]: df.iat[1,1]Out[38]: -0.17321464905330858

Boolean Indexing

  • 使用单个列的值的条件来选择数据
df[df.A > 0]                   A         B         C         D2013-01-01  0.509229  0.313454  0.535203 -1.5370802013-01-06  1.609375  1.464626 -0.737054  0.093372print df[df > 0] 会让为负数的值填充NaN                   A         B         C         D2013-01-01  0.980258  0.338144       NaN  0.8453702013-01-02       NaN  1.104085  0.341050  1.0754452013-01-03  0.171871       NaN  0.797777  0.5366342013-01-04       NaN       NaN       NaN  0.7583902013-01-05       NaN       NaN  1.801255  1.7807752013-01-06       NaN       NaN  2.266597  1.060540
  • 使用isin过滤
df2 = df.copy()df2['E'] = ['one', 'one','two','three','four','three']print df2print df2[df2['E'].isin(['two','four'])]output:                   A         B         C         D     E2013-01-03 -1.001488  0.465646 -0.330277  0.722562   two2013-01-05  0.258179 -0.727527  0.860856 -0.171767  four
  • 按位置设置值 df.iat[0,1] = 11
  • 按标签设置值 df.at[dates[0],’A’] = 0
  • 负值取反
df2 = df.copy()df2[df2 > 0] = -df2print df2output:                   A          B         C         D   F2013-01-01 -1.470583 -11.000000 -2.021309 -0.395401 NaN2013-01-02 -0.297632  -0.628088 -0.286551 -0.400202  -12013-01-03 -0.584846  -0.520314 -0.795003 -1.253678  -22013-01-04 -0.571857  -0.030617 -1.073085 -0.407656  -32013-01-05 -0.409181  -0.217922 -1.856937 -0.681493  -42013-01-06 -0.909665  -0.031711 -0.021490 -1.221956  -5

Missing Data

  • reindexing 允许您更改/添加/删除指定轴上的索引。这将返回数据的副本
d = pd.date_range('20130101', periods=6)df1 = df.reindex(index=d, columns=list(df.columns))df1.loc[d[0]:d[1],'E'] = 1print df1.reindex(fill_value=df.values)output:2013-01-01  0.188434  1.338949 -0.085884  1.882977   12013-01-02  0.396632  0.944758 -0.721702 -1.666582   12013-01-03  0.359807 -0.648172 -0.510065  1.429356 NaN2013-01-04  0.409378  1.320434 -0.293386 -0.159756 NaN2013-01-05  0.604146  0.071139  1.170985 -0.482204 NaN2013-01-06  0.658903 -0.003649 -0.204679  0.472076 NaN
  • drop方法删除任何含有缺少数据的行
d = pd.date_range('20130101', periods=6)df.loc[d[0]:d[1], 'D'] = np.nanprint  df.dropna(how='any')output:                   A         B         C         D2013-01-03  0.972982 -1.777415  1.550535 -0.2229592013-01-04 -0.271872 -0.713687  0.034684  0.7686602013-01-05 -0.427849 -1.112800  1.592027 -0.8707962013-01-06 -0.333351  0.064402 -0.523787  0.939407
  • df.fillna(value=5) 方法填充缺失值
  • pd.isnull(df) 取值为nan的布尔值
print pd.isnull(df)output:                A      B      C      D2013-01-01  False  False  False   True2013-01-02  False  False  False   True2013-01-03  False  False  False  False2013-01-04  False  False  False  False2013-01-05  False  False  False  False2013-01-06  False  False  False  False
  • 描述性统计
df.mean()df.mean(axis=1) 执行列(axis=1)描述统计
  • 对具有不同维度并需要对齐的对象进行操作
d = pd.date_range('20130101', periods=6)res = pd.Series([1, 3, 5, np.nan, 6, 8], index=d).shift(1, axis=0)print resoutput:2013-01-01   NaN2013-01-02     12013-01-03     32013-01-04     52013-01-05   NaN2013-01-06     6
  • apply方法,将函数用于数据
1、df.apply(np.cumsum) 返回df列的累计和(比如,返回结果的第二行第一列的值等于df第一行第一列的值+第二列第二行的值)2、df.apply(lambda x: x.max() - x.min()) 计算每一列最大值-最小值output:A    3.457406B    2.542839C    2.566534D    3.116148dtype: float64
  • 统计次数
s = pd.Series(np.random.randint(0, 7, size=10))print s.value_counts()output:5    30    36    22    11    1dtype: int64
  • 字符串方法, 非字符串值用NaN填充
s = pd.Series(['A', 32, 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat', 1])print s.str.lower()0        a1      NaN2        b3        c4     aaba5     baca6      NaN7     caba8      dog9      cat10     NaN
  • Concat连接
df = pd.DataFrame(np.random.randn(2, 4))df3 = pd.DataFrame(np.random.randn(2, 4))pieces = [df, df3]print pd.concat(pieces)output:          0         1         2         30  0.612235 -1.470256  0.155577  0.5590231 -1.375064  1.246578  1.024598  0.9854050  0.223544  0.489176  0.700047  1.1235871 -0.110190  1.703503 -1.339290 -2.199537
  • join SQL样式合并
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})print leftprint rightprint pd.merge(left, right, on='key')output:   key  lval0  foo     11  foo     2   key  rval0  foo     41  foo     5   key  lval  rval0  foo     1     41  foo     1     52  foo     2     43  foo     2     5
  • Append 将行附加到数据框
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])s = df.iloc[3]print df.append(s, ignore_index=True)output:          A         B         C         D0 -1.321258  0.045675  0.381518  1.7449681 -1.289457 -1.232328  0.518603 -0.4397262  1.580407  1.170709  1.670016 -2.2037993  0.639870  0.111603 -0.518480  0.8289244  1.138179  0.917267 -1.619596  0.8257105 -0.463726  1.555098  1.622742  0.3984596  0.776347  1.068682 -0.468541 -0.3465887  0.665365  1.946883 -0.012843  0.1190178  0.639870  0.111603 -0.518480  0.828924 #附加的行
  • Grouping分组
df = pd.DataFrame({    'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],    'C': np.random.randn(8),    'D': np.random.randn(8)})print df.groupby('A').sum()print df.groupby(['A', 'B']).sum()            C         DAbar -0.907288  1.336085foo  4.172039  0.385459                  C         DA   Bbar one    0.292497 -0.401558    three -1.501987  0.455004    two    0.302203  1.282639foo one    1.641715 -2.024059    three -0.741582  0.566795    two    3.271907  1.842723
  • stack()方法“压缩”DataFrame的列中的一个级别。unstac相反
tuples = list(zip(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],                    ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']))print tuplesindex = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])print indexdf = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])df2 = df[:4]stacked = df2.stack()print stackedoutput:first  secondbar    one     A    0.757833               B   -1.218229       two     A   -0.240841               B    1.111270baz    one     A   -0.899588               B   -0.643713       two     A    0.865628               B   -0.316734dtype: float64print stacked.unstack() # 默认取消堆栈最后一个级别output:                     A         Bfirst secondbar   one    -0.289143 -0.393785      two     0.207235  0.307347baz   one    -0.924273  2.437802      two    -0.607268 -0.491990print stacked.unstack(1)output:second        one       twofirstbar   A -0.618240 -0.768593      B -1.642071 -0.264534baz   A  2.519232 -0.510234      B  0.651714 -0.476661print stacked.unstack(0) # 默认取消堆栈最后一个级别first          bar       bazsecondone    A  0.378821 -0.147900       B  0.521936  1.079642two    A -0.462910  0.550019       B -0.879458 -1.254567

数据透视表

pd.pivot_table其实就是对指定行列,进行求和操作(默认,也可以指定成其他自定义方法)

df = pd.DataFrame({    'A' : ['one', 'one', 'two', 'three'] * 3,    'B' : ['A', 'B', 'C'] * 4,    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,    'D' : np.random.randn(12),    'E' : np.random.randn(12)})print dfprint '---------------------------'print pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])output:        A  B    C         D         E0     one  A  foo  0.129400  0.3143431     one  B  foo -0.349982  0.4460822     two  C  foo -1.963735  0.4690983   three  A  bar  0.484424 -0.6090184     one  B  bar  0.886764 -1.3695895     one  C  bar -0.717089 -1.2777296     two  A  foo  1.036215 -0.9617807   three  B  foo -0.026899  0.5508388     one  C  foo  0.482682 -0.5632409     one  A  bar -0.779486 -0.31429910    two  B  bar -0.743693 -0.00108211  three  C  bar  0.899123  0.000721---------------------------C             bar       fooA     Bone   A -0.779486  0.129400      B  0.886764 -0.349982      C -0.717089  0.482682three A  0.484424       NaN      B       NaN -0.026899      C  0.899123       NaNtwo   A       NaN  1.036215      B -0.743693       NaN      C       NaN -1.963735
  • 绘图
    需要安装 apt-get install python-tk
from pandas import Series,DataFrameimport numpy as npimport matplotlibimport matplotlib.pyplot as pltmatplotlib.style.use('ggplot')s=Series(np.random.randn(10).cumsum(),index=np.arange(0,100,10))df=DataFrame(np.random.randn(10,4).cumsum(0),columns=['A','B','C','D'],index=np.arange(0,100,10))df.plot()plt.show() # 调用次方法,能看到绘制的图

这里写图片描述
- Time Series
ts.resample()修改时间

rng = pd.date_range('1/1/2012', periods=5, freq='S')ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)print tsprint ts.resample('5Min')output:2012-01-01 00:00:00    3582012-01-01 00:00:01     202012-01-01 00:00:02     642012-01-01 00:00:03    1352012-01-01 00:00:04    446Freq: S, dtype: int322012-01-01    204.6Freq: 5T, dtype: float64
  • 时区表示、周期和时间戳之间的转换
rng = pd.date_range('1/1/2012', periods=5, freq='S')ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)ts_utc = ts.tz_localize('UTC')print ts_utc2012-01-01 00:00:00+00:00    1572012-01-01 00:00:01+00:00     392012-01-01 00:00:02+00:00    3472012-01-01 00:00:03+00:00      12012-01-01 00:00:04+00:00    455Freq: S, dtype: int32转换时区ts_utc.tz_convert('US/Eastern')output:2011-12-31 19:00:00-05:00    4942011-12-31 19:00:01-05:00    2622011-12-31 19:00:02-05:00    3772011-12-31 19:00:03-05:00    2962011-12-31 19:00:04-05:00    370Freq: S, dtype: int32

时间跨度表示之间转换

rng = pd.date_range('1/1/2012', periods=5, freq='M')ts = pd.Series(np.random.randn(len(rng)), index=rng)ps = ts.to_period(freq='S')output:2012-01-31 00:00:00   -1.0268382012-02-29 00:00:00    1.3356022012-03-31 00:00:00    0.6760032012-04-30 00:00:00   -0.6216312012-05-31 00:00:00   -1.031011ps.to_timestamp() 转换成时间戳
  • 在构建系列时指定dtype=”category”
s = pd.Series(["a","b","c","a"], dtype="category")
  • 将现有的系列或列转换为category dtype:
df = pd.DataFrame({"A":["a","b","c","a"]})df["B"] = df["A"].astype('category')print dfoutput:   A  B0  a  a1  b  b2  c  c3  a  a
  • 读写csv
    pd.read_csv('foo.csv')    df.to_csv('foo.csv')
  • 读写hdf
df.to_hdf('foo.h5','df')pd.read_hdf('foo.h5','df')
  • 读写execl
df.to_excel('foo.xlsx', sheet_name='Sheet1')pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])