pandas官方文档cookbook（5）中MissValue&groupby翻译

来源：互联网发布：淘宝可爱耳钉店铺推荐编辑：程序博客网时间：2024/05/22 06:14

文档版本：0.20.3
这些例子是用python3.4写出来的。对于较早的python版本需要对代码做些相应的调整。
Pandas（pd）和Numpy（np）是唯一两个默认导入的包。其余的包会显示导入给新用户看。
若有翻译不当的地方，请多多指教。

这份文档中的例子都是从Stack-Overflow和Github中别人提问的比较经典的问题，作者从中进行提炼与总结。

缺失值

反转时间序列把缺失值替换成前一个值。df.index[::-1]表示把df的index进行倒序处理，::表示选择全列，-1表示依次递减逆序。

In [82]: df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))In [83]: df.loc[df.index[3], 'A'] = np.nanIn [84]: dfOut[84]:                    A2013-08-01 -1.0548742013-08-02 -0.1796422013-08-05  0.6395892013-08-06       NaN2013-08-07  1.9066842013-08-08  0.104050In [85]: df.reindex(df.index[::-1]).ffill()Out[85]:                    A2013-08-08  0.1040502013-08-07  1.9066842013-08-06  1.9066842013-08-05  0.6395892013-08-02 -0.1796422013-08-01 -1.054874

分组

不同于agg函数，apply函数是可以在一个数据框的子集中获取所有列进行处理的函数。

In [86]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),   ....:                    'size': list('SSMMMLL'),   ....:                    'weight': [8, 10, 11, 1, 20, 12, 12],   ....:                    'adult' : [False] * 5 + [True] * 2}); df   ....: Out[86]:    adult animal size  weight0  False    cat    S       81  False    dog    S      102  False    cat    M      113  False   fish    M       14  False    dog    M      205   True    cat    L      126   True    cat    L      12#List the size of the animals with the highest weight.In [87]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])Out[87]: animalcat     Ldog     Mfish    Mdtype: object

get_group的使用方法

In [88]: gb = df.groupby(['animal'])In [89]: gb.get_group('cat')Out[89]:    adult animal size  weight0  False    cat    S       82  False    cat    M      115   True    cat    L      126   True    cat    L      12

在分组后对一个列中不同的分类变量进行不同的操作

In [90]: def GrowUp(x):   ....:    avg_weight =  sum(x[x['size'] == 'S'].weight * 1.5)   ....:    avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)   ....:    avg_weight += sum(x[x['size'] == 'L'].weight)   ....:    avg_weight /= len(x)   ....:    return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])   ....: In [91]: expected_df = gb.apply(GrowUp)In [92]: expected_dfOut[92]:        size   weight  adultanimal                     cat       L  12.4375   Truedog       L  20.0000   Truefish      L   1.2500   True

expanding与apply的组合使用

In [93]: S = pd.Series([i / 100.0 for i in range(1,11)])In [94]: def CumRet(x,y):....:    return x * (1 + y)....: In [95]: def Red(x):....:    return functools.reduce(CumRet,x,1.0)....: In [96]: S.expanding().apply(Red)Out[96]: 0    1.0100001    1.0302002    1.0611063    1.1035504    1.1587285    1.2282516    1.3142297    1.4193678    1.5471109    1.701821dtype: float64

在分组后替换一些值为每个分组的平均值（这里注意apply与transform的差别）

In [97]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})In [98]: gb = df.groupby('A')In [99]: def replace(g):   ....:    mask = g < 0   ....:    g.loc[mask] = g[~mask].mean()   ....:    return g   ....: In [100]: gb.transform(replace)Out[100]:      B0  1.01  1.02  1.03  2.0

对聚合的数据进行分组排序

In [101]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,   .....:                    'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],   .....:                    'flag': [False, True] * 3})   .....: In [102]: code_groups = df.groupby('code')In [103]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')In [104]: sorted_df = df.loc[agg_n_sort_order.index]In [105]: sorted_dfOut[105]:   code  data   flag1  bar -0.21   True4  bar -0.59  False0  foo  0.16  False3  foo  0.45   True2  baz  0.33  False5  baz  0.62   True

创建多重聚合列

In [106]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')In [107]: ts = pd.Series(data = list(range(10)), index = rng)In [108]: def MyCust(x):   .....:    if len(x) > 2:   .....:       return x[1] * 1.234   .....:    return pd.NaT   .....: In [109]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}In [110]: ts.resample("5min").apply(mhc)Out[110]: Custom  2014-10-07 00:00:00    1.234        2014-10-07 00:05:00      NaT        2014-10-07 00:10:00    7.404        2014-10-07 00:15:00      NaTMax     2014-10-07 00:00:00        2        2014-10-07 00:05:00        4        2014-10-07 00:10:00        7        2014-10-07 00:15:00        9Mean    2014-10-07 00:00:00        1        2014-10-07 00:05:00      3.5        2014-10-07 00:10:00        6        2014-10-07 00:15:00      8.5dtype: objectIn [111]: tsOut[111]: 2014-10-07 00:00:00    02014-10-07 00:02:00    12014-10-07 00:04:00    22014-10-07 00:06:00    32014-10-07 00:08:00    42014-10-07 00:10:00    52014-10-07 00:12:00    62014-10-07 00:14:00    72014-10-07 00:16:00    82014-10-07 00:18:00    9Freq: 2T, dtype: int64

创建一个计数的列并且放置在数据框中

In [112]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),   .....:                    'Value': [100, 150, 50, 50]}); df   .....: Out[112]:   Color  Value0   Red    1001   Red    1502   Red     503  Blue     50In [113]: df['Counts'] = df.groupby(['Color']).transform(len)In [114]: dfOut[114]:   Color  Value  Counts0   Red    100       31   Red    150       32   Red     50       33  Blue     50       1

在index的基础上进行分组后对某一个组的值位移

In [115]: df = pd.DataFrame(   .....:    {u'line_race': [10, 10, 8, 10, 10, 8],   .....:     u'beyer': [99, 102, 103, 103, 88, 100]},   .....:     index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',   .....:            u'Paynter', u'Paynter', u'Paynter']); df   .....: Out[115]:                  beyer  line_raceLast Gunfighter     99         10Last Gunfighter    102         10Last Gunfighter    103          8Paynter            103         10Paynter             88         10Paynter            100          8In [116]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)In [117]: dfOut[117]:                  beyer  line_race  beyer_shiftedLast Gunfighter     99         10            NaNLast Gunfighter    102         10           99.0Last Gunfighter    103          8          102.0Paynter            103         10            NaNPaynter             88         10          103.0Paynter            100          8           88.0

从每个分组中选取最大值的行

In [118]: df = pd.DataFrame({'host':['other','other','that','this','this'],   .....:                    'service':['mail','web','mail','mail','web'],   .....:                    'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])   .....: In [119]: mask = df.groupby(level=0).agg('idxmax')In [120]: df_count = df.loc[mask['no']].reset_index()In [121]: df_countOut[121]:     host service  no0  other     web   21   that    mail   12   this    mail   2

像Python的itertools.groupby一样分组

In [122]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])In [123]: df.A.groupby((df.A != df.A.shift()).cumsum()).groupsOut[123]: {1: Int64Index([0], dtype='int64'), 2: Int64Index([1], dtype='int64'), 3: Int64Index([2], dtype='int64'), 4: Int64Index([3, 4, 5], dtype='int64'), 5: Int64Index([6], dtype='int64'), 6: Int64Index([7, 8], dtype='int64')}In [124]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()Out[124]: 0    01    12    03    14    25    36    07    18    2Name: A, dtype: int64

阅读全文

0 0