第九章 数据分组和聚合(上)
来源:互联网 发布:淘宝双11业绩查询 编辑:程序博客网 时间:2024/05/16 01:46
import numpy as npfrom pandas import DataFramedf=DataFrame({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':np.random.randn(5),'data2':np.random.randn(5)})df data1 data2 key1 key20 0.867596 0.739047 a one1 -0.271038 -0.621842 a two2 1.809116 -0.479957 b one3 -1.204141 -0.176988 b two4 0.008490 2.401128 a one#想按key1分组,计算data1列的平均值,访问data1,根据key1调用groupbygrouped=df['data1'].groupby(df['key1'])grouped.mean()key1
a 0.201683
b 0.302487
Name: data1, dtype: float64
means=df['data1'].groupby([df['key1'],df['key2']]).mean()meansOut[8]: key1 key2a one 0.438043 two -0.271038b one 1.809116 two -1.204141Name: data1, dtype: float64#通过两个键对数据进行分组,得到的Series具有一个层次化索引means.unstack()Out[9]: key2 one twokey1 a 0.438043 -0.271038b 1.809116 -1.204141
states=np.array(['Ohio','California','California','Ohio','Ohio'])years=np.array([2005,2005,2006,2005,2006])df['data1'].groupby([states,years]).mean()Out[12]: California 2005 -0.271038 2006 1.809116Ohio 2005 -0.168272 2006 0.008490Name: data1, dtype: float64
df.groupby('key1').mean()Out[13]: data1 data2key1 a 0.201683 0.839444b 0.302487 -0.328473df.groupby(['key1','key2']).mean()Out[14]: data1 data2key1 key2 a one 0.438043 1.570087 two -0.271038 -0.621842b one 1.809116 -0.479957 two -1.204141 -0.176988
df.groupby(['key1','key2']).size()#size()返回一个有分组大小的列Out[15]: key1 key2a one 2 two 1b one 1 two 1dtype: int64对分组进行迭代:产生一组二元元组,由分组名和数据块组成
for name,group in df.groupby('key1'): print name print groupa data1 data2 key1 key20 0.867596 0.739047 a one1 -0.271038 -0.621842 a two4 0.008490 2.401128 a oneb data1 data2 key1 key22 1.809116 -0.479957 b one3 -1.204141 -0.176988 b two对于多重键的情况,元组的第一个元素将会是由键值组成的元组:
for (k1,k2),group in df.groupby(['key1','key2']): print k1,k2 print groupa one data1 data2 key1 key20 0.867596 0.739047 a one4 0.008490 2.401128 a onea two data1 data2 key1 key21 -0.271038 -0.621842 a twob one data1 data2 key1 key22 1.809116 -0.479957 b oneb two data1 data2 key1 key23 -1.204141 -0.176988 b two
pieces=dict(list(df.groupby('key1')))pieces['b']Out[19]: data1 data2 key1 key22 1.809116 -0.479957 b one3 -1.204141 -0.176988 b twogroupby默认是在axis=0上进行分组的,通过设置也可在任何轴上进行分组,根据dtype对列进行分组:
df.dtypesOut[20]: data1 float64data2 float64key1 objectkey2 objectdtype: object
grouped=df.groupby(df.dtypes,axis=1)dict(list(grouped))Out[22]: {dtype('float64'): data1 data2 0 0.867596 0.739047 1 -0.271038 -0.621842 2 1.809116 -0.479957 3 -1.204141 -0.176988 4 0.008490 2.401128, dtype('O'): key1 key2 0 a one 1 a two 2 b one 3 b two 4 a one}
df.groupby(['key1','key2'])[['data2']].mean()Out[26]: data2key1 key2 a one 1.570087 two -0.621842b one -0.479957 two -0.176988这种索引操作所返回的对象是一个已分组的DataFrame或已分组的Series
s_grouped=df.groupby(['key1','key2'])['data2']s_groupedOut[28]: <pandas.core.groupby.SeriesGroupBy object at 0x095467F0>s_grouped.mean()Out[29]: key1 key2a one 1.570087 two -0.621842b one -0.479957 two -0.176988Name: data2, dtype: float64通过字典或Series进行分组:
people=DataFrame(np.random.randn(5,5), columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])people.ix[2:3,['b','c']]=np.nan#添加几个na值peopleOut[32]: a b c d eJoe -1.349415 -0.034864 -0.041473 0.316972 -1.077930Steve -0.928486 1.348135 0.648762 -1.265573 -1.798529Wes -0.221656 NaN NaN 0.844571 0.249980Jim -0.048006 -0.207574 -0.465525 -0.888653 1.646979Travis -1.190065 0.113572 0.680029 -1.015694 1.728276
by_column=people.groupby(mapping,axis=1)#axis=1表示按行,默认按列axis=0by_column.sum()Out[35]: blue redJoe 0.275499 -2.462209Steve -0.616811 -1.378880Wes 0.844571 0.028324Jim -1.354177 1.391399Travis -0.335665 0.651783
from pandas import Seriesmap_series=Series(mapping)map_seriesOut[39]: a redb redc blued bluee redf orangedtype: object
people.groupby(map_series,axis=1).count()Out[40]: blue redJoe 2 3Steve 2 3Wes 1 2Jim 2 3Travis 2 3通过函数进行分组:
索引值为人名,根据人名的长度进行分组,用len函数即可。
people.groupby(len).sum()Out[41]: a b c d e3 -1.619077 -0.242438 -0.506997 0.272891 0.8190285 -0.928486 1.348135 0.648762 -1.265573 -1.7985296 -1.190065 0.113572 0.680029 -1.015694 1.728276将函数与数组、列表、字典、Series混用:
people.groupby([len,key_list]).min()Out[43]: a b c d e3 one -1.349415 -0.034864 -0.041473 0.316972 -1.077930 two -0.048006 -0.207574 -0.465525 -0.888653 1.6469795 one -0.928486 1.348135 0.648762 -1.265573 -1.7985296 two -1.190065 0.113572 0.680029 -1.015694 1.728276
columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]], names=['city','tenor'])hier_df=DataFrame(np.random.randn(4,5),columns=columns)hier_dfOut[48]: city US JP tenor 1 3 5 1 30 -0.905468 1.703159 0.239218 1.090464 -0.6775261 -1.152784 -0.610877 1.476141 0.270902 0.1183382 0.535665 -0.551337 -1.060250 0.890812 0.9235373 -0.204244 -0.249316 0.886220 -1.238991 -0.793501根据索引级别分组:层次化索引数据集最方便的地方在于它能够索引级别进行聚合。通过level关键字传入级别编号或名称即可
hier_df.groupby(level='city',axis=1).count()Out[49]: city JP US0 2 31 2 32 2 33 2 3数据聚合:
dfOut[50]: data1 data2 key1 key20 0.867596 0.739047 a one1 -0.271038 -0.621842 a two2 1.809116 -0.479957 b one3 -1.204141 -0.176988 b two4 0.008490 2.401128 a onegrouped=df.groupby('key1')grouped['data1'].quantile(0.9)#quantile计算Series或者dataframe样本分位数Out[52]: key1a 0.695775b 1.507790Name: data1, dtype: float64
def peak_to_peak(arr): return arr.max()-arr.min()grouped.agg(peak_to_peak)Out[54]: data1 data2key1 a 1.138635 3.022970b 3.013257 0.302969
tips=pd.read_csv('e:/tips.csv')tips['tip_pct']=tips['tip']/tips['total_bill']tips[:6]Out[57]: total_bill tip sex smoker day time size tip_pct0 16.99 1.01 Female No Sun Dinner 2 0.0594471 10.34 1.66 Male No Sun Dinner 3 0.1605422 21.01 3.50 Male No Sun Dinner 3 0.1665873 23.68 3.31 Male No Sun Dinner 2 0.1397804 24.59 3.61 Female No Sun Dinner 4 0.1468085 25.29 4.71 Male No Sun Dinner 4 0.186240面向列的多函数应用:根据sex和smoker对tips进行分组
grouped=tips.groupby(['sex','smoker'])grouped_pct=grouped['tip_pct']grouped_pct.agg('mean')Out[60]: sex smokerFemale No 0.156921 Yes 0.182150Male No 0.160669 Yes 0.152771Name: tip_pct, dtype: float64如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名:
grouped_pct.agg(['mean','std',peak_to_peak])Out[61]: mean std peak_to_peaksex smoker Female No 0.156921 0.036421 0.195876 Yes 0.182150 0.071595 0.360233Male No 0.160669 0.041849 0.220186 Yes 0.152771 0.090588 0.674707如果传入的是一个由(name,function)元组组成的列表,各元组的第一个元素就会被用作DataFrame的列名
grouped_pct.agg([('foo','mean'),('bar','std')])Out[62]: foo barsex smoker Female No 0.156921 0.036421 Yes 0.182150 0.071595Male No 0.160669 0.041849 Yes 0.152771 0.090588
functions=['count','mean','max']result=grouped['tip_pct','total_bill'].agg(functions)resultOut[65]: tip_pct total_bill count mean max count mean maxsex smoker Female No 54 0.156921 0.252672 54 18.105185 35.83 Yes 33 0.182150 0.416667 33 17.977879 44.30Male No 97 0.160669 0.291990 97 19.791237 48.33 Yes 60 0.152771 0.710345 60 22.284500 50.81
result['tip_pct']Out[66]: count mean maxsex smoker Female No 54 0.156921 0.252672 Yes 33 0.182150 0.416667Male No 97 0.160669 0.291990 Yes 60 0.152771 0.710345传入带有自定义名称的元组列表
ftuples=[('Durchschnitt','mean'),('Abweichung','var')]grouped['tip_pct','total_bill'].agg(ftuples)Out[68]: tip_pct total_bill Durchschnitt Abweichung Durchschnitt Abweichungsex smoker Female No 0.156921 0.001327 18.105185 53.092422 Yes 0.182150 0.005126 17.977879 84.451517Male No 0.160669 0.001751 19.791237 76.152961 Yes 0.152771 0.008206 22.284500 98.244673想对不同的列应用不同的函数,向agg传入一个从列名映射到函数的字典:
grouped.agg({'tip':np.max,'size':'sum'})Out[69]: tip sizesex smoker Female No 5.2 140 Yes 6.5 74Male No 9.0 263 Yes 10.0 150grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})Out[70]: tip_pct size min max mean std sumsex smoker Female No 0.056797 0.252672 0.156921 0.036421 140 Yes 0.056433 0.416667 0.182150 0.071595 74Male No 0.071804 0.291990 0.160669 0.041849 263 Yes 0.035638 0.710345 0.152771 0.090588 150向groupby传入as_index=False可以禁用
分组级运算及转换
想为一个DataFrame添加一个用于存放各索引分组平均值的列,先聚合再合并:
k1_means=df.groupby('key1').mean().add_prefix('mean_')k1_meansOut[73]: mean_data1 mean_data2key1 a 0.201683 0.839444b 0.302487 -0.328473pd.merge(df,k1_means,left_on='key1',right_index=True)Out[74]: data1 data2 key1 key2 mean_data1 mean_data20 0.867596 0.739047 a one 0.201683 0.8394441 -0.271038 -0.621842 a two 0.201683 0.8394444 0.008490 2.401128 a one 0.201683 0.8394442 1.809116 -0.479957 b one 0.302487 -0.3284733 -1.204141 -0.176988 b two 0.302487 -0.328473在groupby上使用transform方法:tran sform会将一个函数应用于各个分组,并将结果返回到适当位置上
people.groupby(key).mean()Out[76]: a b c d eone -0.920379 0.039354 0.319278 0.048616 0.300109two -0.488246 0.570281 0.091618 -1.077113 -0.075775people.groupby(key).transform(np.mean)Out[77]: a b c d eJoe -0.920379 0.039354 0.319278 0.048616 0.300109Steve -0.488246 0.570281 0.091618 -1.077113 -0.075775Wes -0.920379 0.039354 0.319278 0.048616 0.300109Jim -0.488246 0.570281 0.091618 -1.077113 -0.075775Travis -0.920379 0.039354 0.319278 0.048616 0.300109创建一个距平化函数,传递给transform:
def demean(arr): return arr-arr.mean()demeaned=people.groupby(key).transform(demean)demeanedOut[80]: a b c d eJoe -0.429036 -0.074218 -0.360751 0.268356 -1.378039Steve -0.440240 0.777855 0.557143 -0.188460 -1.722754Wes 0.698722 NaN NaN 0.795955 -0.050129Jim 0.440240 -0.777855 -0.557143 0.188460 1.722754Travis -0.269686 0.074218 0.360751 -1.064310 1.428167
demeaned.groupby(key).mean()Out[81]: a b c d eone 7.401487e-17 0.000000e+00 2.775558e-17 -7.401487e-17 7.401487e-17two 2.775558e-17 -5.551115e-17 0.000000e+00 5.551115e-17 0.000000e+00apply函数:拆分-应用-合并。根绝分组选出最高的5个tip_pct值,编写一个选取指定列具有最大值的行函数:
def top(df,n=5,column='tip_pct'): return df.sort_values(by=column)[-n:]top(tips,n=6)Out[84]: total_bill tip sex smoker day time size tip_pct109 14.31 4.00 Female Yes Sat Dinner 2 0.279525183 23.17 6.50 Male Yes Sun Dinner 4 0.280535232 11.61 3.39 Male No Sat Dinner 2 0.29199067 3.07 1.00 Female Yes Sat Dinner 1 0.325733178 9.60 4.00 Female Yes Sun Dinner 2 0.416667172 7.25 5.15 Male Yes Sun Dinner 2 0.710345对smoker分组并用该函数调用apply
tips.groupby('smoker').apply(top)Out[85]: total_bill tip sex smoker day time size tip_pctsmoker No 88 24.71 5.85 Male No Thur Lunch 2 0.236746 185 20.69 5.00 Male No Sun Dinner 5 0.241663 51 10.29 2.60 Female No Sun Dinner 2 0.252672 149 7.51 2.00 Male No Thur Lunch 2 0.266312 232 11.61 3.39 Male No Sat Dinner 2 0.291990Yes 109 14.31 4.00 Female Yes Sat Dinner 2 0.279525 183 23.17 6.50 Male Yes Sun Dinner 4 0.280535 67 3.07 1.00 Female Yes Sat Dinner 1 0.325733 178 9.60 4.00 Female Yes Sun Dinner 2 0.416667 172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')Out[86]: total_bill tip sex smoker day time size \smoker day No Fri 94 22.75 3.25 Female No Fri Dinner 2 Sat 212 48.33 9.00 Male No Sat Dinner 4 Sun 156 48.17 5.00 Male No Sun Dinner 6 Thur 142 41.19 5.00 Male No Thur Lunch 5 Yes Fri 95 40.17 4.73 Male Yes Fri Dinner 4 Sat 170 50.81 10.00 Male Yes Sat Dinner 3 Sun 182 45.35 3.50 Male Yes Sun Dinner 3 Thur 197 43.11 5.00 Female Yes Thur Lunch 4 tip_pct smoker day No Fri 94 0.142857 Sat 212 0.186220 Sun 156 0.103799 Thur 142 0.121389 Yes Fri 95 0.117750 Sat 170 0.196812 Sun 182 0.077178 Thur 197 0.115982
result=tips.groupby('smoker')['tip_pct'].describe()resultOut[88]: smoker No count 151.000000 mean 0.159328 std 0.039910 min 0.056797 25% 0.136906 50% 0.155625 75% 0.185014 max 0.291990Yes count 93.000000 mean 0.163196 std 0.085119 min 0.035638 25% 0.106771 50% 0.153846 75% 0.195059 max 0.710345Name: tip_pct, dtype: float64result.unstack('smoker')Out[89]: smoker No Yescount 151.000000 93.000000mean 0.159328 0.163196std 0.039910 0.085119min 0.056797 0.03563825% 0.136906 0.10677150% 0.155625 0.15384675% 0.185014 0.195059max 0.291990 0.710345分位数和桶分析:
frame=DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})factor=pd.cut(frame.data1,4)factor[:10]Out[92]: 0 (-1.307, 0.37]1 (0.37, 2.0461]2 (-1.307, 0.37]3 (0.37, 2.0461]4 (-1.307, 0.37]5 (-1.307, 0.37]6 (-1.307, 0.37]7 (0.37, 2.0461]8 (-1.307, 0.37]9 (-1.307, 0.37]Name: data1, dtype: categoryCategories (4, object): [(-2.99, -1.307] < (-1.307, 0.37] < (0.37, 2.0461] < (2.0461, 3.723]]cut返回的factor对象可直接用于groupby
def get_stats(group): return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}grouped=frame.data2.groupby(factor)grouped.apply(get_stats).unstack()Out[95]: count max mean mindata1 (-2.99, -1.307] 101.0 2.730091 -0.206504 -2.922759(-1.307, 0.37] 531.0 2.773289 -0.034422 -2.892153(0.37, 2.0461] 344.0 3.078922 0.046008 -3.127290(2.0461, 3.723] 24.0 1.671815 -0.020857 -3.019921
grouping=pd.qcut(frame.data1,10,labels=False)grouped=frame.data2.groupby(grouping)grouped.apply(get_stats).unstack()Out[98]: count max mean mindata1 0 100.0 2.730091 -0.212662 -2.9227591 100.0 2.230289 -0.076260 -2.5214502 100.0 2.404481 0.051184 -2.3692983 100.0 2.773289 0.016575 -2.2840564 100.0 2.328424 -0.043627 -2.8921535 100.0 1.996065 -0.104378 -2.0329996 100.0 2.166334 0.015241 -2.2911397 100.0 2.687426 -0.057435 -2.4095128 100.0 2.883604 0.267017 -3.1272909 100.0 3.078922 -0.093742 -3.019921
#返回分位数编号grouping=pd.qcut(frame.data1,10,labels=False)grouped=frame.data2.groupby(grouping)grouped.apply(get_stats).unstack()s=Series(np.random.randn(6))#第0、2、4为nans[::2]=np.nansa=Series(np.random.randn(9))#第0、3、6为nana[::3]=np.nanaa.dropna()s.fillna(s.mean())缺失值填充:
states=['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']group_key=['East']*4+['West']*4data=Series(np.random.randn(8),index=states)data[['Vermont','Nevada','Idaho']]=np.nandataOut[117]: Ohio 0.326110New York 0.136178Vermont NaNFlorida 0.392480Oregon -0.177571Nevada NaNCalifornia 0.061360Idaho NaNdtype: float64data.groupby(group_key).mean()Out[118]: East 0.284923West -0.058105dtype: float64用分组平均值去填充NA
fill_mean=lambda g:g.fillna(g.mean())data.groupby(group_key).apply(fill_mean)Out[120]: Ohio 0.326110New York 0.136178Vermont 0.284923Florida 0.392480Oregon -0.177571Nevada -0.058105California 0.061360Idaho -0.058105dtype: float64
fill_values={'East':0.5,'West':-1}fill_func=lambda g:g.fillna(fill_values[g.name])data.groupby(group_key).apply(fill_func)Out[123]: Ohio 0.326110New York 0.136178Vermont 0.500000Florida 0.392480Oregon -0.177571Nevada -1.000000California 0.061360Idaho -1.000000dtype: float64随机抽牌
for suit in ['H','S','C','D']: cards.extend(str(num)+suit for num in base_names) #extend()接受一个列表参数,把参数列表的元素添加到列表的尾部deck=Series(card_val,index=cards)#长度为52的Series,索引为排名deck[:20]#从整副牌中抽取5张def draw(deck,n=5): return deck.take(np.random.permutation(len(deck))[:n])draw(deck)#想要从每种花色中抽取两张,花色是最后一个字符,用apply进行分组get_suit=lambda card:card[-1]#只要最后一个字母deck.groupby(get_suit).apply(draw,n=2)deck.groupby(get_suit,group_keys=False).apply(draw,n=2)分组加权平均数和相关系数
grouped=df.groupby('category')get_wavg=lambda g:np.average(g['data'],weights=g['weights'])grouped.apply(get_wavg)Out[146]: categorya 0.858361b -0.098089dtype: float64
阅读全文
0 0
- 第九章 数据分组和聚合(上)
- 第九章 数据分组与聚合(中)
- 第九章 数据分组与聚合(下)
- 第九章:数据聚合与分组运算
- pandas 数据分组和聚合
- 运维 2.0 时代:数据聚合和分组
- pandas数据分组和聚合操作
- 聚合分析和分组
- 聚合和分组
- 分组和聚合函数
- 数据的分组和聚合 groupby agg apply
- 数据分组与聚合函数
- (十七)分组聚合
- SQL基础(一八)--- 数据分组与聚合函数
- 2015-04-07-数据聚合与分组运算(2)-分组级运算和转换
- mysql聚合函数和分组
- Hibernate Criteria分组和聚合
- 聚合函数和分组查询
- 原型链的结构(深入理解原型继承)和举例
- 《Ios Human Interface Guidelines》--Authentication
- 解决 Errors running builder 'DeploymentBuilder' on project '
- js获取iPhone 微信, qq ,uc, 的页面屏幕的高度
- Vue入门篇_2
- 第九章 数据分组和聚合(上)
- mysql存储引擎InnoDB和MyISAM的区别
- 11_04_Linux网络配置之四 ifconfig及ip命令详解
- java变量,类,项目命名规则
- 坐在马桶上看算法:快速排序
- mp4解析h265
- 每次打开火狐都要检查附加组件与浏览器的兼容性,如何取消这个页面
- 关于muduo网络库的注解
- Spring MVC 流程图