问题解决：Hive中双count(distinct)过慢的问题

来源：互联网发布：struts2遍历标签数据加编辑：程序博客网时间：2024/06/18 03:09

这里说的双count(distinct)是指类似下面的语句

select day,count(distinct session_id),count(distinct user_id) from log a group by day;

如果要执行这样的语句，前提必须设置参数:set hive.groupby.skewindata=true;

我们可以用“空间换时间”的思路解决问题：

select day,count(case when type='session' then 1 else null end) as session_cnt, count(case when type='user' then 1 else null end) as user_cntfrom(select day,session_id,type from (select day,session_id,'session' as type from logunion allselect day user_id,'user' as typefrom log)group by day,session_id,type ) t1group by day

这里的type字段完全是自己定义的，目的就是通过多余的空间，将“查值”、“去重”、“累加1”操作分散到不同的mr任务里面去，起到提速的效果。

注意，type的取值个数和原语句中有几个count(distinct)是一致的，和session_id、user_id有多少种取值没关系。

阅读全文

0 0