where top1000(X)<=10等条件谓词提前问题记录

来源:互联网 发布:nginx fastcgi temp 编辑:程序博客网 时间:2024/03/29 09:56

今天,遇到这样一个问题,在下面sql中:

add jar /opt/hadoop/yuanfeng/bdiudf.jar;
          create temporary function top1000 as 'com.bfd.id.Id';
          select a.state_date,a.customer,a.taskid,a.step_id,a.exit_title,a.pv,top1000(a.only_id)
          from
                (  select t1.state_date,t1.customer,t1.taskid,t1.step_id,t1.exit_title,t1.pv,t1.only_id
                  from 
                      ( select t11.state_date,
                               t11.customer,
                               t11.taskid,
                               t11.step_id,
                               t11.exit_title,
                               t11.pv,
                               concat(t11.customer,t11.taskid,t11.step_id) as only_id
                       from
                          (  select state_date,customer,taskid,step_id,exit_title,count(*) as pv
                             from bdi_fact2.mid_url_step
                             where exit_url!='-1'
                             and exit_title !='-1'
                             and l_date='2015-08-31'
                             group by state_date,customer,taskid,step_id,exit_title
                            )t11
                       )t1
                       order by t1.only_id,t1.pv desc
                 )a
          where  a.customer='Cdianyingwang'
          and a.taskid='33'
          and a.step_id='0' 
          and top1000(a.only_id)<=10;

在0.9和0.14的hive中查询结果不一样,究其根本,是0.9hive没有谓词提前功能(ppd),而0.14有这个功能。

问题就出现在这里,请看执行计划:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage
可以看到,一共两个运算阶段,stage-1,stage-2
stage-1:
//即子查询a.t1表的内容 
                      ( select t11.state_date,
                               t11.customer,
                               t11.taskid,
                               t11.step_id,
                               t11.exit_title,
                               t11.pv,
                               concat(t11.customer,t11.taskid,t11.step_id) as only_id
                       from
                          (  select state_date,customer,taskid,step_id,exit_title,count(*) as pv
                             from bdi_fact2.mid_url_step
                             where exit_url!='-1'
                             and exit_title !='-1'
                             and l_date='2015-08-31'
                             group by state_date,customer,taskid,step_id,exit_title
                            )t11
                       )t1

来看看这个阶段做了什么:

看到红框里的内容了吗?编译器在这里做了一个愚蠢的事情(其实也不能怪编译器),在错误的时间错误的地点做了错误的事。

——在这个时候提前把数据过滤了,为什么说是提前呢,是因为原语句的意图是在所有记录根据only_id和pv全排序后再取top1000,而不是在t1表

这个中间结果这里,而就是这样的动作导致了后续的一些错误:

看下这个图:

这个是stage-1的执行日志,可以看到reduce有两个。因为在这个阶段过早的top1000了(top1000(only_id)<=10),每个reduce取了10条记录。

好,stage-1先说到这里,我们来看看stage-2:

select a.state_date,a.customer,a.taskid,a.step_id,a.exit_title,a.pv,top1000(a.only_id)
          from
                (  select t1.state_date,t1.customer,t1.taskid,t1.step_id,t1.exit_title,t1.pv,t1.only_id
                  from 
                      
                        ...
                       )t1
                       order by t1.only_id,t1.pv desc
                 )a
          where  a.customer='Cdianyingwang'
          and a.taskid='33'
          and a.step_id='0' 
          and top1000(a.only_id)<=10;

其实,stage-2就只有一个reduce阶段:

可以看到,本来在这个阶段应该做top1000(only_id)<=10操作的,但是在谓词条件这一块是空的!因为这个条件已经被提前了!

好的,已经比较清晰了,为什么top1000(only_id)<=10明明限定了10条,结果会出20条——因为stage-1有2个reduce.

为什么会这样?

因为对于编译器来说,top1000(only_id)<=10只是一个谓词条件,跟a=1,b="abc"是没有差异的,编译器会毫不犹豫的把这个条件下推到语句的最内层(t1)。

所以,这不是bug,编译器无法感知谓词条件或者说是udf函数是什么意图,只会简单的把他下推到底。

所以该怎么做?
可以:set hive.optimize.ppd=false;
0 0