Hive 分析函数lead、lag实例应用

来源:互联网 发布:电信网络在线测速 编辑:程序博客网 时间:2024/06/05 21:12

 说明

Hive的分析函数又叫窗口函数,在oracle中就有这样的分析函数,主要用来做数据统计分析的。
Lag和Lead分析函数可以在同一次查询中取出同一字段的前N行的数据(Lag)和后N行的数据(Lead)作为独立的列。
这种操作可以代替表的自联接,并且LAG和LEAD有更高的效率,其中over()表示当前查询的结果集对象,括号里面的语句则表示对这个结果集进行处理。

函数介绍

LAG

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
参数1为列名,参数2为往下第n行(可选,默认为1),参数3为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

场景

问题

用户Peter在浏览网页,在某个时刻,Peter点进了某个页面,过一段时间后,Peter又进入了另外一个页面,如此反复,那怎么去统计Peter在某个特定网页的停留时间呢,又或是怎么统计某个网页用户停留的总时间呢?

数据准备

现在用户的行为都被采集了,处理转换到hive数据表,表结构如下:
create table test.user_log(    userid string,    time string,    url string) row format delimited fields terminated by ',';
记录数据:
+------------------+----------------------+---------------+--+| user_log.userid  |    user_log.time     | user_log.url  |+------------------+----------------------+---------------+--+| Peter            | 2015-10-12 01:10:00  | url1          || Peter            | 2015-10-12 01:15:10  | url2          || Peter            | 2015-10-12 01:16:40  | url3          || Peter            | 2015-10-12 02:13:00  | url4          || Peter            | 2015-10-12 03:14:30  | url5          || Marry            | 2015-11-12 01:10:00  | url1          || Marry            | 2015-11-12 01:15:10  | url2          || Marry            | 2015-11-12 01:16:40  | url3          || Marry            | 2015-11-12 02:13:00  | url4          || Marry            | 2015-11-12 03:14:30  | url5          |+------------------+----------------------+---------------+--+

分析步骤

获取用户在某个页面停留的起始与结束时间
select userid,       time stime,       lead(time) over(partition by userid order by time) etime,       url   from test.user_log;
结果:
+---------+----------------------+----------------------+-------+--+| userid  |        stime         |        etime         |  url  |+---------+----------------------+----------------------+-------+--+| Marry   | 2015-11-12 01:10:00  | 2015-11-12 01:15:10  | url1  || Marry   | 2015-11-12 01:15:10  | 2015-11-12 01:16:40  | url2  || Marry   | 2015-11-12 01:16:40  | 2015-11-12 02:13:00  | url3  || Marry   | 2015-11-12 02:13:00  | 2015-11-12 03:14:30  | url4  || Marry   | 2015-11-12 03:14:30  | NULL                 | url5  || Peter   | 2015-10-12 01:10:00  | 2015-10-12 01:15:10  | url1  || Peter   | 2015-10-12 01:15:10  | 2015-10-12 01:16:40  | url2  || Peter   | 2015-10-12 01:16:40  | 2015-10-12 02:13:00  | url3  || Peter   | 2015-10-12 02:13:00  | 2015-10-12 03:14:30  | url4  || Peter   | 2015-10-12 03:14:30  | NULL                 | url5  |+---------+----------------------+----------------------+-------+--+

计算用户在页面停留的时间间隔(实际分析当中,这里要做数据清洗工作,如果一个用户停留了4、5个小时,那这条记录肯定是不可取的。)
select userid,       time stime,       lead(time) over(partition by userid order by time) etime,       UNIX_TIMESTAMP(lead(time) over(partition by userid order by time),'yyyy-MM-dd HH:mm:ss')- UNIX_TIMESTAMP(time,'yyyy-MM-dd HH:mm:ss') period,       url   from test.user_log;
结果:
+---------+----------------------+----------------------+---------+-------+--+| userid  |        stime         |        etime         | period  |  url  |+---------+----------------------+----------------------+---------+-------+--+| Marry   | 2015-11-12 01:10:00  | 2015-11-12 01:15:10  | 310     | url1  || Marry   | 2015-11-12 01:15:10  | 2015-11-12 01:16:40  | 90      | url2  || Marry   | 2015-11-12 01:16:40  | 2015-11-12 02:13:00  | 3380    | url3  || Marry   | 2015-11-12 02:13:00  | 2015-11-12 03:14:30  | 3690    | url4  || Marry   | 2015-11-12 03:14:30  | NULL                 | NULL    | url5  || Peter   | 2015-10-12 01:10:00  | 2015-10-12 01:15:10  | 310     | url1  || Peter   | 2015-10-12 01:15:10  | 2015-10-12 01:16:40  | 90      | url2  || Peter   | 2015-10-12 01:16:40  | 2015-10-12 02:13:00  | 3380    | url3  || Peter   | 2015-10-12 02:13:00  | 2015-10-12 03:14:30  | 3690    | url4  || Peter   | 2015-10-12 03:14:30  | NULL                 | NULL    | url5  |+---------+----------------------+----------------------+---------+-------+--+

计算每个页面停留的总时间,某个用户访问某个页面的总时间
select nvl(url,'-1') url,       nvl(userid,'-1') userid,       sum(period) totol_peroid from (select userid,       time stime,       lead(time) over(partition by userid order by time) etime,       UNIX_TIMESTAMP(lead(time) over(partition by userid order by time),'yyyy-MM-dd HH:mm:ss')- UNIX_TIMESTAMP(time,'yyyy-MM-dd HH:mm:ss') period,       url   from test.user_log) a group by url, userid with rollup;
结果:
+-------+---------+---------------+--+|  url  | userid  | totol_peroid  |+-------+---------+---------------+--+| -1    | -1      | 14940         || url1  | -1      | 620           || url1  | Marry   | 310           || url1  | Peter   | 310           || url2  | -1      | 180           || url2  | Marry   | 90            || url2  | Peter   | 90            || url3  | -1      | 6760          || url3  | Marry   | 3380          || url3  | Peter   | 3380          || url4  | -1      | 7380          || url4  | Marry   | 3690          || url4  | Peter   | 3690          || url5  | -1      | NULL          || url5  | Marry   | NULL          || url5  | Peter   | NULL          |+-------+---------+---------------+--+




0 0
原创粉丝点击