hive之窗口函数理解与实践

来源:互联网 发布:python就业班视频 编辑:程序博客网 时间:2024/05/31 19:37

快过年了,总算没那么忙了,可以完善一下这篇文章了。之前只能说是备忘罢了,很多细节没说清楚。

首先,我们要知道什么是窗口子句:

需要指定一个窗口的边界,语法是这样的:

ROWS betweenCURRENT ROW | UNBOUNDED PRECEDING | [num] PRECEDING AND  UNBOUNDED FOLLOWING | [num] FOLLOWING| CURRENT ROW

RANGE between [num] PRECEDING  AND [num]FOLLOWING

如下图:



ROWS是物理窗口,从行数上控制窗口的尺寸的;

RANGE是逻辑窗口,从列值上控制窗口的尺寸。这个比较难理解,但说白了就简单了,具体解释如下面栗子1.

结合order by子句使用,如果在order by子句后面没有指定窗口子句,则默认为:range between unbounded preceding and current row

------------------------------------------------------------

rt_data表存放了当天每半小时的店铺销售数据

字段stat_date格式yyMMddHHmm

------------------------------------------------------------

1.统计截止到当前时间段的店铺累计销售金额

select shop_id, stat_date, ordamt, sum(ordamt) over(partition by shop_id order by stat_date) from rt_data where dt = '2015-01-11' and shop_id = 10026;1002620150112003051705170100262015011201005669108391002620150112013023961323510026201501120200149814733100262015011202301997167301002620150112030011881791810026201501120330598185161002620150112040047918995100262015011204301587205821002620150112053079921381

这里order by 后面省去了窗口子句,还原起来就是order by stat_date range between unbounded preceding and current row. 这是什么意思?

就是先按stat_date排序,窗口就是比当前stat_date小的行到当前stat_date的行。

其实,这里写成order by stat_date rows between unbounded preceding and current row是一样的。

2.统计每个时间段的销售占比

select shop_id, stat_date, ordamt, ordamt / sum(ordamt) over(partition by shop_id) from rt_data where dt = '2015-01-11' and shop_id = 10026order by stat_date;1002620150112090054060.01422710026201501120930130980.03447110026201501121000156790.04126310026201501121030121340.03193410026201501121100262590.06910710026201501121130165160.04346610026201501121200201270.052969

partition by之后没有使用order by,窗口大小默认就是整个分组。

3.找出2点的销售金额及前半小时的销售金额和后1个小时的销售金额

select * from (     select         shop_id,         stat_date,        ordamt,         lag(stat_date,1) over(partition by shop_id order by stat_date),         lag(ordamt,1) over(partition by shop_id order by stat_date),         lead(stat_date,2) over(partition by shop_id order by stat_date),        lead(ordamt,2) over(partition by shop_id order by stat_date)     from rt_data     where dt = '2015-01-11' and shop_id = 10026 ) twhere stat_date = 201501120200;10026201501120200149820150112013023962015011203001188

lag(field, N)是取前N行的值,lead(field, N)是取后N行的值。

4.按照销售金额进行排名,金额最大的排最前(limit可以取topn的数)

select shop_id, stat_date, ordamt, dense_rank() over(partition by shop_id order by ordamt desc) from rt_data where dt = '2015-01-11' and shop_id = 10026;1002620150112110026259.011002620150112120020127.021002620150112130018138.031002620150112113016516.041002620150112140015956.051002620150112100015679.061002620150112220014290.071002620150112143014203.081002620150112133013290.091002620150112093013098.010

dense_rank和rank都是排名函数,区别在于dense_rank是连续排名,rank遇到排名并列时,下一列排名跳空。

percent_rank,介于0和1直接的小数形式表示,计算方法是(rank - 1) / (n-1),其中rank为行的序号,n为组的行数,这个暂时不知道有什么地方可以运用,欢迎指导。

5.按销售金额排序,取出前20%的时间段和相应金额

select * from (    select shop_id, stat_date, ordamt, ntile(5) over(partition by shop_id order by ordamt desc) sorted     from rt_data     where dt = '2015-01-11' and shop_id = 10026 ) t where sorted = 1;1002620150112110026259.011002620150112120020127.011002620150112130018138.011002620150112113016516.011002620150112140015956.011002620150112100015679.011002620150112220014290.011002620150112143014203.011002620150112133013290.01
NTILE这个很强大,以前要获取一定比例的数据是非常困难的,NTILE就是把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。

---------------------------------------------------------------------

over语句还可以独立出来,用window重写,但需要注意的是,如下sort by使用了多个字段,如果用range指定窗口的话会出错,需要用rows来指定窗口,因为range是对列的比较,不支持多列比较。

select shop_id, stat_date, ordamt, sum(ordamt) over win as tfrom rt_data where dt = '2015-01-11' and shop_id = 10026window win as (distribute by shop_id sort by shop_id, ordamt desc rows between unbounded preceding and current row);


1 0
原创粉丝点击