hive之窗口函数理解与实践
来源:互联网 发布:python就业班视频 编辑:程序博客网 时间:2024/05/31 19:37
快过年了,总算没那么忙了,可以完善一下这篇文章了。之前只能说是备忘罢了,很多细节没说清楚。
首先,我们要知道什么是窗口子句:
需要指定一个窗口的边界,语法是这样的:
ROWS betweenCURRENT ROW | UNBOUNDED PRECEDING | [num] PRECEDING AND UNBOUNDED FOLLOWING | [num] FOLLOWING| CURRENT ROW
或
RANGE between [num] PRECEDING AND [num]FOLLOWING
如下图:
ROWS是物理窗口,从行数上控制窗口的尺寸的;
RANGE是逻辑窗口,从列值上控制窗口的尺寸。这个比较难理解,但说白了就简单了,具体解释如下面栗子1.
结合order by子句使用,如果在order by子句后面没有指定窗口子句,则默认为:range between unbounded preceding and current row
------------------------------------------------------------
rt_data表存放了当天每半小时的店铺销售数据
字段stat_date格式yyMMddHHmm
------------------------------------------------------------
1.统计截止到当前时间段的店铺累计销售金额
select shop_id, stat_date, ordamt, sum(ordamt) over(partition by shop_id order by stat_date) from rt_data where dt = '2015-01-11' and shop_id = 10026;1002620150112003051705170100262015011201005669108391002620150112013023961323510026201501120200149814733100262015011202301997167301002620150112030011881791810026201501120330598185161002620150112040047918995100262015011204301587205821002620150112053079921381
这里order by 后面省去了窗口子句,还原起来就是order by stat_date range between unbounded preceding and current row. 这是什么意思?
就是先按stat_date排序,窗口就是比当前stat_date小的行到当前stat_date的行。
其实,这里写成order by stat_date rows between unbounded preceding and current row是一样的。
2.统计每个时间段的销售占比
select shop_id, stat_date, ordamt, ordamt / sum(ordamt) over(partition by shop_id) from rt_data where dt = '2015-01-11' and shop_id = 10026order by stat_date;1002620150112090054060.01422710026201501120930130980.03447110026201501121000156790.04126310026201501121030121340.03193410026201501121100262590.06910710026201501121130165160.04346610026201501121200201270.052969
partition by之后没有使用order by,窗口大小默认就是整个分组。
3.找出2点的销售金额及前半小时的销售金额和后1个小时的销售金额
select * from ( select shop_id, stat_date, ordamt, lag(stat_date,1) over(partition by shop_id order by stat_date), lag(ordamt,1) over(partition by shop_id order by stat_date), lead(stat_date,2) over(partition by shop_id order by stat_date), lead(ordamt,2) over(partition by shop_id order by stat_date) from rt_data where dt = '2015-01-11' and shop_id = 10026 ) twhere stat_date = 201501120200;10026201501120200149820150112013023962015011203001188
lag(field, N)是取前N行的值,lead(field, N)是取后N行的值。
4.按照销售金额进行排名,金额最大的排最前(limit可以取topn的数)
select shop_id, stat_date, ordamt, dense_rank() over(partition by shop_id order by ordamt desc) from rt_data where dt = '2015-01-11' and shop_id = 10026;1002620150112110026259.011002620150112120020127.021002620150112130018138.031002620150112113016516.041002620150112140015956.051002620150112100015679.061002620150112220014290.071002620150112143014203.081002620150112133013290.091002620150112093013098.010
dense_rank和rank都是排名函数,区别在于dense_rank是连续排名,rank遇到排名并列时,下一列排名跳空。
percent_rank,介于0和1直接的小数形式表示,计算方法是(rank - 1) / (n-1),其中rank为行的序号,n为组的行数,这个暂时不知道有什么地方可以运用,欢迎指导。
5.按销售金额排序,取出前20%的时间段和相应金额
select * from ( select shop_id, stat_date, ordamt, ntile(5) over(partition by shop_id order by ordamt desc) sorted from rt_data where dt = '2015-01-11' and shop_id = 10026 ) t where sorted = 1;1002620150112110026259.011002620150112120020127.011002620150112130018138.011002620150112113016516.011002620150112140015956.011002620150112100015679.011002620150112220014290.011002620150112143014203.011002620150112133013290.01NTILE这个很强大,以前要获取一定比例的数据是非常困难的,NTILE就是把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
---------------------------------------------------------------------
over语句还可以独立出来,用window重写,但需要注意的是,如下sort by使用了多个字段,如果用range指定窗口的话会出错,需要用rows来指定窗口,因为range是对列的比较,不支持多列比较。
select shop_id, stat_date, ordamt, sum(ordamt) over win as tfrom rt_data where dt = '2015-01-11' and shop_id = 10026window win as (distribute by shop_id sort by shop_id, ordamt desc rows between unbounded preceding and current row);
- hive之窗口函数理解与实践
- hive之窗口函数理解与实践
- hive之窗口函数理解与实践
- hive之窗口函数理解与实践
- hive窗口函数理解加实践
- Hive 之 分析窗口函数
- [Hive]窗口函数与分析函数
- Hive 窗口与分析型函数
- 9.16 hive基本命令与窗口函数
- HIVE窗口函数
- Hive窗口函数
- Hive窗口函数总结
- Hive窗口函数
- Hive 窗口函数
- Hive的窗口函数
- Hive分析窗口函数
- hive函数 regexp_extract实践
- Hive介绍与实践
- request.getRemoteAddr()如何获取用户真实的IP地址
- scala入门-08 apply方法和单例对象的使用
- URL和URI
- IOS SDK百度地图不能正常显示,只显示网格
- C++ Virtual详解
- hive之窗口函数理解与实践
- Unity Optimisation Basics Part 1
- Xcode5修改APP的名称与包名
- java中web.xml配置文件
- Eclipse 项目出现红 x
- 奇偶个数
- IDEA 改变编码方式
- Hibernate的优缺点
- php 之apache rewrite伪静态的简单配置和使用