HiveSQL 进阶指南

来源:互联网 发布:逆回购 知乎 编辑:程序博客网 时间:2024/06/06 08:38

SQL进阶指南

数据准备

环境:
hadoop 2.4.1 +
hive 1.2.1+

create table tmp.yws_example as select 1 as showing_id,'A' as broker_id ,'c001' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160101 as start_time   union all select 2 as showing_id,'A' as broker_id ,'c002' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160105 as start_time   union all select 3 as showing_id,'A' as broker_id ,'c003' as cust_id,'h002' as housedel_id ,2 as biz_type ,20160108 as start_time   union all select 4 as showing_id,'A' as broker_id ,'c005' as cust_id,'h002' as housedel_id ,2 as biz_type ,20160209 as start_time   union all select 5 as showing_id,'A' as broker_id ,'c001' as cust_id,'h003' as housedel_id ,1 as biz_type ,20160218 as start_time   union all select 6 as showing_id,'A' as broker_id ,'c004' as cust_id,'h004' as housedel_id ,2 as biz_type ,20160214 as start_time   union all select 7 as showing_id,'A' as broker_id ,'c008' as cust_id,'h004' as housedel_id ,2 as biz_type ,20160307 as start_time   union all select 8 as showing_id,'A' as broker_id ,'c007' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160306 as start_time   union all select 9 as showing_id,'A' as broker_id ,'c004' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160329 as start_time   union all select 10 as showing_id,'A' as broker_id ,'c002' as cust_id,'h002' as housedel_id ,2 as biz_type ,20160411 as start_time  union all select 11 as showing_id,'A' as broker_id ,'c002' as cust_id,'h002' as housedel_id ,2 as biz_type ,20160419 as start_time  union all select 12 as showing_id,'A' as broker_id ,'c002' as cust_id,'h003' as housedel_id ,1 as biz_type ,20160425 as start_time  union all select 13 as showing_id,'B' as broker_id ,'c009' as cust_id,'h002' as housedel_id ,2 as biz_type ,20160523 as start_time  union all select 14 as showing_id,'B' as broker_id ,'c005' as cust_id,'h002' as housedel_id ,2 as biz_type ,20160724 as start_time  union all select 15 as showing_id,'B' as broker_id ,'c007' as cust_id,'h003' as housedel_id ,1 as biz_type ,20160829 as start_time  union all select 16 as showing_id,'B' as broker_id ,'c006' as cust_id,'h004' as housedel_id ,2 as biz_type ,20160917 as start_time  union all select 17 as showing_id,'B' as broker_id ,'c007' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160323 as start_time  union all select 18 as showing_id,'B' as broker_id ,'c009' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160422 as start_time  union all select 19 as showing_id,'B' as broker_id ,'c004' as cust_id,'h003' as housedel_id ,1 as biz_type ,20161011 as start_time  union all select 20 as showing_id,'B' as broker_id ,'c003' as cust_id,'h002' as housedel_id ,2 as biz_type ,20161112 as start_time  union all select 21 as showing_id,'B' as broker_id ,'c001' as cust_id,'h002' as housedel_id ,2 as biz_type ,20161213 as start_time  union all select 22 as showing_id,'B' as broker_id ,'c007' as cust_id,'h003' as housedel_id ,1 as biz_type ,20160530 as start_time  union all select 23 as showing_id,'B' as broker_id ,'c006' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160809 as start_time  union all select 24 as showing_id,'B' as broker_id ,'c002' as cust_id,'h001' as housedel_id ,1 as biz_type ,20160907 as start_time ;

CASE WHEN

例子

select     case when biz_type=1 then '买卖'         when biz_type=2 then '租赁'         else '未知'    end as case_name   ,if(biz_type=1,'买卖',if(biz_type=2,'租赁','未知')) as if_name   ,count(1)   ,sum(1)   ,max(start_time)   ,min(start_time)    -- avgfrom tmp.yws_examplegroup by     case when biz_type=1 then '买卖'         when biz_type=2 then '租赁'         else '未知'    end   ,if(biz_type=1,'买卖',if(biz_type=2,'租赁','未知'))

CASE WHEN与IF()\NVL()比较

   1、case when 语法中when 语句可以多次使用,最后可以使用else ,类似if --else if --else if --else .   2、if()可以写成与case when 等价。但条理性比较差,不容易组织代码   3、nvl() 更简单,但功能有限。完全可以使用case when /if()替换

常见用法

1、编码转换/异常转换 —- 如上例代码,不多表述

2、连续数据离散化

举例如房子面积来分小/中/大户型select     case when build_area>0 and build_area<90 then '小户型'         when build_area>=90 and build_area<144 then '中户型'         when build_area>=144 then '大户型'         else '未知'    end as build_type_name    .....from tab_namewhere pt='20170424000000'and dp in ('4','5','6')

3、分类合并

-- 22:本科(实习生),50:本科(统招),40:专科(统招),30:大专以下-- ,60:硕士研究生,45:45-本科,17:专科(非统招),70:博士研究生-- ,20:硕士研究生(实习生),10:初中及以下,18:专科(实习生),07:未知,A:未知,''::未知select     case when highest_degree in (22,50,45)  then '本科'         when highest_degree in (40,17,18)  then '专科'         when highest_degree in (60,20) then '硕士研究生'         when highest_degree=70 then '博士研究生'         else '其它'    end as degree_type_name    .......from table_name  where pt='20170424000000'

4、case when 经常与聚合函数配合使用,实现分支计算

select     pt   ,sum(case when request_url regexp '/ershoufang/' then 1 end) ershoufang_pv    ,sum(case when request_url regexp '^http://xx.faxxxng.xxxxx.cxxxom/' then 1 end) xinfang_pv    ,sum(case when request_url regexp '/zufang/' then 1 end ) zufang_pv    .......from log_table_name where pt='20170424000000'group by pt 

5、与聚合函数配合使用,实现行转列功能

select     broker_id    ,max(case when biz_type=1 then start_time else 10000101 end )   ,min(case when biz_type=1 then start_time else 99990101 end )   ,max(case when biz_type=2 then start_time else 10000101 end )   ,min(case when biz_type=2 then start_time else 99990101 end )from  tmp.yws_examplegroup by broker_id

窗口函数

URL: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

LEAD/LAG–相对取值

ead 向后取指定位置列值,如果没有则填默认值
lag 与lead向反,是向前取值
lead(取值列,窗口内偏移量,默认值)
lag(取值列,窗口内偏移量,默认值)
over(partition by 分窗口的列 order by 排序列 )

lead/lag示例

first_value/last_value–绝对取值

fist_value/last_value示例

count/sum/min/max/avg–相对聚合

count/sum/min/max/avg示例

分析函数

RANK/ROW_NUMBER/DENSE_RANK

常用分析函数

cume_dist/percent_rank/ntile

常用分析函数2

聚合函数、case when 与row_number结合

这里写图片描述

列拆行 行合列

参考URL:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
行与列互操作
注意:
如果要求合并的字段顺序和原来拆分时相同就得写自定义函数了

数据立方体 rollup/cube/grouping sets

参考链接:
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
这里写图片描述
注意:上图中多个group by 同时只能存在一个

0 0