Hive的分析函数操作

来源:互联网 发布:jenkins python pdf 编辑:程序博客网 时间:2024/05/16 09:32
分析函数
--------------
0.11之后支持的,扫描多个输入的行计算每行的结果。通常和OVER, PARTITION BY, ORDER BY,windowing
配合使用。和传统分组结果不一样,传统结果没组中只有一个结果(max)。

分析函数的结果会出现多次,和每条记录都连接输出。

Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_clause>])

SELECT name, dept_num, salary,
COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,
SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,
SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1, 
SUM(salary) OVER(ORDER BY dept_num, name rows unbounded preceding) AS runningTotal2
FROM employee_contract
ORDER BY dept_num, name;

//宏观使用cid排序整个数据集,在分区内按照id降序排列。
SELECT id, orderno, price,cid ,
COUNT(*) OVER (PARTITION BY cid) AS cnt  , 
min(price) over (partition by orderno order by id desc) FROM orders ORDER BY cid;

//
SELECT id, orderno, price,cid ,
min(price) over (partition by orderno) FROM orders ORDER BY cid;

//order by每条记录内取.
SELECT id, orderno, price,cid ,
min(price) over (order by price desc) FROM orders ORDER BY cid;

//分区都是独立分区,不是嵌套再分区
SELECT id, orderno, price,cid ,
COUNT(*) OVER (PARTITION BY cid) AS cnt  , 
min(price) over (partition by orderno) FROM orders ORDER BY cid;

//分区内排序
SELECT id, orderno, price,cid ,
min(price) over (partition by cid order by price desc) FROM orders ORDER BY cid;

//rank
SELECT id, orderno, price,cid ,
RANK() OVER (PARTITION BY cid ORDER BY price) FROM orders ORDER BY cid;

//dense_rank
SELECT id, orderno, price,cid ,
dense_rank() over (partition by cid) FROM orders ORDER BY cid;

//row_number()
SELECT id, orderno, price,cid ,
row_number() over (partition by cid) FROM orders ORDER BY cid;

//CUME_DIST:

//PERCENT_RANK
currow-1 / totalrow - 1
1: 1 - 1 / 3 - 1 = 0
2: 2 - 1 / 3 - 1 = 0.5 
3: 3 - 1 / 3 - 1 = 1

//NTILE:

创建表

........................................

CREATE TABLE employee
(
name string,
dept_num int,
salary float 
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

SELECT name, dept_num, salary,
RANK()       OVER (PARTITION BY dept_num ORDER BY salary) AS rank,
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank, 
ROW_NUMBER() OVER () AS row_num,
ROUND((CUME_DIST() OVER (PARTITION BY dept_num ORDER BY salary)), 2) AS cume_dist,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank,
NTILE(4) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile
FROM employee ORDER BY dept_num;

CUME_DIST:累加分布
--------------------
current row_num/ total rows,如果重复行,都取相同末尾行的行号。
例如:
1: 2 / 3 = 0.67
1: 2 / 3 = 0.67
2:  3 / 3 = 1


1: 1 / 3 = 0.33
2: 3 / 3 = 1
2:  3 / 3 = 1


1: 3 / 3 = 1
1: 3 / 3 = 1
1: 3 / 3 = 1

percent_rank
---------------
currentrow - 1 / totalrow - 1
类似于cume_dist,但是提取相同rank的首行行号。
1: 1 - 1 / 3 - 1 = 0
1: 1 - 1 / 3 - 1 = 0
2: 3 - 1 / 3 - 1 = 1 

1: 1 - 1 / 3 - 1 = 0
2: 2 - 1 / 3 - 1 = 0.5
2: 2 - 1 / 3 - 1 = 0.5 

NTile
----------------
对每条记录分配桶的编号,桶的个数.指定桶的数。
原创粉丝点击