Windowing Functions In Hive
来源:互联网 发布:java线程状态 编辑:程序博客网 时间:2024/06/05 20:45
感谢分享原文:https://acadgild.com/blog/windowing-functions-in-hive/
另外可参考:
http://www.cnblogs.com/skyEva/p/5730531.html
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WINDOWclause
Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data. Windowing in Hive is introduced from Hive 0.11. In this blog, we will be giving a demo on the windowing functions available in Hive.
Windowing in Hive includes the following functions
Lead
The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the window.
Lag
The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the window.
FIRST_VALUE
LAST_VALUE
The OVER clause
- OVER with standard aggregates:
- COUNT
- SUM
- MIN
- MAX
- AVG
OVER with a PARTITION BY statement with one or more partitioning columns.
- OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns.
Analytics functions
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
To give you a brief idea of these windowing functions in Hive, we will be using stock market data. You can download the sample stocks data from here and load into your stocks table.
Now we will create a table to load this stock market data as shown below.
Let us dive deeper into the window functions in Hive.
Lag
This function returns the values of the previous row. You can specify an integer offset which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lag
Here using lag we can display the yesterday’s closing price of the ticker. Lag is to be used with over function, inside the over function you can use partition or order by classes.
In the below screenshot, you can see the closing price of the stock for the day and the yesterday’s price.
Lead
This function returns the values from the following rows. You can specify an integer offset which designates the row position else it will take the default integer offset as 1.
Here is the sample function for lead
Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s and that can be done as follows.
In the below screenshot, you can see the result.
FIRST_VALUE
It returns the value of the first row from that window. With the below query, you can see the first row high price of the ticker for all the days.
LAST_VALUE
It is the reverse of FIRST_VALUE. It returns the value of the last row from that window. With the below query, you can see the last row high price value of the ticker for all the days.
Let us now see the usage of the aggregate function using Over.
Count
It returns the count of all the values for the expression written in the over clause. From the below query, we can find the number of rows present for each ticker.
For each partition, the count of ticker will be calculated, you can see the same in the below screen shot.
Sum
It returns the sum of all the values for the expression written in the over clause. From the below query, we can find the sum of all the closing stock prices for that particular ticker.
For each ticker, the sum of all the closing prices will be calculated, you can see the same in the below screen shot.
select
dim,
sum(num) over(partition by dim) as total,
sum(num) over(partition by dim,dim_2) as total_2
from
(
select
'A' as dim,
'AA' as dim_2,
1 as num
union all
select
'A' as dim,
'CC' as dim_2,
1 as num
union all
select
'B' as dim,
'BB' as dim_2,
1 as num
)a
Finding running total
For suppose let us take if you want to get running total of the volume_for_the_day for all the days for every ticker then you can do this with the below query.
In the above screenshot, you can see the volume_for_the_day for each day and the running total is the sum of volume_for_the_day’s that are elapsed.
Finding the percentage of each row value
Now let’s take a scenario where you need to find the percentage of the volume_for_the_day on the total volumes for that particular ticker and that can be done as follows.
In the above screenshot, you can see that the percentage contribution of the volumes for the day is found based on the total volume for that ticker.
Min
It returns the minimum value of the column for the rows in that over clause. From the below query, we can find the minimum closing stock price for each particular ticker.
Max
It returns the maximum value of the column for the rows in that over clause. From the below query, we can find the maximum closing stock price for each particular ticker.
AVG
It returns the average value of the column for the rows that over clause returns. From the below query, we can find the average closing stock price for each particular ticker.
Now let us work on some Analytic functions.
Rank
The rank function will return the rank of the values as per the result set of the over clause. If two values are same then it will give the same rank to those 2 values and then for the next value, the sub-sequent rank will be skipped.
The below query will rank the closing prices of the stock for each ticker. The same you can see in the below screenshot.
Row_number
Row number will return the continuous sequence of numbers for all the rows of the result set of the over clause.
From the below query, you will get the ticker, closing price and its row number for each ticker.
Dense_rank
It is same as the rank() function but the difference is if any duplicate value is present then the rank will not be skipped for the subsequent rows. Each unique value will get the ranks in a sequence.
The below query will rank the closing prices of the stock for each ticker. The same you can see in the below screenshot.
Cume_dist
It returns the cumulative distribution of a value. It results from 0 to 1. For suppose if the total number of records are 10 then for the 1st row the cume_dist will be 1/10 and for the second 2/10 and so on till 10/10.
This cume_dist will be calculated in accordance with the result set returned by the over clause. The below query will result in the cumulative of each record for every ticker.
Percent_rank
It returns the percentage rank of each row within the result set of over clause. Percent_rank is calculated in accordance with the rank of the row and the calculation is as follows (rank-1)/(total_rows_in_group – 1). If the result set has only one row then the percent_rank will be 0.
The below query will calculate the percent_rank for every row in each partition and you can see the same in the below screen shot.
Ntile
It returns the bucket number of the particular value. For suppose if you say Ntile(5) then it will create 5 buckets based on the result set of the over clause after that it will place the first 20% of the records in the 1st bucket and so on till 5th bucket.
The below query will create 5 buckets for every ticker and the first 20% records for every ticker will be in the 1st bucket and so on.
In the below screenshot, you can see that 5 buckets will be created for every ticker and the least 20% closing prices will be in the first bucket and the next 20% will be in the second bucket and so on till 5th bucket for all the tickers.
This is how we can perform windowing operations in Hive.
- Windowing Functions In Hive
- hive --Windowing and Analytics Functions
- Hive 窗口函数(Windowing Functions)
- Hive Projects_1. SQL Windowing & 分区表函数 -- 带完善
- Hive之Functions(七)
- storm 1.0新功能--Windowing Support in Core Storm滑动窗口
- hive中使用case、if:一个region统计业务(hive条件函数case、if、COALESCE语法介绍:CONDITIONAL FUNCTIONS IN HIVE)
- some functions in assembly
- Useful functions in PHP
- python build-in functions
- Inline Functions in C++
- Callback Functions in JavaScript
- Functions in R
- Overloading Functions in C
- Built-in Functions
- Python Built-in Functions
- awk built-in functions
- [Built-in Functions] - A
- 扫雷
- Nginx 做JavaWeb负载均衡
- 微服务架构中模块划分和服务识别
- 微信小程序一个你可能需要的功能
- Arduino语音天气预报(一)
- Windowing Functions In Hive
- Spark学习笔记1-RDD编程
- 信数金服决策引擎分享(三):如何实现决策的协同维护?
- 移动web(三)touch事件详解
- C语言实现动态顺序表
- 两个人聪明人的空城——《司马懿之虎啸龙吟》
- 学习笔记—bootstrap(2)
- HTTP方法
- Long类型转String JS丢失