Windowing Functions In Hive

来源:互联网 发布:java线程状态 编辑:程序博客网 时间:2024/06/05 20:45

感谢分享原文:https://acadgild.com/blog/windowing-functions-in-hive/  

另外可参考:

http://www.cnblogs.com/skyEva/p/5730531.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WINDOWclause


Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data. Windowing in Hive is introduced from Hive 0.11. In this blog, we will be giving a demo on the windowing functions available in Hive.

Windowing in Hive includes the following functions

  • Lead

    • The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.

    • Returns null when the lead for the current row extends beyond the end of the window.

  • Lag

    The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.

    Returns null when the lag for the current row extends before the beginning of the window.

  • FIRST_VALUE

  • LAST_VALUE

The OVER clause

  • OVER with standard aggregates:
    • COUNT
    • SUM
    • MIN
    • MAX
    • AVG

OVER with a PARTITION BY statement with one or more partitioning columns.

  • OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns.

Analytics functions

  • RANK
  • ROW_NUMBER
  • DENSE_RANK
  • CUME_DIST
  • PERCENT_RANK
  • NTILE

To give you a brief idea of these windowing functions in Hive, we will be using stock market data. You can download the sample stocks data from here and load into your stocks table.

Now we will create a table to load this stock market data as shown below.

Let us dive deeper into the window functions in Hive.

Lag

This function returns the values of the previous row. You can specify an integer offset which designates the row position else it will take the default integer offset as 1.

Here is the sample function for lag

Here using lag we can display the yesterday’s closing price of the ticker. Lag is to be used with over function, inside the over function you can use partition or order by classes.

In the below screenshot, you can see the closing price of the stock for the day and the yesterday’s price.

Lead

This function returns the values from the following rows. You can specify an integer offset which designates the row position else it will take the default integer offset as 1.

Here is the sample function for lead

Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s and that can be done as follows.

In the below screenshot, you can see the result.

FIRST_VALUE

It returns the value of the first row from that window. With the below query, you can see the first row high price of the ticker for all the days.

LAST_VALUE

It is the reverse of FIRST_VALUE. It returns the value of the last row from that window. With the below query, you can see the last row high price value of the ticker for all the days.

Let us now see the usage of the aggregate function using Over.

Count

It returns the count of all the values for the expression written in the over clause. From the below query, we can find the number of rows present for each ticker.

For each partition, the count of ticker will be calculated, you can see the same in the below screen shot.

Sum

It returns the sum of all the values for the expression written in the over clause. From the below query, we can find the sum of all the closing stock prices for that particular ticker.

For each ticker, the sum of all the closing prices will be calculated, you can see the same in the below screen shot.

select
dim,
sum(num) over(partition by dim) as total,
sum(num) over(partition by dim,dim_2) as total_2
from
(
select
'A' as dim,
'AA' as dim_2,
1 as num
union all
select
'A' as dim,
'CC' as dim_2,
1 as num
union all
select
'B' as dim,
'BB' as dim_2, 
1 as num

)a

dim1dim2num输出-->dim1totaltotal_2AAA1 A21ACC1 A21BBB1 A11


Finding running total

For suppose let us take if you want to get running total of the volume_for_the_day for all the days for every ticker then you can do this with the below query.

In the above screenshot, you can see the volume_for_the_day for each day and the running total is the sum of volume_for_the_day’s that are elapsed.

Finding the percentage of each row value

Now let’s take a scenario where you need to find the percentage of the volume_for_the_day on the total volumes for that particular ticker and that can be done as follows.

In the above screenshot, you can see that the percentage contribution of the volumes for the day is found based on the total volume for that ticker.

Min

It returns the minimum value of the column for the rows in that over clause. From the below query, we can find the minimum closing stock price for each particular ticker.

Max

It returns the maximum value of the column for the rows in that over clause. From the below query, we can find the maximum closing stock price for each particular ticker.

AVG

It returns the average value of the column for the rows that over clause returns. From the below query, we can find the average closing stock price for each particular ticker.

Now let us work on some Analytic functions.

Rank

The rank function will return the rank of the values as per the result set of the over clause. If two values are same then it will give the same rank to those 2 values and then for the next value, the sub-sequent rank will be skipped.

The below query will rank the closing prices of the stock for each ticker. The same you can see in the below screenshot.

Row_number

Row number will return the continuous sequence of numbers for all the rows of the result set of the over clause.

From the below query, you will get the ticker, closing price and its row number for each ticker.

Dense_rank

It is same as the rank() function but the difference is if any duplicate value is present then the rank will not be skipped for the subsequent rows. Each unique value will get the ranks in a sequence.

The below query will rank the closing prices of the stock for each ticker. The same you can see in the below screenshot.

Cume_dist

It returns the cumulative distribution of a value. It results from 0 to 1. For suppose if the total number of records are 10 then for the 1st row the cume_dist will be 1/10 and for the second 2/10 and so on till 10/10.

This cume_dist will be calculated in accordance with the result set returned by the over clause. The below query will result in the cumulative of each record for every ticker.

Percent_rank

It returns the percentage rank of each row within the result set of over clause. Percent_rank is calculated in accordance with the rank of the row and the calculation is as follows (rank-1)/(total_rows_in_group – 1). If the result set has only one row then the percent_rank will be 0.

The below query will calculate the percent_rank for every row in each partition and you can see the same in the below screen shot.

Ntile

It returns the bucket number of the particular value. For suppose if you say Ntile(5) then it will create 5 buckets based on the result set of the over clause after that it will place the first 20% of the records in the 1st bucket and so on till 5th bucket.

The below query will create 5 buckets for every ticker and the first 20% records for every ticker will be in the 1st bucket and so on.

In the below screenshot, you can see that 5 buckets will be created for every ticker and the least 20% closing prices will be in the first bucket and the next 20% will be in the second bucket and so on till 5th bucket for all the tickers.

This is how we can perform windowing operations in Hive.


原创粉丝点击