分析函数

来源:互联网 发布:java递归代码 编辑:程序博客网 时间:2024/04/29 10:18

原文地址:

http://www.cnblogs.com/skyEva/p/5730531.html

http://blog.csdn.net/qq_34941023/article/details/51773367

分析函数是什么?

分析函数是Oracle专门用于解决复杂报表统计需求的功能强大的函数,它可以在数据中进行分组然后计算基于组的某种统计值,并且每一组的每一行都可以返回一个统计值。

分析函数和聚合函数的不同之处是什么?

普通的聚合函数用group by分组,每个分组返回一个统计值,而分析函数采用partition by分组,并且每组每行都可以返回一个统计值。

分析函数的形式

分析函数带有一个开窗函数over(),包含三个分析子句:分组(partition by), 排序(order by), 窗口(rows) ,他们的使用形式如下:over(partition by xxx order by yyy rows between zzz)。


-----first_value()与last_value():求最值对应的其他属性

FIRST_VALUE(AREA_CODE)  OVER(PARTITION BY BILL_MONTH ORDER BY SUM(LOCAL_FARE) DESC 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FIRSTVAL, 

LAST_VALUE(AREA_CODE)  OVER(PARTITION BY BILL_MONTH ORDER BY SUM(LOCAL_FARE) DESC 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) LASTVAL 


-----rank(),dense_rank()与row_number():求排序

rank()值相同时排名相同,其后排名跳跃不连续

RANK() OVER(PARTITION BY DEPTNO ORDER BY SAL DESC) RW


dense_rank()值相同时排名相同,其后排名连续不跳跃

DENSE_RANK() OVER(PARTITION BY DEPTNO ORDER BY SAL DESC) RW


row_number()值相同时排名不相等,其后排名连续不跳跃

ROW_NUMBER() OVER(PARTITION BY DEPTNO ORDER BY SAL DESC) RW


-----------lag()与lead():求之前或之后的第N行 

lag和lead函数可以在一次查询中取出同一字段的前n行的数据和后n行的值。这种操作可以使用对相同表的表连接来实现,不过使用lag和lead有更高的效率。
lag(arg1,arg2,arg3)
第一个参数是列名,
第二个参数是偏移的offset,
第三个参数是超出记录窗口时的默认值。


举例如下:
SQL> select *  from kkk;                                          
                                                                  
        ID NAME                                                   
---------- --------------------                                   
         1 1name                                                  
         2 2name                                                  
         3 3name                                                  
         4 4name                                                  
         5 5name                                                  
SQL> select id,name,lag(name,1,0) over(order by id) from kkk; 
                                                                  
        ID NAME                 LAG(NAME,1,0)OVER(ORDERBYID)      
---------- -------------------- ----------------------------      
         1 1name                0                                 
         2 2name                1name                             
         3 3name                2name                             
         4 4name                3name                             
         5 5name                4name

SQL> select id,name,lead(name,1,0) over(order by id) from kkk;
                                                                  
        ID NAME                 LEAD(NAME,1,0)OVER(ORDERBYID)     
---------- -------------------- -----------------------------     
         1 1name                2name                             
         2 2name                3name                             
         3 3name                4name                             
         4 4name                5name                             
         5 5name                0

SQL> select id,name,lead(name,2,0) over(order by id) from kkk;                                                                                                               
        ID NAME                 LEAD(NAME,2,0)OVER(ORDERBYID)     
---------- -------------------- -----------------------------     
         1 1name                3name                             
         2 2name                4name                             
         3 3name                5name                             
         4 4name                0                                 
         5 5name                0  
SQL> select id,name,lead(name,1,'linjiqin') over(order by id) from kkk;                                  
                                                                                  
        ID NAME                 LEAD(NAME,1,'ALSDFJLASDJFSAF')                    
---------- -------------------- ------------------------------                    
         1 1name                2name                                             
         2 2name                3name                                             
         3 3name                4name                                             
         4 4name                5name                                             
         5 5name                linjiqin  


-------------rollup()与cube():排列组合分组 

1)、group by rollup(a, b, c):
首先会对(a、b、c)进行group by,
然后再对(a、b)进行group by,
其后再对(a)进行group by,
最后对全表进行汇总操作。

2)、group by cube(a, b, c):
则首先会对(a、b、c)进行group by,
然后依次是(a、b),(a、c),(a),(b、c),(b),(c),
最后对全表进行汇总操作。


-----------max(),min(),sum()与avg():求移动的最值总和与平均值

SUM(LOCAL_FARE) OVER(PARTITION BY AREA_CODE  ORDER BY TO_NUMBER(BILL_MONTH)  RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 
AVG(LOCAL_FARE) OVER(PARTITION BY AREA_CODE ORDER BY TO_NUMBER(BILL_MONTH)  RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 
MAX(LOCAL_FARE) OVER(PARTITION BY AREA_CODE  ORDER BY TO_NUMBER(BILL_MONTH)   RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 
MIN(LOCAL_FARE) OVER(PARTITION BY AREA_CODE ORDER BY TO_NUMBER(BILL_MONTH)  RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) 


----unbounded preceding and unbouned following针对当前所有记录的前一条、后一条记录,也就是表中的所有记录

--unbounded:不受控制的,无限的

--preceding:在...之前

--following:在...之后

----ROWS BETWEEN unbounded preceding AND current row  是指第一行至当前行的汇总

----ROWS BETWEEN current row AND unbounded following 指当前行到最后一行的汇总

----ROWS BETWEEN 1 preceding AND current row 是指当前行的上一行(rownum-1)到当前行的汇总

----ROWS BETWEEN 1 preceding AND 1 following 是指当前行的上一行(rownum-1)到当前行的下辆行(rownum+2)的汇总


----------累加


SELECT AREA_CODE, 

       BILL_MONTH,

       LOCAL_FARE,

       SUM(LOCAL_FARE)OVER(PARTITIONBY AREA_CODE 

                            ORDERBY BILL_MONTHASC) "last_sum_value" 

  FROM (SELECT T.AREA_CODE, T.BILL_MONTH,SUM(T.LOCAL_FARE) LOCAL_FARE 

          FROM

         GROUPBY T.AREA_CODE, T.BILL_MONTH) 

 ORDERBY AREA_CODE, BILL_MONTH




------------排序

ROW_NUMBER() over (PARTITION byhexuncookieid order by cycookied) order

-------------无需排序给行号

select row_number() over () as rowid, cust_idfrom dms.fund_trans_log2 limit 10;


1 分析函数:用于等级、百分点、n分片等

Ntile 是Hive很强大的一个分析函数。

  • 可以看成是:它把有序的数据集合 平均分配 到 指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
  • 语法是:

     ntile (num)  over ([partition_clause]  order_by_clause)  as your_bucket_num

  •    然后可以根据桶号,选取前或后 n分之几的数据。

例子:

    给了用户和每个用户对应的消费信息表, 计算花费前50%的用户的平均消费;

 


-- 把用户和消费表,按消费下降顺序平均分成2份

drop tableifexists test_by_payment_ntile;

create table test_by_payment_ntileas

select 

      nick, 

      payment ,

      NTILE(2)OVER(ORDERBY payment desc)AS rn 

from test_nick_payment;


-- 分别对每一份计算平均值,就可以得到消费靠前50%和后50%的平均消费

select 

   'avg_payment'as inf,

   t1.avg_payment_up_50 as avg_payment_up_50,

   t2.avg_payment_down_50 as avg_payment_down_50

from

 (select

         avg(payment)as avg_payment_up_50 

  from test_by_payment_ntile 

  where rn=1

)t1

   join

(select 

          avg(payment)as avg_payment_down_50 

 from test_by_payment_ntile 

 where rn=2

)t2

on (t1.dp_id=t2.dp_id);


 

Rank,Dense_Rank, Row_Number

SQL很熟悉的3个组内排序函数了。语法一样:

R()  over  (partion  by  col1...  order  by  col2...  desc/asc)


select 

   class1,

   score,

   rank() over(partitionby class1orderby scoredesc) rk1,

   dense_rank() over(partitionby class1orderby scoredesc) rk2,

   row_number() over(partitionby class1orderby scoredesc) rk3

from zyy_test1;



如上图所示,rank  会对相同数值,输出相同的序号,而且下一个序号不间断;

       dense_rank  会对相同数值,输出相同的序号,但下一个序号,间断

       row_number 会对所有数值输出不同的序号,序号唯一连续;

2. 窗口函数 Lag, Lead, First_value,Last_value

Lag, Lead

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值, 与LAG相反



-- 组内排序后,向后或向前偏移

-- 如果省略掉第三个参数,默认为NULL,否则补上。

select

    dp_id,

    mt,

    payment,

    LAG(mt,2)over(partitionby dp_idorderby mt) mt_new

from test2;



 



-- 组内排序后,向后或向前偏移

-- 如果省略掉第三个参数,默认为NULL,否则补上。

select

   dp_id,

   mt,

   payment,

   LEAD(mt,2,'1111-11')over(partitionby dp_idorderby mt) mt_new

from test2;



 

FIRST_VALUE, LAST_VALUE

first_value:  取分组内排序后,截止到当前行,第一个值

last_value:  取分组内排序后,截止到当前行,最后一个值


-- FIRST_VALUE      获得组内当前行往前的首个值

-- LAST_VALUE       获得组内当前行往前的最后一个值

-- FIRST_VALUE(DESC) 获得组内全局的最后一个值

select

   dp_id,

   mt,

   payment,

   FIRST_VALUE(payment) over(partitionby dp_idorderby mt) payment_g_first,

   LAST_VALUE(payment) over(partitionby dp_idorderby mt) payment_g_last,

  FIRST_VALUE(payment) over(partitionby dp_idorderby mtdesc) payment_g_last_global

from test2

ORDER BY dp_id,mt;


####

Windowing and Analytics Functions


  • Windowing and Analytics Functions
    • Enhancements to Hive QL
    • Examples
      • PARTITION BY with one partitioning column, no ORDER BY or window specification
      • PARTITION BY with two partitioning columns, no ORDER BY or window specification
      • PARTITION BY with one partitioning column, one ORDER BY column, and no window specification
      • PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification
      • PARTITION BY with partitioning, ORDER BY, and window specification
      • WINDOW clause
      • LEAD using default 1 row lead and not specifying default value
      • LAG specifying a lag of 3 rows and default value of 0
      • Distinct counting for each partition


Enhancements to Hive QL

Version


Introduced in Hive version 0.11.

This section introduces the Hive QL enhancements for windowing and analytics functions. See "Windowing Specifications in HQL" (attached to HIVE-4197) for details. HIVE-896 has more information, including links to earlier documentation in the initial comments.

All of the windowing and analytics functions operate as per the SQL standard.

The current release supports the following functions for windowing and analytics:

  1. Windowing functions
    • LEAD
      • The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
      • Returns null when the lead for the current row extends beyond the end of the window.
    • LAG
      • The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
      • Returns null when the lag for the current row extends before the beginning of the window.
    • FIRST_VALUE
    • LAST_VALUE
  2. The OVER clause
    • OVER with standard aggregates:
      • COUNT
      • SUM
      • MIN
      • MAX
      • AVG
    • OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
    • OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
      • OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
        (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
      • (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
      • (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING


        When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to 
        RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
        When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

        The OVER clause supports the following functions, but it does not support a window with them (seeHIVE-4797):
        Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.
        Lead and Lag functions.

  3. Analytics functions
    • RANK
    • ROW_NUMBER
    • DENSE_RANK
    • CUME_DIST
    • PERCENT_RANK
    • NTILE
  4. Distinct support in Hive 2.1.0 and later (seeHIVE-9534)
    Distinct is supported for aggregation functions including SUM, COUNT and AVG, which aggregate over the distinct values within each partition. Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason. The supported syntax is as follows.
  5. COUNT(DISTINCTa) OVER (PARTITION BYc)




  6. ORDER BY and window specification is supported for distinct in Hive 2.2.0 (see HIVE-13453). An example is as follows.
  7. COUNT(DISTINCTa) OVER (PARTITION BYc ORDERBY dROWS BETWEEN1 PRECEDING AND1 FOLLOWING)



  8. Aggregate functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)
    Support to reference aggregate functions within the OVER clause has been added. For instance, currently we can use the SUM aggregation function within the OVER clause as follows.
  9. SELECTrank() OVER (ORDERBY sum(b))
    FROM T
    GROUP BYa;



  10. Examples

    This section provides examples of how to use the Hive QL windowing and analytics functions in SELECT statements. SeeHIVE-896 for additional examples.

    PARTITION BY with one partitioning column, no ORDER BY or window specification

    SELECTa, COUNT(b) OVER (PARTITIONBY c)

    FROM T;

    PARTITION BY with two partitioning columns, no ORDER BY or window specification

    SELECTa, COUNT(b) OVER (PARTITIONBY c, d)

    FROM T;

    PARTITION BY with one partitioning column, one ORDER BY column, and no window specification

    SELECTa, SUM(b) OVER (PARTITIONBY cORDER BYd)

    FROM T;

    PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification

    SELECTa, SUM(b) OVER (PARTITIONBY c, dORDER BYe, f)

    FROM T;

    PARTITION BY with partitioning, ORDER BY, and window specification

    SELECTa, SUM(b) OVER (PARTITIONBY cORDER BYd ROWSBETWEEN UNBOUNDED PRECEDINGAND CURRENTROW)

    FROM T;

    SELECTa, AVG(b) OVER (PARTITIONBY cORDER BYd ROWSBETWEEN 3 PRECEDINGAND CURRENTROW)

    FROM T;

    SELECTa, AVG(b) OVER (PARTITIONBY cORDER BYd ROWSBETWEEN 3 PRECEDINGAND 3 FOLLOWING)

    FROM T;

    SELECTa, AVG(b) OVER (PARTITIONBY cORDER BYd ROWSBETWEEN CURRENTROW ANDUNBOUNDED FOLLOWING)

    FROM T;

     

    There can be multiple OVER clauses in a single query. A single OVER clause only applies to the immediately preceding function call. In this example, the first OVER clause applies to COUNT(b) and the second OVER clause applies to SUM(b):

    SELECT 

     a,

     COUNT(b) OVER (PARTITIONBY c),

     SUM(b) OVER (PARTITIONBY c)

    FROM T;

     

    Aliases can be used as well, with or without the keyword AS:

    SELECT 

     a,

     COUNT(b) OVER (PARTITIONBY c)AS b_count,

     SUM(b) OVER (PARTITIONBY c) b_sum

    FROM T;

    WINDOW clause

    SELECTa, SUM(b) OVER w

    FROM T

    WINDOW w AS(PARTITION BYc ORDERBY dROWS UNBOUNDED PRECEDING);

    LEAD using default 1 row lead and not specifying default value

    SELECTa, LEAD(a) OVER (PARTITION BYb ORDERBY CROWS BETWEENCURRENT ROW AND 1 FOLLOWING)

    FROM T;

    LAG specifying a lag of 3 rows and default value of 0

    SELECTa, LAG(a, 3, 0) OVER (PARTITION BYb ORDERBY CROWS 3 PRECEDING)

    FROM T;

    Distinct counting for each partition

    SELECTa, COUNT(distincta) OVER (PARTITION BYb)

    FROM T;

    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"> <rdf:Description rdf:about="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics" dc:identifier="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics" dc:title="LanguageManual WindowingAndAnalytics" trackback:ping="https://cwiki.apache.org/confluence/rpc/trackback/31819589"/> </rdf:RDF>

    4 people like this


0 0
原创粉丝点击