大数据学习24：hive_sql_累计求和小案例

来源：互联网发布：游戏里被广为人知的梗编辑：程序博客网时间：2024/06/01 09:20

环境：
hive

需求：
输入数据，每天用户的流量分隔符 \t

user    date    accesspanda   2017-1-01   5tgifshow 2017-1-01   3tyy  2017-1-01   2tlaifeng 2017-1-01   2tpanda   2017-1-02   5tgifshow 2017-1-02   3tyy  2017-1-02   2tlaifeng 2017-1-02   2tpanda   2017-2-01   4tgifshow 2017-2-01   3tyy  2017-2-01   1tlaifeng 2017-2-01   4tpanda   2017-2-02   4tgifshow 2017-2-02   3tyy  2017-2-02   1tlaifeng 2017-2-02   4tpanda   2017-3-01   4tgifshow 2017-3-01   3tyy  2017-3-01   1tlaifeng 2017-3-01   4tpanda   2017-3-02   4tgifshow 2017-3-02   3tyy  2017-3-02   1tlaifeng 2017-3-02   4t=============>求出下表，每个用户按月统计当月数值，并新增按月累加字段user    date    acc     acc_sumgifshow 2017-1  6       6gifshow 2017-2  6       12gifshow 2017-3  6       18laifeng 2017-1  4       4laifeng 2017-2  8       12laifeng 2017-3  8       20panda   2017-1  10      10panda   2017-2  8       18panda   2017-3  8       26yy      2017-1  4       4yy      2017-2  2       6yy      2017-3  2       8

分析：
由于原输入数据为文本，所以从本地导入hive 建 shipin_origin 的时候，hive表中字段均为String。
那么结果需要按月统计，则必须对 date 进行 substr() 。
在本案例中，有两种方式实现这个功能。
1)通过 hive 自带的函数 sum()over() 进行求解，效果好，时间短
2)通过标准 sql 进行 inner join 进行求解，效果差，但是更能理解inner join 和实现原理

操作：
方法一：
1、先在hive上建表，并load数据

create table shipin_origin (user String,date String,access String)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;

2、这个查询是用来将原始表进行按月累计，形成一个每个月总量的中间表

select  user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 from shipin_origin  a group by user,substr(date,1,6)结果：gifshow 2017-1  6gifshow 2017-2  6gifshow 2017-3  6laifeng 2017-1  4laifeng 2017-2  8laifeng 2017-3  8panda   2017-1  10panda   2017-2  8panda   2017-3  8yy      2017-1  4yy      2017-2  2yy      2017-3  2

3、通过这个中间表，采用sum()over() 函数

select A.user ,A.c1 ,A.c2 ,sum(A.c2)over(partition by A.user order by A.c1 ) acc_sum from (select  user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 from shipin_origin  a group by user,substr(date,1,6)) A;结果：gifshow 2017-1  6       6gifshow 2017-2  6       12gifshow 2017-3  6       18laifeng 2017-1  4       4laifeng 2017-2  8       12laifeng 2017-3  8       20panda   2017-1  10      10panda   2017-2  8       18panda   2017-3  8       26yy      2017-1  4       4yy      2017-2  2       6yy      2017-3  2       8Time taken: 23.988 seconds, Fetched: 12 row(s)

注意：在上面这个sql中，因为使用了嵌套查询，一定要把嵌套内的查询结果进行别名处理，否则会出现以下的报错：

FAILED: ParseException line 3:53 cannot recognize input near '<EOF>' '<EOF>' '<EOF>' in subquery source

方法二：
1、先在hive上建表，并load数据

create table shipin_origin (user String,date String,access String)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;

2、同样将原始表进行按月累计，形成一个每个月总量的中间表，但是这里我们创建一个中间表去存储
（为什么这里进行了中间表存储，是为了简便sql写法，不建表其实也可以，但是逻辑上面要注意！！）

hive (default)>create table mid_shipin ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' as select  user,substr(date,1,6) c1 ,sum (cast(substr(access,1,1) as INT ))  c2 from shipin_origin  a group by user,substr(date,1,6) ; 结果：gifshow 2017-1  6gifshow 2017-2  6gifshow 2017-3  6laifeng 2017-1  4laifeng 2017-2  8laifeng 2017-3  8panda   2017-1  10panda   2017-2  8panda   2017-3  8yy      2017-1  4yy      2017-2  2yy      2017-3  2hive (default)> desc mid_shipin;OKuser                    string                                      c1                      string                                      c2                      bigint                                      Time taken: 0.326 seconds, Fetched: 3 row(s)

3、开始对这个表进行操作，并获得结果：

select A.user,A.c1,max(A.c2) c2 ,sum(B.c2) accumulate from (select user,c1,sum(c2) c2  from  mid_shipin group by user,c1)  A inner join (select user,c1,sum(c2)  c2  from  mid_shipin group by user,c1 ) B on A.user = B.userwhere B.c1<=A.c1group by A.user,A.c1order by A.user,A.c1====>结果：gifshow 2017-1  6       6gifshow 2017-2  6       12gifshow 2017-3  6       18laifeng 2017-1  4       4laifeng 2017-2  8       12laifeng 2017-3  8       20panda   2017-1  10      10panda   2017-2  8       18panda   2017-3  8       26yy      2017-1  4       4yy      2017-2  2       6yy      2017-3  2       8Time taken: 114.184 seconds, Fetched: 12 row(s)

这个过程中，用到了4对 mapreduce 和 1次 map 时间消耗很大！

过程分析：

#为了测试，放到mysql 里面看sql逻辑create table sp1 (user varchar(10),date varchar(10),acc int);insert into  sp1 values ('gifshow', '2017-1',  6    );insert into  sp1 values ('gifshow', '2017-2' , 6    );insert into  sp1 values ('gifshow', '2017-3' , 6    );insert into  sp1 values ('laifeng', '2017-1' , 4    );insert into  sp1 values ('laifeng', '2017-2' , 8    );insert into  sp1 values ('laifeng', '2017-3' , 8    );insert into  sp1 values ('panda' ,  '2017-1' , 10   );insert into  sp1 values ('panda' ,  '2017-2' , 8    );insert into  sp1 values ('panda' ,  '2017-3' , 8    );insert into  sp1 values ('yy'    ,  '2017-1' , 4    );insert into  sp1 values ('yy'    ,  '2017-2' , 2    );insert into  sp1 values ('yy'    ,  '2017-3' , 2    );

select * from (select user,date ,acc from sp1)  A inner join (select user,date ,acc from sp1) B on A.user = B.user ;

注意后面带 * 的行

+---------+--------+------+---------+--------+------+| user    | date   | acc  | user    | date   | acc  |+---------+--------+------+---------+--------+------+| gifshow | 2017-1 |    6 | gifshow | 2017-1 |    6 || gifshow | 2017-1 |    6 | gifshow | 2017-2 |    6 |*| gifshow | 2017-1 |    6 | gifshow | 2017-3 |    6 |*| gifshow | 2017-2 |    6 | gifshow | 2017-1 |    6 || gifshow | 2017-2 |    6 | gifshow | 2017-2 |    6 || gifshow | 2017-2 |    6 | gifshow | 2017-3 |    6 |*| gifshow | 2017-3 |    6 | gifshow | 2017-1 |    6 || gifshow | 2017-3 |    6 | gifshow | 2017-2 |    6 || gifshow | 2017-3 |    6 | gifshow | 2017-3 |    6 || laifeng | 2017-1 |    4 | laifeng | 2017-1 |    4 || laifeng | 2017-1 |    4 | laifeng | 2017-2 |    8 |*| laifeng | 2017-1 |    4 | laifeng | 2017-3 |    8 |*| laifeng | 2017-2 |    8 | laifeng | 2017-1 |    4 || laifeng | 2017-2 |    8 | laifeng | 2017-2 |    8 || laifeng | 2017-2 |    8 | laifeng | 2017-3 |    8 |*| laifeng | 2017-3 |    8 | laifeng | 2017-1 |    4 || laifeng | 2017-3 |    8 | laifeng | 2017-2 |    8 || laifeng | 2017-3 |    8 | laifeng | 2017-3 |    8 || panda   | 2017-1 |   10 | panda   | 2017-1 |   10 || panda   | 2017-1 |   10 | panda   | 2017-2 |    8 |*| panda   | 2017-1 |   10 | panda   | 2017-3 |    8 |*| panda   | 2017-2 |    8 | panda   | 2017-1 |   10 || panda   | 2017-2 |    8 | panda   | 2017-2 |    8 || panda   | 2017-2 |    8 | panda   | 2017-3 |    8 |*| panda   | 2017-3 |    8 | panda   | 2017-1 |   10 || panda   | 2017-3 |    8 | panda   | 2017-2 |    8 || panda   | 2017-3 |    8 | panda   | 2017-3 |    8 || yy      | 2017-1 |    4 | yy      | 2017-3 |    2 |*| yy      | 2017-1 |    4 | yy      | 2017-1 |    4 || yy      | 2017-1 |    4 | yy      | 2017-2 |    2 |*| yy      | 2017-2 |    2 | yy      | 2017-3 |    2 |*| yy      | 2017-2 |    2 | yy      | 2017-1 |    4 || yy      | 2017-2 |    2 | yy      | 2017-2 |    2 || yy      | 2017-3 |    2 | yy      | 2017-3 |    2 || yy      | 2017-3 |    2 | yy      | 2017-1 |    4 || yy      | 2017-3 |    2 | yy      | 2017-2 |    2 |+---------+--------+------+---------+--------+------+36 rows in set (0.00 sec)

mysql> select * from (select user,date ,acc from sp1)  A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date <= A.date ;

上表后缀星号被过滤,采用inner join 的方式，过滤掉不用累加的行！！

+---------+--------+------+---------+--------+------+| user    | date   | acc  | user    | date   | acc  |+---------+--------+------+---------+--------+------+| gifshow | 2017-1 |    6 | gifshow | 2017-1 |    6 || gifshow | 2017-2 |    6 | gifshow | 2017-1 |    6 || gifshow | 2017-2 |    6 | gifshow | 2017-2 |    6 || gifshow | 2017-3 |    6 | gifshow | 2017-1 |    6 || gifshow | 2017-3 |    6 | gifshow | 2017-2 |    6 || gifshow | 2017-3 |    6 | gifshow | 2017-3 |    6 || laifeng | 2017-1 |    4 | laifeng | 2017-1 |    4 || laifeng | 2017-2 |    8 | laifeng | 2017-1 |    4 || laifeng | 2017-2 |    8 | laifeng | 2017-2 |    8 || laifeng | 2017-3 |    8 | laifeng | 2017-1 |    4 || laifeng | 2017-3 |    8 | laifeng | 2017-2 |    8 || laifeng | 2017-3 |    8 | laifeng | 2017-3 |    8 || panda   | 2017-1 |   10 | panda   | 2017-1 |   10 || panda   | 2017-2 |    8 | panda   | 2017-1 |   10 || panda   | 2017-2 |    8 | panda   | 2017-2 |    8 || panda   | 2017-3 |    8 | panda   | 2017-1 |   10 || panda   | 2017-3 |    8 | panda   | 2017-2 |    8 || panda   | 2017-3 |    8 | panda   | 2017-3 |    8 || yy      | 2017-1 |    4 | yy      | 2017-1 |    4 || yy      | 2017-2 |    2 | yy      | 2017-1 |    4 || yy      | 2017-2 |    2 | yy      | 2017-2 |    2 || yy      | 2017-3 |    2 | yy      | 2017-1 |    4 || yy      | 2017-3 |    2 | yy      | 2017-2 |    2 || yy      | 2017-3 |    2 | yy      | 2017-3 |    2 |+---------+--------+------+---------+--------+------+24 rows in set (0.00 sec)

按照user,date 进行分类，用 max 取出 acc (因为每个月的都一样) ,用sum 进行累加。

mysql> select A.user,A.date,max(a.acc) acc_month, sum(B.acc) acc_sum from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date <= A.date group by A.user,A.date order by A.user,A.date;

+---------+--------+-----------+---------+| user    | date   | acc_month | acc_sum |+---------+--------+-----------+---------+| gifshow | 2017-1 |         6 |       6 || gifshow | 2017-2 |         6 |      12 || gifshow | 2017-3 |         6 |      18 || laifeng | 2017-1 |         4 |       4 || laifeng | 2017-2 |         8 |      12 || laifeng | 2017-3 |         8 |      20 || panda   | 2017-1 |        10 |      10 || panda   | 2017-2 |         8 |      18 || panda   | 2017-3 |         8 |      26 || yy      | 2017-1 |         4 |       4 || yy      | 2017-2 |         2 |       6 || yy      | 2017-3 |         2 |       8 |+---------+--------+-----------+---------+12 rows in set (0.00 sec)

sql分析结束！

阅读全文

0 0