大数据学习24:hive_sql_累计求和小案例
来源:互联网 发布:游戏里被广为人知的梗 编辑:程序博客网 时间:2024/06/01 09:20
环境:
hive
需求:
输入数据,每天用户的流量 分隔符 \t
user date accesspanda 2017-1-01 5tgifshow 2017-1-01 3tyy 2017-1-01 2tlaifeng 2017-1-01 2tpanda 2017-1-02 5tgifshow 2017-1-02 3tyy 2017-1-02 2tlaifeng 2017-1-02 2tpanda 2017-2-01 4tgifshow 2017-2-01 3tyy 2017-2-01 1tlaifeng 2017-2-01 4tpanda 2017-2-02 4tgifshow 2017-2-02 3tyy 2017-2-02 1tlaifeng 2017-2-02 4tpanda 2017-3-01 4tgifshow 2017-3-01 3tyy 2017-3-01 1tlaifeng 2017-3-01 4tpanda 2017-3-02 4tgifshow 2017-3-02 3tyy 2017-3-02 1tlaifeng 2017-3-02 4t=============>求出下表,每个用户按月统计当月数值,并新增按月累加字段user date acc acc_sumgifshow 2017-1 6 6gifshow 2017-2 6 12gifshow 2017-3 6 18laifeng 2017-1 4 4laifeng 2017-2 8 12laifeng 2017-3 8 20panda 2017-1 10 10panda 2017-2 8 18panda 2017-3 8 26yy 2017-1 4 4yy 2017-2 2 6yy 2017-3 2 8
分析:
由于原输入数据为文本,所以从本地导入hive 建 shipin_origin 的时候,hive表中字段均为String。
那么结果需要按月统计,则必须对 date 进行 substr() 。
在本案例中,有两种方式实现这个功能。
1)通过 hive 自带的函数 sum()over() 进行求解,效果好,时间短
2)通过 标准 sql 进行 inner join 进行求解,效果差,但是更能理解inner join 和 实现原理
操作:
方法一:
1、先在hive上建表,并load数据
create table shipin_origin (user String,date String,access String)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;
2、这个查询是用来将原始表进行按月累计,形成一个每个月总量的中间表
select user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 from shipin_origin a group by user,substr(date,1,6)结果:gifshow 2017-1 6gifshow 2017-2 6gifshow 2017-3 6laifeng 2017-1 4laifeng 2017-2 8laifeng 2017-3 8panda 2017-1 10panda 2017-2 8panda 2017-3 8yy 2017-1 4yy 2017-2 2yy 2017-3 2
3、通过这个中间表,采用sum()over() 函数
select A.user ,A.c1 ,A.c2 ,sum(A.c2)over(partition by A.user order by A.c1 ) acc_sum from (select user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 from shipin_origin a group by user,substr(date,1,6)) A;结果:gifshow 2017-1 6 6gifshow 2017-2 6 12gifshow 2017-3 6 18laifeng 2017-1 4 4laifeng 2017-2 8 12laifeng 2017-3 8 20panda 2017-1 10 10panda 2017-2 8 18panda 2017-3 8 26yy 2017-1 4 4yy 2017-2 2 6yy 2017-3 2 8Time taken: 23.988 seconds, Fetched: 12 row(s)
注意:在上面这个sql中,因为使用了嵌套查询,一定要把嵌套内的查询结果进行别名处理,否则会出现以下的报错:
FAILED: ParseException line 3:53 cannot recognize input near '<EOF>' '<EOF>' '<EOF>' in subquery source
方法二:
1、先在hive上建表,并load数据
create table shipin_origin (user String,date String,access String)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;
2、同样将原始表进行按月累计,形成一个每个月总量的中间表,但是这里我们创建一个中间表去存储
(为什么这里进行了中间表存储,是为了简便sql写法,不建表其实也可以,但是逻辑上面要注意!!)
hive (default)>create table mid_shipin ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' as select user,substr(date,1,6) c1 ,sum (cast(substr(access,1,1) as INT )) c2 from shipin_origin a group by user,substr(date,1,6) ; 结果:gifshow 2017-1 6gifshow 2017-2 6gifshow 2017-3 6laifeng 2017-1 4laifeng 2017-2 8laifeng 2017-3 8panda 2017-1 10panda 2017-2 8panda 2017-3 8yy 2017-1 4yy 2017-2 2yy 2017-3 2hive (default)> desc mid_shipin;OKuser string c1 string c2 bigint Time taken: 0.326 seconds, Fetched: 3 row(s)
3、开始对这个表进行操作,并获得结果:
select A.user,A.c1,max(A.c2) c2 ,sum(B.c2) accumulate from (select user,c1,sum(c2) c2 from mid_shipin group by user,c1) A inner join (select user,c1,sum(c2) c2 from mid_shipin group by user,c1 ) B on A.user = B.userwhere B.c1<=A.c1group by A.user,A.c1order by A.user,A.c1====>结果:gifshow 2017-1 6 6gifshow 2017-2 6 12gifshow 2017-3 6 18laifeng 2017-1 4 4laifeng 2017-2 8 12laifeng 2017-3 8 20panda 2017-1 10 10panda 2017-2 8 18panda 2017-3 8 26yy 2017-1 4 4yy 2017-2 2 6yy 2017-3 2 8Time taken: 114.184 seconds, Fetched: 12 row(s)
这个过程中,用到了4对 mapreduce 和 1次 map 时间消耗很大!
过程分析:
#为了测试,放到mysql 里面看sql逻辑create table sp1 (user varchar(10),date varchar(10),acc int);insert into sp1 values ('gifshow', '2017-1', 6 );insert into sp1 values ('gifshow', '2017-2' , 6 );insert into sp1 values ('gifshow', '2017-3' , 6 );insert into sp1 values ('laifeng', '2017-1' , 4 );insert into sp1 values ('laifeng', '2017-2' , 8 );insert into sp1 values ('laifeng', '2017-3' , 8 );insert into sp1 values ('panda' , '2017-1' , 10 );insert into sp1 values ('panda' , '2017-2' , 8 );insert into sp1 values ('panda' , '2017-3' , 8 );insert into sp1 values ('yy' , '2017-1' , 4 );insert into sp1 values ('yy' , '2017-2' , 2 );insert into sp1 values ('yy' , '2017-3' , 2 );
select * from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user ;
注意后面带 * 的行
+---------+--------+------+---------+--------+------+| user | date | acc | user | date | acc |+---------+--------+------+---------+--------+------+| gifshow | 2017-1 | 6 | gifshow | 2017-1 | 6 || gifshow | 2017-1 | 6 | gifshow | 2017-2 | 6 |*| gifshow | 2017-1 | 6 | gifshow | 2017-3 | 6 |*| gifshow | 2017-2 | 6 | gifshow | 2017-1 | 6 || gifshow | 2017-2 | 6 | gifshow | 2017-2 | 6 || gifshow | 2017-2 | 6 | gifshow | 2017-3 | 6 |*| gifshow | 2017-3 | 6 | gifshow | 2017-1 | 6 || gifshow | 2017-3 | 6 | gifshow | 2017-2 | 6 || gifshow | 2017-3 | 6 | gifshow | 2017-3 | 6 || laifeng | 2017-1 | 4 | laifeng | 2017-1 | 4 || laifeng | 2017-1 | 4 | laifeng | 2017-2 | 8 |*| laifeng | 2017-1 | 4 | laifeng | 2017-3 | 8 |*| laifeng | 2017-2 | 8 | laifeng | 2017-1 | 4 || laifeng | 2017-2 | 8 | laifeng | 2017-2 | 8 || laifeng | 2017-2 | 8 | laifeng | 2017-3 | 8 |*| laifeng | 2017-3 | 8 | laifeng | 2017-1 | 4 || laifeng | 2017-3 | 8 | laifeng | 2017-2 | 8 || laifeng | 2017-3 | 8 | laifeng | 2017-3 | 8 || panda | 2017-1 | 10 | panda | 2017-1 | 10 || panda | 2017-1 | 10 | panda | 2017-2 | 8 |*| panda | 2017-1 | 10 | panda | 2017-3 | 8 |*| panda | 2017-2 | 8 | panda | 2017-1 | 10 || panda | 2017-2 | 8 | panda | 2017-2 | 8 || panda | 2017-2 | 8 | panda | 2017-3 | 8 |*| panda | 2017-3 | 8 | panda | 2017-1 | 10 || panda | 2017-3 | 8 | panda | 2017-2 | 8 || panda | 2017-3 | 8 | panda | 2017-3 | 8 || yy | 2017-1 | 4 | yy | 2017-3 | 2 |*| yy | 2017-1 | 4 | yy | 2017-1 | 4 || yy | 2017-1 | 4 | yy | 2017-2 | 2 |*| yy | 2017-2 | 2 | yy | 2017-3 | 2 |*| yy | 2017-2 | 2 | yy | 2017-1 | 4 || yy | 2017-2 | 2 | yy | 2017-2 | 2 || yy | 2017-3 | 2 | yy | 2017-3 | 2 || yy | 2017-3 | 2 | yy | 2017-1 | 4 || yy | 2017-3 | 2 | yy | 2017-2 | 2 |+---------+--------+------+---------+--------+------+36 rows in set (0.00 sec)
mysql> select * from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date <= A.date ;
上表后缀星号被过滤,采用inner join 的方式,过滤掉 不用累加的行!!
+---------+--------+------+---------+--------+------+| user | date | acc | user | date | acc |+---------+--------+------+---------+--------+------+| gifshow | 2017-1 | 6 | gifshow | 2017-1 | 6 || gifshow | 2017-2 | 6 | gifshow | 2017-1 | 6 || gifshow | 2017-2 | 6 | gifshow | 2017-2 | 6 || gifshow | 2017-3 | 6 | gifshow | 2017-1 | 6 || gifshow | 2017-3 | 6 | gifshow | 2017-2 | 6 || gifshow | 2017-3 | 6 | gifshow | 2017-3 | 6 || laifeng | 2017-1 | 4 | laifeng | 2017-1 | 4 || laifeng | 2017-2 | 8 | laifeng | 2017-1 | 4 || laifeng | 2017-2 | 8 | laifeng | 2017-2 | 8 || laifeng | 2017-3 | 8 | laifeng | 2017-1 | 4 || laifeng | 2017-3 | 8 | laifeng | 2017-2 | 8 || laifeng | 2017-3 | 8 | laifeng | 2017-3 | 8 || panda | 2017-1 | 10 | panda | 2017-1 | 10 || panda | 2017-2 | 8 | panda | 2017-1 | 10 || panda | 2017-2 | 8 | panda | 2017-2 | 8 || panda | 2017-3 | 8 | panda | 2017-1 | 10 || panda | 2017-3 | 8 | panda | 2017-2 | 8 || panda | 2017-3 | 8 | panda | 2017-3 | 8 || yy | 2017-1 | 4 | yy | 2017-1 | 4 || yy | 2017-2 | 2 | yy | 2017-1 | 4 || yy | 2017-2 | 2 | yy | 2017-2 | 2 || yy | 2017-3 | 2 | yy | 2017-1 | 4 || yy | 2017-3 | 2 | yy | 2017-2 | 2 || yy | 2017-3 | 2 | yy | 2017-3 | 2 |+---------+--------+------+---------+--------+------+24 rows in set (0.00 sec)
按照user,date 进行分类,用 max 取出 acc (因为每个月的都一样) ,用sum 进行累加。
mysql> select A.user,A.date,max(a.acc) acc_month, sum(B.acc) acc_sum from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date <= A.date group by A.user,A.date order by A.user,A.date;
+---------+--------+-----------+---------+| user | date | acc_month | acc_sum |+---------+--------+-----------+---------+| gifshow | 2017-1 | 6 | 6 || gifshow | 2017-2 | 6 | 12 || gifshow | 2017-3 | 6 | 18 || laifeng | 2017-1 | 4 | 4 || laifeng | 2017-2 | 8 | 12 || laifeng | 2017-3 | 8 | 20 || panda | 2017-1 | 10 | 10 || panda | 2017-2 | 8 | 18 || panda | 2017-3 | 8 | 26 || yy | 2017-1 | 4 | 4 || yy | 2017-2 | 2 | 6 || yy | 2017-3 | 2 | 8 |+---------+--------+-----------+---------+12 rows in set (0.00 sec)
sql分析结束!
阅读全文
0 0
- 大数据学习24:hive_sql_累计求和小案例
- SQL开发--树形层次累计求和案例
- 小编程题之大数据求和
- 存储过程获得小部门月累计-小部门年累计-大部门年累计
- 大数据学习21:sqoop 语法和常用命令和小案例
- oracle 累计求和
- oracle 累计求和
- 递归法累计求和
- oracle累计求和
- 递归累计求和
- 大数据预科班3_小案例
- Sql Server累计求和问题
- Oracle按月份累计求和
- Oracle按日期累计求和
- oracle 下实现累计求和
- 加载数据小案例
- 大数据学习的小概念
- 小白如何学习大数据
- 手慢无 | 年薪30万也离职学的Linux运维课?到底教什么?
- 谷歌员工吐槽:在谷歌工作的烦恼
- centos 7搭建SS服务器代理
- 【VNPY入门教程之一】VNPY开源交易平台WINDOWS下环境配置
- String类为什么是final
- 大数据学习24:hive_sql_累计求和小案例
- rebuild index
- jQuery Validate验证详解
- 简单用宏定义实现max(a,b)
- RMySQL encoding issue on Windows
- RecyclerView关于notifyItemRemoved的那点小事
- #MD5 加密算法及实现
- Elasticsearch 数据搜索篇·【入门级干货】
- 分苹果