sparksql优化1(小表大表关联优化 & union替换or)
来源:互联网 发布:火猫解说 收入 知乎 编辑:程序博客网 时间:2024/06/05 09:45
----原语句(运行18min)
INSERT into TABLE schema.dstable
SELECT bb.ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1
WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN (SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020') aa
LEFT OUTER JOIN schema.dstable cc on cc.ip = bb.ip
WHERE cc.ip is null AND
(bb.user_click_rate > aa.avg_click_rate * 3 AND click_num > 500) OR (click_num > 1000)
分析:
1、aa表存放的就是一个指标数据,1条记录,列为小表
2、bb表存放的是按ip聚合的明细数据,记录很多,列为大表
3、cc表用来过滤ip,数量也很小,列为过滤表,作用很小。
查看执行计划,发现bb与aa进行left outer join时,引发了shuffle过程,造成大量的磁盘及网络IO,影响性能。
优化方案1:调整大小表位置,将小表放在左边后,提升至29s (该方案一直不太明白为啥会提升,执行计划里显示的也就是大小表位置调换下而已,跟之前的没其他区别)
优化方案2: 将 or 改成 union,提升至35s(各种调整,一直怀疑跟or有关系,后面调整成union其他不变,果真效率不一样;但方案1只是调整了下大小表顺序,并未调整其他,其效率同样提升很大;不太明白sparksql内部到底走了什么优化机制,后面继续研究);
优化方案3: 采用cache+broadcast方式,提升至20s(该方案将小表缓存至内存,进行map侧关联)
----方案2:or 改成 union(运行35s)
INSERT into TABLE schema.dstable
select aa.ip
from
(
SELECT bb.ip ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1 WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN(SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020') aa
WHERE
(bb.user_click_rate > aa.avg_click_rate * 3 AND click_num > 20)
union
SELECT bb.ip ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1 WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN (SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020') aa
WHERE click_num > 40
) aa
LEFT OUTER JOIN schema.dstable cc on aa.ip=cc.ip
where cc.ip is null
-----cache+broadcast方式(20s)
原理:使用broadcast将会把小表分发到每台执行节点上,因此,关联操作都在本地完成,基本就取消了shuffle的过程,运行效率大幅度提高。
cache table cta as SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020';
INSERT into TABLE schema.dstable
SELECT bb.ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1
WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN cta aa
LEFT OUTER JOIN schema.dstable cc on cc.ip = bb.ip
WHERE cc.ip is null AND
(bb.user_click_rate > aa.avg_click_rate * 3 AND click_num > 500) OR (click_num > 1000)
注意:
cache 表不一定会被广播到Executor,执行map side join,还受另外一个参数:spark.sql.autoBroadcastJoinThreshold影响,该参数判断是否将该表广播;
spark.sql.autoBroadcastJoinThreshold参数默认值是10M,所以只有cache的表小于10M的才被广播到Executor上去执行map side join。
INSERT into TABLE schema.dstable
SELECT bb.ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1
WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN (SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020') aa
LEFT OUTER JOIN schema.dstable cc on cc.ip = bb.ip
WHERE cc.ip is null AND
(bb.user_click_rate > aa.avg_click_rate * 3 AND click_num > 500) OR (click_num > 1000)
分析:
1、aa表存放的就是一个指标数据,1条记录,列为小表
2、bb表存放的是按ip聚合的明细数据,记录很多,列为大表
3、cc表用来过滤ip,数量也很小,列为过滤表,作用很小。
查看执行计划,发现bb与aa进行left outer join时,引发了shuffle过程,造成大量的磁盘及网络IO,影响性能。
优化方案1:调整大小表位置,将小表放在左边后,提升至29s (该方案一直不太明白为啥会提升,执行计划里显示的也就是大小表位置调换下而已,跟之前的没其他区别)
优化方案2: 将 or 改成 union,提升至35s(各种调整,一直怀疑跟or有关系,后面调整成union其他不变,果真效率不一样;但方案1只是调整了下大小表顺序,并未调整其他,其效率同样提升很大;不太明白sparksql内部到底走了什么优化机制,后面继续研究);
优化方案3: 采用cache+broadcast方式,提升至20s(该方案将小表缓存至内存,进行map侧关联)
----方案2:or 改成 union(运行35s)
INSERT into TABLE schema.dstable
select aa.ip
from
(
SELECT bb.ip ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1 WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN(SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020') aa
WHERE
(bb.user_click_rate > aa.avg_click_rate * 3 AND click_num > 20)
union
SELECT bb.ip ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1 WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN (SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020') aa
WHERE click_num > 40
) aa
LEFT OUTER JOIN schema.dstable cc on aa.ip=cc.ip
where cc.ip is null
-----cache+broadcast方式(20s)
原理:使用broadcast将会把小表分发到每台执行节点上,因此,关联操作都在本地完成,基本就取消了shuffle的过程,运行效率大幅度提高。
cache table cta as SELECT round(sum(click)/sum(imp),4) avg_click_rate FROM schema.srctable1 WHERE date = '20171020';
INSERT into TABLE schema.dstable
SELECT bb.ip FROM
(SELECT ip, sum(click) click_num, round(sum(click)/sum(imp),4) user_click_rate FROM schema.srctable1
WHERE date = '20171020' AND ip IS NOT NULL AND imp>0 GROUP BY ip) bb
LEFT OUTER JOIN cta aa
LEFT OUTER JOIN schema.dstable cc on cc.ip = bb.ip
WHERE cc.ip is null AND
(bb.user_click_rate > aa.avg_click_rate * 3 AND click_num > 500) OR (click_num > 1000)
注意:
cache 表不一定会被广播到Executor,执行map side join,还受另外一个参数:spark.sql.autoBroadcastJoinThreshold影响,该参数判断是否将该表广播;
spark.sql.autoBroadcastJoinThreshold参数默认值是10M,所以只有cache的表小于10M的才被广播到Executor上去执行map side join。
阅读全文
0 0
- sparksql优化1(小表大表关联优化 & union替换or)
- [索引优化] -- in or替换为union all
- MYSQL索引优化: IN 和 OR 替换为 union all
- SQL优化实例:OR -- UNION
- 【sql 优化】union 、union all、or使用
- 关于MYSQL索引优化和in or替换为union all
- oracle 优化or 替换为in、exists、union all的几种写法,测试没有问题!
- 关于MYSQL索引优化和in or替换为union all
- 关于MYSQL索引优化和in or替换为union all的问题
- 25 union代替or --优化主题系列
- SparkSQL查询优化
- 【SQL优化】B树索引位图转换及OR到UNION(ALL)的改写
- MySQL的or/in/union与索引优化
- MySQL的or/in/union与索引优化
- 26 union代替or-续 --优化主题系列
- 学习笔记1:数据库优化(优化的关联方面)
- SQL索引优化2(MySQL的or/in/union与索引优化)
- SparkSQL之优化器Catalyst系统
- Java虚拟机第二章2.3 学习笔记
- 利用 iPhone X 的脸部识别能力为内容制作工作服务
- iOS开发-使用Runloop实现线程保活、线程常驻
- jlink gdb server with arm-none-eabi-gdb
- 配置方法数超过64k应用的问题
- sparksql优化1(小表大表关联优化 & union替换or)
- 详解python中的浅拷贝和深拷贝
- Web Scraping with Python: 使用 Python 下载 CSDN 博客图片
- android加载网络图片之前获取图片信息
- node.js-ORM数据库框架sequelize使用总结 超时查询、缓存查询、多表关联查询
- OpenCV中的videowriter
- ViewPage怎么更新数据
- UVA 11752 The Super Powers
- Vim常用操作总结及课后习题