解决spark sql关联(join)查询使用“or“缓慢的问题

来源:互联网 发布:国外发展人工智能吗 编辑:程序博客网 时间:2024/05/22 12:44

1.需求描述

将a表的数据与b表的两个字段进行关联,输出结果

a表数据约24亿条

b表数据约30万条

2.优化效果

优化后执行时间从数天减少到数分钟

3.资源配置

spark 1.4.1

200core,600G RAM

4.代码简化版(优化前)

sqlContext.sql("name,ip1,ip2 as ip from table_A where name is not null and ip2 is not null or ip2 is not null) group by name,ip1,ip2").registerTempTable("a")sqlContext.read.parquet("table_B").registerTempTable("b")sqlContext.sql('''select         ip,        count(1) as cnt from         (select                 bb.ip as ip,                aa.name as name         from                 (select * from b where ip != '')bb left join (select * from a)aa on (bb.ip=aa.ip2 or bb.ip=aa.ip1)        group by                 bb.ip,                aa.name)group by         ip''').write.json("result")

5.代码简化版(优化后)

后来经过排查发现是使用or语句导致的运行缓慢,于是将两个条件查询注册成两张表,然后union成一张表,union操作其实只是合并两个rdd的分区,基本没有什么开销。然后在对这张表进行关联操作

代码如下:

//查询出需要的字段并进行缓存,因为下面要查询2次sqlContext.sql("CACHE TABLE all AS select name,ip1,ip2 from table_A where name is not null and (ip1 is not null or ip2 is not null) group by name,ip1,ip2")sqlContext.sql("select name,ip1 from all group by name,ip1").registerTempTable("temp1")sqlContext.sql("select name,ip2 from all group by name,ip2").registerTempTable("temp2")sqlContext.sql("select name,ip from (select * from temp1 union all select * from temp2)a group by name,ip").registerTempTable("a")sqlContext.read.parquet("table_B").registerTempTable("b")sqlContext.sql('''select         ip,        count(1) as cnt from         (select                 bb.ip as ip,                aa.name as name        from                 (select * from b where ip != '')bb left join (select * from a)aa on bb.ip=aa.ip        group by                 bb.ip,                aa.name)group by         ip''').write.json("result")
0 0