hive sql优化-join Mapjoin Group by

来源：互联网发布：c语言加法编程自定义编辑：程序博客网时间：2024/05/16 06:14

join

按照key进行分发，key的合并在map阶段，而在join左边的表，也就是主表，会首先读入内存，当然它不是全部读入内存，而是部分读入内存，如果左边的表的key相对分散（或少，分散的意思就是相同key的数据量小），读入内存的数据会比较小，join任务执行会比较快。而如果左边的表的key比较集中，而这张表的数据量很大，那么数据倾斜会比较严重。

map阶段同一key数据分发给一个reduce

join原则：

小表join大表，原因是在join操作的reduce阶段（不是map阶段），位于join左边的表的内容会被部分加载进内存，如果数据量比较少，就是全部加载到内存。将条目少的表放在左边，可以有效减少发生内存溢出的几率。

多个表关联时，最多分拆成小段，避免大sql（无法控制中间job）

多表join on条件相同时合并为一个map-reduce，做outer join的时候也是一样，查看执行计划explain

比如查询三个表关联：

select pt.page_id,count(t.url) PV

from rpt_page_type pt

join

(select url_page_id,url from trackinfo where ds='2013-10-11') t

on pt.page_id=t.url_page_id

join

(select page_id from rpt_page_kpi_new where ds='2013-10-11') r

on t.url_page_id=r.page_id

group by pt.page_id;

where条件最好不要放在最后，弄个中间子查询放进去，减少数据分发。

关联三个表，通常就是生成三个mapreduce，一个表一个。

hive的执行计划和oracle不太一样，oracle会生成一个cost，通过这个cost可以知道脚本的执行性能，但hive里面没有这个东西。

优化的话就是mapreduce的数量越少越好

比较两个表关联：

select pt.page_id,count(t.url) PV

from rpt_page_type pt

join

(select url_page_id,url from trackinfo where ds='2013-10-11') t

on pt.page_id=t.url_page_id

group by pt.page_id

上面两个select有很多地方相同，利用这个特性，可以把相同join on条件的放在一个job处理

当然很多情况下要看怎么做，但是要有这个意识，很多表关联时，把相同的条件抽出来

在小表关联大表时，如果join的条件不相同，如：

insert overwrite table page_pv

select pt.page_id,count(t.url) PV

from rpt_page_type pt

join

(select url_page_id,url,province_id from trackinfo where ds='2013-10-11') t

on pt.page_id=t.url_page_id

join

(select page_id,province_id from rpt_page_kpi_new where ds='2013-10-11') r

on t.province_id=r.province_id

group by pt.page_id

大表join小表

访户未登录时，日志中userid是空，在用user_id进行hash分桶时，会将日志中userid为空的数据分到一起，导致了过大空key造成倾斜。

解决办法：

由于相同的key只会分发到一个reduce去处理，所以就可以把空值的key变成一个字符串加上一个随机数，把倾斜的数据分到不同的reduce商，由于null值关联不上，处理后并不影响最终结果。

案例：

End_user 5000万，纬度表

Trackinfo 每天两个亿，按天增量表

原写法：

select u.id,url,t.track_time

from end_user u

join

(select end_user_id,url,track_time from trackinfo where ds='2013-12-01')t

on u.id=t.end_user_id limit 2;

其中end_user_id是很多为空的

调整为：

select u.id,url,track_time

from end_user u

join

{select case when end_user_id='null' or end_user_id is null

then cast (concat('00000000',floor(rand()*1000000)) as bigint)

else end_user_id end end_user_id,

url,track_time

from trackinfo where ds='2013-12-01') t

on u.id=t.end_user_id limit 2;

实验发现时间是原时间的一半。也就是说性能提高了一倍。

MapJoin：

join操作在map阶段完成，如果需要的数据在map的过程中可以访问则不再需要reduce

小表关联一个超大表时，容易发生数据倾斜，可以用MapJoin把小表全部加载到内存，在map端进行join，避免reduce处理

如：

insert overwrite table page_pv

select /*+ MAPJOIN(pt)*/

pt.page_id,count(t.url) PV

from rpt_page_type pt

join

(select url_page_id,url from trackinfo where ds='2013-10-11') t

on pt.page_id=t.url_page_id;

这个小表如果是25MB，25000行，放到内存中是比较合适的。

如果是小表，能否自动选择mapjoin？

set hive.auto.convert.join = true;默认为false

该参数为true时，hive自动对左边的表统计量，如果是小表则加入内存中，也就是对小表进行map join

大表小表的阈值：

set hive.mapjoin.smalltable.filesize;

hive.mapjoin.smalltable.filesize=25000000

默认是25MB

hive.mapjoin.cache.numrows

说明：mapjoin存放在内存里的数据量，也就是行数

默认是25000

hive.mapjoin.followby.gby.localtask.max.memory.usage

说明：map join做group by 操作时，可以使用多大的内存来存储数据，如果数据量太大，则不会保存在内存里

默认值：0.53

hive.mapjoin.localtask.max.memory.usage

说明：本地任务可以使用内存的百分比

默认值为0.90

group by：

map端部分聚合：

并不是所有的聚合操作都需要在reduce端完成，很多聚合操作都可以现在map端进行聚合，最后在reduce端得出最终结果。

基于hash

参数包括：

hive.map.aggr = true就是说也会在map端进行部分聚合，默认为true

hive.groupby.mapaggr.checkinterval = 100000在map端进行聚合操作的条目数量，也就是行数

默认情况下，map端同一个key数据分发给一个reduce，当一个key数量过大时就会倾斜了。数据倾斜大多数情况就是map阶段很快，reduce阶段很慢。

有数据倾斜的时候进行负载均衡

hive.groupby.skewindata = false

当选项设定为true,生成的查询计划会有两个MR job，第一个MR job中，Map的输出结果会随机分布到Reduce中，每个Reduce做部分聚合操作，并输出结果，这样处理的结果是相同的group by key有可能被分发到不同的reduce中，从而达到负载均衡的目的，第二个MR job再根据预处理的数据结果按照Group by key分不到reduce中，（这个过程可以保证相同的group by key被分布到同一个reduce中，最后完成最终的绝活操作。

Count(distinct)也是容易数据倾斜，当该字段存在大量值为null或空的记录

原因就是执行计划也是按照key去进行分发

解决思路：

count distinct时，将值为空的数据在where里过滤掉，在最后结果加1

如：

count(distinct end_user_id) as user_num

修正为：

cast(count(distinct end_user_id)+1 as bigint) as user_num

where end_user_id is not null and query<>

0 0