HIVE 中 multi_distinct的注意事项

来源：互联网发布：数据库双机备份编辑：程序博客网时间：2024/05/28 03:03

前hive的版本支持multi-distinct的特性，这个在用起来比较方便，但是在此特性下面无法开启防数据倾斜的开关(sethive.groupby.skewindata=true),防止数据倾斜的参数只在单distinct情况下会通过一个job来防止数据的倾斜。multi-distinct使用起来方便的同时也可能会带来性能的不优化，如日志中常常统计pv，Uv，独立ip数，独立session数，这些都要去重统计，如下面统计各个浏览器占比的SQL，这个sql可能需要运行20到30分钟（这个和集群和日志数据量相关），browser_core只有10个数值，其reduce压力很大，优化后会有50%-70%的提升

以下是用外部表提升性能的方法

set hive.map.aggr=true;

set hive.groupby.skewindata=true;

insert overwrite local directory '/home/xxx.txt'--注意hive有三种存储格式，你可以根据实际情况进行选择

row format delimited

fields terminated by '\t'

select

distinct(concat(u、id,"\t",lt,"\t",ln,"\t",d))

from

abcdistinct where src='new' orsrc='old';

--经纬度数据去重

drop table if exists abcrigth;

create external table abcrigth(

id String ,

lt String ,

ln String ,

d String)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/home/xxx.txt'

OVERWRITE INTO TABLE abcright;

0 0

HIVE&nbsp;中&nbsp;multi_distinct的注意事项

HIVE 中 multi_distinct的注意事项