[Hive]Hive调优:让任务并行执行
来源:互联网 发布:ni 高速数据采集卡 编辑:程序博客网 时间:2024/05/20 05:07
业务背景
extract_trfc_page_kpi的hive sql如下:
set mapred.job.queue.name=pms;set hive.exec.reducers.max=8;set mapred.reduce.tasks=8;set mapred.job.name=extract_trfc_page_kpi;insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')select distinct page_type_id, pv, uv, '$yesterday' update_time from( --针对PC、H5 select page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 group by page_type_id union all --PC搜索页特殊处理 select 5 as page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)union all --针对APP select a.page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi a left outer join ( select distinct page_type_id, old_page_type_id from tandem.mobile_backend_page_url_rule where is_delete = 0 ) b on (a.page_type_id = b.old_page_type_id) where a.ds = '$yesterday' and stat_type = 1 group by a.page_type_id ) t;
上面的sql中存在两个union all操作,顺序执行下来的话,需要耗时20分钟。
优化策略
分析以上的sql,其中union all前后的三个查询操作并无直接关联,因此没有必要顺序执行,因此优化的思路是让这三个查询操作并行执行,hive提供了如下参数实现job的并行操作:
// 开启任务并行执行set hive.exec.parallel=true;// 同一个sql允许并行任务的最大线程数set hive.exec.parallel.thread.number=8;
方案一
在执行sql时加上上面的两个hive参数,如:
set mapred.job.queue.name=pms;set hive.exec.reducers.max=8;set mapred.reduce.tasks=8;set hive.exec.parallel=true;set hive.exec.parallel.thread.number=8;set mapred.job.name=extract_trfc_page_kpi;insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')select distinct page_type_id, pv, uv, '$yesterday' update_time from( --针对PC、H5 select page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 group by page_type_id union all --PC搜索页特殊处理 select 5 as page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)union all --针对APP select a.page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi a left outer join ( select distinct page_type_id, old_page_type_id from tandem.mobile_backend_page_url_rule where is_delete = 0 ) b on (a.page_type_id = b.old_page_type_id) where a.ds = '$yesterday' and stat_type = 1 group by a.page_type_id ) t;
方案二
在hive-site.xml中进行设置,查看当前版本hive的配置参数:
hive> set -v;...hive.exec.orc.zerocopy=falsehive.exec.parallel=falsehive.exec.parallel.thread.number=8hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLoggerhive.exec.rcfile.use.explicit.header=truehive.exec.rcfile.use.sync.cache=truehive.exec.reducers.bytes.per.reducer=1000000000hive.exec.reducers.max=999hive.exec.rowoffset=falsehive.exec.scratchdir=/tmp/hive-pmshive.exec.script.allow.partial.consumption=falsehive.exec.script.maxerrsize=100000hive.exec.script.trust=falsehive.exec.show.job.failure.debug.info=true...
这些参数是配置在$HIVE_HOME/conf/hive-site.xml中的,现在在这个配置文件中加入:
<property> <name>hive.exec.parallel</name> <value>true</value></property><property> <name>hive.exec.parallel.thread.number</name> <value>16</value></property>
重新启动hive,看到刚刚配置的参数已经生效了:
hive> set -v;...hive.exec.orc.skip.corrupt.data=falsehive.exec.orc.zerocopy=falsehive.exec.parallel=truehive.exec.parallel.thread.number=16hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLoggerhive.exec.rcfile.use.explicit.header=truehive.exec.rcfile.use.sync.cache=truehive.exec.reducers.bytes.per.reducer=1000000000hive.exec.reducers.max=999hive.exec.rowoffset=falsehive.exec.scratchdir=/tmp/hive-pmshive.exec.script.allow.partial.consumption=false...
结论
经过测试,添加了这两个参数以后,extract_trfc_page_kpi脚本执行时间从耗时20分钟,优化为耗时3分钟。
3 0
- [Hive]Hive调优:让任务并行执行
- hive实现任务并行执行
- hive并行执行job
- hive并行执行job
- hive并行执行job
- Hive优化--并行执行
- hive 执行时reduce 任务个数 设置
- contab与hive自动执行任务
- 通过Hive执行任务,异常——Hive:Connection refused;Hadoop:too many open files
- Hive任务优化基础
- Hive Tez任务失败
- Hive提交任务内存不足
- HIVE 设置任务名称
- hive sql执行计划
- hive执行源码分析
- Hive执行流程分析
- hive之脚本执行
- 执行hive内存溢出
- SVN服务器搭建和使用
- JS格式金额,每3位加一个,号
- 黑马程序员————IO流操作的基本规律和使用方法
- Setting up a Linux build environment
- 常见的linux命令
- [Hive]Hive调优:让任务并行执行
- “全员编程,广泛嵌入”(三)—— 物联网时代(一)
- 实现一个简单的滚动视图
- 爬虫scrapy抓取w3school课程列表
- service 生命周期的小探究
- 【floyd求最小环】【Vijos 1046】【观光旅游】
- Java高新技术---反射&动态代理
- iOS开发滚动视图UIScrollView
- Emacs