Hive-为什么没有启动MapReduce任务

来源:互联网 发布:mysql存储过程有什么用 编辑:程序博客网 时间:2024/06/05 08:14

Hive-为什么没有启动MapReduce任务

涉及到关键参数:

  • hive.fetch.task.conversion
  • hive.fetch.task.aggr
  • hive.fetch.task.conversion.threshold

问题的背景:
(1)查询sql
select * from dim_beeper.dim_beeper_trans_match_driver_bid where p_day=’2017-06-06’ and source_id = 1594007;
(2)dim_beeper_trans_match_driver_bid是按照p_day来进行的分区表
(3)dim_beeper_trans_match_driver_bid的存储格式为parquet
(4)dim_beeper_trans_match_driver_bid在p_day=2017-06-06分区下,包含两个文件,每个文件的大小如下:
这里写图片描述

(5)执行上面的sql会出现OutOfMemoryError错误

=====
在执行上面的sql时,由于是简单的select *, hive内部会进行优化,并不会启动MapReduce任务,而会只有一个Fetch Task 来在执行hive的客户端进行数据的读取和解析操作,此时 由于Hive客户端的内存设置限制,会出现上面提到的 OutOfMemoryError 错误。

之后,尝试将select * 改为 select 所有字段,按照之前的经验,此时会启动MapReduce任务,但是此时还是未启动MapReduce任务,结果还是会出现上面的 OutOfMemoryError 错误。

难道 hive的内部机制调整了???
通过查看Hive的相关文档,找到了文章开头提到的Hive内部针对fetch.task的优化参数,接下来就详细的介绍一下上面的三个参数吧。

1、hive.fetch.task.conversion

This parameter controls which kind of simple query can be converted to a single fetch task.
It was added in Hive 0.10 per HIVE-2925,可选值如下:
–“none” is added in Hive 0.14 to disable this feature, per HIVE-8389.
–“minimal” means SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only.
– “more” means SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns).”more” can take any kind of expressions in the SELECT clause, including UDFs.(UDTFs and lateral views are not yet supported – see HIVE-5718.)

1.2版本的hive,默认为 more,即会出现该文章所遇到的问题,通过将其修改为none,即会关闭fetch 优化。

2、hive.fetch.task.aggr

Aggregation queries with no group-by clause (for example, select count(*) from src) execute final aggregations in a single reduce task. If this parameter is set to true, Hive delegates the final aggregation stage to a fetch task, possibly decreasing the query time.
It was added in Hive 0.12 per HIVE-4002.
It can help when the output of the mappers are very small, by reducing the time for shuffle and launching reduce task.

对于没有 group by 的汇总查询(select count(*)),hive将不会启动最终的Reduce任务,而是在客户端本地启动Fetch任务来完成Reduce的工作,从而避免了shuffle,实现了一定程度的优化

3、hive.fetch.task.conversion.threshold

This parameter controls input threshold (in bytes) for applying hive.fetch.task.conversion.
It was added in Hive 0.13 per HIVE-3990, default is “-1” to disable the feature;
The default value was changed in Hive 0.14 to 1GB(1073741824) per HIVE-7397.

该参数决定在输入大小为多少时,hive.fetch.task.conversion生效,默认为 1G

==============
参考:http://www.openkb.info/2015/01/how-to-enable-fetch-task-instead-of.html

原创粉丝点击