Hive-为什么没有启动MapReduce任务

来源：互联网发布：mysql存储过程有什么用编辑：程序博客网时间：2024/06/05 08:14

Hive-为什么没有启动MapReduce任务

涉及到关键参数：

hive.fetch.task.conversion
hive.fetch.task.aggr
hive.fetch.task.conversion.threshold

问题的背景：
(1)查询sql
select * from dim_beeper.dim_beeper_trans_match_driver_bid where p_day=’2017-06-06’ and source_id = 1594007;
(2)dim_beeper_trans_match_driver_bid是按照p_day来进行的分区表
(3)dim_beeper_trans_match_driver_bid的存储格式为parquet
(4)dim_beeper_trans_match_driver_bid在p_day＝2017-06-06分区下，包含两个文件，每个文件的大小如下：
这里写图片描述

(5)执行上面的sql会出现OutOfMemoryError错误

＝＝＝＝＝
在执行上面的sql时，由于是简单的select *， hive内部会进行优化，并不会启动MapReduce任务，而会只有一个Fetch Task 来在执行hive的客户端进行数据的读取和解析操作，此时由于Hive客户端的内存设置限制，会出现上面提到的 OutOfMemoryError 错误。

之后，尝试将select * 改为 select 所有字段，按照之前的经验，此时会启动MapReduce任务，但是此时还是未启动MapReduce任务，结果还是会出现上面的 OutOfMemoryError 错误。

难道 hive的内部机制调整了？？？
通过查看Hive的相关文档，找到了文章开头提到的Hive内部针对fetch.task的优化参数，接下来就详细的介绍一下上面的三个参数吧。

1、hive.fetch.task.conversion

This parameter controls which kind of simple query can be converted to a single fetch task.
It was added in Hive 0.10 per HIVE-2925,可选值如下:
–“none” is added in Hive 0.14 to disable this feature, per HIVE-8389.
–“minimal” means SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only.
– “more” means SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns).”more” can take any kind of expressions in the SELECT clause, including UDFs.(UDTFs and lateral views are not yet supported – see HIVE-5718.)

1.2版本的hive，默认为 more，即会出现该文章所遇到的问题，通过将其修改为none，即会关闭fetch 优化。

2、hive.fetch.task.aggr

Aggregation queries with no group-by clause (for example, select count(*) from src) execute final aggregations in a single reduce task. If this parameter is set to true, Hive delegates the final aggregation stage to a fetch task, possibly decreasing the query time.
It was added in Hive 0.12 per HIVE-4002.
It can help when the output of the mappers are very small, by reducing the time for shuffle and launching reduce task.

对于没有 group by 的汇总查询（select count(*)），hive将不会启动最终的Reduce任务，而是在客户端本地启动Fetch任务来完成Reduce的工作，从而避免了shuffle，实现了一定程度的优化

3、hive.fetch.task.conversion.threshold

This parameter controls input threshold (in bytes) for applying hive.fetch.task.conversion.
It was added in Hive 0.13 per HIVE-3990, default is “-1” to disable the feature;
The default value was changed in Hive 0.14 to 1GB(1073741824) per HIVE-7397.

该参数决定在输入大小为多少时，hive.fetch.task.conversion生效，默认为 1G

＝＝＝＝＝＝＝＝＝＝＝＝＝＝
参考：http://www.openkb.info/2015/01/how-to-enable-fetch-task-instead-of.html

阅读全文

1 0