hive 性能调优、优化

来源：互联网发布：最优化方法张薇答案编辑：程序博客网时间：2024/05/16 11:01

QQ交流群：335671559

1、explain _query与 explain extended _query

用于查看hive对hql的解析，包括执行阶段、执行任务和任务属性

explain select name from test

explain extended select name from test

...

2、limit

在使用客户端查询hive数据时，经常会用到limit限制输出数据数目，很多情况下会执行全表查询，而只返回很少一部分数据，所以这种操作很浪费时间，所以可以对这个操作进行优化，

<name>hive.limit.optimize.enable</name>

<description>Whether to enable to optimization to

try a smaller subset of data for simple LIMIT first.</description>

</property>

这个参数保证hive使用limit查询时进行抽样查询，不需要进行全表查询，节省很多时间。缺点是有些需要的数据可能被忽略掉（抽样）

一下两个参数配合使用：

hive.limit.row.max.size 每一行最大长度

hive.limit.optimize.limit.file: 从多少个数据文件中进行抽样

3、Local Mode

执行本地化，当查询处理很小的数据集合时，优先进行本地化处理

<name>hive.exec.mode.local.auto</name>

Let hive determine whether to run in local mode automatically

</description>

</property>

4、并行化执行

每个查询被hive转化成多个阶段，有些阶段关联性不大，则可以并行化执行，减少执行时间

<name>hive.exec.parallel</name>

<description>Whether to execute jobs in parallel</description>

</property>

5、Strict 模式（严格）

如果把hive.mapred.mode 设置成strict模式，则有三种查询不能操作

（1）当在有分区的表进行查询操作时，如果WHERE字句中没有一个分区过滤条件，则hive不允许执行这个操作

hive> SELECT DISTINCT(planner_id) FROM fracture_ins WHERE planner_id=5;

FAILED: Error in semantic analysis: No Partition Predicate Found for

Alias "fracture_ins" Table "fracture_ins"

（2）当查询语句中有ORDER BY字句时，则必须有LIMIT字句。因为ORDER BY 字句把所有的结果送到一个reducer进行排序，使用limit可以减少执行时间

hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id

> LIMIT 100000;

... normal results ...

（3）求表的笛卡尔积（Cartesian product）时，必须使用ON字句，而不能用WHERE字句替代

hive> SELECT * FROM fracture_act JOIN fracture_ads

> ON (fracture_act.planner_id = fracture_ads.planner_id);

... normal results ...

6、调整Mappers 和Reducers数目

在hive中，有些查询语句仅仅需要map任务，而不需要reduce数目，所以要根据具体的查询调整reduce数目。比如像group by 这样的字句就需要reduce任务。

hive> SELECT pixel_id, count FROM fracture_ins WHERE hit_date=20120119

> GROUP BY pixel_id;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 3

默认情况下，reduce任务数目是根据输入数据大小进行计算的，每个reducer默认处理的数据是1GB（hive.exec.reducers.bytes.per.reducer），利用上例，修改默认值为750MB，则有如下输出

hive> set hive.exec.reducers.bytes.per.reducer=750000000;

hive> SELECT pixel_id,count(1) FROM fracture_ins WHERE hit_date=20120119

> GROUP BY pixel_id;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 4

reducer数目估计值为4

mapred.reduce.tasks 这个值是调整默认reducer数目的参数，reducer数目不易过少，这样不会充分利用集群的并行化。同样，reducer数目不易过多，因为初始化和调度reducer任务会占很多时间，同时防止一个较大的任务占据了集群的较多的资源，有如下参数可以限制每个job最多可有的reducer数目：

hive.exec.reducers.max

计算方式

(Total Cluster Reduce Slots * 1.5) / (avg number of queries running)

7、JVM重利用

<name>mapred.job.reuse.jvm.num.tasks</name>

<description>How many tasks to run per jvm. If set to -1, there is no limit.</description>

</property>

JVM重利用可以是JOB长时间保留slot，知道作业结束，这在对于有较多任务和较多小文件的任务是非常有意义的，减少执行时间。当然这个值不能设置过大，因为有些作业会有reduce任务，如果reduce任务没有完成，则map任务占用的slot不能释放，其他的作业可能就需要等待。

8、索引

9、动态分区优化

10、虚拟列

hive表提供两个虚拟列，一个是输入的文件名（分片）、一个是文件块内偏移

hive> set hive.exec.rowoffset=true;

hive> SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, line

> FROM hive_text WHERE line LIKE '%hive%' LIMIT 2;

har://file/user/hive/warehouse/hive_text/folder=docs/

data.har/user/hive/warehouse/hive_text/folder=docs/README.txt 2243

http://hive.apache.org/

har://file/user/hive/warehouse/hive_text/folder=docs/

data.har/user/hive/warehouse/hive_text/folder=docs/README.txt 3646

- Hive 0.8.0 ignores the hive-default.xml file, though we continue

通过这两个虚拟列，可以确定有异常的文件名和哪一行

0 0

hive 性能 调优、优化

hive 性能调优、优化