Hive编程(十)【调优】

来源：互联网发布：淘宝保证金有什么用编辑：程序博客网时间：2024/06/05 14:11

10.1 使用EXPLAIN

hive> DESCRIBE onecol; 
number int

hive> SELECT * FROM onecol;
554

hive> SELECT SUM(number) FROM onecol;
14

使用EXPLAIN

hive> EXPLAIN SELECT SUM(number) FROM onecol;

10.2 EXPLAIN EXTENDED

使用EXPLAIN EXTENDED将会输出更加完整的信息。

10.3 限制调整

LIMIT 语句需要执行整个查询，然后返回部分结果。

在Hive中有如下配置

<property> 
<name>hive.limit.optimize.enable</name> 
<value>true</value> 
<description>Whether to enable to optimization to 
try a smaller subset of data for simple LIMIT first.</description> 
</property>

若hive.limit.optimize.enable属性的值为true，还会有两个属性控制LIMIT

hive.limit.row.max.size

<property> 
<name>hive.limit.row.max.size</name> 
<value>100000</value> 
<description>When trying a smaller subset of data for simple LIMIT, 
how much size we need to guarantee each row to have at least. 
</description> 
</property>

hive.limit.optimize.limit.file

<property> 
<name>hive.limit.optimize.limit.file</name> 
<value>10</value> 
<description>When trying a smaller subset of data for simple LIMIT, 
maximum number of files we can sample.</description> 
</property>

10.4 Join优化

将数据集大的表放在JOIN语句的右边。

10.5 本地模式

hive> set mapred.job.tracker=local;

hive> set mapred.tmp.dir=/home/edward/tmp;

hive> SELECT * from people WHERE firstname=bob;

也可以设置hive.exec.mode.local.auto属性值为true，让Hive处于本地模式运行。通常将这个配置写在$HOME/.hiverc

若要全局生效的话，将这个配置添加到$HIVE_HOME/conf/hive-site.xml中

<property> 
<name>hive.exec.mode.local.auto</name> 
<value>true</value> 
<description> 
Let hive determine whether to run in local mode automatically 
</description> 
</property>

10.6 并行执行

Hive默认一次执行一个阶段。可以通过设置参数hive.exec.parallel值为true开启并发执行。

<property> 
<name>hive.exec.parallel</name> 
<value>true</value> 
<description>Whether to execute jobs in parallel</description> 
</property>

10.7 严格模式

Hive提供严格模式，防止用户执行一些意想不到或不良的查询。

通过设置属性hive.mapred.mode值为strict禁止3中类型的查询

分区表

除非WHERE语句中含有分区字段过滤条件。否则不允许查询。即，不允许扫描所有分区。如:

hive> SELECT DISTINCT(planner_id) FROM fracture_ins WHERE planner_id=5; 
FAILED: Error in semantic analysis: No Partition Predicate Found for 
Alias "fracture_ins" Table "fracture_ins"

修改为以下的语句则正常输出

hive> SELECT DISTINCT(planner_id) FROM fracture_ins 
> WHERE planner_id=5 AND hit_date=20120101;

使用ORDER BY的查询

使用ORDER BY查询，要求必须使用LIMIT语句。如:

hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id; 
FAILED: Error in semantic analysis: line 1:56 In strict mode, 
limit must be specified if ORDER BY is present planner_id

修改为以下的语句则正常输出：

hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id 
> LIMIT 100000;

限制笛卡尔积

hive> SELECT * FROM fracture_act JOIN fracture_ads 
> WHERE fracture_act.planner_id = fracture_ads.planner_id; 
FAILED: Error in semantic analysis: In strict mode, cartesian product 
is not allowed. If you really want to perform the operation, 
+set hive.mapred.mode=nonstrict+

修改为以下的语句则正常输出：

hive> SELECT * FROM fracture_act JOIN fracture_ads 
> ON (fracture_act.planner_id = fracture_ads.planner_id);

10.8 调整mapper和reducer个数

通过设置hive.exec.reducers.max阻止查询消耗太多的reduce资源。有必要将该属性配置到$HIVE_HOME/conf/hive-site.xml中。对该属性值大小计算的公式如下:

(集群总Reduce槽位个数*1.5)/(执行中查询的平均个数)

在集群环境中

10.9 JVM重用

10.10 索引

索引用来加快GROUP BY语句的查询速度。

10.11 动态分区调整

10.12 推测执行

10.13 耽搁MapReduce中多个Group By

10.14 虚拟列

阅读全文

0 0