hive索引
来源:互联网 发布:彩票app源码 编辑:程序博客网 时间:2024/05/18 03:25
索引是标准的数据库技术,hive 0.7版本之后支持索引。
Hive的索引目的是提高Hive表指定列的查询速度。
没有索引时,类似'WHERE tab1.col1 = 10' 的查询,Hive会加载整张表或分区,然后处理所有的rows,但是如果在字段col1上面存在索引时,那么只会加载和处理文件的一部分。
与其他传统数据库一样,增加索引在提升查询速度时,会消耗额外资源去创建索引和需要更多的磁盘空间存储索引。
Hive提供有限的索引功能,这不像传统的关系型数据库那样有“键(key)”的概念,用户可以在某些列上创建索引来加速某些操作,给一个表创建的索引数据被保存在另外的表中。Hive的索引功能现在还相对较晚,提供的选项还较少。但是,索引被设计为可使用内置的可插拔的java代码来定制,用户可以扩展这个功能来满足自己的需求。 当然不是说有的查询都会受惠于Hive索引。用户可以使用EXPLAIN语法来分析HiveQL语句是否可以使用索引来提升用户查询的性能。像RDBMS中的索引一样,需要评估索引创建的是否合理,毕竟,索引需要更多的磁盘空间,并且创建维护索引也会有一定的代价。 用户必须要权衡从索引得到的好处和代价。
索引的一般用法
下面介绍索引的常见用法:
A、 Create/build,show和drop index
create index table01_index on table table01(column2) as 'COMPACT' with deferred rebuild;
show index on table01;
drop index table01_index on table01;
B、 Create then build,show formatted和drop index
create index table02_index on table table02(column3) as 'compact' with deferred rebuild;
alter index table02_index on table02 rebuild;
show formatted index on table02;
drop index table02_index on table02;
C、 创建bitmap索引,build,show 和drop
create index table03_index on table table03 (column4) as 'bitmap' with deferred rebuild;
alter index table03_index ontable03 rebuild;
show formatted index ontable03;
drop index table03_index on table03;
D、 在一张新表上创建索引
create index table04_index on table table04 (column5) as 'compact' with deferred rebuild in tabletable04_index_table;
E、 创建索引,存储格式为RCFile
create index table05_index on table table05 (column6) as 'compact' with deferred rebuild stored as rcfile;
F、 创建索引,存储格式为TextFile
create index table06_index on table table06 (column7) as 'compact' with deferred rebuild row format delimited fields terminated by '\t' stored as textfile;
G、 创建带有索引属性的索引
create index table07_index on table table07 (column8) as 'compact' with deferred rebuild idxproperties("prop1"="value1", "prop2"="value2");
H、 创建带有表属性的索引
create index table08_index on table table08 (column9) as 'compact' with deferred rebuild tblproperties("prop3"="value3", "prop4"="value4");
I、 如果索引存在,则删除
drop index if exists table09_index on table09;
J、 在分区上重建索引
alter index table10_index on table10partition (columnx='valueq', columny='valuer') rebuild;
下面说说怎么创建索引:
1、先创建表:1
hive> create table user( id
int
, name string)
2
> ROW FORMAT DELIMITED
3
> FIELDS TERMINATED BY
'\t'
4
> STORED AS TEXTFILE;
2、导入数据:
1
hive> load data local inpath
'/export1/tmp/wyp/row.txt'
2
> overwrite into table user;
3、创建索引之前测试
01
hive> select * from user where id =
500000
;
02
Total MapReduce jobs =
1
03
Launching Job
1
out of
1
04
Number of reduce tasks is set to
0
since there's no reduce operator
05
Cannot run job locally: Input Size (=
356888890
) is larger than
06
hive.exec.mode.local.auto.inputbytes.max (=
134217728
)
07
Starting Job = job_1384246387966_0247, Tracking URL =
08
09
http:
//l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
10
11
Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247
12
Hadoop job information
for
Stage-
1
: number of mappers:
2
; number of reducers:
0
13
2013
-
11
-
13
15
:
09
:
53
,
336
Stage-
1
map =
0
%, reduce =
0
%
14
2013
-
11
-
13
15
:
09
:
59
,
500
Stage-
1
map=
50
%,reduce=
0
%, Cumulative CPU
2.0
sec
15
2013
-
11
-
13
15
:
10
:
00
,
531
Stage-
1
map=
100
%,reduce=
0
%, Cumulative CPU
5.63
sec
16
2013
-
11
-
13
15
:
10
:
01
,
560
Stage-
1
map=
100
%,reduce=
0
%, Cumulative CPU
5.63
sec
17
MapReduce Total cumulative CPU time:
5
seconds
630
msec
18
Ended Job = job_1384246387966_0247
19
MapReduce Jobs Launched:
20
Job
0
: Map:
2
Cumulative CPU:
5.63
sec
21
HDFS Read:
361084006
HDFS Write:
357
SUCCESS
22
Total MapReduce CPU Time Spent:
5
seconds
630
msec
23
OK
24
500000
wyp.
25
Time taken:
14.107
seconds, Fetched:
1
row(s)
一共用了14.107s
4、对user创建索引
01
hive> create index user_index on table user(id)
02
> as
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
03
> with deferred rebuild
04
> IN TABLE user_index_table;
填充索引数据:
05
hive> alter index user_index on user rebuild;
可以看到索引内存储的数据:
06
hive> select * from user_index_table limit
5
;
07
0
hdfs:
//mycluster/user/hive/warehouse/table02/000000_0 [0]
08
1
hdfs:
//mycluster/user/hive/warehouse/table02/000000_0 [352]
09
2
hdfs:
//mycluster/user/hive/warehouse/table02/000000_0 [704]
10
3
hdfs:
//mycluster/user/hive/warehouse/table02/000000_0 [1056]
11
4
hdfs:
//mycluster/user/hive/warehouse/table02/000000_0 [1408]
12
Time taken:
0.244
seconds, Fetched:
5
row(s)
这样就对user表创建好了一个索引。
5、对创建索引后的user再进行测试
01
hive> select * from user where id =
500000
;
02
Total MapReduce jobs =
1
03
Launching Job
1
out of
1
04
Number of reduce tasks is set to
0
since there's no reduce operator
05
Cannot run job locally: Input Size (=
356888890
) is larger than
06
hive.exec.mode.local.auto.inputbytes.max (=
134217728
)
07
Starting Job = job_1384246387966_0247, Tracking URL =
08
09
http:
//l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/
10
11
Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247
12
Hadoop job information
for
Stage-
1
: number of mappers:
2
; number of reducers:
0
13
2013
-
11
-
13
15
:
23
:
12
,
336
Stage-
1
map =
0
%, reduce =
0
%
14
2013
-
11
-
13
15
:
23
:
53
,
240
Stage-
1
map=
50
%,reduce=
0
%, Cumulative CPU
2.0
sec
15
2013
-
11
-
13
15
:
24
:
00
,
253
Stage-
1
map=
100
%,reduce=
0
%, Cumulative CPU
5.27
sec
16
2013
-
11
-
13
15
:
24
:
01
,
650
Stage-
1
map=
100
%,reduce=
0
%, Cumulative CPU
5.27
sec
17
MapReduce Total cumulative CPU time:
5
seconds
630
msec
18
Ended Job = job_1384246387966_0247
19
MapReduce Jobs Launched:
20
Job
0
: Map:
2
Cumulative CPU:
5.63
sec
21
HDFS Read:
361084006
HDFS Write:
357
SUCCESS
22
Total MapReduce CPU Time Spent:
5
seconds
630
msec
23
OK
24
500000
wyp.
25
Time taken:
13.042
seconds, Fetched:
1
row(s)
时间用了13.042s这和没有创建索引的效果差不多。
01
hive> CREATE INDEX employees_index
02
> ON TABLE employees (country)
03
> AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
04
> WITH DEFERRED REBUILD
05
> IDXPROPERTIES (
'creator'
=
'me'
,
'created_at'
=
'some_time'
)
06
> IN TABLE employees_index_table
07
> COMMENT
'Employees indexed by country and name.'
;
08
FAILED: Error in metadata: java.lang.RuntimeException: \
09
Check the index columns, they should appear in the table being indexed.
10
FAILED: Execution Error,
return
code
1
from \
11
org.apache.hadoop.hive.ql.exec.DDLTask
这个bug发生在Hive0.10.0、0.10.1、0.11.0,在Hive0.12.0已经修复了,详情请参见:https://issues.apache.org/jira/browse/HIVE-4251
总结:索引表的基本包含几列:1. 源表的索引列;2. _bucketname hdfs中文件地址 3. 索引列在hdfs文件中的偏移量。原理是通过记录索引列在HDFS中的偏移量,精准获取数据,避免全表扫描。
索引测试
(1) 查询表中行数
hive (hive)> select count(*) from userbook;
4409365
(2) 表中未创建索引前查询
hive (hive)> select * from userbook where book_id = '15999998838';
Query ID =hadoop_20150627165551_595da79a-0e27-453b-9142-7734912934c4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is setto 0 since there's no reduce operator
Starting Job =job_1435392961740_0012, Tracking URL =http://gpmaster:8088/proxy/application_1435392961740_0012/
Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0012
Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0
2015-06-27 16:56:04,666 Stage-1map = 0%, reduce = 0%
2015-06-27 16:56:28,974 Stage-1map = 50%, reduce = 0%, Cumulative CPU4.36 sec
2015-06-27 16:56:31,123 Stage-1map = 78%, reduce = 0%, Cumulative CPU6.21 sec
2015-06-27 16:56:34,698 Stage-1map = 100%, reduce = 0%, Cumulative CPU7.37 sec
MapReduce Total cumulative CPUtime: 7 seconds 370 msec
Ended Job =job_1435392961740_0012
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 7.37 sec HDFS Read: 348355875 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent:7 seconds 370 msec
OK
userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 45.678 seconds, Fetched: 1 row(s)
(3) 创建索引
hive (hive)> create index userbook_bookid_idx on table userbook(book_id) as 'COMPACT' WITH DEFERRED REBUILD;
(4) 创建索引后再执行查询
hive (hive)> select * from userbook where book_id = '15999998838';
Query ID =hadoop_20150627170019_5bb5514a-4c8e-4c47-9347-ed0657e1f2ff
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is setto 0 since there's no reduce operator
Starting Job =job_1435392961740_0013, Tracking URL = http://gpmaster:8088/proxy/application_1435392961740_0013/
Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0013
Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0
2015-06-27 17:00:30,429 Stage-1map = 0%, reduce = 0%
2015-06-27 17:00:54,003 Stage-1map = 50%, reduce = 0%, Cumulative CPU7.43 sec
2015-06-27 17:00:56,181 Stage-1map = 78%, reduce = 0%, Cumulative CPU9.66 sec
2015-06-27 17:00:58,417 Stage-1map = 100%, reduce = 0%, Cumulative CPU10.83 sec
MapReduce Total cumulative CPUtime: 10 seconds 830 msec
Ended Job =job_1435392961740_0013
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 10.83 sec HDFS Read: 348356271 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent:10 seconds 830 msec
OK
userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 40.549 seconds, Fetched: 1 row(s)
可以看到创建索引后,速度还是稍快一点的。
其实对于这种简单的查询,通过我们的设置,可以不用启动Map/Reduce的,而是启动Fetch task,直接从HDFS文件中filter过滤出需要的数据,需要设置如下参数:
set hive.fetch.task.conversion=more;
hive (hive)> select * from userbook where book_id = '15999998838';
OK
userbook.book_id userbook.book_name userbook.author userbook.public_date userbook.address
15999998838 uviWfFJ KwCrDOA 2009-12-27 3b74416d-eb69-48e2-9d0d-09275064691b
Time taken: 0.093 seconds,Fetched: 1 row(s)
可以看到速度更快了,毕竟省略掉了开启MR任务,执行效率提高不少。
- hive索引
- hive 索引
- hive索引
- hive 索引
- Hive索引
- hive索引
- hive索引
- hive索引
- hive 索引
- Hive索引使用
- 带索引的HIVE
- HIVE Indexex 索引
- hive索引浅析
- 带索引的HIVE
- Hive创建索引
- Hive创建索引
- Hive创建索引
- HIVE有没有索引
- 类的加载,new操作与构造函数
- XCode调试技巧–设置全局断点快速定位问题代码所在行
- 织梦dedecms文档静态页批量替换为动态浏览
- sort a
- 云时代的编程模式
- hive索引
- lintcode ----比较字符串
- spring MVC +Spring + Hibernate + PostgreSQL框架的集成和多租户( 一)
- linux 下 iozone随机读写测试报错: error reading block
- IOS开发—通过ChildViewController实现view的切换
- sort b
- UVA 11983-Weird Advertisement-k次以上覆盖矩形面积(线段树+扫描线)
- Autolayout之 关于+requiresConstraintBasedLayout方法作用
- 【leetcode】172. Factorial Trailing Zeroes