hive与hbase整合过程

来源：互联网发布：大数据零售业编辑：程序博客网时间：2024/05/06 06:30

hive与hbase整合过程---coco

# by coco
# 2014-07-25

本文主要实现一下目标：
1. 在hive中创建的表能直接创建保存到hbase中。
2. hive中的表插入数据，插入的数据会同步更新到hbase对应的表中。
3. hbase对应的列簇值变更，也会在Hive中对应的表中变更。
4. 实现了多列，多列簇的转化：（示例：hive中3列对应hbase中2列簇）

hive与hbase的整合
1. 创建hbase识别的表：
hive> CREATE TABLE hbase_table_1(key int, value string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
> TBLPROPERTIES ("hbase.table.name" = "xyz");
OK
Time taken: 1.833 seconds
hbase.table.name 定义在hbase的table名称
hbase.columns.mapping 定义在hbase的列族
hbase中看到的表：
hbase(main):007:0> list
TABLE
hivetest
student
test
xyz
4 row(s) in 0.1050 seconds

=> ["hivetest", "student", "test", "xyz"]

2.使用sql导入数据
i.预先准备数据
a)新建hive的数据表
hive> create table ccc(foo int,bar string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
OK
Time taken: 2.563 seconds
[root@db96 ~]# cat kv1.txt
1 val_1
2 val_2
这个文件位于root目录下，/root/kv1.txt

[root@db96 ~]#
hive> load data local inpath '/root/kv1.txt' overwrite into table ccc;
Copying data from file:/root/kv1.txt
Copying file: file:/root/kv1.txt
Loading data to table default.ccc
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted hdfs://db96:9000/hive/warehousedir/ccc
[Warning] could not update stats.
OK
Time taken: 2.796 seconds
hive> select * from ccc;
OK
1 val_1
2 val_2
NULL NULL
Time taken: 0.348 seconds, Fetched: 3 row(s)
hive>
使用sql导入hbase_table_1
hive> insert overwrite table hbase_table_1 select * from ccc where foo=1;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1406161997851_0002, Tracking URL = http://db96:8088/proxy/application_1406161997851_0002/
Kill Command = /usr/local/hadoop//bin/hadoop job -kill job_1406161997851_0002
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-07-24 16:04:48,938 Stage-0 map = 0%, reduce = 0%
2014-07-24 16:04:57,571 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 2.54 sec
MapReduce Total cumulative CPU time: 2 seconds 540 msec
Ended Job = job_1406161997851_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.54 sec HDFS Read: 217 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 540 msec
OK
Time taken: 27.648 seconds

查看数据
会显示刚刚插入的数据
1 val_1
hive> select * from hbase_table_1;
OK
1 val_1
Time taken: 1.143 seconds, Fetched: 1 row(s)

hbase 登录hbase
查看加载的数据
hbase(main):008:0> scan "xyz"
ROW COLUMN+CELL
1 column=cf1:val, timestamp=1406189096793, value=val_1
1 row(s) in 0.1090 seconds

hbase(main):009:0>
可以看到，在hive中添加的数据86，已经在hbase中了.
添加数据:
hbase(main):009:0> put 'xyz','100','cf1:val','www.gongchang.com'
hbase(main):011:0> put 'xyz','200','cf1:val','hello,word!'
hbase(main):012:0> scan "xyz"
ROW COLUMN+CELL
1 column=cf1:val, timestamp=1406189096793, value=val_1
100 column=cf1:val, timestamp=1406189669476, value=www.gongchang.com
200 column=cf1:val, timestamp=1406189704742, value=hello,word!
3 row(s) in 0.0240 seconds

Hive
参看hive中的数据
hive> select * from hbase_table_1;
OK
1 val_1
100 www.gongchang.com
200 hello,word!
Time taken: 1.097 seconds, Fetched: 3 row(s)
hive>
刚刚在hbase中插入的数据，已经在hive里了.

hive访问已经存在的hbase
hbase中的元数据准备：
hbase(main):014:0> describe "student"
DESCRIPTION ENABLED
'student', {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => true
'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE',
MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.1380 seconds

hbase(main):015:0> put "student",'1','info:name','tom'
hbase(main):017:0> put "student",'2','info:name','lily'
hbase(main):018:0> put "student",'3','info:name','wwn'
hbase(main):019:0> scan "student"
ROW COLUMN+CELL
1 column=info:name, timestamp=1406189948888, value=tom
2 column=info:name, timestamp=1406190005724, value=lily
3 column=info:name, timestamp=1406190016967, value=wwn
3 row(s) in 0.0420 seconds

hive访问已经存在的hbase
使用CREATE EXTERNAL TABLE：
CREATE EXTERNAL TABLE hbase_table_3(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")
TBLPROPERTIES("hbase.table.name" = "student");
hive> CREATE EXTERNAL TABLE hbase_table_3(key int, value string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")
> TBLPROPERTIES("hbase.table.name" = "student");
OK
Time taken: 1.21 seconds
hive> select * from hbase_table_3;
OK
1 tom
2 lily
3 wwn
Time taken: 0.107 seconds, Fetched: 3 row(s)
由上可以看出，hive已经能访问查看hbase中原有的数据了。
注意：如果hbase中列簇名name数据变更，那么hive中查询结果也会相应的变更，如果hbase中不是其他列簇
内容更新则hive中查询结果不显示。

三、多列和多列族（Multiple Columns and Families）
1．创建数据库

CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu")
TBLPROPERTIES("hbase.table.name" = "student_info");
登陆hive操作：
hive> CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu")
> TBLPROPERTIES("hbase.table.name" = "student_info");
OK
Time taken: 2.957 seconds
hive> select * from hbase_table_2;
OK
Time taken: 1.16 seconds
hive> select * from hbase_table_3;
OK
1 tom
2 lily
3 wwn
4 marry
Time taken: 0.117 seconds, Fetched: 4 row(s)
hive> set hive.cli.print.header=true;
hive> select * from hbase_table_3;
OK
hbase_table_3.key hbase_table_3.value
1 tom
2 lily
3 wwn
4 marry
Time taken: 1.132 seconds, Fetched: 4 row(s)
hive> desc hbase_table_3;
OK
col_name data_type comment
key int from deserializer
value string from deserializer
Time taken: 0.19 seconds, Fetched: 2 row(s)
hive> insert overwrite table hbase_table_add1 select key,value,key+1,value from hbase_table_3;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1406161997851_0003, Tracking URL = http://db96:8088/proxy/application_1406161997851_0003/
Kill Command = /usr/local/hadoop//bin/hadoop job -kill job_1406161997851_0003
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-07-25 08:42:46,068 Stage-0 map = 0%, reduce = 0%
2014-07-25 08:42:56,218 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 2.77 sec
MapReduce Total cumulative CPU time: 2 seconds 770 msec
Ended Job = job_1406161997851_0003
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.77 sec HDFS Read: 239 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 770 msec
OK
_col0 _col1 _col2 _col3
Time taken: 28.01 seconds
hive> select * from hbase_table_add1;
OK
hbase_table_add1.key hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1 tom 2 NULL
2 lily 3 NULL
3 wwn 4 NULL
4 marry 5 NULL
Time taken: 1.105 seconds, Fetched: 4 row(s)
hive> insert overwrite table hbase_table_add1 select key,value,key+1,key+100 from hbase_table_3;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1406161997851_0004, Tracking URL = http://db96:8088/proxy/application_1406161997851_0004/
Kill Command = /usr/local/hadoop//bin/hadoop job -kill job_1406161997851_0004
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2014-07-25 08:45:15,164 Stage-0 map = 0%, reduce = 0%
2014-07-25 08:45:25,609 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1406161997851_0004
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 2.69 sec HDFS Read: 239 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
key value _c2 _c3
Time taken: 25.587 seconds
hive> select * from hbase_table_add1;
OK
hbase_table_add1.key hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3
1 tom 2 101
2 lily 3 102
3 wwn 4 103
4 marry 5 104
Time taken: 1.122 seconds, Fetched: 4 row(s)

登陆hbase中查看：
hbase(main):001:0> list
TABLE
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
shivetest
student
student_info
test
xyz
5 row(s) in 2.4090 seconds

=> ["hivetest", "student", "student_info", "test", "xyz"]
hbase(main):002:0> scan "student_info"
ROW COLUMN+CELL
1 column=city:nu, timestamp=1406249125147, value=101
1 column=info:col1, timestamp=1406249125147, value=tom
1 column=info:col2, timestamp=1406249125147, value=2
2 column=city:nu, timestamp=1406249125147, value=102
2 column=info:col1, timestamp=1406249125147, value=lily
2 column=info:col2, timestamp=1406249125147, value=3
3 column=city:nu, timestamp=1406249125147, value=103
3 column=info:col1, timestamp=1406249125147, value=wwn
3 column=info:col2, timestamp=1406249125147, value=4
4 column=city:nu, timestamp=1406249125147, value=104
4 column=info:col1, timestamp=1406249125147, value=marry
4 column=info:col2, timestamp=1406249125147, value=5
4 row(s) in 0.1110 seconds

hbase(main):003:0>

这里有3个hive的列,(value1和value2，value3),2个hbase的列簇(info,city)
hive的2列(value,和value2)对应1个hbase的列簇（info，在hbase的列名称col1,col2）,
hive的另外1列(value3)对应列nu位于city列簇。
这里实现了hive中表，多列存放到hbase少量固定的列簇中

0 0