Hive映射Hbase表

来源:互联网 发布:vsftpd 源码编译安装 编辑:程序博客网 时间:2024/06/04 20:00

Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
hive
Hbase的整合功能的实现是利用两者本身对外的API接口互相进行通信,相互通信主要是依靠hive_hbase-handler.jar工具类,大致意思如图所示:

Hadoop框架安装完后,安装与配置hbasehive
Hbase
的安装路径为/usr/local/hbase
Hive
的安装路径为/usr/local/hive
整合hivehbase的过程如下:
1
将文件

/usr/lib/hbase/lib/hbase-client-1.0.0-cdh5.4.3.jarln-s /usr/lib/hbase/lib/hbase-common-1.0.0-cdh5.4.3.jar

/usr/lib/hbase/lib/hbase-hadoop2-compat-1.0.0-cdh5.4.3.jar

/usr/lib/hbase/lib/hbase-hadoop-compat-1.0.0-cdh5.4.3.jarln-s /usr/lib/hbase/lib/hbase-protocol-1.0.0-cdh5.4.3.jar

/usr/lib/hbase/lib/hbase-server-1.0.0-cdh5.4.3.jar

/usr/lib/hbase/lib/htrace-core-3.0.4.jar

/usr/lib/zookeeper/zookeeper.jar

拷贝到/opt/mapr/hive/hive-0.7.1/lib文件夹下面
注意:如果hive/lib下已经存在这两个文件的其他版本(例如zookeeper-3.3.1.jar),建议删除后使用hbase下的相关版本
2
修改hive/confhive-site.xml文件,在底部添加如下内容:
<property>

<name>hive.querylog.location</name>

<value>/usr/local/hive/hive-1.2.2/logs</value>

</property>            

<property>

<name>hive.aux.jars.path</name>

<value>file:///usr/local/hive/hive-1.2.2/lib/hbase-protocol-0.96.2-hadoop2.jar,file:///usr/local/hive/hive-1.2.2/lib/hive-hbase-handler-1.2.2.jar,file:///usr/local/hive/hive-1.2.2/lib/hbase-server-0.96.2-hadoop2.jar,file:///usr/local/hive/hive-1.2.2/lib/hbase-common-0.96.2-hadoop2.jar,file:///usr/local/hive/hive-1.2.2/lib/zookeeper-3.4.5.jar,file:///usr/local/hive/hive-1.2.2/lib/hbase-client-0.96.2-hadoop2.jar,file:///usr/local/hive/hive-1.2.2/lib/htrace-core-2.04.jar</value>


注意,配置 文件中的jar包在Hive/lib都要有,没有的从Hbase/lib中 通过scp拷贝到Hive/lib下。 


注意:如果hive-site.xml不存在则自行创建,或者把hive-default.xml.template文件改名后使用。
3.
拷贝hbase-common-0.96.2-hadoop2.jar到所有hadoop节点(包括master)hadoop/lib下。
4.
拷贝hbase/conf下的hbase-site.xml文件到所有hadoop节点(包括master)的${HADOOP_HOME}/etc/hadoop/下。

注意,如果3,4两步跳过的话,运行hive时很可能出现如下错误:
org.apache.hadoop.hbase.ZooKeeperConnectionException:HBase is able to connect to ZooKeeper but the connection closes immediately.

This could be a sign that the server has toomany connections (30 is the default). Consider inspecting your ZK server logsfor that error and
then make sure you are reusingHBaseConfiguration as often as you can. See HTable's javadoc for moreinformation. at org.apache.hadoop.
hbase.zookeeper.ZooKeeperWatcher.

5 启动hive
单节点启动
bin/hive -hiveconf hbase.master=master:60000
集群启动
bin/hive -hiveconf hbase.zookeeper.quorum=node1,node2,node3   (
所有的zookeeper节点)
如果hive-site.xml文件中没有配置hive.aux.jars.path,则可以按照如下方式启动。
hive --auxpath /opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,/opt/mapr/hive/hive-0.7.1/lib/hbase-0.90.4.jar,/opt/mapr/hive/hive-0.7.1/lib/zookeeper-3.3.2.jar-hiveconf hbase.master=localhost:60000

测试修改hive的配置文件hive-site.xml

<property>
  <name>hive.zookeeper.quorum</name>
  <value>node1,node2,node3</value>
  <description>The list of zookeeper servers to talk to. This isonly needed for read/write locks.</description>
</property>

不用增加参数启动hive就可以联合hbase

6 启动后进行测试
1创建hbase识别的表
CREATE TABLE hbase_table_1(key int, value string) STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping" = "
:key,cf1:val") TBLPROPERTIES ("hbase.table.name" ="xyz");
hbase.table.name
定义在hbasetable名称,多列时,data:1data:2;多列族时,data1:1,data2:1
hbase.columns.mapping 定义在hbase的列族,里面的:key是固定值而且要保证在表pokes中的foo字段是唯一值

创建有分区的表

CREATE TABLE hbase_table_1(key int, value string)  partitioned by(day string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITHSERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")TBLPROPERTIES ("hbase.table.name" = "xyz");

 

不支持表的修改
会提示不能修改非本地表。
hive> ALTER TABLE hbase_table_1 ADD PARTITION (day = '2012-09-22');
FAILED: Error in metadata: Cannot use ALTER TABLE on a non-native table FAILED:Execution Error, return code 1 fromorg.apache.hadoop.hive.ql.exec.DDLTask 


2使用sql导入数据
新建hive的数据表
create table pokes(foo int,bar string)row format delimited fields terminated by',';
批量导入数据
load data local inpath '/home/1.txt' overwrite into table pokes;

1.txt文件的内容为 
1,hello 
2,pear 
3,world

使用sql导入hbase_table_1

SET hive.hbase.bulk=true;


insert overwrite table hbase_table_1 select *from pokes;

导入有分区的表

insert overwrite table hbase_table_1  partition(day='2012-01-01') select * from pokes;

(3) 查看数据 
hive> select * from hbase_table_1;
OK
1 hello
2 pear
3 world

(注:与hbase整合的有分区的表存在个问题  select * from table查询不到数据,selectkey,value from table可以查到数据)

4)登录Hbase去查看数据
hbase shell

hbase(main):002:0> describe 'xyz' 
DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1',BLOOMFILTER => 'NONE', REPLICATION_S true 
COPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>'2147483647', BLOCKSI 
ZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 
1 row(s) in 0.0830 seconds
hbase(main):003:0> scan 'xyz'
ROW COLUMN+CELL 
1 column=cf1:val, timestamp=1331002501432, value=hello 
2 column=cf1:val, timestamp=1331002501432, value=pear 
3 column=cf1:val, timestamp=1331002501432, value=world

这时在Hbase中可以看到刚才在hive中插入的数据了。

7 对于在hbase已经存在的表,在hive中使用CREATE EXTERNAL TABLE来建立
例如hbase中的表名称为test1,字段为 a: , b: ,c:hive中建表语句为

create external table hive_test (key int,gid map<string,string>,sidmap<string,string>,uid map<string,string>) STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping" ="a:,b:,c:") TBLPROPERTIES  ("hbase.table.name" = "test1");
hive中建立好表后,查询hbasetest1表内容
Select * from hive_test;

OK
1 {"":"qqq"} {"":"aaa"}{"":"bbb"}
2 {"":"qqq"} {} {"":"bbb"}

查询gid字段中value值的方法为
select gid[''] from hbase2;
得到查询结果
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201203052222_0017, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201203052222_0017
Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job-Dmapred.job.tracker=maprfs:/// -kill job_201203052222_0017
2012-03-06 14:38:29,141 Stage-1 map = 0%, reduce = 0%
2012-03-06 14:38:33,171 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201203052222_0017
OK
qqq
qqq

如果hbasetest1中的字段为user:gid,user:sid,info:uid,info:level,在hive中建表语句为
create external table hive_test(key int,user map<string,string>,infomap<string,string>) STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping" ="user:,info:") TBLPROPERTIES  (g"hbase.table.name" ="test1");

查询hbase表的方法为
select user['gid'] from hbase2;

 注:hive连接hbase优化,将HADOOP_HOME/conf中的hbase-site.xml文件中增加配置

 <property>
   <name>hbase.client.scanner.caching</name>
   <value>10000</value>
 </property>

或者在执行hive语句之前执行hive>set hbase.client.scanner.caching=10000;

 

原创粉丝点击