Hive和Hbase的数据整合联系

来源：互联网发布：php 在线人数统计编辑：程序博客网时间：2024/05/20 17:23

lz最近在研究hadoop家族中非常重要的两个工具：hive和hbase。这两个工具分别对应于类sql的hadoop数据查询和hadoop的database。都是基于hadoop中的hdfs。

下图是一个比较典型的hadoop的数据处理流程图：

我们可以发现，在数据存入hbase—>Hive对数据进行统计分析的这个步骤中就涉及到了Hive与Hbase的整合。因此，有必要了解一下这两个工具之间的数据整合。

一、hive

Hive是建立在hadoop之上的数据仓库基础构架、是为了减少MapReduce编写工作的批处理系统，Hive本身不存储和计算数据，它完全依赖于HDFS和MapReduce。Hive可以理解为一个客户端工具，将我们的sql操作转换为相应的MapReduce jobs，然后在Hadoop上面运行。

二、hbase
Hbase全称为Hadoop Database，即Hbase是Hadoop的数据库，是一个分布式的存储系统。Hbase利用Hadoop的HDFS作为其文件存储系统，利用Hadoop的MapReduce来处理Hbase中的海量数据。利用zookeeper作为其协调工具。

这两者之间有着本质的区别，但是所谓的hive和hbase的整合即为：

base数据库的缺点在于—-语法格式异类，没有类sql的查询方式，因此在实际的业务当中操作和计算数据非常不方便，但是Hive就不一样了，Hive支持标准的sql语法，于是我们就希望通过Hive这个客户端工具对Hbase中的数据进行操作与查询，进行相应的数据挖掘。

因此，我们希望能够整合hive和hbase如下图所示：

接下来就详细讲述这两个工具如何实现数据的映射，那么本文主要讲从hbase数据库表到hive数据表的映射，其中的关键在于hbase中的table和hive中的table如何在column级别上的映射。

假设现在有一张hbase的表，表名为：users

对应这张hbase表，我们希望在hive中能够建立hive_users的表：如下所示：

那么在hive中，我们采用如下的语句来实现映射：

hive>create external table hivetable(rowkey string, column1 string,column2 string,column3 string)   hive>stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   hive>with serdeproperties("hbase.columns.mapping" = ":key,columnfamily1:column1,columnfamily1:column2,columnfamily2:column3")  hive>tblproperties("hbase.table.name"="hbasetable");

语法具体含义：
上面这个建表语句表示在Hive中建立一个外部表—名字叫做hivetable，与其在Hbase中建立映射关系的表名字为hbasetable,映射关系如下：
hivetable —————————————hbasetable
rowkey<—————>key （Hive中的rowkey字段关联到Hbase中的行健key）
column1<————–>columnfamily1:column1 (hivetable中的column1映射到hbasetable中columnfamily1上的column1字段)
column2<————–>columnfamily1:column2 (hivetable中的column2映射到hbasetable中columnfamily1上的column2字段)
column3<————–>columnfamily2:column3 (hivetable中的column3映射到hbasetable中columnfamily2上的column3字段)
stored by ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ 的含义是：Hive与Hbase整合功能(互相通信)的实现主要是通过hive_hbase-handler.jar这个工具类来完成的，而语法中的HBaseStorageHandler指的就是这个工具类。

下面用一个具体的实例来说明如何通过Hive来绑定Hbase中的table：
首先我们先在Hbase中建立一个表：customer–其数据模型以及内容如下：

创建完表之后核实一下customer表中的内容，看是否创建成功：

hbase(main):001:0> scan 'customer'ROW                            COLUMN+CELL xiaoming                      column=address:city, timestamp=1465142056815, value=hangzhou xiaoming                      column=address:country, timestamp=1465142078267, value=china xiaoming                      column=address:province, timestamp=1465142041784, value=zhejiang xiaoming                      column=info:age, timestamp=1465142102017, value=24 xiaoming                      column=info:company, timestamp=1465142114558, value=baidu zhangyifei                    column=address:city, timestamp=1465142154995, value=shenzhen zhangyifei                    column=address:country, timestamp=1465142167587, value=china zhangyifei                    column=address:province, timestamp=1465142138872, value=guangdong zhangyifei                    column=info:age, timestamp=1465142183538, value=28 zhangyifei                    column=info:company, timestamp=1465142200569, value=alibaba2 row(s) in 0.7090 seconds

接着我们根据上面的语法在hive中建立对应的表hive_customer,语法实现如下：

hive>  create external table hive_customer(rowkey string, city string,country string,province string,age string,company string)    >  stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'    >  with serdeproperties("hbase.columns.mapping" = ":key,address:city,address:country,address:province,info:age,info:company")    >  tblproperties("hbase.table.name"="customer");

通过上面的语法我们就可以在hive中建立对应的表hive_customer了，现在我们查看一下表结构：

hive> describe hive_customer;OKrowkey  string  from deserializercity    string  from deserializercountry string  from deserializerprovince        string  from deserializerage     string  from deserializercompany string  from deserializerTime taken: 0.068 seconds

即为：

从上面的这个表结构可以看出，在hive表与hbase表整合的过程中，无非就是建立一个映射关系而已。
现在我们在hive中查看一下hive_customer表的内容：

很明显，hive中表的内容和我们预期想的是一样的，上面这个sql操作由于是全表操作，并没有走MapReduce程序，下面我们实现一个走MapReduce程序的sql操作：
查询hive_customer表中xiaoming的相关信息：
执行操作：

hive> select * from hive_customer    > where rowkey="xiaoming";

表明通过mapreduce在执行查询操作。那么一个简单的hbase表到hive表的映射就完成了。

阅读全文

0 0