hive整合hbase并做测试

来源：互联网发布：灵界室友网络剧下载编辑：程序博客网时间：2024/05/22 01:41

基于Hadoop+Hive架构对海量数据进行查询：http://blog.csdn.net/kunshan_shenbin/article/details/7105319
HBase 0.90.5 + Hadoop 1.0.0 集成：http://blog.csdn.net/kunshan_shenbin/article/details/7209990

第一步，启动hadoop，命令：./start-all.sh

第二步，启动hive，命令：

拷贝hbase-0.94.12.jar和zookeeper-3.4.2.jar到hive/lib下。
注意：如何hive/lib下已经存在这两个文件的其他版本（例如zookeeper-3.3.1.jar），建议删除后使用hbase下的相关版本。
修改hive/conf下hive-site.xml文件，在底部添加如下内容：

<!--

<name>hive.exec.scratchdir</name>

<value>/usr/local/hive/tmp</value>

</property>

--><property>

<name>hive.querylog.location</name>

<value>/usr/local/hadoop/hive/logs</value>

</property>

<name>hive.aux.jars.path</name> <value>file:///usr/local/hadoop/hive/lib/hive-hbase-handler-0.9.0.jar,file:///usr/local/hadoop/hive/lib/hbase-0.94.12.jar,file:///usr/local/hadoop/hive/lib/zookeeper-3.4.2.jar</value>

</property>

<name>hbase.zookeeper.quorum</name> <value>nistest.node1,nistest.node2,nistest.node3,instest.node4</value>

</property>

注意：如果hive-site.xml不存在则自行创建，或者把hive-default.xml.template文件改名后使用。

配置了hbase.zookeeper.quorum以后，启动就不需要加配置文件了

直接/bin/hive启动就可以（同时showtalbes的时候没有表是因为hive跟hbase有不同的表名，单独管理）

如果没有配置

启动的时候需要添加配置文件：

Bin/hive –hiveconfhbase.zookeeper.quorum=nistest.node1,nistest.node2,nistest.node3,nistest.node4

第三步，启动hbase，命令：./start-hbase.sh

第四步，建立关联表，这里我们要查询的表在hbase中已经存在所以，使用CREATE EXTERNAL TABLE来建立，如下：

创建hbase表（在hbase客户端hbase shell）的代码如下：

create 'hbase_table_1','cf1'

然后加入数据：

put 'hbase_table_1','1','cf1:val','fjsh1'

Java代码

CREATE EXTERNAL TABLE hbase_table_2(key string, valuestring)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" ="cf1:val")

TBLPROPERTIES("hbase.table.name" ="hbase_table_1");

1. hbase.columns.mapping指向对应的列族；多列时，data:1，data:2；多列族时，data1:1,data2:1；

2. hbase.table.name指向对应的表；

3. hbase_table_2(keystring, value string)，这个是关联表

4. 我们看一下HBase中要查询的表的结构，

DESCRIPTION ENABLED

'hbase_table_1', {NAME => 'cf1',DATA_BLOCK_ENCODING true

=> 'NONE', BLOOMFILTER => 'NONE',REPLICATION_SCOPE =

>'0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VE

RSIONS => '0', TTL => '2147483647',KEEP_DELETED_CELL

S=> 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fal

se',ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}

1 row(s) in 0.1380 seconds

在看一下表中的数据，

ROW COLUMN+CELL

1 column=cf1:val,timestamp=1397729052180, value=value1

2 column=cf1:val,timestamp=1397533269763, value=fjsh

3 column=cf1:val,timestamp=1397729085322, value=value3

3 row(s) in 0.1000 seconds

这张图片是来自于网上的解释

我们在hive命令行中先查看一下hbase_table_2，

hive> select * from hbase_table_2

> ;

1 value1

2 fjsh

3 value3

Time taken: 1.045 seconds

对比下hbase中的数据是正确的，然后我们再插入一条数据在hbase中

put 'hbase_table_1','4','cf1:val','fjsh4'

然后在hive中查看结果：

select * from hbase_table_2

> ;

1 value1

2 fjsh

3 value3

4 fjsh4

Time taken: 0.222 seconds

也是正确无误的

下面我们来查询一下hbase_table_2表中value值为fjsh4的数据，

select * from hbase_table_2 wherevalue='fjsh4'

> ;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 sincethere's no reduce operator

Starting Job = job_201404151531_0011,Tracking URL = http://nistest.master:50030/jobdetails.jsp?jobid=job_201404151531_0011

Kill Command =/usr/local/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=nistest.master:8021 -kill job_201404151531_0011

Hadoop job information for Stage-1: numberof mappers: 1; number of reducers: 0

2014-04-17 20:18:53,597 Stage-1 map =0%, reduce = 0%

2014-04-17 20:19:00,670 Stage-1 map =100%, reduce = 0%, Cumulative CPU 2.03sec

2014-04-17 20:19:01,682 Stage-1 map =100%, reduce = 0%, Cumulative CPU 2.03sec

2014-04-17 20:19:02,692 Stage-1 map =100%, reduce = 0%, Cumulative CPU 2.03sec

2014-04-17 20:19:03,702 Stage-1 map =100%, reduce = 0%, Cumulative CPU 2.03sec

2014-04-17 20:19:04,710 Stage-1 map =100%, reduce = 100%, Cumulative CPU 2.03sec

MapReduce Total cumulative CPU time: 2seconds 30 msec

Ended Job = job_201404151531_0011

MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 2.03 sec HDFS Read: 267 HDFS Write: 8 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds30 msec

4 fjsh4

Time taken: 51.066 seconds

以上只是在命令行里左对应的查询，我们的目的是使用JAVA代码来查询出有用的数据，其实这个也很简单，

首先，启动Hive的命令有点变化，使用如下命令：

Java代码

./hive --service hiveserver -p 10000

./hive --service hiveserver

这里我们默认使用嵌入的Derby数据库，这里可以在hive-site.xml文件中查看到：

Java代码

<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://slave1:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepass</value>
</property>

在此，数据库链接的URL可以使用默认的：jdbc:hive://localhost:10000/default

有了上面的准备，下面我们就可以使用JAVA代码来读取数据了，如下：

import java.sql.*;

import org.apache.log4j.Logger;

import com.mysql.jdbc.Statement;

public class HiveTest {

private static String driverName ="org.apache.hadoop.hive.jdbc.HiveDriver";

private static String url ="jdbc:hive://192.168.2.161:10000/default";

private static String user ="sa";

private static String password ="sa";

private static String sql ="";

private static ResultSet res;

private static final Logger log = Logger.getLogger(HiveJdbcClient.class);

private static Connection con;

public static void main(String[] args)throws Exception {

setUp();

testSelect();

}

public static void testSelect() throws SQLException {

java.sql.Statement stmt = con.createStatement();

ResultSet res = stmt.executeQuery("select * fromhbase_table_2");

// boolean moreRow = res.next();

while (res.next()) {

System.out.println(res.getString(1)+","+res.getString(2));

//moreRow = res.next();

}

protected static void setUp() throws Exception {

Class.forName(driverName);

con = DriverManager.getConnection(url,user, password);

}

执行结果：

1,value1

2,fjsh

3,value3

4,fjsh4

0 0