HadoopDB集群配置方法

来源：互联网发布：java类库大全有哪些编辑：程序博客网时间：2024/04/28 13:23

Author:谢

本文介绍在2个节点上安装配置HadoopDB方法。各节点的名称分别为Cluster01(192.168.0.1)，Cluster02(192.168.0.2)。

以下若是用 hadoop@Cluster0X:~ 代表 Cluster01 ~ Cluster02 都要做的.

参考资料1：HadoopDB Quick Start Guide

参考资料2：HadoopDB安装使用

1 首先在各节点上安装Hadoop-0.20.2，参考此文。

2 各节点安裝设置 PostgreSQL：

安裝并为数据库建立 hadoop 帐号，假定使用密碼为 1234
hadoop@Cluster0X:~$ sudo apt-get install postgresql
hadoop@Cluster0X:~$ sudo vim /etc/postgresql/8.4/main/pg_hba.conf
- #local   all         all                               ident
  local   all         all                               trust
  # IPv4 local connections:
  #host    all         all         127.0.0.1/32          md5
  host    all         all         127.0.0.1/32          password
  host    all         all         192.168.0.1/24          password            # 加上Cluster 機器 IP 範圍
  # IPv6 local connections:
  #host    all         all         ::1/128               md5
  host    all         all         ::1/128               password
hadoop@Cluster0X:~$ sudo /etc/init.d/postgresql-8.4 restart
hadoop@Cluster0X:~$ sudo su - postgres
postgres@Cluster0X:~$ createuser hadoop
- Shall the new role be a superuser? (y/n) y
  postgres@Cluster01:~$ psql
  psql (8.4.2)
  Type "help" for help.
  
  postgres=# alter user hadoop with password '1234';
  ALTER ROLE
  postgres=# /q
測試其他機器可否連線
- hadoop@Cluster01:~$ createdb testdb
- hadoop@Cluster02:~$ psql -h Cluster01 testdb
  - 如果能够连接上，则出现以下提示：
    - 口令:
      psql (8.4.2)
      SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
      Type "help" for help.
      
      testdb=#

3 设置HadoopDB

首先从http://sourceforge.net/projects/hadoopdb/files/下载hadoopdb，解压后，其中包含hadoopdb.jar。

然后再从http://jdbc.postgresql.org/download.html下载postgresql-8.4-701.jdbc4.jar。

hadoop@Cluster0X:~$ cp hadoopdb.jar HADOOP_HOME/lib/

hadoop@Cluster0X:~$ cp postgresql-8.4-701.jdbc4.jar HADOOP_HOME/lib/

hadoop@Cluster0X:~$ vim HADOOP_HOME/conf/core-site.xml

core-site.xml 文件中添加以下内容：
<property>
<name>hadoopdb.config.file</name>
<value>HadoopDB.xml</value>
<description>The name of the HadoopDB cluster configuration file</description>
</property>

<property>
<name>hadoopdb.fetch.size</name>
<value>1000</value>
<description>The number of records fetched from JDBC ResultSet at once</description>
</property>

<property>
<name>hadoopdb.config.replication</name>
<value>false</value>
<description>Tells HadoopDB Catalog whether replication is enabled.
Replica locations need to be specified in the catalog.
False causes replica information to be ignored.</description>
</property>

hadoop@Cluster01:~$ vim nodes.txt

将集群中的所有节点IP写入此文件：

192.168.0.1
192.168.0.2

hadoop@Cluster01:~$ vim Catalog.properties

#Properties for Catalog Generation
##################################
nodes_file=nodes.txt
# Relations Name and Table Name are the same
relations_unchunked=raw
relations_chunked=poi
catalog_file=HadoopDB.xml
##
#DB Connection Parameters
##
port=5432
username=hadoop
password=1234
driver=org.postgresql.Driver
url_prefix=jdbc/:postgresql/://
##
#Chunking properties
##
# the number of databases on a node
chunks_per_node=2
# for udb0 ,udb1 ( 2 nodes = 0 ~ 1 )
unchunked_db_prefix=udb
# for cdb0 ,cdb1, ... , cdb5 (2 nodes x 3 chunks = 0~5 )
chunked_db_prefix=cdb
##
#Replication Properties
##
dump_script_prefix=/root/dump_
replication_script_prefix=/root/load_replica_
dump_file_u_prefix=/mnt/dump_udb
dump_file_c_prefix=/mnt/dump_cdb
##
#Cluster Connection
##
ssh_key=id_rsa-gsg-keypair

hadoop@Cluster01:~$ java -cp lib/hadoopdb.jar edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator Catalog.properties

产生的 HadoopDB.xml 类似下面：
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DBClusterConfiguration xmlns="http://edu.yale.cs.db.hadoop/DBConfigurationSchema">
    <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location=" 192.168.0.1 ">
        <Relations id="raw">
            <Partitions url="jdbc:postgresql:// 192.168.0.1 :5432/udb0" id="0"/>
        </Relations>
        <Relations id="poi">
            <Partitions url="jdbc:postgresql:// 192.168.0.1 :5432/cdb0" id="0"/>
            <Partitions url="jdbc:postgresql:// 192.168.0.1 :5432/cdb1" id="1"/>
        </Relations>
    </Nodes>
    <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location=" 192.168.0.2 ">
        <Relations id="raw">
            <Partitions url="jdbc:postgresql:// 192.168.0.2 :5432/udb1" id="1"/>
        </Relations>
        <Relations id="poi">
            <Partitions url="jdbc:postgresql:// 192.168.0.2 :5432/cdb2" id="2"/>
            <Partitions url="jdbc:postgresql:// 192.168.0.2 :5432/cdb3" id="3"/>
        </Relations>
    </Nodes>
</DBClusterConfiguration>

将HadoopDB.xml放入HDFS中：

hadoop@Cluster01:~$ hadoop dfs -put HadoopDB.xml HadoopDB.xml

4 数据加载

在此以 raw 这个 talbe 当作例子进行讲解。根据前面HadoopDB.xml的内容，集群中的两个节点，第一个节点的数据库名称为udb0 ，第二个节点为udb1，首先分别创建这两个数据库：

hadoop@Cluster01:~$ createdb udb0
hadoop@Cluster02:~$ createdb udb1

在对应数据库上建立raw 这个table：

hadoop@Cluster01:~$ psql udb0
udb0=#
CREATE TABLE raw (
ID int,
NAME varchar(300)
);
同理如 Cluster02

先将本地文件系统中的数据put到HDFS中：

$ hadoop fs -put localfile input_in_hdfs/

数据切割：使用 HadoopDB 所提供的切割工具将数据切割成n份(一般，n是集群有节点数目)，以下例子中：n=2，每一份数据的分割符是'/n':

$ hadoop jar lib/hadoopdb.jar edu.yale.cs.hadoopdb.dataloader.GlobalHasher input_in_hdfs out_in_hdfs 2 '/n' 0

在各个节点上，将切割后的数据，从hdfs中传到各个节点的本地文件系统中:

$ hadoop fs -get out_in_hdfs/part-00000 /home/hadoop/p0

将本地文件载入到数据库中：

hadoop@Cluster01:~$ psql udb0
udb0=# COPY RAW FROM '/home/hadoop/p0' WITH DELIMITER E'/t' ;

同理如 Cluster02

5 SMS安装配置

第3步下载的hadoopdb解压后包含SMS_dist.tar.gz文件夹。

SMS的安装方法很简单：首先将其解压，然后打开bin目录下的hive-config.sh文件，再最后加上一句：

export HADOOP_HOME="此处写hadoop的安装路径"。

然后在hive上建立上面raw这个table对应的数据库：

假設 Hive 使用的数据將儲存在 HDFS 的 /db

hadoop@Cluster01:~ $ hadoop dfs -mkdir /db

hadoop@Cluster01:~ $ SMS_dist/bin/hive
CREATE EXTERNAL TABLE raw (
ID int,
NAME string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS
INPUTFORMAT 'edu.yale.cs.hadoopdb.sms.connector.SMSInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/db/raw';

/db/raw中的raw是指table的名称

5 以上设定完后，即可在一台机器上(Ex: Cluster1) 執行 $ SMS_dist/bin/hive 看看成果

hadoop@Cluster01:~ $ SMS_dist/bin/hive
hive> show tables;
hive> select name from raw;