HadoopDB安装使用

来源：互联网发布：淘宝助理菜鸟打印编辑：程序博客网时间：2024/06/05 10:01

由於它是在 Hadoop-0.19.x 開發的，因此我還是用 Hadoop-0.19.2 來架設，至於架設部分可以參考這篇：[Linux] 安裝 Hadoop 0.20.1 Multi-Node Cluster @ Ubuntu 9.10，其中 0.19.2 與 0.20.1 安裝上只有些微的差別，在上述文章中的 hadoop-0.20.1/conf/core-site.xml 與 hadoop-0.20.1/conf/mapred-site.xml 的內容，只需改寫在 hadoop-0.19.2/conf/hadoop-site.xml 即可。接著下面的介紹也將延續上則安裝教學，以 3-Node Cluster ，分別以 Cluster01、Cluster02 和 Cluster03 作為範例敘述，並且各台使用 hadoop 帳號來操作。

首先需建立 3-Node Cluster on Hadoop 0.19.x
- HadoopDB Quick Start Guide
- JDBC Driver - Java 1.6 請使用postgresql-8.4-701.jdbc4.jar
- 另一個不錯的安裝文件：HadoopDB 實做
以下若是用 hadoop@Cluster0X:~ 代表 Cluster01 ~ Cluster03 都要做的
對各台安裝設定 PostgreSQL
- 安裝並為資料庫建立 hadoop 帳號，假定使用密碼為 1234
- hadoop@Cluster0X:~$ sudo apt-get install postgresql
- hadoop@Cluster0X:~$ sudo vim /etc/postgresql/8.4/main/pg_hba.conf
  - #local   all         all                               ident
    local   all         all                               trust
    # IPv4 local connections:
    #host    all         all         127.0.0.1/32          md5
    host    all         all         127.0.0.1/32          password
    host    all         all         192.168.0.1/16          password            # 加上Cluster 機器 IP 範圍
    # IPv6 local connections:
    #host    all         all         ::1/128               md5
    host    all         all         ::1/128               password
- hadoop@Cluster0X:~$ sudo /etc/init.d/postgresql-8.4 restart
- hadoop@Cluster0X:~$ sudo su - postgres
- postgres@Cluster0X:~$ createuser hadoop
  - Shall the new role be a superuser? (y/n) y
    postgres@Cluster01:~$ psql
    psql (8.4.2)
    Type "help" for help.
    
    postgres=# alter user hadoop with password '1234';
    ALTER ROLE
    postgres=# \q
- 測試其他機器可否連線
  - hadoop@Cluster01:~$ createdb testdb
  - hadoop@Cluster02:~$ psql -h Cluster01 testdb
    - 錯誤訊息
      - psql: FATAL: no pg_hba.conf entry for host "192.168.56.168", user "hadoop", database "testdb", SSL on
        FATAL: no pg_hba.conf entry for host "192.168.56.168", user "hadoop", database "testdb", SSL off
    - 正確訊息
      - Password:
        psql (8.4.2)
        SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
        Type "help" for help.
        
        testdb=#
設定 HadoopDB
- hadoop@Cluster0X:~$ cp hadoopdb.jar HADOOP_HOME/lib/
- hadoop@Cluster0X:~$ cp postgresql-8.4-701.jdbc4.jar HADOOP_HOME/lib/
- hadoop@Cluster0X:~$ vim HADOOP_HOME/conf/hadoop-site.xml
  - <property>
    <name>hadoopdb.config.file</name>
    <value>HadoopDB.xml</value>
    <description>The name of the HadoopDB cluster configuration file</description>
    </property>
    
    <property>
    <name>hadoopdb.fetch.size</name>
    <value>1000</value>
    <description>The number of records fetched from JDBC ResultSet at once</description>
    </property>
    
    <property>
    <name>hadoopdb.config.replication</name>
    <value>false</value>
    <description>Tells HadoopDB Catalog whether replication is enabled.
    Replica locations need to be specified in the catalog.
    False causes replica information to be ignored.</description>
    </property>
- hadoop@Cluster01:~$ vim nodes.txt
  - 192.168.56.168
    192.168.56.169
    192.168.56.170
- hadoop@Cluster01:~$ vim Catalog.properties
  - #Properties for Catalog Generation
    ##################################
    nodes_file=nodes.txt
    # Relations Name and Table Name are the same
    relations_unchunked=raw
    relations_chunked=poi
    catalog_file=HadoopDB.xml
    ##
    #DB Connection Parameters
    ##
    port=5432
    username=hadoop
    password=1234
    driver=org.postgresql.Driver
    url_prefix=jdbc\:postgresql\://
    ##
    #Chunking properties
    ##
    # the number of databases on a node
    chunks_per_node=3
    # for udb0 ,udb1, udb2 ( 3 nodes = 0 ~ 2 )
    unchunked_db_prefix=udb
    # for cdb0 ,cdb1, ... , cdb8 ( 3 nodes x 3 chunks = 0~8 )
    chunked_db_prefix=cdb
    ##
    #Replication Properties
    ##
    dump_script_prefix=/root/dump_
    replication_script_prefix=/root/load_replica_
    dump_file_u_prefix=/mnt/dump_udb
    dump_file_c_prefix=/mnt/dump_cdb
    ##
    #Cluster Connection
    ##
    ssh_key=id_rsa-gsg-keypair
- hadoop@Cluster01:~$ java -cp lib/hadoopdb.jar edu.yale.cs.hadoopdb.catalog.SimpleCatalogGenerator Catalog.properties
  - 產生的 HadoopDB.xml 類似下面：
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <DBClusterConfiguration xmlns="http://edu.yale.cs.db.hadoop/DBConfigurationSchema">
        <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="192.168.56.168">
            <Relations id="raw">
                <Partitions url="jdbc:postgresql://192.168.56.168:5432/udb0" id="0"/>
            </Relations>
            <Relations id="poi">
                <Partitions url="jdbc:postgresql://192.168.56.168:5432/cdb0" id="0"/>
                <Partitions url="jdbc:postgresql://192.168.56.168:5432/cdb1" id="1"/>
                <Partitions url="jdbc:postgresql://192.168.56.168:5432/cdb2" id="2"/>
            </Relations>
        </Nodes>
        <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="192.168.56.169">
            <Relations id="raw">
                <Partitions url="jdbc:postgresql://192.168.56.169:5432/udb1" id="1"/>
            </Relations>
            <Relations id="poi">
                <Partitions url="jdbc:postgresql://192.168.56.169:5432/cdb3" id="3"/>
                <Partitions url="jdbc:postgresql://192.168.56.169:5432/cdb4" id="4"/>
                <Partitions url="jdbc:postgresql://192.168.56.169:5432/cdb5" id="5"/>
            </Relations>
        </Nodes>
        <Nodes Password="1234" Username="hadoop" Driver="org.postgresql.Driver" Location="192.168.56.170">
            <Relations id="raw">
                <Partitions url="jdbc:postgresql://192.168.56.170:5432/udb2" id="2"/>
            </Relations>
            <Relations id="poi">
                <Partitions url="jdbc:postgresql://192.168.56.170:5432/cdb6" id="6"/>
                <Partitions url="jdbc:postgresql://192.168.56.170:5432/cdb7" id="7"/>
                <Partitions url="jdbc:postgresql://192.168.56.170:5432/cdb8" id="8"/>
            </Relations>
        </Nodes>
    </DBClusterConfiguration>
- hadoop@Cluster01:~$ hadoop dfs -put HadoopDB.xml HadoopDB.xml
建立資料表、測試資料匯入各台機器的資料庫中，並且在 Hive 上建立相對應的資料表
- 在此以 raw 這個 talbe 當作範例。假設 HadoopDB.xml 對 raw 這個 table 敘述有 3 個，即上述範例的 udb0 、udb1 和 udb2，那就要分別去上頭指定的機器上建立資料庫
  - hadoop@Cluster01:~$ createdb udb0
    hadoop@Cluster02:~$ createdb udb1
    hadoop@Cluster03:~$ createdb udb2
- 並且依輸入的資料建立資料表
  - hadoop@Cluster01:~$ psql udb0
    udb0=#
    CREATE TABLE raw (
    ID int,
    NAME varchar(300)
    );
  - 同理如 Cluster02 跟 Cluster03
- 資料匯入
  - hadoop@Cluster01:~$ psql udb0
    udb0=# COPY RAW FROM '/home/hadoop/p0' WITH DELIMITER E'\t' ;
  - 關於 /home/hadoop/p0 的資料主要從原本依開始的大檔案，使用 HadoopDB 所提供的切割工具處理的
    - $ hadoop jar lib/hadoopdb.jar edu.yale.cs.hadoopdb.dataloader.GlobalHasher src_in_hdfs out_in_hdfs 3 '\n' 0
    - $ hadoop fs -get out_in_hdfs/part-00000 /home/hadoop/p0
  - 假設資料擺在 /home/haddop/p0 並且欄位以 tab 分隔
  - 同理也要處理 Cluster02 跟 Cluster03
- 最後，在 Hive 上頭建立相對應的資料表 (只需用一台機器執行)
  - 假設 Hive 使用的資料表將儲存在 HDFS 的 /db
  - hadoop@Cluster01:~ $ hadoop dfs -mkdir /db
  - hadoop@Cluster01:~ $ SMS_dist/bin/hive
    CREATE EXTERNAL TABLE raw (
    ID int,
    NAME string
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    STORED AS
    INPUTFORMAT 'edu.yale.cs.hadoopdb.sms.connector.SMSInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION '/db/raw';
  - 其中 /db/raw 的 basename 要跟 table 名稱一樣(各台資料庫裡的資料表與Hive建立的資料表)，另外對於資料欄位的型別也別忘了轉換囉
以上設定完後，即可在一台機器上(Ex: Cluster1) 執行 $ SMS_dist/bin/hive 看看成果
- hadoop@Cluster01:~ $ SMS_dist/bin/hive
  hive> show tables;
  hive> select name from raw;