Hawq学习笔记 --- How to access HDFS data via GPDB external table with gphdfs protocol
来源:互联网 发布:淘宝平板客户端和手机 编辑:程序博客网 时间:2024/06/06 09:01
Environment
- Greenplum database 4.2.x, 4.3.x
- Pivotal HD 1.x, 2.x
Prerequisites
- Download jdk (1.7 is recommended) package and install it on all servers of the GPDB cluster
- Download Pivotal HD installation package (same version as that of PHD cluster to be access) and put it on GPDB master host
Install required PHD packages on GPDB cluster
Option 1
1. Unpack PHD installation tallball. Here take PHD-2.0.1 release as an example
[root@admin phd201]# tar xvfz PHD-2.0.1.0-148.tar.gz
... ...
And find out the following rpm packages in the unpacked directory
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
2. Install the rpm packages (follow the order listed above to avoid dependency check error). This needs to be done on all segment servers of GPDB cluster
[root@admin phd201]# cd PHD-2.0.1.0-148
[root@admin PHD-2.0.1.0-148]#rpm -ivh \
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm \
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm \
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm \
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
3. Configure hadoop configuration files & ensure that hdfs works.
Option 2
1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required rpm packages could be found under /usr/lib/gphd/rpms on admin node.
2. Install the rpm packages on all segment hosts of GPDB cluster through either of the following ways.
a) Copy those rpm packages to each segment server and install them with "rpm -ivh" command manually
b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below. Note that "admin.hadoop.local" is the hostname of admin node on your site, which needs to be modified accordingly.
[gphd]
name=PHD Admin Node Repo
baseurl=http://admin.hadoop.local/gphd_yum_repo
enabled=1
gpgcheck=0
metadata_expire=0
3. Configure hadoop configuration files & ensure that hdfs works.
Then run "yum install <rpm package name>" to complete installation.
GPDB Configuration
1. Set environment variable JAVA_HOME for gpadmin user correctly on all segment servers, as illustrated below. Better to set it in .bashrc or .bash_profile
[gpadmin@admin ~]$echo $JAVA_HOME
/usr/java/default
[gpadmin@admin ~]$ls -l /usr/java/default
lrwxrwxrwx 1 root root 16 Jul 18 2013 /usr/java/default -> /usr/java/latest
[gpadmin@admin ~]$ls -l /usr/java/latest
lrwxrwxrwx 1 root root 21 Dec 15 2013 /usr/java/latest -> /usr/java/jdk1.7.0_25
[gpadmin@admin ~]$cat .bash_profile | grep JAVA_HOME
export JAVA_HOME=/usr/java/default
2. Set the parameters for GPDB
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd'"[gpadmin@admin ~]$ gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"
Run "gpstop -u" to take the change into effect.
Test
1. Check HDFS is accessible from any of the segment servers
[gpadmin@sdw1 ~]$hdfs dfs -ls hdfs://hdm2:8020/
Found 7 items
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/apps
drwxr-xr-x - postgres gpadmin 0 2014-06-16 04:55 hdfs://hdm2:8020/hawq_data
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/hive
drwxr-xr-x - mapred hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/mapred
drwxrwxrwx - hdfs hadoop 0 2014-07-10 22:29 hdfs://hdm2:8020/tmp
drwxrwxrwx - hdfs hadoop 0 2014-06-16 18:11 hdfs://hdm2:8020/user
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/yarn
2. Create a temporary text file and put it to HDFS
[gpadmin@admin ~]$cat test1.txt
15,west
25,east
[gpadmin@admin ~] hdfs dfs -put test1.txt hdfs://hdm2:8020/tmp/
[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 16 2014-07-06 22:38 hdfs://hdm2:8020/tmp/test1.txt
3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS
NOTE: The blow example location field is for a single namenode deployment. In the case of two Namenodes (High Availability ) location field would be as followed "gphdfs:///tmp/test1.txt". We do not include a port and replace hostname with the name configured in core-site.xml for param "fs.defaultFS"
initdb=# create external table test_hdfs (age int, name text) location('gphdfs://hdm2:8020/tmp/test1.txt') format 'text' (delimiter ',');
CREATE EXTERNAL TABLE
initdb=# \d test_hdfs
External table "public.test_hdfs"
Column | Type | Modifiers
--------+---------+-----------
age | integer |
name | text |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter ',' null '\N' escape '\'
External location: gphdfs://hdm2:8020/tmp/test1.txt
4. Try query data from the external table
initdb=# select * from test_hdfs;age | name
-----+------
15 | west
25 | east
(2 rows)
5. Create a writable external table in GPDB pointing to a file in HDFS
initdb=# select * from myt1;
id | name
------+-------
1000 | Jason
(1 row)
initdb=# create writable external table test_hdfs2 (like myt1) location('gphdfs://hdm2:8020/tmp/test2.txt') format 'text' (delimiter ',');
NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE EXTERNAL TABLE
6. Insert data to the writable external table
initdb=# insert into test_hdfs2 select * from myt1;
INSERT 0 1
7. Check existence and content of file in HDFS
[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp/test2.txt
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 11 2014-07-13 23:37 hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
[gpadmin@admin ~]$hdfs dfs -cat hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
1000,Jason
- Hawq学习笔记 --- How to access HDFS data via GPDB external table with gphdfs protocol
- Moving Data from HDFS to Hive Using an External Table
- How to read binary data from HDFS with Thrift?
- How-to: effective store kafka data into hdfs via spark streaming
- How to Access Git Repository via SSH
- How to access url via wifi in blackberry emulator
- How to access 2 dimensions SAFEARRAY data
- Using Oracle Data Pump to create an external table
- How to Add Route to TCP/IP Routing Table With Windows Routing And Remote Access Console or DOS Promp
- how to wipe data with adb?
- How to access data/data folder in Android device?
- Hawq学习笔记 --- 构建于HDFS之上的Greenplum(实时查询引擎)
- [c#] How to use Data Access Application Block?
- How to access pixel data of an image
- Hawq学习笔记 --- 创建带有自增字段的table,并插入数据
- hive external table partition 关联HDFS数据
- hive external table partition 关联HDFS数据
- How to Do Everything with Microsoft Office Access 2003
- CodeForces 718A
- Uh, First, You know what I will say as a coder
- jquery选择器 之 获取父级元素、同级元素、子元素
- UVa11520-Fill the Square
- hdu 2063 过山车【二分图匹配,ver匈牙利
- Hawq学习笔记 --- How to access HDFS data via GPDB external table with gphdfs protocol
- Codis 使用
- Android Studio该如何创建虚拟机呢?包括错误无法启动虚拟机
- 立体视觉:算法和应用(七)
- C Primer Plus学习 二
- Docke 1.12 基础篇:48条命令(6)
- 第5周项目2-建立链栈算法库
- 文章标题
- 站队(test05)