Pivotal Knowledge Base

Follow

How to access HDFS data via GPDB external table with gphdfs protocol

Environment

  • Greenplum database 4.2.x, 4.3.x
  • Pivotal HD 1.x, 2.x
  • HDP 2.x

Prerequisites

  • Download jdk (1.7 is recommended) package and install it on all servers of the GPDB cluster 
  • Download Pivotal HD installation package (same version as that of PHD cluster to be access) and put it on GPDB master host
  • Link to download HDP installation package could be found in release note of each HDP release. Access here to choose one HDP release then read the release note

Install required PHD packages on GPDB cluster
Option 1

1. Unpack PHD installation tallball. Here take PHD-2.0.1 release as an example

[root@admin phd201]# tar xvfz PHD-2.0.1.0-148.tar.gz

... ...

And find out the following rpm packages in the unpacked directory

utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm

2. Install the rpm packages (follow the order listed above to avoid dependency check error). This needs to be done on all segment servers of GPDB cluster

[root@admin phd201]# cd PHD-2.0.1.0-148

[root@admin PHD-2.0.1.0-148]#rpm -ivh \
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm \
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm \
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm \
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm

3. Configure hadoop configuration files & ensure that hdfs works.

Option 2

1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required rpm packages could be found under /usr/lib/gphd/rpms on admin node. 

2. Install the rpm packages on all segment hosts of GPDB cluster through either of the following ways.

    a)  Copy those rpm packages to each segment server and install them with "rpm -ivh" command manually

    b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below. Note that "admin.hadoop.local" is the hostname of admin node on your site, which needs to be modified accordingly.

[gphd]
name=PHD Admin Node Repo
baseurl=http://admin.hadoop.local/gphd_yum_repo
enabled=1
gpgcheck=0
metadata_expire=0

3. Configure hadoop configuration files & ensure that hdfs works.

Then run "yum install <rpm package name>" to complete installation.

GPDB Configuration

1. Set environment variable JAVA_HOME for gpadmin user correctly on all segment servers, as illustrated below. Better to set it in .bashrc or .bash_profile

[gpadmin@admin ~]$echo $JAVA_HOME

/usr/java/default
[gpadmin@admin ~]$ls -l /usr/java/default
lrwxrwxrwx 1 root root 16 Jul 18 2013 /usr/java/default -> /usr/java/latest
[gpadmin@admin ~]$ls -l /usr/java/latest
lrwxrwxrwx 1 root root 21 Dec 15 2013 /usr/java/latest -> /usr/java/jdk1.7.0_25

[gpadmin@admin ~]$cat .bash_profile | grep JAVA_HOME
export JAVA_HOME=/usr/java/default


2. Set the parameters for GPDB

[gpadmin@admin ~]$ gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd'"
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"

Run "gpstop -u" to take the change into effect.

Test

1. Check HDFS is accessible from any of the segment servers

[gpadmin@sdw1 ~]$hdfs dfs -ls hdfs://hdm2:8020/

Found 7 items
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/apps
drwxr-xr-x - postgres gpadmin 0 2014-06-16 04:55 hdfs://hdm2:8020/hawq_data
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/hive
drwxr-xr-x - mapred hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/mapred
drwxrwxrwx - hdfs hadoop 0 2014-07-10 22:29 hdfs://hdm2:8020/tmp
drwxrwxrwx - hdfs hadoop 0 2014-06-16 18:11 hdfs://hdm2:8020/user
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/yarn

2. Create a temporary text file and put it to HDFS

[gpadmin@admin ~]$cat test1.txt

15,west
25,east

[gpadmin@admin ~] hdfs dfs -put test1.txt hdfs://hdm2:8020/tmp/

[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 16 2014-07-06 22:38 hdfs://hdm2:8020/tmp/test1.txt

3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS

NOTE: The blow example location field is for a single namenode deployment. In the case of two Namenodes (High Availability ) location field would be as followed "gphdfs:///tmp/test1.txt". We do not include a port and replace hostname with the name configured in core-site.xml for param "fs.defaultFS"

initdb=# create external table test_hdfs (age int, name text) location('gphdfs://hdm2:8020/tmp/test1.txt') format 'text' (delimiter ',');

CREATE EXTERNAL TABLE

initdb=# \d test_hdfs
External table "public.test_hdfs"
Column | Type | Modifiers
--------+---------+-----------
age | integer |
name | text |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter ',' null '\N' escape '\'
External location: gphdfs://hdm2:8020/tmp/test1.txt

4. Try query data from the external table

initdb=# select * from test_hdfs;
age | name
-----+------
15 | west
25 | east
(2 rows)

5. Create a writable external table in GPDB pointing to a file in HDFS

initdb=# select * from myt1;

id | name
------+-------
1000 | Jason
(1 row)

initdb=# create writable external table test_hdfs2 (like myt1) location('gphdfs://hdm2:8020/tmp/test2.txt') format 'text' (delimiter ',');

NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE EXTERNAL TABLE

6. Insert data to the writable external table

initdb=# insert into test_hdfs2 select * from myt1;

INSERT 0 1

7. Check existence and content of file in HDFS

[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp/test2.txt

Found 1 items
-rw-r--r-- 3 gpadmin hadoop 11 2014-07-13 23:37 hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
[gpadmin@admin ~]$hdfs dfs -cat hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
1000,Jason

Comments

Powered by Zendesk