Pivotal Knowledge Base


Reference to use Sqoop with HDFS and HBase in a Pivotal Hadoop cluster deployed by ICM

Sqoop is a command line interface used to typically transfer data from a relational database to Hadoop. The following article will provide some examples of how to transfer data using Sqoop to HDFS and HBase in a Pivotal Hadoop cluster deployed using ICM. 


Step 1: Run the following on the Pivotal hdb-master node( this node by default has postgres installed) as root:

yum install sqoop
Step 2: Then run the following: 
su - gpadmin
sqoop list-databases --connect jdbc:postgresql://RRR.RRR.R.RRR:5432 --username gpadmin --password yourpassword

Some information about the above sqoop command we just ran: 

In the above the RRR.RRR.R.RRR represents the ip address for the HAWQ-Master Node which can be procured by checking the /etc/hosts file as below. My hawq-master is located on vm5 so I have taken the corresponding ip address for my hawq master node. 

[gpadmin@rhel64-5 ~]$ cat /etc/hosts localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

XXX.XXX.X.XXX rhel64-1.localdomain rhel64-1 vm1 # NIC <eth0>
YYY.YYY.Y.YYY rhel64-2.localdomain rhel64-2 vm2 # NIC <eth0>
ZZZ.ZZZ.Z.ZZZ rhel64-3.localdomain rhel64-3 vm3 # NIC <eth0>
MMM.MMM.M.MMM rhel64-4.localdomain rhel64-4 vm4 # NIC <eth0>
RRR.RRR.R.RRR rhel64-5.localdomain rhel64-5 vm5 # NIC <eth0>

And the port number (5432) is the default port for postgres. 

Step 3: Then go ahead and create a table in hawq which you can later use to load information to different Hadoop services:

gpadmin# source /usr/local/hawq/greenplum_path.sh
gpadmin# psql -p 5432
psql (8.2.15)
Type "help" for help.
gpadmin=# \d
No relations found.
gpadmin=# \l
List of databases
Name | Owner | Encoding | Access privileges
gpadmin | gpadmin | UTF8 |
postgres | gpadmin | UTF8 |
template0 | gpadmin | UTF8 |
template1 | gpadmin | UTF8 |
(4 rows)
gpadmin=# \c gpadmin
You are now connected to database "gpadmin" as user "gpadmin".
gpadmin=# create table test (a int, b text);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause –
Using column named 'a' as the Greenplum Database data
distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution
of data. Make sure column(s) chosen are the optimal data
distribution key to minimize skew.
gpadmin=# insert into test values (1, '435252345');
gpadmin=# select * from test;
a | b
1 | 435252345
(1 row)

Importing from postgres using SQOOP to HDFS

To import from sqoop to hdfs run the following command:

sqoop import --direct --connect jdbc:postgresql://RRR.RRR.R.RRR:5432 --username gpadmin --password yourpassword --table test -m 1
Expected Output should be something like this:
14/06/04 23:27:44 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/06/04 23:27:44 INFO manager.SqlManager: Using default fetchSize of 1000
14/06/04 23:27:44 INFO tool.CodeGenTool: Beginning code generation
14/06/04 23:27:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "test" AS t LIMIT 1
14/06/04 23:27:46 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/gphd/hadoop
Note: /tmp/sqoop-gpadmin/compile/6d62fa759efb1d1392458e959b9bb2a3/test.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/06/04 23:27:48 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-gpadmin/compile/6d62fa759efb1d1392458e959b9bb2a3/test.jar
14/06/04 23:27:48 INFO manager.DirectPostgresqlManager: Beginning psql fast path import
14/06/04 23:27:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "test" AS t LIMIT 1
14/06/04 23:27:49 INFO manager.DirectPostgresqlManager: Performing import of table test from database
14/06/04 23:27:52 INFO manager.DirectPostgresqlManager: Transfer loop complete.
14/06/04 23:27:52 INFO manager.DirectPostgresqlManager: Transferred 12 bytes in 1.2663 seconds (9.4768 bytes/sec)

[gpadmin@rhel64-5 ~]$ hdfs dfs -ls /user/gpadmin
Found 2 items
drwx------ - gpadmin hadoop 0 2014-06-04 23:26 /user/gpadmin/.staging
drwxr-xr-x - gpadmin hadoop 0 2014-06-04 23:27 /user/gpadmin/test

Importing from postgres using SQOOP to HBASE

To import from Sqoop to HBASE run the following command:

sqoop import --connect jdbc:postgresql://RRR.RRR.R.RRR:5432/gpadmin --username gpadmin --password yourpassword --table test --hbase-table test --column-family world --hbase-create-table --hbase-row-key a -m 1
Expected Output: 
[gpadmin@rhel64-5 ~]$ hbase shell
14/06/04 23:37:46 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.96.0-hadoop2-gphd-, r40412e69ceba293d9aa1e940da1f32e442431e0b, Mon Apr 14 12:09:37 CST 2014
hbase(main):001:0> list
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/gphd/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/gphd/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
1 row(s) in 3.3040 seconds
=> ["test"]
hbase(main):002:0> scan 'test'
ROW                                 COLUMN+CELL                                                                                        
 435252345                          column=world:b, timestamp=1401925031715, value=1                                                   
1 row(s) in 0.5200 seconds

Note above, that the hbase table 'test' only contains 


Whereas the Pivotal HDB table has :

gpadmin=# select * from test;
a | b
1 | 435252345
(1 row)

This is because column a is being used as the row-key for sqoop and sqoop transfers over everything but the row-key values. As such this behavior is expected and correct. 






Powered by Zendesk