Pivotal Knowledge Base

Follow

How to read parquet files on HDFS from HDB using PXF

Environment

Product Version
Pivotal HDB 1.2.x 1.3.x 2.x
OS RHEL 6.x
Hadoop PHD3.x HDP2.x


Purpose

This article will show you an example of using PXF to read parquet files on Hadoop cluster(HDFS).

Procedure

1. Prepare parquet files on your HDFS filesystem.

$ hdfs dfs -ls/tmp/sample
Found 7 items
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00000.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00001.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00002.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00003.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00004.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00005.gz.parquet
-rwxr--r--   3 gpadmin hdfs     235841 2016-05-11 00:52 /tmp/sample/part-m-00006.gz.parquet

2. Using hive command line, create a Hive external table pointing to the parquet files. You first need to know table structure of your parquet file to create an external table.

hive> CREATE EXTERNAL TABLE my_hive_parquet
(
   MONTH0 string,
   CUSTOMER_ID string,
   SERVICE string,
   SERVICE_NUMBER string,
   TYPE_OF_SHAREPLUS__MOBILE string,
   SHAREPLUS_PARENT_NUMBER__MOBILE string,
   STUDENT_DISCOUNT_COMPONENT__MOBILE string,
   CIS_STATUS___SUB__MOBILE string,
   HANDSET__MOBILE string,
   HUBCLUB_MEMBER string,
   GENDER string,
   RACE string,
   AGE_BAND string,
   RESIDENT_STATUS string,
   ADDRESS_ID___HH string,
 POSTAL_CODE___HH string,
   SMART_CARD_NUMBER string
)
STORED AS PARQUET LOCATION '/tmp/sample' ;

3. Create a Hawq external table pointing to the Hive table you just created using PXF.

gpadmin=# CREATE EXTERNAL TABLE my_hdb_parquet(
   MONTH0 text,
   CUSTOMER_ID text,
   SERVICE text,
   SERVICE_NUMBER text,
   TYPE_OF_SHAREPLUS__MOBILE text,
   SHAREPLUS_PARENT_NUMBER__MOBILE text,
   STUDENT_DISCOUNT_COMPONENT__MOBILE text,
   CIS_STATUS___SUB__MOBILE text,
   HANDSET__MOBILE text,
   HUBCLUB_MEMBER text,
   GENDER text,
   RACE text,
   AGE_BAND text,
   RESIDENT_STATUS text,
   ADDRESS_ID___HH text,
   POSTAL_CODE___HH text,
   SMART_CARD_NUMBER text
)
LOCATION ('pxf://hdm1:51200/my_hive_parquet?PROFILE=Hive')
FORMAT 'custom' (formatter='pxfwritable_import');

4. Read the data through the external table from HDB

 gpadmin=# select count(*) from my_hdb_parquet;
 count
--------
 700000
(1 row)
 
Time: 6536.905 ms
 
gpadmin=# select * from my_hdb_parquet limit 5;

Please make sure pxf-service is running on the namenodes and all HDB segment nodes before following this example. 

Comments

Powered by Zendesk