Pivotal Knowledge Base

Follow

How to Import Fixed Length Data into HAWQ with PXF

Environment

 Product  Version
 HAWQ  2.0.x
 OS  RHEL 6.x

Purpose

This article discusses how to load fixed length data from a file into a HAWQ table.

Cause 

The file a user wants to load into HAWQ is a controlled file and stored in Isilon HDFS where the content inside uses fixed length to determine file name, number of records, and the timestamp. Once loaded, they can then query the data.

Below is the control file format that we would like PXF to use when loading data into HAWQ. Here, the file name is "PAYROLL_MASTER_2014-10-31_000000.000000.ctl":

2015-07-10 18:36:38.6300002014-10-31 00:00:00.0000002014-10-31 23:59:59.00000000000000010000025270
2015-07-11 18:36:38.6300002014-10-31 00:00:00.0000002014-10-31 23:59:59.00000000000000010000025270

Column format output:

FileTimestamp CHAR(26) format ‘YYYY-MM-DD HH:MI:SS.S(6)’ 2015-07-10 18:36:38.630000
DataStartTimestamp CHAR(26) format ‘YYYY-MM-DD HH:MI:SS.S(6)’ 2014-10-31 00:00:00.000000
DataEndTimestamp CHAR(26) format ‘YYYY-MM-DD HH:MI:SS.S(6)’ 2014-10-31 23:59:59.000000
FileQualifier CHAR(10) 0000000001
NbrRecords CHAR(10) 0000025270

Procedure

The first step is to create a HAWQ external table to load data from HDFS and then, change the format by inserting data into a HAWQ internal table. Follow the steps to fulfill user's request:

1. Source the table PAYROLL_MASTER_2014-10-31_000000.000000.ctl on HDFS and create an external table in Greenplum:

CREATE EXTERNAL TABLE payroll (s1 text)
LOCATION ('pxf://hdm2.hadoop.local:51200/tmp/PAYROLL_MASTER_2014-10-31_000000.000000.ctl'
'?PROFILE=HdfsTextSimple') FORMAT 'TEXT'; gpadmin=# select * from payroll; s1 ---------------------------------------------------------------------------------------------------- 2015-07-10 18:36:38.6300002014-10-31 00:00:00.0000002014-10-31 23:59:59.00000000000000010000025270 2015-07-11 18:36:38.6300002014-10-31 00:00:00.0000002014-10-31 23:59:59.00000000000000010000025270

2. Create a table pay1 in Greenplum:

gpadmin=# create table pay1(FileTimestamp CHAR(26),DataStartTimestamp CHAR(26),DataEndTimestamp CHAR(26),FileQualifier CHAR(10),NbrRecords CHAR(10)); 

gpadmin=# \d+ pay1
                     Append-Only Table "public.pay1"
       Column       |     Type      | Modifiers | Storage  | Description
--------------------+---------------+-----------+----------+-------------
 filetimestamp      | character(26) |           | extended |
 datastarttimestamp | character(26) |           | extended |
 dataendtimestamp   | character(26) |           | extended |
 filequalifier      | character(10) |           | extended |
 nbrrecords         | character(10) |           | extended |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Has OIDs: no
Options: appendonly=true
Distributed randomly

3. Insert the row data to the table and format by using the function substring:

gpadmin=#  insert into pay1 select substring(s1 from 1 for 26),substring(s1 from 27 for 26),substring(s1 from 53 for 26), substring(s1 from 79 for 10), substring(s1 from 89 for 10) from payroll;
INSERT 0 2

gpadmin=#  select * from pay1;
       filetimestamp        |     datastarttimestamp     |      dataendtimestamp      | filequalifier | nbrrecords
----------------------------+----------------------------+----------------------------+---------------+------------
 2015-07-11 18:36:38.630000 | 2014-10-31 00:00:00.000000 | 2014-10-31 23:59:59.000000 | 0000000001    | 0000025270
 2015-07-10 18:36:38.630000 | 2014-10-31 00:00:00.000000 | 2014-10-31 23:59:59.000000 | 0000000001    | 0000025270
(2 rows) 

Additional Information

If you want to use fixed length file format with the external table directly, you can create a custom Profile in /etc/pxf/conf/pxf-profiles.xml and define the format with plugins.

For more information, you can refer to this link.

Comments

Powered by Zendesk