Pivotal Knowledge Base

Follow

Error message "missing data for column" seen when querying a CSV with a header stored in HDFS via gphdfs

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
Pivotal HD/HDP 3.0 / 2.x
gphdfs  

Symptom

When querying an external table based on a CSV file with a header stored in HDFS via gphdfs such as the following:

gpadmin=# CREATE EXTERNAL TABLE test_demo12_dos
LOCATION (
'gphdfs://hawq20/gphdfs2/test-demo12.csv'
)
FORMAT 'CSV' (HEADER); NOTICE: HEADER means that each one of the data files has a header row. CREATE EXTERNAL TABLE

Queries will intermittently fail, as shown below:

gpadmin=# select * from test_demo1_dos;
 entity_name | raw_dk
-------------+--------
 demo1test   |      1
 demo2test   |      2
 demo3test  |      3
 demo4test   |      4
(4 rows)

gpadmin=# select * from test_demo12_dos;
 entity_name | raw_dk
-------------+--------
 demo1test   |      1
 demo2test   |      2
 demo3test  |      3
 demo4test   |      4
(4 rows)

gpadmin=# select * from test_demo12_dos;
ERROR:  missing data for column "raw_dk"  (seg0 slice1 gpdb-sandbox.localdomain:40000 pid=100917)
DETAIL:  External table test_demo12_dos, line 1 of gphdfs://hawq20/gphdfs2/test-demo12.csv.dos: ""
gpadmin=#

Error Message:

The master log in "debug5" logs will show the following:

2016-09-08 05:41:57.244720 PDT,"gpadmin","gpadmin",p121013,th679020288,"[local]",,2016-09-08 05:41:39 PDT,82689,con47,cmd6,seg-1,,dx89,x82689,sx1,"LOG","00000","processResults.  isBusy = 1",,,,,,,0,,,,
2016-09-08 05:41:57.244727 PDT,"gpadmin","gpadmin",p121013,th679020288,"[local]",,2016-09-08 05:41:39 PDT,82689,con47,cmd6,seg-1,,dx89,x82689,sx1,"LOG","00000","PQgetResult",,,,,,,0,,,,
2016-09-08 05:41:57.244734 PDT,"gpadmin","gpadmin",p121013,th679020288,"[local]",,2016-09-08 05:41:39 PDT,82689,con47,cmd6,seg-1,,dx89,x82689,sx1,"LOG","00000","seg0 slice1 gpdb-sandbox.localdomain:40000 pid=1210
23 -> PGRES_FATAL_ERROR 22P04  ERROR:  missing data for column ""raw_dk""
CONTEXT:  External table demo1_csv, line 1 of gphdfs://hawq20/gphdfs2/test-demo12.csv: """"
",,,,,,,0,,,,
2016-09-08 05:41:57.244765 PDT,"gpadmin","gpadmin",p121013,th679020288,"[local]",,2016-09-08 05:41:39 PDT,82689,con47,cmd6,seg-1,,dx89,x82689,sx1,"LOG","00000","processResults says we have more to do with 1: seg0
 slice1 gpdb-sandbox.localdomain:40000 pid=121023",,,,,,,0,,,,

Cause

This is a software defect in Greenplum 4.3.x reported to engineering via MPP-26577. 

Resolution

This issue is still be worked on. Review the release notes to determine if the issue has been resolved in a current release of Pivotal Greenplum or contact Pivotal support.

In the meantime, the header should be removed from the CSV before putting the file into HDFS and the external table should be created with no header to work around this issue.

 

 

Comments

Powered by Zendesk