Pivotal Knowledge Base

Follow

Failed to Load Empty String from External Table into Numeric Field

Environment

 Product  Version
 Pivotal HDB  2.x
 OS  RHEL 6.x

Symptom

When attempting to load empty strings (like "") in data file into numeric fields (e.g. integer) via PXF-based external table, it fails with the error message “data formatting errors.”

Error Message:

gpadmin=# insert into a1 select * from a1_ext1;
NOTICE: Found 2 data formatting errors (2 or more input rows). Rejected related input data.

Cause

The failure was due to an invalid syntax for target integer type with an empty string ("") as input.  

RCA

The internal HEAP table and external table are defined as follows:

gpadmin=# \d a1
Append-Only Table "public.a1"
Column | Type | Modifiers
--------+---------+-----------
c1 | text |
c2 | integer |
c3 | text | gpadmin=# \d a1_ext1
External table "public.a1_ext1"
Column | Type | Modifiers
--------+---------+-----------
c1 | text |
c2 | integer |
c3 | text |
Type: readable
Encoding: UTF8
Format type: csv
Format options: delimiter ',' null '' escape '"' quote '"' fill missing fields
External location: pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple
Segment reject limit: 10000 rows
Error table: err_a1_ext1

And raw data in source data file (ea1.dat) looks like below one:

[gpadmin@hdm2 50074]$ cat ea1.dat
"12354.0","","N/A"

Since data loading is rejected, it will be logged into the error table, from which the following information could be queried out:

gpadmin=# select * from err_a1_ext1;
cmdtime | relname | filename | linenum | bytenum | errmsg
| rawdata | rawbytes
------------------------------+---------+--------------------------------------------------------------------+---------+---------+-----------------------------------------------
-----------------+--------------------+----------
2017-02-28 10:30:09.57121+08 | a1_ext1 | pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple | 1 | | invalid input syntax for integer: "", column c
2 | "12354.0","","N/A" |
2017-02-28 10:30:09.57121+08 | a1_ext1 | pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple | 2 | | missing data for column "c2", found empty data
line, column c2 | |
(2 rows)

It's apparent that the rejection reason is invalid input syntax for integer: "". This means HDB internally has no idea how to convert an empty string to some valid numeric value.

Resolution

Since auto type conversion is not possible for HDB, some explicit conversion could be done before the empty string is passed into HDB to do an internal conversion. 

1. Specify "text" type for empty string filed in the external table definition:

gpadmin=# \d a1_ext
External table "public.a1_ext"
Column | Type | Modifiers
--------+------+-----------
c1 | text |
c2 | text |
c3 | text |
Type: readable
Encoding: UTF8
Format type: csv
Format options: delimiter ',' null '' escape '"' quote '"' fill missing fields
External location: pxf://hdm1.hadoop.local:51200//tmp/ea1.dat/?profile=HdfsTextSimple
Segment reject limit: 10000 rows
Error table: err_a1_ext

2. Convert the empty string with NULLIF() function before inserting data to internal heap table:

gpadmin=# insert into a1 select c1, nullif(c2,'')::integer, c3 from a1_ext;
INSERT 0 1
gpadmin=# select * from a1;
c1 | c2 | c3
---------+----+-----
12354.0 | | N/A
(1 row)

Instead of using NULLIF(), the alternative could be writing a UDF which converts empty string to NULL or 0 first.   

Comments

Powered by Zendesk