Pivotal Knowledge Base

Follow

HAWQ / PXF - "invalid byte sequence for encoding "UTF8": 0x00" when reading data from an external table in HAWQ via PXF

Environment

Product Version
Pivotal HDB 1.3.x
PXF  2.5
Pivotal Greenplum (GPDB)  

Overview

While reading data from an external table in HAWQ via PXF the following error may be seen:

retail_demo=# select * from test.inventory_orders_items_expected_pxf;
ERROR:  Segment reject limit reached. Aborting operation. Last error was: invalid byte sequence for encoding "UTF8": 0x00  (seg1 slice1 ac1hdw1.dca:40000 pid=238396)
DETAIL:  External table inventory_orders_items_expected_pxf, line 4148808 of file pxf://ac1hdw3.dca:50070/retail_demo/inventory/OrderItems-1.csv.gz?Profile=HdfsTextSimple
retail_demo=#

Cause

The error contains the byte sequence "0x00" (NULL character), this means a NULL character is trying to be inserted into a character field in HAWQ. NULL characters are not supported in character fields in HAWQ and Pivotal Greenplum.

If the code is something other than 0x00 it may be that there is an encoding issue in the file.

In order to check, if NULLs are present in the file, the following command can be used. In this case, there are 535 lines where NULL is present:

[root@phd3 ~]# grep -Pa '\x00' OrderItems-1.csv | wc -l
535
[root@phd3 ~]#

Resolution

The NULL characters that are trying to be put into character fields in HAWQ should be removed ideally by modifying the way the data source is created.

If the file cannot be re-created below is one way of removing the NULL fields.

IMPORTANT NOTE - this method will remove ALL NULL entries and could potentially modify the data in an unwanted way, use with caution. 

1. Confirm NULL is present and count the total number of lines in the file:

[root@phd3 ~]# grep -Pa '\x00' OrderItems-1.csv | wc -l
535
[root@phd3 ~]# wc -l OrderItems-1.csv
6318000 OrderItems-1.csv
[root@phd3 ~]# 

2. Remove the NULL characters using tr

[root@phd3 ~]# tr < OrderItems-1.csv -d '\000' > OrderItems-Corrected-1.csv

3. Confirm all NULL characters have been removed and that the total number of lines is still unchanged.

[root@phd3 ~]# grep -Pa '\x00' OrderItems-Corrected-1.csv | wc -l
0
[root@phd3 ~]# wc -l OrderItems-Corrected-1.csv
6318000 OrderItems-Corrected-1.csv
[root@phd3 ~]#

4. Reload the data into HDFS / PXF / HAWQ.

As mentioned previously, the above resolution is just a workaround. Ideally, the data source should make sure to not put NULL into character fields. 

 

Comments

Powered by Zendesk