Pivotal Knowledge Base

Follow

Data loading fails with "Invalid byte sequence for encoding UTF8"

Environment

All Pivotal Greenplum (GPDB) versions

Problem

When data is loaded from external sources with gpfdist/gpload/copy, the query might fail with "Invalid byte sequence for encoding "UTF8": 0xc942" ":

msong=# copy source_address from '/data/msong_env/cases/SR59883770/source_address.dat.0001' WITH DELIMITER '|' ;
ERROR: invalid byte sequence for encoding "UTF8": 0xc942
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY source_address, line 39097

Cause

Files from external sources might be encoded with different encoding rather than UTF8, however, the target GPDB is configured to accept UTF8 encoding from client.

msong=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)

In such case, database is not able to recognize the external files and fails with above errors during loading.

Solution

As a workaround, convert the external source files to UTF8 with iconv:

iconv -f original_charset -t utf-8 originalfile > newfile

As a solution, the external source files should always be encoded with the same encoding as the target database configured, check if following 3 outputs match:
1. Check for client encoding of target database:

show client_encoding;

2. Check for server encoding of target database:

show server_encoding;

3. Check for encoding of the external files:

file <filename>

Comments

Powered by Zendesk