Pivotal Knowledge Base

Follow

PXF Location String(path) doesn't Support Unicode Resulting in Error: "java.io.IOException: File Does not Exist"

Environment

 Product  Version
 Pivotal HDB  1.x, 2.0.x, 2.1.x
 OS  RHEL 6.x

Symptom

If an external table is defined using Unicode PXF location string, it returns an error instead of reading the table.

Error Message:

gpadmin=# select * from test_hdfs_jp;
ERROR:  remote component error (500) from '172.28.21.193:51200':  
type Exception report
message File does not exist: /tmp/????/ABC-????-001.csv
description The server encountered an internal error that prevented it from fulfilling this request.
exception java.io.IOException: File does not exist: /tmp/?????/passwd (libchurl.c:897) (seg10 hdw2.hdp.local:40000 pid=389911) (dispatcher.c:1801) DETAIL: External table test_hdfs_jp, file pxf://hdm1:51200/tmp/テスト試験/passwd?profile=HdfsTextSimple

How to reproduce the issue:

1. Put a sample .csv file in HDFS - One using plain ASCII, and the other using Unicode (eg. Japanese, Korean, Chinese) filename:

$ cat test.csv
15,west
25,east

$ hdfs dfs -put test.csv /tmp/test.csv
$ hdfs dfs -put test.csv /tmp/テスト試験/passwd

2. From PSQL, define PXF external table accessing these two .csv files:

gpadmin=# create external table test_hdfs_en (age int, name text) 
LOCATION ('pxf://hdm1:51200/tmp/*.csv?profile=HdfsTextSimple')
FORMAT 'csv' (delimiter ',' null '' escape '"' quote '"' newline 'LF')
ENCODING 'UTF8';

gpadmin=# create external table test_hdfs_jp (age int, name text) 
LOCATION ('pxf://hdm1:51200/tmp/テスト試験/passwd?profile=HdfsTextSimple') FORMAT 'csv' (delimiter ',' null '' escape '"' quote '"' newline 'LF') ENCODING 'UTF8';

3. Try to read the external tables from PSQL. It'll fail to read the external table with the Unicode path:

gpadmin=# select * from test_hdfs_en;
 age | name
-----+------
  15 | west
  25 | east
(2 rows)

gpadmin=# select * from test_hdfs_jp;
ERROR:  remote component error (500) from '172.28.21.193:51200':  type  Exception report   message   File does not exist: /tmp/?????/passwd    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   java.io.IOException: File does not exist: /tmp/??????/passwd (libchurl.c:897)  (seg10 hdw2.hdp.local:40000 pid=389911) (dispatcher.c:1801)
DETAIL:  External table test_hdfs_jp, file pxf://hdm1:51200/tmp/テスト試験/passwd?profile=HdfsTextSimple 

Cause

Currently, PXF doesn't support Unicode but ASCII only. When multi-byte Unicode is used in the PXF location string, it'll fail to read the external table.

Resolution

Currently, there is no fix for this limitation. The only workaround at this stage is using pure ASCII string for a PXF location string. 

 

Comments

Powered by Zendesk