Pivotal Knowledge Base

Follow

HDB/PXF Fails to Read Hive External Table using Multibyte Delimiter

Environment

 Product  Version
 Pivotal HDB  2.x
 OS  RHEL 6.x

Symptom

HDB fails to query against a Hive external table using multibyte delimiter through PXF.

Error Message:

gpadmin=# select * from multi;
ERROR: remote component error (500) from '172.28.21.189:51200': 
type Exception report message java.lang.Exception: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe
description The server encountered an internal error that prevented it from fulfilling this request.
exception javax.servlet.ServletException: java.lang.Exception:
java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe (libchurl.c:897) (seg4 hdw1.hdp.local:40000 pid=672562) (dispatcher.c:1801)
DETAIL: External table multi

Cause 

The required class "org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe" for the multibyte delimiter processing exists in the hive-contrib.jar. However, the Jar file "hive-contrib.jar" is not in the default Java classpath for PXF service.

Resolution

1. Locate the path for hive-contrib.jar: 

# locate jar | grep hive | grep contri
/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib-1.2.1000.2.4.0.0-169.jar
/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar ### <<<< Use this one as it's a symbol link to above one
/usr/hdp/2.4.0.0-169/storm/contrib/storm-hive/storm-hive-0.10.0.2.4.0.0-169.jar

2. From Ambari -> PXF -> Configs -> Advanced pxf-public-classpath, add the identified path:

:
/usr/hdp/current/hive-client/lib/datanucleus*
/usr/hdp/2.4.0.0-169/hive/lib/derby-10.10.2.0.jar
/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar   # <<<!!! Add this line identified above. 

3. Restart PXF services from Amabari, and now you should be able to access the Hive external table:

gpadmin=# select count(*) from multi;
count
-------
    16
(1 row)

Additional Information 

Example of recreation/resolve of the issue -

1. Create a sample file with multibyte delimiter (in this case '||') for hive external table:

$ hdfs dfs -cat /tmp/hivetest/sample.txt
1 || AAAAAAAAA
2 || BBBBBBBBB
3 || AAAAAAAAA
4 || BBBBBBBBB
5 || AAAAAAAAA
6 || BBBBBBBBB
7 || AAAAAAAAA
8 || BBBBBBBBB
9 || AAAAAAAAA
12 || BBBBBBBBB
11 || AAAAAAAAA
14 || BBBBBBBBB
15 || AAAAAAAAA
16 || BBBBBBBBB
17 || AAAAAAAAA
18 || BBBBBBBBB

2. Create an external table from Hive to access the file created above using MultiDelimitSerDe:

CREATE EXTERNAL TABlE tableex(id string, name STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"="||")
STORED AS TEXTFILE
LOCATION '/tmp/hivetest';

 3. Check if you see similar error from Hive also. You may or may not see this error from Hive depending on your current Java class path:

hive> select count(*) from tableex;
Query ID = hive_20170305184023_778fafe2-a9e9-4b51-a9e3-e6731f748b69
:
:
:
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
     at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2120)
     at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:143)
     at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:316)
     ... 24 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

4. Locate the path to hive-contrib.jar:

# locate jar | grep hive | grep contri
/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib-1.2.1000.2.4.0.0-169.jar
/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar ### <<<< Use this one as it's a symbol link to above one
/usr/hdp/2.4.0.0-169/storm/contrib/storm-hive/storm-hive-0.10.0.2.4.0.0-169.jar

5. Add hive-contrib.jar to the JAR path:

hive> add JAR /usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar;
Added [/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar] to class path
Added resources: [/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar]

6. Confirm that the Hive query is working against the external table:

hive> select count(*) from tableex;
:
:
2017-03-05 18:39:00,573 Stage-1 map = 0%,  reduce = 0%
2017-03-05 18:39:10,445 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.76 sec
2017-03-05 18:39:20,294 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.13 sec
MapReduce Total cumulative CPU time: 7 seconds 130 msec
Ended Job = job_1485381284246_0029
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.13 sec   HDFS Read: 7391 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 130 msec
OK
16
Time taken: 37.338 seconds, Fetched: 1 row(s)

7. Create external table from HAWQ side for the Hive table created above:

CREATE EXTERNAL TABLE multi
(
   id text,
   name text
)
LOCATION ('pxf://hdm1:51200/tableex?PROFILE=Hive')
FORMAT 'custom' (formatter='pxfwritable_import');

8. Now you have the same issue here:

gpadmin=# select * from multi;
ERROR:  remote component error (500) from '172.28.21.189:51200':  type  Exception report   message   java.lang.Exception: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   javax.servlet.ServletException: java.lang.Exception: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe (libchurl.c:897)  (seg4 hdw1.hdp.local:40000 pid=672562) (dispatcher.c:1801)
DETAIL:  External table multi

9. From "Ambari -> PXF -> Configs -> Advanced pxf-public-classpath" add the same jar path:

:
/usr/hdp/current/hive-client/lib/datanucleus*
/usr/hdp/2.4.0.0-169/hive/lib/derby-10.10.2.0.jar
/usr/hdp/2.4.0.0-169/hive/lib/hive-contrib.jar   # <<<!!! Add this line identified above.

10. Restart All PXF services from Ambari.

11. Try the same query from HAWQ:

gpadmin=# select * from multi;
id  |    name
-----+------------
1   |  AAAAAAAAA
2   |  BBBBBBBBB
3   |  AAAAAAAAA
4   |  BBBBBBBBB
5   |  AAAAAAAAA
6   |  BBBBBBBBB
7   |  AAAAAAAAA
8   |  BBBBBBBBB
9   |  AAAAAAAAA
12  |  BBBBBBBBB
11  |  AAAAAAAAA
14  |  BBBBBBBBB
15  |  AAAAAAAAA
16  |  BBBBBBBBB
17  |  AAAAAAAAA
18  |  BBBBBBBBB
(16 rows)

gpadmin=# select count(*) from multi;
count
-------
    16
(1 row)

12. Try querying the same table using Hcatalog directly:

gpadmin=# select * from hcatalog.default.tableex;
id  |    name
-----+------------
1   |  AAAAAAAAA
2   |  BBBBBBBBB
3   |  AAAAAAAAA
4   |  BBBBBBBBB
5   |  AAAAAAAAA
6   |  BBBBBBBBB
7   |  AAAAAAAAA
8   |  BBBBBBBBB
9   |  AAAAAAAAA
12  |  BBBBBBBBB
11  |  AAAAAAAAA
14  |  BBBBBBBBB
15  |  AAAAAAAAA
16  |  BBBBBBBBB
17  |  AAAAAAAAA
18  |  BBBBBBBBB
(16 rows)

 

Comments

Powered by Zendesk