Pivotal Knowledge Base

Follow

HDB query becomes hung when reading large parquet file(s) through PXF service

Environment

Product Version
Pivotal HDB 1.3.x
OS RHEL 6.x
Others  

Symptom

When attempting read large parquet file(s) through PXF/Hive profile, the query becomes hung.


Error Message:

/var/gphd/pxf/pxf-service/logs/catalina.out

Exception in thread "tomcat-http--7" java.lang.OutOfMemoryError: Java heap space
at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:599)
at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:360)
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:100)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
at com.pivotal.pxf.plugins.hive.HiveAccessor.getReader(HiveAccessor.java:94)
at com.pivotal.pxf.plugins.hdfs.HdfsSplittableDataAccessor.getNextSplit(HdfsSplittableDataAccessor.java:87)
at com.pivotal.pxf.plugins.hdfs.HdfsSplittableDataAccessor.openForRead(HdfsSplittableDataAccessor.java:61)
at com.pivotal.pxf.plugins.hive.HiveAccessor.openForRead(HiveAccessor.java:83)
at com.pivotal.pxf.service.ReadBridge.beginIteration(ReadBridge.java:50)
at com.pivotal.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:100)
:

Cause 

The parquet files are compressed so when it is expanded on reading, therefore PXF service requires much more memory than initially it anticipated. When dealing with large & many parquet files, the default Java Heap Space for PXF service(512MB) is not enough for processing them.

Resolution

On ALL PXF service nodes, increase the Java Heap Space(-Xmx) for the PXF service by editing /var/gphd/pxf/pxf-service/bin/setenv.sh as below and then restart the PXF service.

# vi /var/gphd/pxf/pxf-service/bin/setenv.sh
JAVA_HOME="/usr/java/default"
AGENT_PATHS=""
JAVA_AGENTS=""
JAVA_LIBRARY_PATH=""
JVM_OPTS="-Xmx1024M -Xss256K"
JAVA_OPTS="$JVM_OPTS $AGENT_PATHS $JAVA_AGENTS $JAVA_LIBRARY_PATH"
# service pxf-service restart

The 1024M here is not absolute value for all situations. Depending on the number of parquet files and their sizes, this value should be tuned accordingly.

Additional Information 

As an example, following size of parquet file caused this symptom when PXF service was running with default Java Heap space of 512MB.

# hdfs dfs -ls /tmp/parquets
Found 1 items
-rw-r--r-- 3 gpadmin hdfs 525282287 2016-07-24 19:50 /tmp/parquets/1.gz.parquet

Internal Comments

Strictly for Pivotal Support only:

https://jira-pivotal.atlassian.net/browse/GPSQL-3282

 

 

Comments

Powered by Zendesk