Pivotal Knowledge Base

Follow

Understanding GPHDFS Configurations

Environment

  • Pivotal Greenplum (GPDB)
  • A Hadoop distribution ( CDH, MAPR, PHD )

Purpose

  The most common error observed when setting up gphdfs is the infamous java class not found.  The cause is basically a configuration issue with either the GPDB GUC ( Global User Configuration Parameter ) or environmental issue sourcing the the write Hadoop JAR files. This article will help you understand how GPHDFS works and give the end user the tools needed troubleshoot configuration issues.

Basic GPDB configuration details

gp_hadoop_target_version

Determines what family of Hadoop GPDB will be connecting to.   This will change what binaries GPDB will use when accessing external HDFS data.  Refer to our Documentation for acceptable values

gp_hadoop_home The value is the Hadoop binary installation path.  
$GPHOME/lib/hadoop/hadoop_env.sh The above two parameters are plugged into this file which determines the how the Hadoop java classes are sourced

High level query execution

The GPDB Administration guide explains in detail how a query gets executed and provides some best practices to help maximize performance.  The above illustrates that each query segment instance running on the segment server will need to launch a new JVM process for each query.

Example Classes required by GPHDFS read/write

The above is a breakdown of JAVA classes required for GPHDFS to function. 

  • $GPHOME/lib/hadoop/hadoop_env.sh is used to source the vendors Hadoop as well as the greenplum HDFS reader/writer classes
  • All Oracle JAVA Dependencies will be sourced based on bash $JAVA_HOME settings

Do not be afraid of hadoop_env.sh

It makes sense that the end user might be a little hesitant to update the contents of this file given that it is located under the $GPHOME/lib direcotry.  But realistically the hadoop_env.sh controls your external query destiny.  A GPDB sysadmin needs to understand what this file is and how to bend it to meet their application needs. 

There are two very important Environmental variable.

export GP_JAVA_OPT=-Xmx1000m
export PATH=$JAVA_HOME/bin:$PATH

The first GP_JAVA_OPT controlls how much memory can be allocated to each JVM process launched when reading from or writing to external HDFS tables.  Each GPDB segment instance will need to launch it's own JVM so consider the following situation

  • 10 segments on one server
  • GP_JAVA_OPT=-Xmx1000m

Given the above environment each external table query will consume 10GB of kernel virtual memory.  See this article or more information regarding how kernel allocates memory.


The second is JAVA_HOME. Without java home set in ~gpadmin/.bashrc hadoop_env.sh will fail to source java or might source the wrong version of java.  Its important to explicitly define what java version you want gpadmin user to use. 


The rest of the $GPHOME/lib/hadoop/hadoop_env.sh is used to to iterate a number of subdirectories in ${gp_hadoop_home} in an effort to build the correct classpath needed to launch the gphdfs jvm process that reads data from HDFS. 

Troubleshooting Techniques

Manually test connectivity on a individual segment

gp_hadoop_target_version JAR File
gphd-1.0 gphd-1.0-gnet-1.0.0.1.jar
gphd-1.1 gphd-1.1-gnet-1.1.0.0.jar
gphd-1.2 gphd-1.2-gnet-1.1.0.0.jar
gphd-2.0 gphd-2.0.2-gnet-1.2.0.0.jar
gpmr-1.0 gpmr-1.0-gnet-1.0.0.1.jar
gpmr-1.2 gpmr-1.2-gnet-1.0.0.1.jar
cdh3u2 cdh3u2-gnet-1.1.0.0.jar
cdh4.1 cdh4.1-gnet-1.2.0.0.jar
hdp2 cdh4.1-gnet-1.2.0.0.jar

GphdfsTest.sh will launch GPDB java class com.emc.greenplum.gpdb.hdfsconnector.HDFSReader and read data from HDFS the same way GPDB does. You may want to use this for quick connectivity troubleshooting between GPDB and gphdfs. This can also be helpful in cases where there are kerberos related errors and you need to enable kerberos debugging.

  1. Download GphdfsTest.sh and upload it to any segment in the cluster you wish to test connectivity
  2. Ensure the following bash variables are set
    	$JAVA_HOME
    	$GPHOME
  3. Example usage when reading a text file from HDFS
    [gpadmin@gpdb ~]$ ./GphdfsTest.sh gphd-2.0 /usr/lib/gphd TEXT gphdfs://hdm1:8020/tmp/t1
    15/12/18 08:46:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/12/18 08:46:13 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
    15/12/18 08:46:13 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 76 for gpadmin on 172.28.17.21:8020
    15/12/18 08:46:13 INFO security.TokenCache: Got dt for hdfs://hdm1:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 172.28.17.21:8020, Ident: (HDFS_DELEGATION_TOKEN token 76 for gpadmin)
    15/12/18 08:46:13 INFO input.FileInputFormat: Total input paths to process : 1
    3423
    234
    324
    23542
    342
    35
    234
    234
    324
    324
    235
    235
    32
    423
    432
  4. Example usage with kerberos debugging enabled using "-v" option
    [gpadmin@gpdb ~]$ ./run.sh gphd-2.0 /usr/lib/gphd TEXT gphdfs://hdm1:8020/tmp/t1 -v
    Java config name: null
    Native config name: /etc/krb5.conf
    Loaded from native config
    15/12/18 08:42:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    >>>KinitOptions cache name is /tmp/krb5cc_500
    >>>DEBUG   client principal is gpadmin@PHD.LOCAL
    >>>DEBUG  server principal is krbtgt/PHD.LOCAL@PHD.LOCAL
    >>>DEBUG  key type: 18
    >>>DEBUG  auth time: Fri Dec 18 08:39:13 PST 2015
    >>>DEBUG  start time: Fri Dec 18 08:39:13 PST 2015
    >>>DEBUG  end time: Sat Dec 19 08:39:09 PST 2015
    >>>DEBUG  renew_till time: Fri Dec 18 08:39:13 PST 2015
    >>> CCacheInputStream: readFlags()  FORWARDABLE; RENEWABLE; INITIAL;
    >>>DEBUG   client principal is gpadmin@PHD.LOCAL
    >>>DEBUG  server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/PHD.LOCAL@PHD.LOCAL
    >>>DEBUG  key type: 0
    .
    .
    .
    .

Note: When reading other binary formats like Parquete, Avro, or GPDBWritable the output of this tool might include non-ascii characters. The purpose of this tool is to successfully read data from hdfs so as long as data is returned the test is a success

Some other useful link

Example how to setup and test GPHDFS

 

Comments

Powered by Zendesk