Pivotal Knowledge Base

Follow

Concurrent HDB queries resulting in Namenode JVM crash, problematic frame: [libnss_ldap.so.2+0x8ae6] _nss_ldap_getspnam_r+0x616

Environment

Product Version
Pivotal HDB 1.3.x
OS RHEL 6.x

Symptom
Namenode could die abruptly without reporting any error message in the namenode daemon log file (.log). Output file (.out) may report a JVM crash with below error message. This issue was experienced while running multiple queries from HDB concurrently, but it not limited to HDB.

# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f599e1ea6bf, pid=679039, tid=140022985860864
#
# JRE version: 6.0_34-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.9-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libnss_ldap.so.2+0x36bf] _nss_ldap_sethostent+0x25f
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid679039.log

Cause
There can be multiple reasons for such an error. One of the starting point to debug such an error is to look at .out file indicated problematic frame which holds the *.so file and the function which has caused the crash. Also, refer to the section where it reports an error file with a name hs_err_pid*.log which has more details of thread information, register values,  java frames etc. Reviewing the information further in hs_err_pid*.log may reveal more insights to the system call done. For instance, in this particular instant, hs_err_pid log file reported the below Stack & Java frames

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.apache.hadoop.security.JniBasedUnixGroupsMapping.getGroupForUser(Ljava/lang/String;)[Ljava/lang/String;+0
j org.apache.hadoop.security.JniBasedUnixGroupsMapping.getGroups(Ljava/lang/String;)Ljava/util/List;+7
j org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(Ljava/lang/String;)Ljava/util/List;+5
J org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(Lcom/google/protobuf/Descriptors$MethodDescriptor;Lcom/google/protobuf/RpcController;Lcom/google/protobuf/Message;)Lcom/google/protobuf/Message;
J org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(Lorg/apache/hadoop/ipc/RPC$Server;Ljava/lang/String;Lorg/apache/hadoop/io/Writable;J)Lorg/apache/hadoop/io/Writable;
J org.apache.hadoop.ipc.Server$Handler$1.run()Ljava/lang/Object;
v ~StubRoutines::call_stub
J java.security.AccessController.doPrivileged(Ljava/security/PrivilegedExceptionAction;Ljava/security/AccessControlContext;)Ljava/lang/Object;
J org.apache.hadoop.security.UserGroupInformation.doAs(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object;
J org.apache.hadoop.ipc.Server$Handler.run()V
v ~StubRoutines::call_stub

If you look at the above stack you may notice the call to JniBasedUnixGroupsMappingWithFallback and if you correlate it with the hadoop configuration parameter used to define it, you would come across the property hadoop.security.group.mapping whose value is set to org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback by default.

org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback is a class for user to group mapping (get groups for a given user) for ACL. The default implementation, org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback, will determine if the Java Native Interface (JNI) is available. If JNI is available, the implementation will use API within hadoop to resolve a list of groups for a user. If JNI is not available then the shell implementation, ShellBasedUnixGroupsMapping, is used. This implementation shells out to the Linux/Unix environment with the "bash -c groups" command to resolve a list of groups for a user.

Further investigation revealed that, we may be hitting a point where libnss_ldap.so.2 is not able to handle concurrent request made by HDB and is causing namenode jvm to crash. However, this is not limited to HDB and may be experienced because of concurrent request made to namenode from any application.

Workaround
In order to avoid the use of JNI based calls because of org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback (default value), you can set the value of the parameter hadoop.security.group.mapping to org.apache.hadoop.security.ShellBasedUnixGroupsMapping in core-site.xml on namenode and restart the namenode. You must be able to workaround such abrupt namenode jvm crash.

<property>
<name>hadoop.security.group.mapping</name>
<value>org.apache.hadoop.security.ShellBasedUnixGroupsMapping</value>
</property>

Just for reference, below are the other related problematic frames observed due to the use of JniBasedUnixGroupsMappingWithFallback and resulting in namenode jvm crash.

1) # C [libnss_ldap.so.2+0x36bf] _nss_ldap_sethostent+0x25f
2) # C [libc.so.6+0x7af1c] char+0x1c
3) # C [libnss_ldap.so.2+0x8a70] _nss_ldap_getspnam_r+0x5a0 

Internal Comments

Notes: Pivotal internal employees reference JIRA : GPSQL-1626

 

Comments

Powered by Zendesk