Pivotal Knowledge Base

Follow

hdfs dfs ls commands will hang indefinitely when secure hdfs is enabled

Environment

  • PHD 1.x
  • PHD 2.x

Symptom

[gpadmin@etl1 ~]$ hdfs dfs -ls /
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>>KinitOptions cache name is /tmp/krb5cc_500
>>>DEBUG   client principal is gpadmin/etl1.phd.local@PHD.LOCAL
>>>DEBUG  server principal is krbtgt/PHD.LOCAL@PHD.LOCAL
>>>DEBUG  key type: 18
>>>DEBUG  auth time: Mon Aug 18 11:57:12 PDT 2014
>>>DEBUG  start time: Mon Aug 18 11:57:12 PDT 2014
>>>DEBUG  end time: Tue Aug 19 11:57:12 PDT 2014
>>>DEBUG  renew_till time: Mon Aug 18 11:57:12 PDT 2014
>>> CCacheInputStream: readFlags()  FORWARDABLE; RENEWABLE; INITIAL;
>>>DEBUG   client principal is gpadmin/etl1.phd.local@PHD.LOCAL
>>>DEBUG  server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/PHD.LOCAL@PHD.LOCAL
>>>DEBUG  key type: 0
>>>DEBUG  auth time: Wed Dec 31 16:00:00 PST 1969
>>>DEBUG  start time: null
>>>DEBUG  end time: Wed Dec 31 16:00:00 PST 1969
>>>DEBUG  renew_till time: null
>>> CCacheInputStream: readFlags()
14/08/18 12:12:48 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo. Not retrying because failovers (15) exceeded maximum allowed (15)
org.apache.hadoop.net.ConnectTimeoutException: Call From etl1.phd.local/10.110.127.23 to hdw1.phd.local:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/10.110.127.23:45865 remote=hdw1.phd.local/172.28.17.4:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
	at org.apache.hadoop.ipc.Client.call(Client.java:1351)
	at org.apache.hadoop.ipc.Client.call(Client.java:1300)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:688)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1796)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1116)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1112)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1112)
	at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1622)
	at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
	at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
	at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
	at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:255)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:305)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/10.110.127.23:45865 remote=hdw1.phd.local/172.28.17.4:8020]
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642)
	at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399)
	at org.apache.hadoop.ipc.Client.call(Client.java:1318)
	... 28 more
ls: Call From etl1.phd.local/10.110.127.23 to hdw1.phd.local:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/10.110.127.23:45865 remote=hdw1.phd.local/172.28.17.4:8020]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout

Cause

In this case the user kerberos principal included the hostname "etl1.phd.local".  When the kerberos principal includes the hostname hdfs will resolve that hostname to an IP address. Then hdfs will bind the socket to the resolved ip interface regardless of what the namenode hostname resolves to. This is by design as per HDFS-7215.

[gpadmin@etl1 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: gpadmin/etl1.phd.local@PHD.LOCAL

Valid starting     Expires            Service principal
08/18/14 11:57:12  08/19/14 11:57:12  krbtgt/PHD.LOCAL@PHD.LOCAL
	renew until 08/18/14 11:57:12

In the diagram ETL client will attempt to connect to namenode on vlan 123 from vlan0. In this case etl node is able to reach namenode but eventually times out because the namenode will send the TCP SYN/ACK back through the public default route which does not know how to get back to vlan0 on subnet 172

Network Environment Details

  • Interface 1: VLAN123 10.x Private network
  • Interface 2: VLAN0 172.x Public network
  • namenode hostname resolves to 10.x public network
  • client etl1.phd.local hostname resovles to 172.x private network

Fix

Make sure all nodes in the cluster use the same network or are able to reroute between networks. A workaround could be to use a user principal without a hostname like gpadmin@PHD.LOCAL forcing hdfs to skip hostname resoltion when security is enable

Comments

Powered by Zendesk