Pivotal Knowledge Base

Follow

Resource Manager is down in a HDFS HA configuration

Environment 

Product Version
Pivotal HD (PHD)  3.0.x
ZooKeeper 3.4.6
YARN 2.6.x

 

Overview

In a Pivotal HD Hadoop Distributed File System (HDFS) high availability (HA) configuration, all Resource Managers and Node Managers are down as shown below:

 

Symptom 

The ResourceManager logs will show that the connection to the ZooKeeper nodes have failed. In this example, there are two ResourceManagers and each ResourceManager log shows that the connection to ZooKeeper was lost at the same time:

From ResourceManager1 - /var/log/hadoop-yarn/yarn/yarn-yarn-resourcemanager-<hostname>.log:

2016-03-21 00:42:49,116 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1096)) - Client session timed out, have not heard from server in 6667ms for sessionid 0x1537efac1080000, closing socket connection and attempting reconnect
2016-03-21 00:42:49,219 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(558)) - Session disconnected. Entering neutral mode...
<...>
2016-03-21 00:43:24,552 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(753)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type EMBEDDED_ELECTOR_FAILED. Cause:
Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
2016-03-21 00:43:24,554 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2016-03-21 00:43:24,554 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down

From ResourceManager2 - /var/log/hadoop-yarn/yarn/yarn-yarn-resourcemanager-<hostname>.log:

2016-03-21 00:42:47,202 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(1096)) - Client session timed out, have not heard from server in 6666ms for sessionid 0x3537ef77
0710000, closing socket connection and attempting reconnect
2016-03-21 00:42:47,303 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(558)) - Session disconnected. Entering neutral mode...
<...>
2016-03-21 00:43:51,060 ERROR ha.ActiveStandbyElector (ActiveStandbyElector.java:waitForZKConnectionEvent(1044)) - Connection timed out: couldn't connect to ZooKeeper in 10000 milliseconds
2016-03-21 00:43:51,982 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x0 closed
2016-03-21 00:43:51,982 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:reEstablishSession(748)) - org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
2016-03-21 00:43:51,983 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
2016-03-21 00:43:56,983 FATAL ha.ActiveStandbyElector (ActiveStandbyElector.java:fatalError(642)) - Failed to reEstablish connection with ZooKeeper
2016-03-21 00:43:56,984 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(753)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type EMBEDDED_ELECTOR_FAILED. Cause:
Failed to reEstablish connection with ZooKeeper
2016-03-21 00:43:56,986 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:isStaleClient(1006)) - Ignoring stale result from old client with sessionId 0x35396a046590000
2016-03-21 00:43:56,987 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down.

Cause

In a HA configuration, ResourceManager needs ZooKeeper to respond correctly in order to know which node should be in control of assigning out resources. If the ZooKeeper becomes unresponsive or shuts down, none of the ResourceManagers will know which node is master and the YARN services will automatically shut down.

Although ResourceManager is down, it is likely that ZooKeeper caused the YARN issues. Investigation needs to be done to understand why ZooKeeper became unresponsive for a period of time.

Resolution

1. Review /var/log/zookeeper/zookeeper.out and /var/log/messages to understand why ZooKeeper became unresponsive. The most common reasons are:

  • ZooKeeper services were manually shutdown or restarted.
  • Network issues arose between the Zookeeper nodes and ResourceManager nodes.
  • I/O issues arose on the ZooKeeper nodes. Zookeeper is very sensitive to I/O latency and we recommend that each Zookeeper instance have its own dedicated disk to read and write to avoid I/O contention.

2. Once the ZooKeeper issues have been resolved and ZooKeeper is running correctly, restart ResourceManager via Ambari.

3. Restart NodeManagers via Ambari.

 

 

Comments

Powered by Zendesk