Pivotal Knowledge Base

Follow

Ambari Agent doesn't Start and No error reported in the Logs

Environment

 Product  Version
 Hortonworks Data Platform (HDP)  All supported versions
 OS  RHEL 6.x
 Ambari  2.x
 NFS mount points present in affected host  All

Symptom

When attempting to start the Ambari agent, it doesn't return any error but the subsequent ambari-agent status calls, return the error "Agent not running."

Analysis Process

After running ambari-agent start at least once in the /var/log/ambari-agent/ambari-agent.log file, we only see a few lines for each attempted start:

INFO 2017-03-24 10:37:53,103 main.py:90 - loglevel=logging.INFO
INFO 2017-03-24 10:37:53,103 main.py:90 - loglevel=logging.INFO
INFO 2017-03-24 10:37:53,103 main.py:90 - loglevel=logging.INFO
INFO 2017-03-24 10:37:53,104 DataCleaner.py:39 - Data cleanup thread started
INFO 2017-03-24 10:37:53,106 DataCleaner.py:120 - Data cleanup started
INFO 2017-03-24 10:37:53,108 DataCleaner.py:122 - Data cleanup finished
  1. The logs don't show any error and it is not a complete startup sequence.
  2. Confirm if there are NFS mount points or network storage attached to this machine.
  3. Search for the fuser processes in uninterruptible sleep state using the command ps -flye | grep fuser and review if the output looks similar to the one below (multiple fuser processes in 'D' state):
D root 513702 513701 0 80 0 2300 25605 rpc_wa 13:09 pts/14 00:00:00 fuser 8670 tcp
S root 521174 1 0 80 0 1264 2825 wait 13:47 pts/14 00:00:00 /bin/sh -c fuser 8670/tcp 2>/dev/null | awk '{print $2}'
D root 521175 521174 0 80 0 2296 1396 rpc_wa 13:47 pts/14 00:00:00 fuser 8670 tcp
S root 521929 521921 0 80 0 1264 2825 wait 13:56 pts/14 00:00:00 /bin/sh -c fuser 8670/tcp 2>/dev/null | awk '{print $2}'
D root 521930 521929 0 80 0 2292 1396 rpc_wa 13:56 pts/14 00:00:00 fuser 8670 tcp
S gpadmin 523226 523026 0 80 0 904 25812 pipe_w 14:11 pts/15 00:00:00 grep fuser

This is an issue in the OS, related to NFS. In this scenario, the host affected had some issues related to NFS.

Cause

Ambari Agent startup process relies on fuser command to obtain the PID of the Agent. Since this command in stuck in an infinite loop due to a bug, according to this RH Article, the startup process for the agent never completes.

Resolution

The fix for this issue is to reboot the affected server. The only way to clear these processes and recover is to reboot the server. During the reboot process, there will be some errors unmounting the NFS or network attached storage. A hard reboot may be needed to reboot this host.

Comments

Powered by Zendesk