Pivotal Knowledge Base

Follow

Bosh commands are returning 'unresponsive agent'

Environment

Pivotal Cloud Foundry (PCF) Elastic Runtime all versions

Symptom

  • lsof -p <bosh agent pid> shows established connection with directory on port 25777 from the agent instance. This tcp session should close once immediately after the director returns the http response. If this connection is lingering for an extended period of time then it is likely the agent is hung.
bosh-agen 2707 root 6u IPv4 248913 0t0 TCP mysql-proxy-0.node.dc1.cf.internal:44934->192.168.10.41:25777 (ESTABLISHED)
  • Bosh agent logs will show no update for several days or hours depending on when the started occurred.

Cause

Bosh hangs sending when sending an API request to the director. For some reason the http request sent by the agent never gets a response and agent waits indefinitely which blocks agent heartbeats from being sent to the director.

From an OS perspective the TCP socket for port 25777 is established as per the kernel. The session is not established anymore on the director side and the http transport will wait indefinitely for TCP session to close or response to be returned.

Resolution

ssh directly to the affected VM/s from the Ops Manager VM and kill the Bosh agent process as follows. 

ssh vcap@
sudo -i
ps aux |grep agent
ps -ef |grep bosh-agent| grep -v grep
root       866   857  0 Jul04 ?        
00:26:38 /var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json kill -9 857

The bosh agent will be automatically recovered and start on a new PID.

Additional Information

(NATS) is the cross-component communication Message Bus.

These are responsible for the following

  1. Performing provisioning instructions on the VMs
  2. Informing the Health Monitor about changes in the health of monitored processes

For further information, please refer to the following resources: 

https://bosh.io/docs/bosh-components.html#nats

Comments

Powered by Zendesk