Pivotal Cloud Foundry (PCF) Elastic Runtime all versions
- lsof -p <bosh agent pid> shows established connection with director on port 25777 from the agent instance. This tcp session should close once immediately after the director returns the http response. If this connection is lingering for an extended period of time then it is likely the agent is hung.
bosh-agent 2707 root 6u IPv4 248913 0t0 TCP mysql-proxy-0.node.dc1.cf.internal:44934->192.168.10.41:25777 (ESTABLISHED)
- Bosh agent logs will show no update for several days or hours depending on when the started occurred.
Bosh hangs sending when sending an API request to the director. For some reason the http request sent by the agent never gets a response and agent waits indefinitely which blocks agent heartbeats from being sent to the director.
From an OS perspective the TCP socket for port 25777 is established as per the kernel. The session is not established anymore on the director side and the http transport will wait indefinitely for TCP session to close or response to be returned.
This is fixed in the latest stemcell release. Bosh Agent will now aggressively timeout on NATS connection failure after 5 minutes.
As a workaround ssh directly to the affected VM/s from the Ops Manager VM and kill the Bosh agent process as follows.
ssh vcap@ sudo -i ps aux |grep agent ps -ef |grep bosh-agent| grep -v grep root 866 857 0 Jul04 ?
00:26:38 /var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json kill -9 857
The bosh agent will be automatically recovered and start on a new PID.
(NATS) is the cross-component communication Message Bus.
These are responsible for the following
- Performing provisioning instructions on the VMs
- Informing the Health Monitor about changes in the health of monitored processes
For further information, please refer to the following resources:
Refer to Bosh Deployment Fails with "unresponsive agent" Error due to Unresponsive VMs article for similar issues.