Pivotal Cloud Foundry (PCF) all versions
Bosh commands appear to hang indefinitely or timeout. Executing the following command reveals hundreds or thousands of scan and fix tasks
bosh -e director tasks --no-filter
This type of scenario typically manifests itself during a Bosh deployment when many tasks are generated while at the same time there is a bosh agent that is intermittently skipping heartbeats.
Basically what can happen is the bosh agent will miss a heartbeat and health monitor (which runs on bosh director) will trigger the creation of a scan and fix task. When scan and fix executes it finds that the bad agent has successfully sent a heartbeat and skips resurrection of the instance. This will repeatedly happen hundreds of times causing the director task queue to build up. And when there are many long-running deployment tasks executing this can cause a race condition where task queue grows too large.
NOTE: running bosh stop, start, restart, and recreate may result in undesirable behavior if deployment changes are in progress. When troubleshooting these types of issues it is best to avoid executing these commands. Instead, use the IAAS or bosh cck to engage these types of troubleshooting actions.
- ssh into the bosh director vm and disable health monitor. This will stop health monitor from creating new tasks. After about 10 minutes the inflight scan and fix tasks will eventually timeout.
monit stop health_monitor
- Then we need to Identify which VM is triggering the scan and fix tasks. This may not always be apparent because bosh vms will sometimes report the agent as responsive and may only report it as unresponsive intermittently
- Check the output of bosh vms to see if you can quickly identify unresponsive agents.
bosh -e director vms --details
- If the output of bosh vms does not show which VM ( for example there could be many VM's then we need to ssh into the bosh director and review the health monitor logs. Search for the following warnings in the log.
WARN : (Resurrector) notifying director to recreate unresponsive VM: cf-300c3738aa8b3ad21fca router/07a82795-821f-4466-9509-f19ac2caf927
- df -h
- ps aux
- ps -ef
Bosh will soon have task rate limiting capability available, which will prevent these types of issue from happening in future.