|Pivotal Cloud Foundry® (PCF)||
After applying changes when attempting to upgrade PCF 1.7 to PCF 1.8, you see the following error:
Started updating job cloud_controller_worker > cloud_controller_worker/0 (b10c2083-b161-4817-9ae9-834537c16a17) (canary). Failed: Action Failed get_task: Task 3b027294-f309-43e4-4910-2632bf732cb3 result: Unmonitoring services: Unmonitoring service nfs_mounter: Sending unmonitor request to monit: Post http://127.0.0.1:2822/nfs_mounter: net/http: request canceled (00:05:21)
After seeing this error, you might see the Cloud Controller VM or Cloud Controller Worker VM in a failed/unresponsive state when you run
`bosh vms`. You can do one of the following:
`bosh instances --ps`
`bosh ssh`into either of those VMs, then run
`sudo monit summary`
After doing so, you'll see the Monit processes in a state similar to this example:
root@7518265d-d108-4c53-9150-28e901e0cab2:/var/vcap/bosh_ssh/bosh_bcb81anuv# monit summary The Monit daemon 5.2.5 uptime: 4d 20h 48m Process 'consul_agent' running - unmonitor pending Process 'cloud_controller_ng' running - unmonitor pending Process 'cloud_controller_worker_local_1' running - unmonitor pending Process 'cloud_controller_worker_local_2' running - unmonitor pending Process 'nginx_cc' running - unmonitor pending Process 'cloud_controller_migration' running - unmonitor pending Process 'metron_agent' running - unmonitor pending Process 'route_registrar' running - unmonitor pending File 'nfs_mounter' accessible - unmonitor pending Process 'statsd-injector' running - unmonitor pending System 'system_localhost' running
In certain cases, the NFS Mounter job can cause Monit to enter the state as seen in the example above if the NFS_server is unavailable. This is because Monit is single-threaded and it hangs up waiting on the NFS Mount which no longer exists. However, it would normally recover when the Server becomes available again, but it won't in PCF 1.8 as the NFS_Server is not a component of PCF 1.8.
To get the VM Monit processes out of the `running - unmonitor pending` state, perform the following steps:
- Disable the resurrector for the installation by running
`bosh vm resurrection disable`
- Power off the
afflictedVM(s) in your IaaS
- Delete the
the afflictedVM(s) in your IaaS
- The cloudcheck will scan the deployment to look for problematic VM(s). The VM(s) you powered off & deleted in the IaaS should be on the list of problematic VM(s) found. Pick the
`delete reference`option when prompted to by the cloudcheck for the VM(s) you removed in the IaaS.
- Enable the resurrector for the installation by running
`bosh vm resurrection enable`.
The afflicted VM(s) will now be completely removed from your environment, and now can be recreated. You can do so by going into the Ops Manager, and Applying Changes so that bosh will recreate the problematic VM(s) in a healthy state.
This issue was observed in a PCF 1.7 -> PCF 1.8 upgrade on AWS.
If Monit is hanging for a different job, see this KB for general instructions on recovering Monit: How to recover Monit processes from a `running - unmonitor pending` or hung status