Pivotal Knowledge Base

Follow

Upgrade from PCF 1.7 to PCF 1.8 Fails and Sending Unmonitor Request to Monit

Environment

 Product  Version
 Pivotal Cloud Foundry® (PCF)

 1.8.x

Symptoms

After applying changes when attempting to upgrade PCF 1.7 to PCF 1.8, you see the following error:

Started updating job cloud_controller_worker > cloud_controller_worker/0 (b10c2083-b161-4817-9ae9-834537c16a17) (canary). Failed: Action Failed get_task: Task 3b027294-f309-43e4-4910-2632bf732cb3 result: Unmonitoring services: Unmonitoring service nfs_mounter: Sending unmonitor request to monit: Post http://127.0.0.1:2822/nfs_mounter: net/http: request canceled (00:05:21)

After seeing this error, you might see the Cloud Controller VM or Cloud Controller Worker VM in a failed/unresponsive state when you run `bosh vms`. You can do one of the following:

  • Run `bosh instances --ps`
  • Run `bosh ssh` into either of those VMs, then run `sudo monit summary`

After doing so, you'll see the Monit processes in a state similar to this example:

root@7518265d-d108-4c53-9150-28e901e0cab2:/var/vcap/bosh_ssh/bosh_bcb81anuv# monit summary

The Monit daemon 5.2.5 uptime: 4d 20h 48m

Process 'consul_agent' running - unmonitor pending
Process 'cloud_controller_ng' running - unmonitor pending
Process 'cloud_controller_worker_local_1' running - unmonitor pending
Process 'cloud_controller_worker_local_2' running - unmonitor pending
Process 'nginx_cc' running - unmonitor pending
Process 'cloud_controller_migration' running - unmonitor pending
Process 'metron_agent' running - unmonitor pending
Process 'route_registrar' running - unmonitor pending
File 'nfs_mounter' accessible - unmonitor pending
Process 'statsd-injector' running - unmonitor pending
System 'system_localhost' running

Cause

In certain cases, the NFS Mounter job can cause Monit to enter the state as seen in the example above if the NFS_server is unavailable. This is because Monit is single-threaded and it hangs up waiting on the NFS Mount which no longer exists. However, it would normally recover when the Server becomes available again, but it won't in PCF 1.8 as the NFS_Server is not a component of PCF 1.8.

Resolution

To get the VM Monit processes out of the `running - unmonitor pending` state, perform the following steps:

  • Disable the resurrector for the installation by running `bosh vm resurrection disable`
  • Power off the afflicted VM(s) in your IaaS
  • Delete the the afflicted VM(s) in your IaaS
  • Run `bosh cloudcheck`
  • The cloudcheck will scan the deployment to look for problematic VM(s). The VM(s) you powered off & deleted in the IaaS should be on the list of problematic VM(s) found. Pick the `delete reference` option when prompted to by the cloudcheck for the VM(s) you removed in the IaaS.
  • Enable the resurrector for the installation by running `bosh vm resurrection enable`.

The afflicted VM(s) will now be completely removed from your environment, and now can be recreated. You can do so by going into the Ops Manager, and Applying Changes so that bosh will recreate the problematic VM(s) in a healthy state.

Additional Information

This issue was observed in a PCF 1.7 -> PCF 1.8 upgrade on AWS. 

If Monit is hanging for a different job, see this KB for general instructions on recovering Monit: How to recover Monit processes from a `running - unmonitor pending` or hung status

 

Comments

Powered by Zendesk