|Pivotal Cloud Foundry® (PCF)||1.5.x, 1.6.x, Fixed in 1.7.x|
The purpose of this KB article is to give a general guideline on how to recover a pre-1.7.x PCF environment that has been vMotioned through vSphere.
This happens when a customer storage vMotion’s existing persistent disks to a new datastore in vSphere. PCF versions prior to 1.7 will continue running until any of the VMs are restarted or changes are applied through Ops Manager. This happens because running a storage vMotion on a datastore with PCF persistent disks will cause vSphere to change the disk-IDs. Which then causes bosh to lose reference of the original disk-ID because it is now different.
Other symptoms we have seen were in an instance where the vMotioned environment had a duplicate of every VM in their vSphere dashboard, as well as when they ran a `bosh cck` they were unable to finish the cloudcheck or resolve the issues. Regardless of this, the system was not down at all, but no changes could be made to it successfully.
If you run `bosh cck` in an environment and see the following error, it may have been vMotioned:
Problem 1 of 1: Disk `disk-73a26ebb-8bd6-4e42-95d1-6ec3a9774601' (maximus-partition-808c7e9f411bf6c948be/0, 1024M) is missing.
- Skip for now (PICK THIS OPTION)
- Delete disk reference (DANGEROUS!)
The following steps have been used to recover PCF 1.5 and 1.6 environments:
DISCLAIMER: This recovery method will require VM shutdown (I.E. it’s disruptive)
- Run `bosh cck --report` to see the name of the missing disks
- Run `bosh instances --details` to get the disk names for your VMs
- Run `bosh vm resurrection disabled` to prevent bosh from trying to recreated VMs you will power off in vSphere. Optionally: you can disable this just for the VMs you will power off (more time-consuming)
- Locate the disks amongst the datastores in vSphere, to do this:
- Locate the migrated VM
- Verify that vSphere isn't configured to terminate powered-off VMs!
- Power off the VM
- Locate its persistent disk (Can be done by right-clicking the VM, selecting “edit settings”, choose, go to the “hardware” section, select “hard disk 3”, check the value of the “disk file” field.
- Move it back to original persistent disk folder (which is configured at Ops Manager -> Ops Manager Director)
- Rename the disk to original name as bosh director knows (disk name can only be renamed via vSphere CLI, here is the KB for this)
- `bosh recreate` the job.
- Find the disks that bosh claims to be missing. Shut down the corresponding VMs for the disks.
- Once the VMs are shutdown, you should be able to move the disks back to their original datastore.
- Once the disks are back in their original datastore, you can bring the VMs back online.
- Run `bosh cck` again and it will reboot/recreate the VMs that were down because of missing disks.
- Run `bosh vm resurrection enabled` to restore VM resurrection to the environment.