Pivotal Knowledge Base

Follow

Azure Reboot Maintenance for Pivotal Cloud Foundry

Environment

  • Pivotal Cloud Foundry Application Service
  • Pivotal Services i.e. MySQL, RabbitMQ, Redis, Cloud Cache

Purpose

Microsoft Azure initiates planned maintenance of the physical hosts in their data center regions on occasion. See the description of this here: https://azure.microsoft.com/en-us/blog/a-new-planned-maintenance-experience-for-your-virtual-machines/

During most maintenance events, VMs are paused in place for a few seconds and not rebooted. Some kinds of maintenance may require the VM to be rebooted in place, or migrated to a new host. VMs can be down for as long as 15 minutes if they are migrated (the expected downtime is much less). VMs will retain all of their configured state such as IP address, etc.

PCF on Azure

Pivotal has two categories of customers on Azure:

  1. Customers with BOSH Resurrector turned on and also using the default Availability Sets generated by the CPI
  2. Customers with BOSH Resurrector turned off and only using the default Availability Sets generated by the CPI

Most customers fall into the Resurrector category (1).

Procedure

Current Recommended Approach

Our current recommended approach is to disable the BOSH Resurrector Plugin during the maintenance window.

Typical backup procedures should be conducted.

The process steps are as follows:

1. Disable the BOSH Resurrector Plugin. This is to prevent BOSH from triggering VM operation via CPI which might conflict with the Azure update/maintenance.

a. To disable the BOSH Resurrector Plugin, use this command:

bosh2 -e <your_env> update-resurrection off

2. Singleton jobs can safely be left to be migrated automatically by Azure, though of course there will be a consequent service outage during the migration.  To resolve the maintenance manually, at a time of your choosing, use the following steps:

a. Run the following:

bosh2 -e <your_env> -d <deployment> stop <vm>

b. Redeploy the VM using the Azure CLI:

az vm perform-maintenance -g <resource_group> -n <vm_name>

More information about Azure Maintenance commands can be found here: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/maintenance-notifications

c. Power on your VM through either the Azure Portal or with

az vm start -g <resource_group> -n <vm_name>

BOSH will detect the VM health once the agent reports to the BOSH Director.

3. The rest of the platform components/jobs should be currently scaled to HA and Azure's automatic VM hardware migration (VM reboot) proceeded without any impact to PCF. The HA jobs, using Azure’s Availability Sets, rolled through as expected using Azure’s Update Domains.

4. Turn BOSH Resurrector Plugin back on after the expected completion of the Azure data center upgrade. This could be several days.

For customers who are using the Resurrector, they may experience a race condition in BOSH after the reboot of a VM. In this situation, BOSH will attempt to create a new VM after determining the rebooting VM is unavailable. With the original VM completing it gets rebooted, BOSH can potentially enter a deadlock, where a deployment never completes, and the lock is not removed. If this is case, take these steps:

  1. Disable the resurrector plugin: bosh2 -e <your_env> update-resurrection off
  2. Get the task number with bosh2 -e <your_env> tasks
  3. Run bosh2 -e <your_env> cancel-task <task_number>
  4. Run bosh2 -e <your_env> cloud-check to detect any errors in your deployments.

If that does not properly resolve the BOSH locking issue, please contact Pivotal Support.

Comments

Powered by Zendesk