Pivotal Knowledge Base

Follow

Redis Tile Upgrade to 1.11.0 or 1.11.1 may Result in Loss of the Persistent Data

Environment

Pivotal Cloud Foundry Redis Tile v1.11.0 and v1.11.1 only.

Purpose

PCF-Redis tile versions 1.11.0 and 1.11.1 include a change in their OpsManager configuration that may result in cf-redis data loss for some customers. The data loss is due to persistent disks becoming detached when the PCF-Redis cf-Redis broker is moved between availability zones. The data loss includes immediate loss of shared-VM Redis instances. Dedicated Redis instances can also be affected.

Affected Redis Tiles:

  • Pivotal Cloud Foundry Redis Tile: Both shared VM and dedicated VM service plans are affected.
  • Upgrading from 1.11.0 or 1.11.1 to 1.11.2 requires some manual steps to remove the risk of data loss.
  • The 1.11.0 and 1.11.1 when upgraded to from a 1.10.* tile. Data loss may occur.

Unaffected Redis Tiles:

  • On-demand Redis times are not affected.
  • 1.10.* tiles and earlier are not affected.
  • Fresh installations of 1.11.0 and 1.11.1 tiles are not affected immediately. However it may be unsafe to upgrade from these versions. Do not make any new installations of 1.11.0 or 1.11.0.
  • There is no risk in upgrading from 1.11.0 to 1.11.1

Affected Availability Zone (AZ) Configurations:*

  • Tiles using only one AZ in 'Balance other jobs in AZ' and the same AZ in 'Singleton jobs in AZ' are not affected.
  • Tiles with a singleton AZ configured that is not in 'Balance other jobs in AZ' will always be affected.
  • Tiles with the singleton job included among more than one 'Balance other jobs' may or may not be affected.

* availability zones are configured in the Redis Tile > Assign AZs and Networks tab

Affected IAAS:

  • All IaaSs might be affected

Suggested Actions

We list actions depending on the current state of your PCF Redis installation. If you are unsure of your state, see the Diagnostic Actions section. The following screenshot shows tile configuration for upgrading to 1.11.2. Note that europe-west1-c is selected in both categories.

Note- Download and upgrade to the latest Redis 1.11.2 from https://network.pivotal.io/products/p-redis/.

redis.png

State 1 - Current version: 1.10.* or earlier. Upgraded from: any previous. Broker moved AZ: N/A.

  • Do not upgrade to 1.11.0 or 1.11.1. Upgrading to 1.11.2 will have no adverse effect.

State 2 - Current version 1.11.0 or 1.11.1. Upgraded from: 1.10.*. Broker moved AZ: Yes.

  • Upgrading to 1.11.2 may cause the broker to move out of the singleton AZ, into one of the ‘Balance other jobs in AZ’s. This will again cause disk detachment and data loss.
    • If data loss is of low concern for your use case, you should upgrade to 1.11.2.
    • If 'Balance other jobs in AZ' configuration includes the AZ selected above in the 'Place singleton jobs in' radio-field, it is safe to upgrade to 1.11.2.
    • If data loss is a concern, we currently recommend not upgrading away from 1.11.0 or 1.11.1. Please contact Pivotal Support for advice.
  • If data recovery from the 1.10.* the state is required, look at the data recovery section.
  • We recommend disabling creation of dedicated vms to avoid further data loss. cf disable-service-access p-redis

State 3 - Current version 1.11.0 or 1.11.1. Upgraded from: 1.10.*. Broker moved AZ: No.

  • It is safe to upgrade to 1.11.2+ without any changes to your AZ configuration.

State 4 - Current version 1.11.0 or 1.11.1. Freshly installed.

  • If 'Balance other jobs in AZ' configuration includes the AZ selected above in the ‘Place singleton jobs in’ radio-field. This is similar to the state 3 “Broker Moved AZ: No”. Please see the guidance above.
  • If 'Balance other jobs in AZ' configuration does not include AZ selected above in the ‘Place singleton jobs in’ radio-field. This is similar to the state 2 “Broker Moved AZ: Yes”. Please see the guidance above.

Cause

  • In 1.10.* and earlier, the broker may be deployed into any of the AZs selected under “Balance other jobs in” field on Ops Manager.
  • Upgrading to 1.11.0 or 1.11.1 switches single_az_only to true. This causes the broker VM to be moved into the singleton AZ (unless it is there already).
  • Moving the broker causes its disk to be orphaned.
  • The broker has no record of any of the dedicated nodes it has previously allocated and or bound.
  • The broker has no record of any shared VMs, nor shared-vm redis data.
  • Any dedicated nodes will continue to function, but the broker will not be aware of them and may cause data to be overridden.

Detailed Description of the Problem

The PCF-Redis tile deploys a cf-redis broker on a single VM. This VM hosts any shared-vm redis services and records the lifecycle of dedicated-node service instances.

  • The Redis Tile in Ops Manager contains a selector “Place singleton jobs in:”. This requires that an AZ be selected as the singleton AZ.
  • Ops Manager exposes a configuration flag called single_az_only. This has two effects:
    • 1) all VMS in that group will be allocated to the same AZ
    • 2) this will always be the singleton AZ (even if there is more than one instance in the group)
  • Ops Manager does not consider all instance_groups containing only one VM to be singleton.

In PCF-Redis 1.10.* and earlier, the tile set single_az_only: false. In versions 1.11.0 and 1.11.1, this was changed to single_az_only: true.

The reason for this change was to allow tile operators to turn off the legacy cf-redis service broker by configuring that instance group to have zero VMs. Constraining an instance group to have a minimum of 0, and a maximum of 1 VMs causes Ops Manager to enforce single_az_only must be true.

An unforeseen side effect of changing to single_az_only: true is that if the broker has been previously deployed to another AZ, it will be moved into the singleton AZ. A patch to Ops Manager removing the requirement to toggle ‘single_az_only’ to true will be published in a few weeks.

When BOSH moves a VM between availability zones, any attached persistent disks are orphaned if they cannot be moved. New disks are created in the new AZ and attached to the new VM. The data is not migrated.

How dedicated-nodes are Affected

When the broker loses its persistent disk, the statefile that records dedicated-node allocation and bindings is lost. Requests from Cloud Foundry to provision a new dedicated-node (CF create-service) will cause the broker to allocate a VM from its pool of available VMs. As the broker has no record of which VMs it has previously allocated, it will clean and reallocate a dedicated VM currently in use. This will destroy the Redis data on that dedicated-node and the existing App bindings will no longer be valid.

When the smoke-test errand runs, this will cause a dedicated instance to be provisioned which may overwrite a pre-existing instance. Attempts to bind to any service instance created prior to the upgrade will fail.

Symptoms

  1. Data stored in any shared-vms will be lost. Apps with connections to shared-vms will lose their connection to Redis.
  2. The record of shared-vm allocations and bindings will be lost.
  3. The record of dedicated-node allocations and bindings will be lost.
  4. Redis data on dedicated nodes will not be lost immediately. The broker has lost the record of allocating these VMs, therefore provisioning new dedicated-node service instances may result in total data loss from existing dedicated-node service instances.

Procedure

Diagnostic Tasks

  1. Determine your current version of the PCF-Redis tile.
    1. If it is earlier than 1.11, do not upgrade to 1.11.0 or 1.11.1, unless you only use a single AZ and have no plans to add AZs.
  2. If the current version is 1.11.0 or 1.11.1, determine whether this installation was upgraded from a 1.10.* tile. The Ops Manager installation Change Log will hold a record of previous upgrades.
  3. If an at-risk upgrade has been performed, determine whether the broker VM has changed its AZ:
    1. Check the installation change logs to see if the cf-redis-broker instance_group has been changed to a different AZ.
    2. The image below shows the broker being moved. Europe-west1-b has been added, so it must be the singleton AZ. The broker will be moved into this AZ, causing disk loss.
    3. If the diff shows the only removal of AZs, the broker has not been moved.

Data Recovery

Please contact the Redis for PCF team to discuss and recover your data. Please do this immediately as there is a 5-day limit where data is recoverable.

If you have previously disabled the cf-redis broker

The Redis On-Demand Broker cf.redis offers a more flexible way to manage service instances. All PCF-Redis tiles ship with both brokers: the legacy cf-redis and the newer on-demand cf.redis. Tiles versions 1.11.0 and 1.11.1 allow the operator to disable the cf-redis broker by setting the instance_group to have zero instances. It is also necessary to disable cf-redis broker errands.

The current fix for changing single_az_only back to false unfortunately requires the tile to force the cf-redis broker instance_group to have exactly one VM. This means that operators who previously reduced this to zero will have the cf-redis broker recreated.

It is currently not possible to remove this VM, however it can be effectively ignored by:

  • turning off the register-broker errand (it may already be off)
  • if it has run already, use the cf cli to disable or remove the cf-redis service.

Comments

Powered by Zendesk