Pivotal Knowledge Base

Follow

Pivotal Cloud Foundry® Redis full disk scaling issue

Environment

Product Version
Pivotal Cloud Foundry® Redis

 all                      

Symptom

When shared-vm persistent disk is filled, and a BOSH operation, such as scale, is performed, the BOSH task may hang in the preparing deployment phase and eventually timeout, causing a failure. You can find evidence for this by checking that a drain has been triggered in /var/vcap/sys/log/cf-redis-broker/drain.log, and then checking the server-logs located in /var/vcap/syslog/redis/{instance-guid}/redis-server.log for failed attempts to shut-down.

Possible timeout error:

[2016-07-29 12:26:11 #1318] [canary_update(cf-redis-broker-partition-aa5545bd553f206250e/0)] ERROR -- DirectorJobRunner: Error updating canary instance: #<Bosh::Director::RpcTimeout: Timed out sending `prepare' to 41c8421b-baa9-4afd-9afd-a38b643219fd after 45 seconds>

Cause

This happens because Redis lacks sufficient disk space to do a successful BGSAVE (which will either have been triggered by the BOSH operation invoking a drain on the Redis broker, or by Redis own backup schedule).

Resolution

There are two different solutions based on the requirements of the users:

Solution 1:

In the event that dataloss is acceptable, the workaround is simply:

1. pkill -9 redis-server

2. At this point, Ops Manager should be able to upgrade without issue. It may take up to 10 minutes for Ops Manager to make progress.

Solution 2:

In the event that data loss is unacceptable, there are two possible states in which the system could be.

 

State 1) Some Redis instances backed up and stopped OK.

In this circumstance, do the following:

1. Move the dump.rdb files off disk (eg. via scp) for those instances that have successfully backed up, freeing space for the other instances to backup.

2. At this point, Ops Manager should be able to upgrade without issue

3. Refer to the restore documentation for restoring each individual instance.

 

State 2) At least one Redis instance is too large to be independently backed up or multiple backups are tripping over each other.

1. Create and attach a new volume large enough to contain the Redis backup files to the broker vm.

2. Identify the name of the new volume with lsblk this command will give you the {volume-name}

3. Create a mount directory with sudo mkdir /{mount-dir} 

4. Find out if a file system exists on the new volume with sudo file -s {volume-name} If a file system exists, please skip step 5.

5. Create a file system on the volume with sudo mkfs -t ext4 {volume-name} if necessary.

6. Mount the volume over this directory with sudo mount {volume-name} {mount-dir}

7. Create a folder for your instance (use instance guid). You can obtain the instance guid by running cf service {service-name} --guid

8. Modify permissions and ownership of {mount-dir}/{instance-guid} with chmod 755 {mount-dir}/{instance-guid} and chown vcap:vcap {mount-dir}/{instance-guid}

9. Get the Redis config command alias with cat /var/vcap/store/cf-redis-broker/redis-data/{instance-guid}/redis.conf | grep CONFIG

10. Make a note of the existing backup dir with /var/vcap/packages/redis/bin/redis-cli -a {instance-password} -p {instance-port} {config-alias} get dir Password and port can be obtained from either a service key bound to your service or by inspecting /var/vcap/store/cf-redis-broker/redis-data/{instance-guid}/redis.conf for requirepass and port

11. Set the Redis backup directory to the mount point of the previously attached volume with /var/vcap/packages/redis/bin/redis-cli -a {instance-password} -p {instance-port} {config-alias} set dir {mount-dir}/{instance-guid}

12. Wait for this Redis instance to stop - this indicates a successful backup.

13. Delete the dump.rdb and appendonly.aof files located in /var/vcap/store/cf-redis-broker/redis-data/{instance-guid}/db to make room on disk.

14. Give the remaining Redis servers a few minutes to attempt a successful backup. If no more redis-servers are running, then go to step 16.

15. Repeat the above steps for all instances affected if there is still not enough disk space for other instances.

16. At this point, Ops Manager should be able to upgrade without issue. It may take up to 10 minutes for Ops Manager to make progress.

17. Refer to the restore documentation for restoring each individual instance, remembering to restore the dir config prior to starting that instance up.

 

Comments

Powered by Zendesk