Pivotal Cloud Foundry (PCF) versions 1.12.x or earlier
We have encountered failures in Bosh where the task gets queued and then moves immediately to the canceled state rather than processing. The issue can be replicated by customers in busy environments using PCF versions 1.12.x or earlier.
An example of what an application developer will see if they encounter this failure is:
Instance deletion failed: There was a problem completing your request. Please contact your operations team providing the following information: service: service-offering-ab7a08b8-5a43-47c8-a1a9-29806cc3f7f8, service-instance-guid: 56c00032-03e0-485e-bfe9-907038641a77, broker-request-id: fc96e137-c7a6-43ee-aab0-434650a3d752, task-id: 78, operation: delete
The operator would then view the Bosh task and see that it is in state canceled, such as:
Acting as user 'director' on 'p-bosh'
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Director task 78
Started deleting instances > redis-server/2d95d493-0d42-4360-b0fd-572982356902 (0). Failed: Task 78 cancelled (00:00:00)
Error 10001: Task 78 cancelled
Task 78 cancelled
We have a working theory that this is related to a race condition when a task goes from queued to processing. If the task was queued for more than 90s and the status is checked before the checkpoint time is updated, it gets canceled.
The permanent solution is to upgrade to PCF v2.0 or greater which uses Bosh v264.1 or above.
A workaround for this is to re-attempt the failing operation.