Pivotal Knowledge Base

Follow

BOSH Tasks go straight to Canceled State if Queued for more than 90 Seconds

Environment

Pivotal Cloud Foundry (PCF) versions 1.12.x or earlier

Symptom

We have encountered failures in Bosh where the task gets queued and then moves immediately to the canceled state rather than processing. The issue can be replicated by customers in busy environments using PCF versions 1.12.x or earlier. 

An example of what an application developer will see if they encounter this failure is:

Instance deletion failed: There was a problem completing your request. Please contact your operations team providing the following information: service: service-offering-ab7a08b8-5a43-47c8-a1a9-29806cc3f7f8, service-instance-guid: 56c00032-03e0-485e-bfe9-907038641a77, broker-request-id: fc96e137-c7a6-43ee-aab0-434650a3d752, task-id: 78, operation: delete

The operator would then view the Bosh task and see that it is in state canceled, such as:

Acting as user 'director' on 'p-bosh'
RSA 1024 bit CA certificates are loaded due to old openssl compatibility

Director task 78
Started deleting instances > redis-server/2d95d493-0d42-4360-b0fd-572982356902 (0). Failed: Task 78 cancelled (00:00:00)

Error 10001: Task 78 cancelled

Task 78 cancelled

Cause

We have a working theory that this is related to a race condition when a task goes from queued to processing. If the task was queued for more than 90s and the status is checked before the checkpoint time is updated, it gets canceled.

Resolution

The permanent solution is to upgrade to PCF v2.0 or greater which uses Bosh v264.1 or above.

A workaround for this is to re-attempt the failing operation.

Comments

Powered by Zendesk