Pivotal Knowledge Base

Follow

In some vSphere Environments, Upgrade/Installation to 1.9 fails with diego_cell Errors

Environment

 Product  Version
 Pivotal Cloud Foundry  1.9
 Elastic Runtime  >1.9.0

Symptom

In some vSphere environments, upgrade/installation to Pivotal Cloud Foundry version 1.9 fails with the following error:

Director task 10718 
Started preparing deployment > Preparing deployment. Done (00:00:01)

Started preparing package compilation > Finding packages to compile. Done (00:00:00)

Started updating instance diego_cell 
Started updating instance diego_cell > diego_cell/8007640d-569b-46e4-bd3a-ece29bad8cc5 (0) (canary). Failed: 'diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)' is not running after update. Review logs for failed jobs: rep (00:05:45)

Error 400007: 'diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)' is not running after update. Review lfogs for failed jobs: rep

Task 10718 error

The key symptom here is that the rep process is failing. See below the output of `monit summary` from the diego_cell where the `rep` is in an `unknown` state:

diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)*                    | failing | AZ1 | xlarge.disk | 10.2.15.20  |
|  consul_agent                                                          | running |    |            |            |
|  rep                                                                  | unknown |    |            |            |
|  garden                                                                | running |    |            |            |
|  metron_agent                                                          | running |

Restarting rep process does not fix the issue either.

Cause 

The issue is caused by the following line in the `/var/vcap/jobs/rep/bin/rep_as_vcap` file:

azure_fd=$(curl -f  --connect-timeout 5 --silent http://169.254.169.254/metadata/v1/InstanceInfo/FD)

In some vSphere environments, the above curl command does not timeout within 30 seconds, causing `rep_as_vcap` script to exit after 30 seconds. Since monit is configured to terminate a process if its associated startup scripts don't exit after 30 seconds, in this case, monit terminates the `rep` process. Hence, the rep process status is `unknown` from the `monit summary` command above.  

Note: The reference to Azure in the above control scripts is related to querying metadata for the Azure IaaS. This IaaS dependency in the Diego control scripts is required to enable some features for Azure although, it has some unintended consequences in the vSphere environments. Please see below for the final fix. 

Resolution

This issue is fixed in Elastic Runtime version 1.9.18. Upgrade to Elastic Runtime 1.9.18 or above. See the release notes: Adds Azure Fault-Domain detection failure logic to rep.

 

 

 

Comments

Powered by Zendesk