|Pivotal Cloud Foundry (PCF)|
|Elastic Runtime||1.6.x, 1.7.x|
The upgrade of Pivotal Cloud Foundry may fail due to Consul issues.
The upgrade fails with the following error message:
Started updating job consul_server-partition-260de9892e7d24109dfe > consul_server-partition-260de9892e7d24109dfe/0 (canary).
Failed: `consul_server-partition-260de9892e7d24109dfe/0' is not running after update (00:05:57) Error 400007: `consul_server-partition-260de9892e7d24109dfe/0' is not running after update
This particular error message is a general error message. It indicates that there is a problem with the software running on the VM. For the purposes of this KB, we're talking about the consul_server VM in particular, so it means that there is a problem with the consul software starting up. It is not possible to tell the specific problem, see Debugging Instructions below for details on how you could investigate more.
In many cases we have found that consul server failures in PCF can be corrected by wiping the data from the nodes and resetting them. This process essentially gives the cluster a fresh start and because there is no persistent data stored on the Consul server, the operation is harmless.
Because this process is quick, non-destructive and has a high success rate for fixing Consul problems, Pivotal recommends trying this process first, before doing any additional debugging.
To perform this process, follow the instructions in the Failed Deploys, Upgrades, Split-Brain Scenarios, etc section of the following link.
If you need assistance with these instructions, please open a support ticket. If performing the steps at the link above does not help, please proceed to the next section.
When this problem occurs, you can debug further by performing the following steps:
- Capture the logs from the failing VM. This can be done through Ops Manager on the Status page for the Elastic Runtime Tile. It can also be done by running bosh logs or by manually copying the /var/vcap/sys/logs directory off the VM.
- SSH into the failing VM and run a monit summary as the root user. This command will list the processes that are deployed to the VM and indicate with one is not running properly.
Once you have captured the information above, you can review the information to better understand the problem or open a support ticket and Pivotal Support will help to diagnose the issue.
As documented here "Upgrading to PCF 1.6", it is recommended that you scale the number of consul servers down to 1 instance prior to upgrading to PCF 1.6. This recommendation can help to avoid some common issues with the consul server.