Pivotal Knowledge Base

Follow

Consul fails to start during upgrade in Cloud Foundry

Environment

Product Version
Pivotal Cloud Foundry (PCF)   
Elastic Runtime  1.6.x, 1.7.x

Symptom

The upgrade of Pivotal Cloud Foundry may fail due to Consul issues.

The upgrade fails with the following error message:

Started updating job consul_server-partition-260de9892e7d24109dfe > consul_server-partition-260de9892e7d24109dfe/0 (canary). 
Failed: `consul_server-partition-260de9892e7d24109dfe/0' is not running after update (00:05:57) Error 400007: `consul_server-partition-260de9892e7d24109dfe/0' is not running after update

Cause

This particular error message is a general error message.  It indicates that there is a problem with the software running on the VM.  For the purposes of this KB, we're talking about the consul_server VM in particular, so it means that there is a problem with the consul software starting up.  It is not possible to tell the specific problem, see Debugging Instructions below for details on how you could investigate more.

Resolution

Quick Fix

In many cases we have found that consul server failures in PCF can be corrected by wiping the data from the nodes and resetting them.  This process essentially gives the cluster a fresh start and because there is no persistent data stored on the Consul server, the operation is harmless.

Because this process is quick, non-destructive and has a high success rate for fixing Consul problems, Pivotal recommends trying this process first, before doing any additional debugging.

To perform this process, follow the instructions in the Failed Deploys, Upgrades, Split-Brain Scenarios, etc section of the following link.

https://github.com/cloudfoundry-incubator/consul-release/tree/master#failure-recovery

If you need assistance with these instructions, please open a support ticket.  If performing the steps at the link above does not help, please proceed to the next section.

Debugging Instructions

When this problem occurs, you can debug further by performing the following steps:

  • Capture the logs from the failing VM.  This can be done through Ops Manager on the Status page for the Elastic Runtime Tile.  It can also be done by running bosh logs or by manually copying the /var/vcap/sys/logs directory off the VM.
  • SSH into the failing VM and run a monit summary as the root user.  This command will list the processes that are deployed to the VM and indicate with one is not running properly.

Once you have captured the information above, you can review the information to better understand the problem or open a support ticket and Pivotal Support will help to diagnose the issue.

Additional Information

As documented here "Upgrading to PCF 1.6", it is recommended that you scale the number of consul servers down to 1 instance prior to upgrading to PCF 1.6.  This recommendation can help to avoid some common issues with the consul server.

Comments

Powered by Zendesk