Pivotal Knowledge Base

Follow

How to Recover ETCD Cluster after Failure

Environment

  • Pivotal Cloud Foundry (PCF) 1.6.x, 1.7.x and 1.8.x
  • Component- Elastic Runtime

Symptom

Metron and Doppler use the etcd cluster for discovery (from Pivotal CF 1.8 TCP routing group and route data are also kept in etcd). There are scenarios where a node on the etcd cluster separates to become its own node (incorrectly) causing either Metrons to fail to find Dopplers or all Metrons swarming a small group of Dopplers.

Resolution

Etcd cluster failures in PCF can be corrected by wiping the data from the nodes and resetting them.  This process essentially gives the cluster a fresh start and because there is no persistent data stored on the etcd cluster, the operation is harmless.

Because this process is quick, non-destructive and has a high success rate for fixing etcd problems, Pivotal recommends trying this process first, before doing any additional debugging.

To perform this process, follow the instructions in the Failed Deploys, Upgrades, Split-Brain Scenarios, etc section of the following link.

https://github.com/cloudfoundry-incubator/etcd-release#failure-recovery

$ monit stop etcd (on all nodes in etcd cluster
$ rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
$ monit start etcd (one-by-one on each node in etcd cluster)

If you need assistance with these instructions, please open a support ticket.

Impact

If you choose to enable TCP routing, do not remove etcd data stores during failure recovery procedures since router group data added by the routing API is not ephemeral.

Comments

Powered by Zendesk