Pivotal Knowledge Base

Follow

How to recover ETCD cluster after failure

Environment

Product Version
Pivotal Cloud Foundry (PCF)   1.6.x, 1.7.x, 1.8.x
Component Elastic Runtime

Symptom

Metron and Doppler use etcd cluster for discovery (from Pivotal CF 1.8 TCP routing group and route data is also kept in etcd). There are scenarios where a node on the etcd cluster separates to become it's own node (incorrectly) causing either Metrons to fail finding Dopplers or all Metrons swarming a small group of Dopplers. 

Resolution

Etcd cluster failures in PCF can be corrected by wiping the data from the nodes and resetting them.  This process essentially gives the cluster a fresh start and because there is no persistent data stored on the etcd cluster, the operation is harmless.

Because this process is quick, non-destructive and has a high success rate for fixing etcd problems, Pivotal recommends trying this process first, before doing any additional debugging.

To perform this process, follow the instructions in the Failed Deploys, Upgrades, Split-Brain Scenarios, etc section of the following link.

https://github.com/cloudfoundry-incubator/etcd-release#failure-recovery

$ monit stop etcd (on all nodes in etcd cluster
$ rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
$ monit start etcd (one-by-one on each node in etcd cluster)

If you need assistance with these instructions, please open a support ticket.

Impact/Risk

If you choose to enable TCP routing, do not remove etcd data stores during failure recovery procedures since router group data added by the routing API is not ephemeral.

Comments

Powered by Zendesk