Pivotal Knowledge Base

Follow

Consul failing "x/y nodes reported success"

Environment

 Product  Version
 Pivotal Cloud Foundry®  Prior to 1.9.42, 1.10.31, 1.12.5, 1.11.17

Symptom

In some cases (sometimes during a platform upgrade), Consul will not reach a quorum and it will fail with the following message:

{"timestamp":"1507476742.803297520","source":"confab","message":"confab.agent-client.set-keys.install-key.request.failed","log_level":2,"data":{"error":"Unexpected response code: 500 (1 error(s) occurred:\n\n* 1/89 nodes reported failure)","key":"jm3apuzmw868klczs2xv
=="}}
{"timestamp":"1507476743.803513527","source":"confab","message":"confab.agent-client.set-keys.list-keys.request","log_level":1,"data":{}}
2017/10/08 15:32:23 [INFO] serf: Received list-keys query
2017/10/08 15:32:23 [INFO] serf: Received list-keys query
{"timestamp":"1507476744.513558626","source":"confab","message":"confab.agent-client.set-keys.list-keys.response","log_level":1,"data":{"keys":["jm3apuzmw868klczs2xv
=="]}}
{"timestamp":"1507476744.513611794","source":"confab","message":"confab.agent-client.set-keys.install-key.request","log_level":1,"data":{"key":"jm3apuzmw868klczs2xv
=="}}
2017/10/08 15:32:24 [INFO] serf: Received install-key query
2017/10/08 15:32:24 [INFO] serf: Received install-key query
2017/10/08 15:32:25 [ERR] http: Request POST /v1/operator/keyring, error: 1 error(s) occurred:017/05/04 12:31:39 [INFO] serf: EventMemberFailed: dedicated-node-8 192.0.2.26 {"timestamp":"1493901099.471777678","source":"confab","message":"confab.agent-client.set-keys.list-keys.request.failed","log_level":2,"data":{"error":"155/179 nodes reported success"}}

The important part to note is at the end, where 1 and 89 can vary depending on the size of your environment.

{"error":"1/89 nodes reported success"}

This will also result in Consul/0 instance being reported as down or failing.  This is the canary instance and its failure will halt the deployment so that the other Consul nodes remain up and working. 

Cause

This error is indicating that Consul cannot communicate with some subset of the Consul agents.  This has typically happened when Consul is propagating new keys or rotating the existing keys that it uses to communicate securely and there is a problem during this process.  One common problem is that port 8301 is blocked and prevents Consul from distributing the keys.

Resolution

Upgrade to versions 1.9.42, 1.10.31, 1.11.17, or 1.12.5, which avoids this issue.

This is not typically a problem with the Consul server, but with the specific Consul Agents. To resolve this issue, you need to resolve the issue with the Consul Agent. Exactly how you can resolve this depends on the problem affecting the agent.

Pivotal recommends that you check the problem Consul Agent VMs for the following:

  • Make sure that there is adequate disk space (run bosh vms --vitals to see disk usage on the virtual machines)on the node. If ephemeral or persistent disks are more than 90% full, increase their size or delete something to free up space.
  • Confirm that Consul Server is able to communicate on port 8301 with the VMs in question (both TCP & UDP).

In some cases, these steps might not be sufficient to resolve the issue.  In that case, please contact Pivotal Support for additional assistance troubleshooting and resolving the problem.

Impact/Risks

This situation is not the same as when Consul goes split brain. In fact, following the instructions to repair a split-brain Consul will make this situation worse and can cause application downtime.  If you are seeing the error message indicated in the Symptoms section above, make sure you resolve this before restarting or recreating any of the Consul server nodes.

Additional Information

Related article: 

Consul fails to start during upgrade in Cloud Foundry

How to enable debug mode for Consul

Consul-release: X/Y nodes reported success

Comments

  • Avatar
    Jim Worrell

    * As an addendum:

    It is a good idea to check processes across all vms by using `bosh vms --details` to determine if there is evidence of high disk usage.

    Also it is recommended to check all deployments and not just the CF deployment.

Powered by Zendesk