Pivotal Knowledge Base

Follow

Cloud Foundry Tiles Installed on Separate Subnet other than ERT Appear to be Running and Failing Randomly

Environment

 Product  Version
 Pivotal Cloud Foundry® (PCF)

 All

Symptom

ERT and Services tiles installed on different subnets.

If you see issues with logs reported by JMX Bridge or OpenTSDB Firehose Nozzle, please make sure all the services on Cloud Foundry(PCF) are running fine.

You can check this in the Ops Manager WebGUI. When you refresh the page, you may see some service tile node(s) appear to be running and failing randomly. 

Issue

This can be checked in Ops Manager CLI by running 'bosh vms'(make sure the CF deployment is set).

SSH to Ops Manager:

  • 'ssh ubuntu@<Ops_Manager_IP>'

Target to Director IP:

  • 'bosh --ca-cert /var/tempest/workspaces/default/root_ca_certificate target <Director_IP_Address>'

Set CF Deployment:

  • 'bosh deployment /var/tempest/workspaces/default/deployments/cf-<deployment-id>.yml'

Check VMs status:

  • 'bosh vms --vitals'

Further, select one of the service tiles, say RabbitMQ:

  • 'bosh deployment /var/tempest/workspaces/default/deployments/p-rabbitmq-<unique-ID>'

Check the instance's details:

  • 'bosh instances --ps'

It will show metron_agent in running and failing state randomly.

See the example below for RabbitMQ-Server and RabbitMQ-Broker nodes failing:

+------------------------------+---------+-----+---------------------------+---------+
| Instance | State | AZ | VM Type | IPs |
+------------------------------+---------+-----+---------------------------+---------+
| rabbitmq-broker-partition/0 | failing | n/a | rabbitmq-broker-partition | 1.2.3.4 |
| rabbitmq-broker | running | | | |
| broker-route-registrar | running | | | |
| metron_agent | failing | | | |
| service-metrics | running | | | |
+------------------------------+---------+-----+---------------------------+---------+
| rabbitmq-haproxy-partition/0 | running | n/a | rabbitmq-haproxy-partition| 1.2.3.5 |
| rabbitmq-haproxy | running | | | |
| management-route-registrar | running | | | |
| metron_agent | running | | | |
| service-metrics | running | | | |
+------------------------------+---------+-----+---------------------------+---------+
| rabbitmq-server-partition/0 | running | n/a | rabbitmq-server-partition | 1.2.3.6 |
| rabbitmq-server | running | | | |
| metron_agent | running | | | |
| service-metrics | running | | | |
+------------------------------+---------+-----+---------------------------+---------+
| rabbitmq-server-partition/1 | running | n/a | rabbitmq-server-partition | 1.2.3.7 |
| rabbitmq-server | running | | | |
| metron_agent | running | | | |
| service-metrics | running | | | |
+------------------------------+---------+-----+---------------------------+---------+
| rabbitmq-server-partition/2 | failing | n/a | rabbitmq-server-partition | 1.2.3.8 |
| rabbitmq-server | running | | | |
| metron_agent | failing | | | |
| service-metrics | running | | | |
+------------------------------+---------+-----+---------------------------+---------+

This randomly changes if the command 'bosh instances --ps' runs on the RabbitMQ deployment in the above example:

Login to one of the failing nodes to check the logs:

  1. Select one of the Services deployment, say RabbitMQ (and services)
  2. 'bosh deployment /var/tempest/workspaces/default/deployments/p-rabbitmq-<unique-ID>'
  3. Perform 'bosh ssh' to one of the nodes
  4. 'sudo su -'
  5. 'cd /var/vcap/sys/log/metron_agent/'

metron_agent.stdout.log

       "metron","log_level":"warn","message":"Failed to connect to etcd.

metron_agent.stdout.log

       RuntimeStats: failed to emit: EventWriter: No envelope writer set (see SetWriter)  

Cause

Service tiles such as RabbitMQ, Pivotal Redis, Pivotal, MySQL, etc., need to be able to communicate on port 4001 with ETCD nodes.

The metron_agent on all the nodes in Cloud Foundry communicates to highly-available key value store ETCD for shared configuration and service discovery. This communication is established on port 4001.

Please see the source code for ETCD details.

If the service tiles are installed on a different subnet than the Elastic Runtime Installation, there MUST be port 4001 communication open between the services subnet and the ERT subnet. 

Resolution

Open communication between the Elastic Runtime Subnet and Services tiles Subnet on port 4001.

As described in the README of ETCD code, all the nodes in Cloud Foundry establishes a communication on port 4001 with the ETCD nodes for shared configurations and service discovery.

The above issue described is caused due to failed communications between the service tile nodes and ETCD nodes on port 4001.

Please follow the below steps to troubleshoot and fix the issue:

  1. Login to the Ops Manager CLI.
  2. Select the deployment for CF installation:
    • 'bosh deployment /var/tempest/workspaces/default/deployments/cf-<deployment-id>.yml'
    • Run 'bosh vms' to find the IP Address of ETCD node(s)
    • Perform 'bosh ssh' to login to one of the ETCD nodes.
  3. Make sure the ETCD cluster is healthy:
    • 'sudo su -'
    • Find the etcdctl tool: 'find / -name etcdctl'
    • Make a note of the path as we'll use it in our commands below.
    • Run health-check: '/var/vcap/data/packages/etcd/<long-ID>/etcdctl cluster-health'
    • Check the member list: '/var/vcap/data/packages/etcd/<long-ID>/etcdctl member list'
  4. Once it is confirmed the ETCD cluster is fine, exit out of the ETCD VM.
  5. Select one of the Services deployments, say RabbitMQ (and services) and port 4001:
    • 'bosh deployment /var/tempest/workspaces/default/deployments/p-rabbitmq-<unique-ID>.yml'
    • Perform 'bosh ssh' to one of the  rabbitmq-server  nodes
    • 'sudo su -'
    • Check the communication with etcd node IP (from Step 3 above) on port 4001:
    • 'nc -v <IP_Address_of_etcd_node> 4001'
    • Make sure the connection is successful.
  6. Once the communications are fine, you should be good and the errors should be resolved.

 

Comments

Powered by Zendesk