Pivotal Knowledge Base

Follow

What information to collect when a Pivotal Cloud Foundry RabbitMQ issue occurs?

Environment

  • Pivotal Cloud Foundry (PCF) - all versions
  • PCF- RabbitMQ - all versions

Checklist

When a RabbitMQ deployment fails unexpectedly, answering the following important questions will ensure a faster turn-around and resolution by Pivotal support:

1) Which versions are you running?

  • PCF Operations Manager version
  • PCF Elastic Runtime version
  • PCF RabbitMQ version

2) How is PCF RabbitMQ configured? Go to the RabbitMQ tile and collect the following:

  • PCF RabbitMQ > Settings > Assign AZs and Networks > Balance other jobs in (number of AZs used)
  • PCF RabbitMQ > Settings > RabbitMQ > External load balancer DNS name
  • PCF RabbitMQ > Settings > RabbitMQ > Metrics polling interval
  • PCF RabbitMQ > Settings > Resource Config > HAProxy for RabbitMQ (only if you are using the default load balancer)

3) Attach the output from Ops Manager's debug page.

e.g. https://pcf.TEST.pivotal.io/debug/files where pcf.TEST.pivotal.io is the Ops Manager's FQDN.

4) Collect information from RabbitMQ's Overview page. 

     To do this, Log in as the service broker admin user into RabbitMQ's Management Dashboard - e.g.                https://pivotal-rabbitmq.TEST.pivotal.io - and answer the following:

  • Number of RabbitMQ nodes
  • RabbitMQ Nodes uptime
  • Number of Connection
  • Number of Exchanges
  • Number of Queues
  • Number of Consumers
  • Ready messages
  • Unacked messages
  • Total Messages

5) Collect the following diagnostic information from all RabbitMQ nodes.

    Initially, you will need to log into the director VM and target your RabbitMQ deployment.

Then run this diagnostic command from the PCF Ops Manager VM against every rabbitmq-server instance. The example below was run against rabbitmq-server 0 only, you will need to repeat this for every RabbitMQ node, usually rabbitmq-server {0..3}:

sudo -i bosh ssh rabbitmq-server 0 <<EOF

set +e

mkdir -p $DIAGNOSTIC_DIR

pstree -panl &> $DIAGNOSTIC_DIR/pstree.log

ps e -p \$(pgrep beam) &> $DIAGNOSTIC_DIR/ps-env.log

free -h &> $DIAGNOSTIC_DIR/free.log

vmstat -S M 1 10 &> $DIAGNOSTIC_DIR/vmstat.log

iostat -txm 1 10 &> $DIAGNOSTIC_DIR/iostat.log

df -h &> $DIAGNOSTIC_DIR/df.log

lsblk -a &> $DIAGNOSTIC_DIR/lsblk.log

lsof -nPi TCP &> $DIAGNOSTIC_DIR/tcp.log

ss -panit &> $DIAGNOSTIC_DIR/ss.log

export PATH=\$PATH:/var/vcap/bosh/bin

monit status &> $DIAGNOSTIC_DIR/monit-status.log

export PATH=\$PATH:/var/vcap/packages/rabbitmq-server/bin:/var/vcap/packages/erlang/bin

timeout 60 rabbitmqctl status &> $DIAGNOSTIC_DIR/rmq-status.log

timeout 60 rabbitmqctl environment &> $DIAGNOSTIC_DIR/rmq-environment.log

timeout 60 rabbitmqctl cluster_status &> $DIAGNOSTIC_DIR/rmq-cluster_status.log

timeout 60 rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' &> $DIAGNOSTIC_DIR/rmq-maybe_stuck.log

timeout 60 rabbitmqctl report &> $DIAGNOSTIC_DIR/rmq-report.log

cp -f /var/log/syslog* /var/vcap/monit/monit.log $DIAGNOSTIC_DIR/

chown -fR vcap:vcap $DIAGNOSTIC_DIR

EOF

6) Generate logs for every RabbitMQ node. This can be done via the 'Status' page on the RabbitMQ tile. Logs can then be downloaded from the 'Logs' tab on the RabbitMQ tile.

7) It would be useful to answer the following questions:

  • When did the issue first occur? e.g. 2017-05-02T07:11:41+00:00
  • Is it a recurring issue? Is the issue recurring regularly?
  • Which protocol(s) are you using? e.g. AMQP / MQTT / STOMP
  • Which RabbitMQ client & version are you using? e.g. rabbitmq-java 4.1.0
  • What exceptions is the RabbitMQ client returning?
  • Which connection / channel / queue / exchange are the exceptions for?
  • Is a specific RabbitMQ node failing?
  • Are multiple RabbitMQ nodes failing at the same time?
  • Is the Erlang VM PID changing?
  • Did any rabbitmqctl commands fail during the incident? In which way?
  • What actions did you take to remedy the failures?
  • In which way did the observed behavior change after your actions?
  • Did you restart all RabbitMQ nodes at once or only the affected node(s)?
  • Did any RabbitMQ node remain running during restarts?

Comments

Powered by Zendesk