Pivotal Knowledge Base

Follow

How to report RabbitMQ issues

Environment

Pivotal RabbitMQ 1.8.x and above

Purpose

When a RabbitMQ deployment behaves unexpectedly, it's impossible to figure out where the problem lies without sufficient information. The underlying hardware might be under pressure, the network might be saturated or experiencing high latency, RabbitMQ might be misconfigured, or you might have hit an edge case. A fast support turnaround and incident resolution start by answering the below basic - but very important - questions as mentioned in the following procedure.

Procedure

Start understanding the problem with asking:

1. Which versions are you running?

Screen_Shot_2017-07-17_at_12.18.47_PM.png


Take a look at the Ops Manager to check the versions.

2. How is PCF RabbitMQ configured? 

Screen_Shot_2017-07-15_at_7.55.01_PM.png

Go to PCF RabbitMQ > Credentials > Service Broker Rabbitmq Admin Credentials > Link to Credential and copy the credentials that are needed for the next step:
 
 
 
All the above - and a lot more - from Ops Manager's debug page, e.g. https://pcf.fqdn.pivotal.io/debug/files where pcf.fqdn.pivotal.io is the Ops Manager's URL to be directed to. Providing this extra output would be greatly beneficial. 

3. Gather information as from the RabbitMQ's overview page:

  • Log in as the service broker admin user into RabbitMQ's Management Dashboard - e.g. https://pivotal-rabbitmq.fqdn.pivotal.io - and answer the following:

    • Number of RabbitMQ nodes
    • RabbitMQ Nodes uptime
    • Connections
    • Channels
    • Exchanges
    • Queues
    • Consumers
    • Ready Messages
    • Unacked Messages
    • Total Messages

4. Run a diagnostic on every RabbitMQ node

  • First, log into your PCF Ops Manager VM, e.g.:
    • ssh -i pcf-private-ssh-key.pem ubuntu@pcf.fqdn.pivotal.io
  • Log into your bosh director (you will find the credentials in PCF Ops Manager) & target the correct deployment:
    • bosh login
      Email: director
      Password: [OPS-MANAGER-DIRECTOR-PASSWORD]
      bosh deployment /var/tempest/workspaces/default/deployments/p-rabbitmq-x.x.x.x.x.yml
  • Now run this diagnostic command from the PCF Ops Manager VM against every rabbitmq-server instance:
    • DIAGNOSTIC_DIR="/var/vcap/sys/log/rabbitmq-server/diagnostic.$(date +'%Y%m%d.%H%M')"
      bosh ssh rabbitmq-server 0 <<EOF
      sudo -i
      set +e
      mkdir -p $DIAGNOSTIC_DIR
      pstree -panl &> $DIAGNOSTIC_DIR/pstree.log
      ps e -p \$(pgrep beam) &> $DIAGNOSTIC_DIR/ps-env.log
      free -h &> $DIAGNOSTIC_DIR/free.log
      vmstat -S M 1 10 &> $DIAGNOSTIC_DIR/vmstat.log
      iostat -txm 1 10 &> $DIAGNOSTIC_DIR/iostat.log
      df -h &> $DIAGNOSTIC_DIR/df.log
      lsblk -a &> $DIAGNOSTIC_DIR/lsblk.log
      lsof -nPi TCP &> $DIAGNOSTIC_DIR/tcp.log
      ss -panit &> $DIAGNOSTIC_DIR/ss.log
      export PATH=\$PATH:/var/vcap/bosh/bin
      monit status &> $DIAGNOSTIC_DIR/monit-status.log
      export PATH=\$PATH:/var/vcap/packages/rabbitmq-server/bin:/var/vcap/packages/erlang/bin
      timeout 60 rabbitmqctl status &> $DIAGNOSTIC_DIR/rmq-status.log
      timeout 60 rabbitmqctl environment &> $DIAGNOSTIC_DIR/rmq-environment.log
      timeout 60 rabbitmqctl cluster_status &> $DIAGNOSTIC_DIR/rmq-cluster_status.log
      timeout 60 rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' &> $DIAGNOSTIC_DIR/rmq-maybe_stuck.log
      timeout 60 rabbitmqctl report &> $DIAGNOSTIC_DIR/rmq-report.log
      cp -f /var/log/syslog* /var/vcap/monit/monit.log $DIAGNOSTIC_DIR/
      chown -fR vcap:vcap $DIAGNOSTIC_DIR
      EOF

The above example runs against rabbitmq-server 0 only; you will need to repeat it for every RabbitMQ node, usually rabbitmq-server {0..3}

5. Generate logs for every RabbitMQ node

  • Once the diagnostic command has run on every rabbitmq-server instance, it's time to generate PCF RabbitMQ logs from your PCF Ops Manager:

6. Download the logs from every RabbitMQ node.

  • Once logs have been generated, you will need to download them to the local workstation:

7. Create a support request using the below information as a template:

  • When did the issue first occur? e.g. 2017-05-02T07:11:41+00:00
  • Is it a recurring issue? Is the issue recurring regularly?
  • Which protocols are being used? e.g. AMQP/MQTT/STOMP
  • Which RabbitMQ client & version is being used? e.g. rabbitmq-java 4.1.0
  • What exceptions is the RabbitMQ client returning?
  • Which connection / channel / queue / exchange are the exceptions for?
  • Is a specific RabbitMQ node failing?
  • Are multiple RabbitMQ nodes failing at the same time?
  • Is the Erlang VM PID changing?
  • Did any rabbitmqctl commands fail during the incident? In which way?
  • What actions did you take to remedy the failures?
  • In which way did the observed behavior change after the actions were made?
  • Did you restart all RabbitMQ nodes at once or only the affected node(s)?
  • Did any RabbitMQ node remain running during restarts?
  • If you have a suspicion as to issue, please describe it

8. Upload all gathered info/screenshots in the previous steps and all previously generated logs to your support request. This ensures that everyone involved with your support request can act more efficiently: https://securefiles.pivotal.io/dropzone/customer-service

 

Comments

Powered by Zendesk