Pivotal Knowledge Base

Follow

Investigating Ghost queues on RabbitMQ

Environment

RabbitMQ for PCF

OSS RabbitMQ

Symptom 

A queue with a NaN status is seen in the RabbitMQ management UI.

An NaN queue indicates that there is a record of the queue in the Mnesia db, but the queue could not be found.

This can be verified by running rabbitmqctl list_queues where the queue that is in NaN state will not be visible when listing all queues.

Sample error from the rabbitmq-server logs (see RabbitMQ Logging for log locations):

Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND
- failed to perform operation on queue '' in vhost '' due to timeout, class-id=50, method-id=10)

Information to collect

  1. rabbitmqctl report
  2. rabbitmqctl list_queues
  3. Screenshot from RabbitMQ Management UI (Which show the NaN queues)
  4. Logs from all server nodes

Resolution

  • If a queue is non-durable, do not mirror these queues.
  • If a queue needs to be mirrored, it needs to be durable.
  • Ensure the ha-sync-mode policy is set to automatic for mirrored non-durable queues.
  • To maximize the availability of non-durable mirrored queues and always promote a queue slave to master, use ha-promote-on-shutdown: always. This has the inherent risk of losing messages when unsynchronized queue slaves get promoted to queue masters. If this trade-off is not acceptable, use mirrored durable queues.

Refer to https://github.com/rabbitmq/rabbitmq-server/issues/1501

Impact

When changing the ha-promote-on-shutdown policy to always, this has the inherent risk of losing messages when unsynchronised queue slaves get promoted to queue masters. If this trade-off is not acceptable, use mirrored durable queues. 

Additional Information 

  • Queues in this state can normally be deleted using:
rabbitmqctl eval 'Q = {resource, <<"VHOST_NAME">>, queue, <<"QUEUE_NAME">>}, rabbit_amqqueue:internal_delete(Q).'
  • In order to automatically syncing queues, keep them as empty as possible. Syncing millions of messages or GBs of data on every node restart is known to trigger alarms and block all publishers for extended periods of time (hours, days, even forever in extreme scenarios). 

Comments

Powered by Zendesk