This blog post provides advice and recommendations for best practices when using an IBM Cloud Messages for RabbitMQ deployment.
It summarises questions our customers have when running RabbitMQ deployments on the IBM Cloud, and we hope it will be useful for developers and administrators that are going to use IBM Cloud Messages for RabbitMQ. We assume that the reader is familiar with RabbitMQ, as the post will mainly deal with the differences between a self-hosted RabbitMQ cluster and one deployed and managed by IBM Cloud.
- An IBM Cloud Messages for RabbitMQ deployment consists of three members (nodes) in three availability zones.
- The deployment exposes one connection string, which will redirect to one of the nodes in the deployment in a round-robin fashion.
- Each Messages for RabbitMQ service instance is a single RabbitMQ cluster within a single IBM Cloud region.
Recommendation: If your use-case has a requirement for inter-cluster message operation, you can use the Shovel plugin, which is supported. But when deciding whether to use the service, please note that IBM Cloud Messages for RabbitMQ does not support the Federation plugin and associated federated exchanges and queues.
- All Messages for RabbitMQ connections use TLS/SSL encryption for data in transit.
- All data is encrypted at rest by default.
- Messages for RabbitMQ can be set up with access via public and/or private endpoints.
- Additionally, IP allowlists can be set up to further restrict allowable access.
- More details on security and compliance can be found here.
RabbitMQ supports many use cases for different types of applications. Describing the various types of queues available and their potential uses is beyond the scope of this document as there are many online resources that deal with that (you can find an example here).
Recommendation: We encourage you to use quorum queues on the IBM Cloud Messages for RabbitMQ deployment. The quorum queue is RabbitMQ's strategic queue type for high availability and data safety, and it aims to replace classic mirrored queues, which are due to be deprecated with RabbitMQ version 4.0. Using quorum queues will help reduce operational problems and keep your message data highly available. There is more information on quorum queues here. We also have more information in this blog post announcing version 3.8 of RabbitMQ.
Many messages in a queue can put a heavy load on RAM usage. RabbitMQ starts flushing messages to disk as queues get big. That process usually takes time and blocks the queue from processing messages, which affects queueing speed. For best performance, queue size should stay as close to 0 as possible.
Recommendation: Keep your queues short. To shorten queues, ensure that you can scale up your message consumers when there are peaks of messages generated by the message producers. RabbitMQ functions best when used a messaging system. It is not advisable to use it as a persistent store for messages.
High availability (replication)
The default v-host in an IBM Cloud Messages for RabbitMQ deployment is created with a policy definition (ha-all) that is optimised for high availability. The key parameters in the definition are as follows:
- ha-mode: all
- ha-promote-on-shutdown: when-synced
- ha-sync-mode: automatic
For ha-promote-on-shutdown, the option of "always" can also be used. These are the differences:
- when-synced will prioritise data consistency over availability. If you choose this option, it’ll mean that your queue might be unavailable if the last node to be synced goes offline. It should recover once the node comes back online, which under very rare scenarios can take a long time. Also, based on our experienced, this setting makes it more likely that a queue will be corrupted when there is no promotable leader. Learn more here.
- always will prioritise availability. Even if there isn't a synced replica, one will be promoted and your queue won’t stop taking traffic, but data might be lost if there were any unsynced messages in the queue.
RabbitMQ is very flexible and you can create different v-hosts with different policy configurations. If you create separate v-hosts, you will have to handle replication/availability policies for them.
Recommendation: If you want high availability, you should make sure all your important v-hosts have the above settings in their policies, with one of the two ha-promote-on-shutdown options.
Recommendation: Familiarise yourself with the different queue types and policies to make sure that you are using the ones that are optimised for your use case.
In an IBM Cloud Messages for RabbitMQ deployment, you are provided one connection string per protocol (https, amqps, stomp-ssl, mqtts) that automatically routes connection requests to an available node.
Recommendations on managing connections:
- If a node you are connected to becomes unavailable, you will need to recreate the connection. Therefore, make sure your application is designed to reconnect and retry in case of connection failures.
- RabbitMQ also has a resource called channel. Channels are linked to a connection, but individual channels can fail independently of the Messages for RabbitMQ instance. If that happens, you'll need to handle channel failures as well or you might experience issues communicating with RabbitMQ.
- Try to avoid creating and closing connections and channels because these are computationally expensive operations. It is better to try to re-use connections and channels (see here and here).
You may need to scale your deployment if you are running out of memory, disk space or available IOPs. You can change these for your IBM Cloud Messages for RabbitMQ deployment either manually or automatically based on certain triggers.
RabbitMQ defines a memory watermark — when memory usage reaches this point, the system will stop receiving new messages. In other words "publishing" will be blocked until memory usage goes down (usually by messages being consumed). See more information here.
In an IBM Cloud Messages for RabbitMQ deployment, the memory watermark is set to 40% by default and cannot be changed. Although this number looks low, the memory watermark refers to RabbitMQ server memory usage, as the Erlang garbage collector can consume double the amount used by the server. Therefore, the deployment has to allow sufficient headroom for all memory-using processes.
Recommendation: Monitor your RabbitMQ to make sure you are within the watermark, and provision more memory if you are running close to it. Another way of reducing the memory watermark is to consume messages faster (i.e., be able to scale up the number of processes that are consuming the messages created in the queues).
See this page for more information on monitoring memory usage. Any RAM provisioned for your deployment remains for your future needs or until you scale down your deployment manually.
Disk space and IOPs
The number of Input-Output Operations per second (IOPS) is limited by the type of storage volume. Storage volumes for IBM Cloud Messages for RabbitMQ deployments are provisioned on Block Storage Endurance Volumes in the 10 IOPS per GB tier (see here). It's possible for very busy instances to exceed the IOPS for the disk size, and increasing disk can alleviate a performance bottleneck. Note that disk cannot be scaled down.
Recommendation: Scaling can be done manually or automatically (based on certain threshold triggers). You should familiarise yourself with how scaling works (see here) to make appropriate decisions about whether your workloads should be scaled manually or automatically.
Recommendation: We recommend that you monitor your disk usage and scale accordingly to avoid "disk full" failures. You should use disk auto-scaling based on IO utilization or disk space remaining.
Processing speed is dependent on CPU availability. You need to allocate sufficient CPU so that queues and other Rabbit processes are not waiting for processing time.
Recommendation: Check the Throttling metric in sysdig to monitor your CPU usage. If your CPU usage is very high (saturated) and your message processing is slowing down, you probably need more CPU to handle your messaging throughput. Or, you may want to consider using dedicated CPU cores. There is further information on scaling CPU here.
In the context of Messages for RabbitMQ, service availability and/or downtime can be affected for several reasons:
- Loss of connectivity: As with any cloud deployment, network events can affect access to the service.
- Node failure: In the case of a RabbitMQ node failure, the cluster will redirect requests to another available node. This process is seamless for new connections, but open connections will not get redirected to the new node.
- Planned maintenance: IBM performs regular (weekly) security and other updates to keep infrastructure safe and compliant. When performing maintenance on a node, existing connections to that node may fail.
Recommendation: You should design your applications to handle a temporary loss in connectivity to your deployment or to IBM Cloud. See this article for more information.
Messages for RabbitMQ has the following plugins enabled:
Recommendation: If you need other plugins, please consult with us before deploying an IBM Messages for RabbitMQ instance.
Backup and restore
In the context of IBM Messages for RabbitMQ, only the system configuration data are backed up — not any messages in the queues.
Automatic backups are performed daily and kept with a simple retention schedule of 30 days. You can trigger on-demand backups from the UI via the CLI or API. Restoring is done into a new IBM Messages for RabbitMQ deployment.
For more information on backups and restoring see here.
Logging and monitoring
You can monitor your instance metrics (and add alerts for specific thresholds) using IBM's monitoring service. See here for more details.
Recommendation: Monitor your disk usage, IOPS, RAM and CPU, and create high-usage alerts that give you enough time to scale your deployment if required.
You can also monitor your instance logs using LogDNA. See here for more details.
Recommendation: Set log alerts for log entries that are "Critical" so that you can react to unusual events. For example, high memory watermark or high levels of CPU usage.
Minor version upgrades are done as part of the regular IBM maintenance cycles. See the note above about availability.
Major version upgrades have to be done manually by clients. For more information see here.