Clustering: Availability, multi-instance, and disaster recovery

This topic provides guidance for planning and administering IBM® MQ clusters. This information is a guide based on testing and feedback from customers.

IBM MQ Clustering itself is not a High Availability solution, but in some circumstances it can be used to improve the availability of services using IBM MQ, for example by having multiple instances of a queue on different queue managers. This section gives guidance on ensuring that the IBM MQ infrastructure is as highly available as possible so that it can be used in such an architecture.

Availability of cluster resources
The reason for the usual recommendation to maintain two full repositories is that the loss of one is not critical to the smooth running of the cluster. Even if both become unavailable, there is a 60 day grace period for existing knowledge held by partial repositories, although new or not previously accessed resources (queues for example) are not available in this event.
Using clusters to improve application availability
A cluster can help in designing highly available applications (for example a request/response type server application), by using multiple instances of the queue and application. If needed, priority attributes can give preference to the 'live' application unless a queue manager or channel for example become unavailable. This is powerful for switching over quickly to continue processing new messages when a problem occurs.
However, messages which were delivered to a particular queue manager in a cluster are held only on that queue instance, and are not available for processing until that queue manager is recovered. For this reason, for true data high availability you might want to consider other technologies such as multi-instance queue managers.
Multi-instance queue managers
Software High Availability (multi-instance) is the best built-in offering for keeping your existing messages available. See Using IBM MQ with high availability configurations, Create a multi-instance queue manager, and the following section for more information. Any queue manager in a cluster may be made highly available using this technique, as long as all queue managers in the cluster are running at least IBM WebSphere® MQ 7.0.1. If any queue managers in the cluster are at previous levels, they might lose connectivity with the multi-instance queue managers if they fail over to a secondary IP.
As discussed previously in this topic, as long as two full repositories are configured, they are almost by their nature highly available. If you need to, IBM MQ software High Availability / multi-instance queue managers can be used for full repositories. There is no strong reason to use these methods, and in fact for temporary outages these methods might cause additional performance cost during the failover. Using software HA instead of running two full repositories is discouraged because in the event of a single channel outage, for example, it would not necessarily fail over, but might leave partial repositories unable to query for cluster resources.
Disaster recovery
Disaster recovery, for example recovering from when the disks storing a queue manager's data becomes corrupt, is difficult to do well; IBM MQ can help, but it cannot do it automatically. The only 'true' disaster recovery option in IBM MQ (excluding any operating system or other underlying replication technologies) is restoration from a backup. There are some cluster specific points to consider in these situations:
  • Take care when testing disaster recovery scenarios. For example, if testing the operation of backup queue managers, be careful when bringing them online in the same network as it is possible to accidentally join the live cluster and start 'stealing' messages by hosting the same named queues as in the live cluster queue managers.
  • Disaster recovery testing must not interfere with a running live cluster. Techniques to avoid interference include:
    • Complete network separation or separation at the firewall level.
    • [z/OS]Not starting channel initiation or the z/OS® chinit address space.
    • Not issuing live TLS certificate to the disaster recovery system until, or unless, an actual disaster recovery scenario occurs.
  • When restoring a backup of a queue manager in the cluster it is possible that the backup is out of sync with the rest of the cluster. The REFRESH CLUSTER command can resolve updates and synchronize with the cluster but the REFRESH CLUSTER command must be used as a last resort. See Clustering: Using REFRESH CLUSTER best practices. Review any in-house process documentation and IBM MQ documentation to see whether a simple step was missed before resorting to using the command.
  • As for any recovery, applications must deal with replay and loss of data. It must be decided whether to clear the queues down to a known state, or if there is enough information elsewhere to manage replays.