Configuring high availability, recovery and restart

You can make your applications highly available by maintaining queue availability if a queue manager fails, and by recovering messages after server or storage failure.

About this task

[z/OS]On z/OS®, high availability is built into the platform. See Shared queues and queue sharing groups.

[UNIX, Linux, Windows, IBM i]On Multiplatforms, you can improve client application availability by using client reconnection to switch a client automatically between a group of queue managers, or to the new active instance of a multi-instance queue manager after a queue manager failure. Automatic client reconnect is not supported by IBM® MQ classes for Java. A multi-instance queue manager is configured to run as a single queue manager on multiple servers. You deploy server applications to this queue manager. If the server running the active instance fails, execution is automatically switched to a standby instance of the same queue manager on a different server. If you configure server applications to run as queue manager services, they are restarted when a standby instance becomes the actively running queue manager instance.

Another way to increase server application availability on Multiplatforms is to deploy server applications to multiple computers in a queue manager cluster. From IBM WebSphere® MQ 7.1 onwards, cluster error recovery reruns operations that caused problems until the problems are resolved. See Changes to cluster error recovery on servers other than z/OS. You can also configure IBM MQ for Multiplatforms as part of a platform-specific clustering solution such as:
  • Microsoft Cluster Server
  • [IBM i]HA clusters on IBM i
  • [AIX][Linux]PowerHA® for AIX® (formerly HACMP on AIX) and other UNIX and Linux® clustering solutions

[Linux]On Linux systems, you can configure replicated data queue managers (RDQMs) to implement high availability or disaster recovery solutions. For high availability, instances of the same queue manager are configured on each node in a group of three Linux servers. One of the three instances is the active instance. Data from the active queue manager is synchronously replicated to the other two instances, so one of these instances can take over in the event of some failure. For disaster recovery, a queue manager runs on a primary node at one site, with a secondary instance of that queue manager located on a recovery node at a different site. Data is replicated between the primary instance and the secondary instance, and if the primary node is lost for some reason, the secondary instance can be made into the primary instance and started.

[IBM Cloud Pak for Integration]Native HA is a high availability solution aimed at containers. Native HA uses log replication to keep three instances of a queue manager running on different nodes up to date. One instance is active at any one time and processes messages. The active queue manager send its log updates to the other two instances to keep them updated. If the active instance fails, one of the replica instances automatically takes over the active role.

[MQ Appliance]Another option for a high availability or disaster recovery solution is to deploy a pair of IBM MQ appliances. See High Availability and Disaster Recovery in the IBM MQ Appliance documentation.

A messaging system ensures that messages entered into the system are delivered to their destination. IBM MQ can trace the route of a message as it moves from one queue manager to another using the dspmqrte command. If a system fails, messages can be recovered in various ways depending on the type of failure, and the way a system is configured. IBM MQ maintains recovery logs of the activities of the queue managers that handle the receipt, transmission, and delivery of messages. It uses these logs for three types of recovery:
  1. Restart recovery, when you stop IBM MQ in a planned way.
  2. Failure recovery, when a failure stops IBM MQ.
  3. Media recovery, to restore damaged objects.
In all cases, the recovery restores the queue manager to the state it was in when the queue manager stopped, except that any in-flight transactions are rolled back, removing from the queues any updates that were in-flight at the time the queue manager stopped. Recovery restores all persistent messages; nonpersistent messages might be lost during the process.
CAUTION:
You cannot move recovery logs to a different operating system.