Changes to cluster error recovery (on servers other than z/OS )

From IBM® WebSphere® MQ 7.1 onwards, the queue manager reruns operations that caused problems, until the problems are resolved. If, after five days, the problems are not resolved, the queue manager shuts down to prevent the cache becoming more out of date.

Before IBM WebSphere MQ 7.1, if a queue manager detected a problem with the local repository manager managing a cluster, it updated the error log. In some cases, it then stopped managing clusters. The queue manager continued to exchange applications messages with a cluster, relying on its increasingly out of date cache of cluster definitions. From IBM WebSphere MQ 7.1 onwards, the queue manager reruns operations that caused problems, until the problems are resolved. If, after five days, the problems are not resolved, the queue manager shuts down to prevent the cache becoming more out of date. As the cache becomes more out of date, it causes a greater number of problems. The changed behavior regarding cluster errors in 7.1 or later does not apply to z/OS®.

Every aspect of cluster management is handled for a queue manager by the local repository manager process, amqrrmfa. The process runs on all queue managers, even if there are no cluster definitions.

Before IBM WebSphere MQ 7.1, if the queue manager detected a problem in the local repository manager, it stopped the repository manager after a short interval. The queue manager kept running, processing application messages and requests to open queues, and publish or subscribe to topics.

With the repository manager stopped, the cache of cluster definitions available to the queue manager became more out of date. Over time, messages were routed to the wrong destination, and applications failed. Applications failed attempting to open cluster queues or publication topics that had not been propagated to the local queue manager.

Unless an administrator checked for repository messages in the error log, the administrator might not realize the cluster configuration had problems. If the failure was not recognized over an even longer time, and the queue manager did not renew its cluster membership, even more problems occurred. The instability affected all queue managers in the cluster, and the cluster appeared unstable.

From IBM WebSphere MQ 7.1 onwards, IBM MQ takes a different approach to cluster error handling. Rather than stop the repository manager and keep going without it, the repository manager reruns failed operations. If the queue manager detects a problem with the repository manager, it follows one of two courses of action.

  1. If the error does not compromise the operation of the queue manager, the queue manager writes a message to the error log. It reruns the failed operation every 10 minutes until the operation succeeds. By default, you have five days to deal with the error; failing which, the queue manager writes a message to the error log, and shuts down. You can postpone the five day shutdown.
  2. If the error compromises the operation of the queue manager, the queue manager writes a message to the error log, and shuts down immediately.

An error that compromises the operation of the queue manager is an error that the queue manager has not been able to diagnose, or an error that might have unforeseeable consequences. This type of error often results in the queue manager writing an FFST file. Errors that compromise the operation of the queue manager might be caused by a bug in IBM MQ, or by an administrator, or a program, doing something unexpected, such as ending an IBM MQ process.

The point of the change in error recovery behavior is to limit the time the queue manager continues to run with a growing number of inconsistent cluster definitions. As the number of inconsistencies in cluster definitions grows, the chance of abnormal application behavior grows with it.

The default choice of shutting down the queue manager after five days is a compromise between limiting the number of inconsistencies and keeping the queue manager available until the problems are detected and resolved.

You can extend the time before the queue manager shuts down indefinitely, while you fix the problem or wait for a planned queue manager shutdown. The five-day stay keeps the queue manager running through a long weekend, giving you time to react to any problems or prolong the time before restarting the queue manager.

Corrective actions

You have a choice of actions to deal with the problems of cluster error recovery. The first choice is to monitor and fix the problem, the second to monitor and postpone fixing the problem, and the final choice is to continue to manage cluster error recovery as in releases before IBM WebSphere MQ 7.1.

  1. Monitor the queue manager error log for the error messages AMQ9448 and AMQ5008, and fix the problem.
    • AMQ9448 indicates that the repository manager has returned an error after running a command. This error marks the start of trying the command again every 10 minutes, and eventually stopping the queue manager after five days, unless you postpone the shutdown.
    • AMQ5008 indicates that the queue manager was stopped because an IBM MQ process is missing. AMQ5008 results from the repository manager stopping after five days. If the repository manager stops, the queue manager stops.
  2. Monitor the queue manager error log for the error message AMQ9448, and postpone fixing the problem.
    • If you disable getting messages from SYSTEM.CLUSTER.COMMAND.QUEUE, the repository manager stops trying to run commands, and continues indefinitely without processing any work. However, any handles that the repository manager holds to queues are released. Because the repository manager does not stop, the queue manager is not stopped after five days.
    • Run an MQSC command to disable getting messages from SYSTEM.CLUSTER.COMMAND.QUEUE:
    • ALTER QLOCAL(SYSTEM.CLUSTER.COMMAND.QUEUE) GET(DISABLED)
    • To resume receiving messages from SYSTEM.CLUSTER.COMMAND.QUEUE run an MQSC command:
    • ALTER QLOCAL(SYSTEM.CLUSTER.COMMAND.QUEUE) GET(ENABLED)
  3. Revert the queue manager to the same cluster error recovery behavior as before IBM WebSphere MQ 7.1.
    • You can set a queue manager tuning parameter to keep the queue manager running if the repository manager stops.
    • The tuning parameter is TolerateRepositoryFailure, in the TuningParameters stanza of the qm.ini file. To prevent the queue manager stopping, if the repository manager stops, set TolerateRepositoryFailure to TRUE; see Figure 1.
    • Restart the queue manager to enable the TolerateRepositoryFailure option.
    • If a cluster error has occurred that prevents the repository manager starting successfully, and hence the queue manager from starting, set TolerateRepositoryFailure to TRUE to start the queue manager without the repository manager.

Special consideration

Before IBM WebSphere MQ 7.1, some administrators managing queue managers that were not part of a cluster stopped the amqrrmfa process. Stopping amqrrmfa did not affect the queue manager.

Stopping amqrrmfa in IBM WebSphere MQ 7.1 or later causes the queue manager to stop, because it is regarded as a queue manager failure. You must not stop the amqrrmfa process in 7.1 or later, unless you set the queue manager tuning parameter, TolerateRepositoryFailure.

Example

Figure 1. Set TolerateRepositoryFailure to TRUE in qm.ini

TuningParameters:
        TolerateRepositoryFailure=TRUE