Changes to cluster error recovery on servers on Multiplatforms
The queue manager reruns operations that caused problems, until the problems are resolved. If, after five days, the problems are not resolved, the queue manager shuts down to prevent the cache becoming more out of date.
The queue manager reruns operations that caused problems, until the problems are resolved. If, after five days, the problems are not resolved, the queue manager shuts down to prevent the cache becoming more out of date. As the cache becomes more out of date, it causes a greater number of problems.
Every aspect of cluster management is handled for a queue manager by the local repository manager process, amqrrmfa. The process runs on all queue managers, even if there are no cluster definitions.
IBM® MQ, rather than stop the repository manager and keep going without it, the repository manager reruns failed operations. If the queue manager detects a problem with the repository manager, it follows one of two courses of action.
- If the error does not compromise the operation of the queue manager, the queue manager writes a message to the error log. It reruns the failed operation every 10 minutes until the operation succeeds. By default, you have five days to deal with the error; failing which, the queue manager writes a message to the error log, and shuts down. You can postpone the five day shutdown.
- If the error compromises the operation of the queue manager, the queue manager writes a message to the error log, and shuts down immediately.
An error that compromises the operation of the queue manager is an error that the queue manager has not been able to diagnose, or an error that might have unforeseeable consequences. This type of error often results in the queue manager writing an FFST file. Errors that compromise the operation of the queue manager might be caused by a bug in IBM MQ, or by an administrator, or a program, doing something unexpected, such as ending an IBM MQ process.
The point of the change in error recovery behavior is to limit the time the queue manager continues to run with a growing number of inconsistent cluster definitions. As the number of inconsistencies in cluster definitions grows, the chance of abnormal application behavior grows with it.
The default choice of shutting down the queue manager after five days is a compromise between limiting the number of inconsistencies and keeping the queue manager available until the problems are detected and resolved.
You can extend the time before the queue manager shuts down indefinitely, while you fix the problem or wait for a planned queue manager shutdown. The five-day stay keeps the queue manager running through a long weekend, giving you time to react to any problems or prolong the time before restarting the queue manager.
Corrective actions
You have a choice of actions to deal with the problems of cluster error recovery. The first choice is to monitor and fix the problem and the second to monitor and postpone fixing the problem.
- Monitor the queue manager error log for the error messages AMQ9448 and AMQ5008, and fix the problem.
- AMQ9448 indicates that the repository manager has returned an error after running a command. This error marks the start of trying the command again every 10 minutes, and eventually stopping the queue manager after five days, unless you postpone the shutdown.
- AMQ5008 indicates that the queue manager was stopped because an IBM MQ process is missing. AMQ5008 results from the repository manager stopping after five days. If the repository manager stops, the queue manager stops.
- Monitor the queue manager error log for the error message AMQ9448, and postpone fixing the problem.
- If you disable getting messages from
SYSTEM.CLUSTER.COMMAND.QUEUE
, the repository manager stops trying to run commands, and continues indefinitely without processing any work. However, any handles that the repository manager holds to queues are released. Because the repository manager does not stop, the queue manager is not stopped after five days. - Run an MQSC command to disable getting messages from
SYSTEM.CLUSTER.COMMAND.QUEUE
: -
ALTER QLOCAL(SYSTEM.CLUSTER.COMMAND.QUEUE) GET(DISABLED)
- To resume receiving messages from
SYSTEM.CLUSTER.COMMAND.QUEUE
run an MQSC command: -
ALTER QLOCAL(SYSTEM.CLUSTER.COMMAND.QUEUE) GET(ENABLED)
- If you disable getting messages from
Special consideration
Stopping amqrrmfa in IBM MQ
causes the queue manager to stop, because it is regarded as a queue manager failure. You must not
stop the amqrrmfa process unless you set the queue manager tuning parameter,
TolerateRepositoryFailure
.