Monitoring the WMQ Cluster Repository Process
Anthony_Beardsmore 110000J1UB Comment (1) Visits (10168)
Part of the 'Clustering FAQ' series, looking at how to keep an eye on the 'health' of your cluster tasks. (Note – this information is intended as guidance for use with current WMQ versions 7.1 and 7.5. While many of the details are the same for all versions of WMQ Clusters there will be some differences, so you should use caution if referring to this document when working with earlier product versions).
What is the repository process?
All queue managers participating in a cluster maintain a local cache of information about the cluster, whether they are full repositories with a complete picture of the cluster, or partial repositories which keep just a working subset. See previous blog posts and the Infocenter for more information about the contents of this cache. In this post I’m going to be discussing the repository manager process, which is the component of the queue manager responsible for keeping the cache up to date and sharing information with other queue managers in the cluster.
All queue managers, even those not using clustering, will have a repository manager process – though if they are doing no cluster work, this process should be almost entirely idle and have a very small footprint. On platforms other than z/OS, the repository manager is a stand alone process connected to the queue manager, called amqrrmfa. On WMQ for z/OS, the repository manager runs as a task within the CHIN address space. Therefore when the chinit is not running on z/OS, the cluster cache will not be kept up to date.
If you have a significant cluster deployment, alongside other monitoring of your system it is a very good idea to monitor the health of the repository manager – problems here could end up causing application issues ranging from not getting the workload balancing you expect to complete non-delivery of messages. There are a number of ways to look at what the repository manager is doing, and I’ll go through some of the main ones here.
As a very basic starting point, you may wish to use system tools to watch the CPU usage of the repository task. The repository usually spends most of it’s time idle, with the following exceptions.
The performance of the Repository task itself is not normally something that requires much consideration, even in substantial cluster deplyments. One exception to this, which can be very visible to end users, is when a repository task is so overloaded that cluster queries on behalf of applications cannot be processed in a timely manner. This will be seen as an MQRC
On distributed platforms, the cluster cache resides in one or more blocks of shared memory referenced by both the repository manager process, the queue manager execution controller, and any other MQ processes which need to query information about the cluster. Typically this will be by far the largest memory usage associated with amqrrmfa, though of course there will also be a certain amount of local storage use as for any application. Similarly on WMQ for z/OS, the cluster cache storage is mapped into both the CHIN and MSTR address spaces – note that today this is 31bit addressed ‘below the bar’ storage in both cases, so a large local cache will affect the memory available for other usage such as buffer pools in the queue manager address space.
The ‘System Resources’ section of previous post http
The Repository Queues
All of the work done by the repository manager affects a small set of local queues, and by monitoring these queues we can learn a lot about what the repository manager is currently doing. It is a good idea to configure queue statistics on some of these queues if you are concerned/curious about the activities and performance of the repository process. For example, to gather statistics about the repository command queue at 10 minute intervals you could issue:
ALTER QMGR STATINT(600)
You will need to restart the queue manager to release handles on the queues before these commands take effect. The amqsmon sample (see Infocenter) can be used as a simple means to format the statistics messages generated – your prefered monitoring tools may have options to process these in a variety of more advanced ways.
Here are brief descriptions of the important queues to consider:
Error Handling and Troubleshooting
The cluster repository process can be interrupted for a number of reasons: local system resource problems (e.g. lack of memory), configuration problems (such as someone accidentally disabling access to one of the SYSTEM.CLUSTER queues), bad messages being sent to the input COMMAND queue maliciously or due to queue manager errors elsewhere, or of course program failures in the repository process itself.
If the repository process has stopped processing for one of these reasons, the cluster cache will not be kept up to date, and eventually applications will stop being able to put messages to cluster destinations – even before then there is the risk that updates will not have been received and the wrong routing information will be being used. This is therefore seen as a serious situation, and it is very important to monitor the error logs/chinit log for your queue manager for repository errors and take the appropriate action.
For most types of error there is a ‘grace period’ of retrying cluster operations before the repository manager and eventually the queue manager will shut down. See the clustering and best practices section of the Infocenter for more details of what to look for and what actions can be taken in these situations.
Note that you should never deliberately interrupt the repository manager by attempting to kill the process unless explicitly recommended by IBM service, nor attempt to restart it ‘manually’ (independently from queue manager restart). The process is tightly coupled to the queue manager (via shared memory etc.) and this can cause serious problems, particularly in more recent versions of the product, as well as leaving the cache in an unknown state and out of sync with the rest of the cluster.