Part of the 'Clustering FAQ' series, looking at how to keep an eye on the 'health' of your cluster tasks. (Note – this information is intended as guidance for use with current WMQ versions 7.1 and 7.5. While many of the details are the same for all versions of WMQ Clusters there will be some differences, so you should use caution if referring to this document when working with earlier product versions).
What is the repository process?
All queue managers participating in a cluster maintain a local cache of information about the cluster, whether they are full repositories with a complete picture of the cluster, or partial repositories which keep just a working subset. See previous blog posts and the Infocenter for more information about the contents of this cache. In this post I’m going to be discussing the repository manager process, which is the component of the queue manager responsible for keeping the cache up to date and sharing information with other queue managers in the cluster.
All queue managers, even those not using clustering, will have a repository manager process – though if they are doing no cluster work, this process should be almost entirely idle and have a very small footprint. On platforms other than z/OS, the repository manager is a stand alone process connected to the queue manager, called amqrrmfa. On WMQ for z/OS, the repository manager runs as a task within the CHIN address space. Therefore when the chinit is not running on z/OS, the cluster cache will not be kept up to date.
If you have a significant cluster deployment, alongside other monitoring of your system it is a very good idea to monitor the health of the repository manager – problems here could end up causing application issues ranging from not getting the workload balancing you expect to complete non-delivery of messages. There are a number of ways to look at what the repository manager is doing, and I’ll go through some of the main ones here.
As a very basic starting point, you may wish to use system tools to watch the CPU usage of the repository task. The repository usually spends most of it’s time idle, with the following exceptions.
Periodically, the repository manager runs a ‘maintenance’ of the local cache. This includes checking all locally defined objects to see if they should be readvertised to the cluster, and looking for remotely defined entries which are no longer in use and can be garbage collected. Typically this should take a matter of seconds or in very large clusters a few minutes. After running a maintenance cycle, the repository manager schedules another to run 1 hour later. So although this will normally be seen as an hourly ‘heartbeat’ of activity, in large environments where a maintenance run takes a little longer this will not be completely regular.
The repository manager is always listening for updates from other queue managers, or local requests for information about the cluster. These are normally handled very quickly, so CPU usage will barely register in most cases. The exception might be when bootstrapping a new queue manager into a cluster, or carrying out administrative work such as a REFRESH – at these times there may be a large amount of work to do causing a period of increased activity.
The performance of the Repository task itself is not normally something that requires much consideration, even in substantial cluster deplyments. One exception to this, which can be very visible to end users, is when a repository task is so overloaded that cluster queries on behalf of applications cannot be processed in a timely manner. This will be seen as an MQRC_CLUSTER_RESOLUTION_ERROR in the application. If these are occuring frequently, gathering CPU information and looking at what is happening on the repository queues (see below) may help track down the source of the problem (or ultimately be useful documentation if raising a PMR with IBM service is required).
On distributed platforms, the cluster cache resides in one or more blocks of shared memory referenced by both the repository manager process, the queue manager execution controller, and any other MQ processes which need to query information about the cluster. Typically this will be by far the largest memory usage associated with amqrrmfa, though of course there will also be a certain amount of local storage use as for any application. Similarly on WMQ for z/OS, the cluster cache storage is mapped into both the CHIN and MSTR address spaces – note that today this is 31bit addressed ‘below the bar’ storage in both cases, so a large local cache will affect the memory available for other usage such as buffer pools in the queue manager address space.
The ‘System Resources’ section of previous post http://tinyurl.com/bigclusters gives some guidance on how much memory may be consumed for a particular cluster configuration.
The Repository Queues
All of the work done by the repository manager affects a small set of local queues, and by monitoring these queues we can learn a lot about what the repository manager is currently doing. It is a good idea to configure queue statistics on some of these queues if you are concerned/curious about the activities and performance of the repository process. For example, to gather statistics about the repository command queue at 10 minute intervals you could issue:
ALTER QL(SYSTEM.CLUSTER.COMMAND.QUEUE) STATQ(ON)
ALTER QMGR STATINT(600)
You will need to restart the queue manager to release handles on the queues before these commands take effect. The amqsmon sample (see Infocenter) can be used as a simple means to format the statistics messages generated – your prefered monitoring tools may have options to process these in a variety of more advanced ways.
Here are brief descriptions of the important queues to consider:
The SYSTEM.CLUSTER.REPOSITORY.QUEUE – the contents of the local cache is persisted to this queue so that if the queue manager is restarted at any time it does not need to rediscover everything it knew about the cluster. This queue will be updated every time the cache is modified for any reason – the size of the data on the queue is not that interesting as checkpointing mechanisms mean that this will jump around, but high frequency of PUTs and GETs to the queue indicates a lot of activity in the cache.
The SYSTEM.CLUSTER.COMMAND.QUEUE – all work except periodic maintenance is queued for the repository process here. This includes local requests for information (the first time an application accesses a cluster resource) and data sent from other repositories in the cluster. Most of the time this queue should be empty – a backlog of messages either indicates a very busy period (perhaps someone else in the cluster has issued a REFRESH, in which case high CPU would be another symptom) or possibly a problem with the local repository manager task.
The SYSTEM.CLUSTER.TRANSMIT.QUEUE – traditionally, all messages destined for other queue managers in the cluster flowed through this queue, and this includes repository maintenance messages exchanged between queue managers. At times when application traffic is low, monitoring this queue can give an indication of the communication happening between repositories. If you have chosen to configure multiple transmission queues (available from version 7.5) you will know which additional queues need monitoring – although this may be more work, it can allow finer levels of detail to be seen.
The SYSTEM.CLUSTER.HISTORY.QUEUE is mentioned here only for completeness. This queue is primarily to assist the IBM service team in diagnosing problems and it is not usually useful to configure monitoring here.
Error Handling and Troubleshooting
The cluster repository process can be interrupted for a number of reasons: local system resource problems (e.g. lack of memory), configuration problems (such as someone accidentally disabling access to one of the SYSTEM.CLUSTER queues), bad messages being sent to the input COMMAND queue maliciously or due to queue manager errors elsewhere, or of course program failures in the repository process itself.
If the repository process has stopped processing for one of these reasons, the cluster cache will not be kept up to date, and eventually applications will stop being able to put messages to cluster destinations – even before then there is the risk that updates will not have been received and the wrong routing information will be being used. This is therefore seen as a serious situation, and it is very important to monitor the error logs/chinit log for your queue manager for repository errors and take the appropriate action.
For most types of error there is a ‘grace period’ of retrying cluster operations before the repository manager and eventually the queue manager will shut down. See the clustering and best practices section of the Infocenter for more details of what to look for and what actions can be taken in these situations.
Note that you should never deliberately interrupt the repository manager by attempting to kill the process unless explicitly recommended by IBM service, nor attempt to restart it ‘manually’ (independently from queue manager restart). The process is tightly coupled to the queue manager (via shared memory etc.) and this can cause serious problems, particularly in more recent versions of the product, as well as leaving the cache in an unknown state and out of sync with the rest of the cluster.