When the IBM MQ Appliance was introduced in 2015 it added the ability to view monitoring and statistics information through charts in the MQ Console for the MQ appliance. Underpinning this was a new feature for MQ, publishing of statistics to new system topics (under
$SYS/MQ/INFO/QMGR), with the MQ Console simply subscribing to those topics as necessary. So no need to configure the queue managers to enable and disable statistics as it was needed. These statistics help users monitor resources and diagnose possible performance problems.
The same simple publish/subscribe mechanism has now been enabled across all distributed platforms in MQ v9. This has many benefits:
- Dynamic enabling/disabling of statistics without needing to administratively modify queue manager configuration
- The support for multiple subscribers to the same set of information, allowing more than one monitoring tool to be in place
- The possibility of allowing non-MQ admins to subscribe to a subset of information, specific to their application resources
The MQ Console is not yet available for other distributed v9 platforms (see the statement of direction on that here http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=ca&infotype=an&supplier=877&letternum=ENUSZP15-0069#h2-sodx) but there is a sample with MQ v9, amqsrua, to show you how you can take advantage of this feature now. Or you can write an application that subscribes to the resource monitoring system topic in a similar way to amqsrua. More information about amqsrua can be found at https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.0.0/com.ibm.mq.mon.doc/mo00013_.htm and https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.0.0/com.ibm.mq.mon.doc/mo00014_.htm .
Prior to v9, statistics messages were put to the
SYSTEM.ADMIN.STATISTICS.QUEUE queue and could be output using the amqsmon sample. Some v9 statistics (published to the system topic) are the same as the pre-v9 statistics (put to
SYSTEM.ADMIN.STATISTICS.QUEUE), whereas other statistics are new in v9 and are only published to the system topic. Pre-v9 statistics (put to the
SYSTEM.ADMIN.STATISTICS.QUEUE) continue to be supported in v9 as well. The statistics messages put to the
SYSTEM.ADMIN.STATISTICS.QUEUE are different from those published to the system topic.
This article describes these new v9 statistics in more detail and compares them to the pre-v9 statistics put to
SYSTEM.ADMIN.STATISTICS.QUEUE. Some of the new statistics published in v9 are "log write latency", "queue avoidance" and "lock contention" that can help you maximise the performance of your system by diagnosing performance problems.
Which statistics a queue manager publishes is dynamic and can be discovered by subscribing to metadata on a system topic. In future releases, new statistics may be added, and/or existing statistics removed, so applications should not rely on particular statistics being present, but instead should discover which statistics are available. A simple way of discovering which statistics are currently published is by running amqsrua against a current queue manager.
Statistics are organised by class and type. Currently classes are CPU, DISK, STATMQI and STATQ which are a broad classification. Type subdivides classes of statistics and allows statistics to be more fine grained. There are several statistics published for each type.
A design goal for the generation and collection of these statistics is that they should have as low a performance impact as possible, at the expense of being 100% exact. They should be seen as a "broadbrush" approach. Other mechanisms such as activity trace are available if 100% accuracy is required at the expense of performance. An example of this is the put count and put bytes count. If a single application puts 1000 byte messages, it might be assumed that the put bytes count would be a multiple of 1000. However this isn't necessarily the case, since to improve performance no lock is taken around updating the put bytes count statistic, so the put bytes count may not be multiple of 1000. This makes the statistic more unobtrusive, but less exact.
To subscribe to the system topic to get these statistics, you need to be a member of the mqm group, by default. Alternatively you can authorise anyone else to the correct branch of the
$SYS/MQ subtree so they can get these statistics as well. This is described in https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.0.0/com.ibm.mq.mon.doc/mo00040_.htm .
Controlling the generation of the v9 published statistics
The pre-v9 statistics were controlled using the STAT queue manager attributes. Except for STATINT, these attributes have no effect on the statistics published to the system topic. Instead statistics are only generated when an application subscribes to the system topic. However, the STATINT queue manager attribute continues to control the interval over which the subscription high and low watermarks are taken. Messages are published to the relevant system topic approximately every 10 seconds and this rate is not controlled by STATINT.
Each message published contains the interval over which the statistics apply. This interval is output by amqsrua at the start of each message. The first time amqsrua is run, the interval will be the time since the queue manager was started. Every time you invoke amqsrua, the queue manager resets the interval for all the statistics. So alternately running
amqsrua -c STATQ -t PUT and amqsrua -c STATQ -t GET will appear to miss messages because the PUT statistics interval is reset as a side effect of requesting the GET statistics and vica-versa. To overcome this problem, invoke
amqsrua -c STATQ -t GET -t PUT. Invoking
amqsrua -c STATQ -t GET -t PUT will work successfully after the first few messages, but since amqsrua has to subscribe to two topic strings in succession, the statistics for the second subscription will only include the tiny interval since the first subscription, and not the statistics since the queue manager was started. This applies to all classes and types of statistics. It is expected that statistics will often be monitored over a long period of time so a high frequency of change of interval is not expected.
Although statistics messages are generated every 10 seconds by default, this should not be relied on. This interval may be less if another subscriber starts, since all subscribers will get the first message generated as a result of the new subscriber. The interval may be more, if the machine is busy due a high workload, for instance.
Pre-v9 statistics that are now published to the system topic as well
The data in the STATMQI statistics overlaps with the data in the pre-v9 statistics put to
SYSTEM.ADMIN.STATISTICS.QUEUE. So, for instance the published statistic "MQCONN/MQCONNX count" is the same as the put statistic "ConnCount". For more information about each of these statistics, see http://www.ibm.com/support/knowledgecenter/en/SSFKSJ_9.0.0/com.ibm.mq.mon.doc/q037480_.htm
The counts of MQI calls includes the internal MQI calls that the queue manager makes to put/get messages to system queues, etc, in the same way as the pre-v9 statistics do. Consequently some statistics are higher than are accounted for by application workload alone.
Some pre-v9 statistics are returned as a list of counts. When published in v9, in some cases this list of counts is totalled up, for instance DiscCount and OpenCount. Whereas in other cases the list of counts are published as separate statistics, for instance PutBytes where the persistent bytes that are put are returned separately from the nonpersistent bytes.
After some statistics (such as "Interval total MQPUT/MQPUT1 byte count") a rate is returned 265/sec. This is the rate averaged over the previous interval which is returned in the statistics message, which usually approximates to 10 seconds.
"Expired message count" is equivalent to the ExpiredMsgCount statistic. Messages are counted as expired and removed when the queue manager considers them for expiry. For instance, when an application tries to get from the queue. This may be sometime after the expiration time has expired.
New statistics published in v9
One of the new statistics that is published in v9 is log write latency. This is a rolling average that represents the time that a single write to disk takes. This rolling average is calculated at the time the message is generated. Decreasing log write latency is likely to increase the maximum throughput of your system. If log write latency increases, it can significantly impact the performance of your system. It is a good idea to note your log write latency when your system is performing well, so if your performance deteriorates you can compare your log write latency to your previous good value, thereby discovering whether this is the cause of the problem. Log write latency can vary considerably depending on whether your disks are local or network attached, your RAID configuration or your network performance, and your disk type.
All the CPU and DISK statistics are new in v9 and only published to the system topic. The other statistics that are only published to the system topic are "lock contention", "queue avoided puts" and "queue avoided bytes" for class STATQ type PUT. Not all statistics are published on all platforms, for example many of the CPU SystemSummary statistics aren't published on Windows.
Where you have access to the machine that the queue manager is running on, the operating system tools are likely to give more detail and better accuracy than the CPU and DISK statistics.
Here is some additional explanation for the new statistics that require further description....
- CPU - SystemSummary - User CPU time - the average (taken over the last 10 second interval) percentage of time used by the CPU when it was in non-privileged code.
- CPU - SystemSummary - System CPU time - the average (taken over the last 10 second interval) percentage of time used by the CPU when it was in privileged code.
- CPU - CPU load - one/five/fifteen minute average - the load average. "Load average" is a industry-wide term, but the exact value reported may differ across platforms.
- CPU - SystemSummary - RAM free percentage CPU - SystemSummary - RAM total bytes
- CPU - QMgrSummary - User CPU time - the average (taken over the last 10 second interval) percentage of time used by the CPU when this queue manager's processes were in non-privileged code
- CPU - QMgrSummary - System CPU time - the average (taken over the last 10 second interval) percentage of time used by the CPU when this queue manager's processes were in privileged code
- CPU - QMgrSummary - RAM total bytes - this is an approximation of the memory used by the queue manager.
- DISK - Log bytes max refers to the maximum number of bytes that could be written to the log if all the primary and secondary extents were full. This will be less than the size of the log filesystem
- DISK - Log physical bytes written / logical bytes written - Where LogWriteIntegrity=TripleWrite, the physical number of bytes written to disk will be greater than the logical bytes written.
- DISK - Log write latency - a rolling average that represents the time that a single write to disk takes.
- STATQ - PUT - "Lock contention" is the percentage of attempts to lock the queue that resulted in waiting for another process to release the lock first. Decreasing lock contention is likely to increase the maximum throughput of your system because taking a lock that is not currently locked is a much cheaper operation than waiting for a lock to be released.
- STATQ - PUT - If a message is put to a queue when there is a waiting getter, the message may not need to be queued as it may be possible for it to be passed to the getter immediately. So this message is said to have avoided the queue, and "queue avoided puts" and "queue avoided bytes" are the count of such messages and bytes. Increasing queue avoidance is likely to increase the maximum throughput of your system because it avoids the cost of putting the message onto the queue and getting it off again.
Failed MQI counts
The STATMQI class of v9 statistics outputs a count of the number of failed MQI calls. Not every failed MQI call will appear in these statistics - indeed the failures of not every MQI call have their statistics recorded. This is because many reasons that MQI calls fail for were diagnosed before the MQI call reached the internals of the queue manager where the statistics are recorded. An example of this is MQRC_HCONN_ERROR returned to a client application. If a client application passes a bad hconn, the MQ client will diagnose that error and return MQRC_HCONN_ERROR without passing the MQI call onto the queue manager. Hence the failed MQI call will never appear in the statistics recorded by the queue manager. Even some failures that are diagnosed by the outer layers of the queue manager will not be recorded in the statistics for the same reason. MQI calls involving message properties and message handles are processed entirely in the client, so their failure counts aren't recorded either.
Statistics of failed MQI calls are interesting because they enable customers to troubleshoot poorly-written applications that generate unnecessary failed MQI calls, thereby impacting performance.
Some examples of failing reasons for various MQI calls that would be recorded in the statistics are...
- when MQCONN/MQCONNX/MQOPEN returns 2035 MQRC_NOT_AUTHORIZED when diagnosed by the queue manager, not the client. For example running amqsput as nobody.
- when MQPUT/MQPUT1 returns 2053 MQRC_Q_FULL because MAXDEPTH has been exceeded.
- when MQGET returns 2033 MQRC_NO_MSG_AVAILABLE when browsing or destructively getting from an empty queue
- when MQSUBRQ returns 2437 MQRC_NO_RETAINED_MSG because there is no retained message