Over the past year I have been involved in performance problems involving MQ and have seen a variety of approaches to resolving performance problems. Some work - some don't. I thought I would share some of these, and give some suggestions on how to resolve performance problems.
There are different sorts of performance problems
1. Too expensive:
For example, an application is using too much CPU. On z/OS you can use products like APA and Strobe to profile applications and see where the hot spots are.
Some examples previously blogged about are processing many messages perhaps using MQ message selectors on a deep queue, or DB2 processing involving many rows.
I will not address these problems here.
2. Too slow / Maximum throughput:
Long response time or low throughput.
Some people say they want maximum throughput. This could mean at least one server in running at full capacity - which could mean work queues up on this server, resulting in the overall transaction response time increasing. due to the waiting.
Most people mean they want maximum throughput whilst maintaining a round trip response time less than a specific duration.
The first steps
- Use platform tools to make sure there are no delays due to lack of CPU, paging, or slow IO
- Find the queues with deep queue depths, or messages are on the queue for a long time.
- Look at the application processing the queue, and investigate them
I'll start with an analogy that was inspired whilst sitting waiting at an airport.
There is a long one way corridor with a door at each end. The doors allow one person per second to pass through them. Halfway along the corridor is another door which allows two people per second to pass through.
1. Your boss comes in and tells you there is a bottleneck, because only one person per second comes out of the exit door.
2. You look into the problem, see that there are people queuing up at the entrance door. You work late into the night and change the door to allow up to 10 people per second.
3. Next morning, your boss comes in with a cup of coffee - saying good job - here is a cup of coffee as an award.
4. Your boss comes back a minute later, takes back the coffee and says there is still only one person a second coming out of the exit door.
5. You go and look into the problem, and see there are people queuing up at the exit door. You work late into the night(again) and fix the exit door to allow 10 people per second.
6. Next day your boss comes in, give you the (cold) cup of coffee again and says well done.
7. Your boss comes back 10 minutes later, takes back the cup of coffee and says, there the throughput has improved, but now it is two people per second.
8. You go and see what the problem is - and see the door in the middle is causing the bottleneck. You work late into the night (again) and fix the door to allow 10 people per second.
9. Next afternoon, your boss comes in very happy as there are many people coming out of the exit door - but says "we spent the award budget on buying you the coffee earlier in the week, so take the rest of the evening off!!"
What have we learnt?
1. You work hard to resolve a bottleneck and it seems to make no effect and you get no thanks
2. You may need to fix several bottlenecks before the throughput improves. You had to fix the entrance and exit doors to get any improvement
3. An area (eg the middle doors in the above story) that was not a bottleneck initially, but became a bottleneck later.
A real scenario:
Consider a scenario where there are MQ clients on Windows putting messages to a queue manager on z/OS. These messages subsequently flow to a Linux system running IIB. An IIB flow gets a message and does an insert into a remote database.
The complaint is that the Windows transactions are taking too long. One symptom is that they sometimes see the XMITQs on z/OS building up.
Some customer-like conversations we have had:
We upgraded MQ on z/OS. We exploited 64 bit buffer pools, we moved the active logs to faster DASD. We upgraded MQ on Linux, and got our networking people to improve the network. It still did not help. The customer had read lots of documents and blog posts, but had not investigated where the problem was.
We cant get the doc you asked for
From a packet trace on z/OS, IBM can see that the MQ end of batch flows are taking a long time on the Linux system.
This is most probably due to the commits taking a long time (slow disks) or the queues are filling up on Linux - and so the MQPUT is retried, and there is a very long time (100 milliseconds) before the MQPUT is successful. This then causes the messages to build up on z/OS. So clearly a problem on Linux. We cannot get any information from the Linux system, so how do we tune MQ on z/OS?
It is almost impossible to diagnose a problem on a Linux box from z/OS. You can tell there is a problem - but not what the problem is.
This is like making the entry door support 20 people a second - while the exit door can still only support one person a second.
We havent changed anything on z/OS , we havent spoken to the Linux people or the network people, but the problem is not on z/OS. This is not my problem.
The root cause of the problem was the hop from IIB to the remote database, so people were not looking in the right area.
Before I give my suggestions for identifying where the performance problems are, what can cause delays?
- Your application can be using lots of CPU, for example you have enabled encryption of data or you have to process deep MQ queues looking for a message
- Your application is waiting for disk IO, for example a commit.
- Your application has to wait for storage e.g. paging - or the buffer pool is too small and the MQ message needs to be read from page set to memory before it can be used.
- Your application issued a TCP send request - but because of network, or downstream delays, the send is delayed
- Your application put a request on a work queue (either MQ queue, or a work list in memory), then waited for the server to process the request
- Work is building up as you do not have enough threads processing the work.
- Too many threads processing the work - this can be if you have a hot database record which every instance has to update. The more threads you have means the more deadlocks or contention you get for the hot row.
- Your application has issued a wait. For example I expect the replies to be back in 10 milliseconds so I'll put in a wait for 100 milliseconds regardless of how long the request takes. This 100 ms was from 10 years ago when a disk IO took 10 ms -disk IO is now below 1 ms.
- Your application is waiting on a latch - a latch is typically held for a very short interval - for example while updating 10 fields in a control block.
- Your application is waiting on a lock - a lock is typical held for a longer period of time perhaps 10s of milliseconds. For example you need exclusive access to a database record - and this locked time includes the time while a commit is written to disk
At a simplistic level to identify where the performance problems are, you need to:
- Draw a picture of each major step in the end to end transaction. Bear in mind that a request may go down a path, and the response come back the same way but in the opposite direction, or a response may come back a different path.
- Measure the time spent in each step. This is easy to say - but often hard to do.
You could try to measure the time for each step, but this gets difficult.
For example, continuing the scenario of MQ Clients on Windows putting messages to queues on z/OS queue manager that are subsequently processed on a Linux system running IIB to insert the message to a database;
- The Windows client program does an MQPUT followed by an MQCommit.
- The client program could be instrumented to report the duration of the MQ verbs, or you could do a network trace using tools like Wireshark running on Window to look at the flows to/from z/OS.
- MQ Accounting class(3) on z/OS can be used to display the MQ API response times as seen by the queue manager, but it does not include the network time.
- Network tools like PING and netstat can give useful information on the time to get to from Windows to z/OS.
- On the z/OS queue manager, messages are queued up on an XMIT queue for a channel.
DIS CHS XQTIME can be used to display an average time on queue before the message was sent.
DIS CHS NETTIME to get a measure of the time from z/OS down to the Linux partner machine.
- On the Linux queue manager you can use the DIS QSTATUS to get the maximum age of the message on the queue IIB flow input queue
- IIB tooling can be used to display information about the IIB activity
- Network trace such as Wireshark can be used to measure the requests from IIB to the remote database.
- Use similar techniques on the way back to the original Windows client
This is good in theory, but it may be impractical in practice as some customers do not allow networks to be sniffed, or the data may be encrypted.
A good approach is to use the MQ commands to see where work is building up or being delayed. This may identify one part of the end to end flow, then dig into the slower component.
This digging into slower components may require using commands from other products, such as TCP NETSTAT, CICS or IMS commands on z/OS etc.
Once you have fixed the first bottle neck (or worked round it) look for the next one.
Use the MQ commands on each MQ QUEUE of interest
DIS QSTATUS(queue) CURDEPTH LGETTIME QTIME MSGAGE
If the queue has messages on it check:
LGETTIME is the last gettime - if this is not changing then messages are not being processed.
QTIME is how long the message has been on the queue.
MSGAGE the age of the oldest message.
If the queue depth is increasing, or the messages are slow to be processed, then the application or channel processing the queue is not keeping up, and you need to investigate this.
You may be able to start more instances of the application to process the messages in parallel.
For each sender MCA channel, enable MONCHL on the channel to collect information about the channel activity and restart the channel.
Then (after a while), issue DIS CHS(channel) XQTIME MSGS adding attribute XQMSGSA for cluster-sender channels.
1. XQTIME is the time that messages remained on the transmission queue before being sent by the channel
2. MSGS is the number of messages sent since the channel was started
3. XMSGSA is the number of messages available for the cluster-sender channel on the transmission queue
Issue this command a few times, a few seconds apart.
If messages are spending a long time on the transmission queue, or the number of messages on the transmission queue is growing, then the channel can't keep up with messages arriving on the transmission queue. If MSGS isn't increasing, then the channel isn't processing any messages, which indicates that the channel isn't running.
If the channel can't keep up with messages arriving on the transmission queue, do you know what the data rate should be?
Issue DIS CHS(channel) BYTSSENT twice.
You can calculate the approximate data rate for the channel from the difference in bytes sent, and the time between the commands.
Compare this with data from a time when there is no problem.
- If messages are building up on the IIB flow input queue, look at IIB message flow statistics and accounting data to see the route that messages take through the flow, and whether a long time is spent processing messages in any nodes.
- Connection from IIB to the database - Use the IIB stats and accounting or use wireshark if you think there may be a network problem.
- Use database tools at the remote end to review the performance of the database requests
What about performance problems with MQ?
With MQ it is a simple application concept.
- An application does MQPUT a message - and issue a commit
- The message is available for getting from the queue
- Another application does an MQGET and issues a commit.
These can be broken down into different categories
- MQ API requests take a long time. For z/OS a put or a get of a short message ( under 50 KB) should be around 1 ms. The commit of a unit of work with persistent messages should be 1-10 milliseconds or better.
- You can tell this by using the MQ CLASS 3 accounting which tells you the average MQ API response times, and where the time was spent.
- An MQGET-with-wait can take a long time if there are no messages.
- An MQGET can take a long time if many messages have to be searched - eg if the queue is not indexed and the get was issued specifying msgid or correlid, or the get specified message selectors which causes the queue to be searched linearly for matching message(s).
- An MQGET can take longer if a message was not in a buffer pool and had to be read from a pageset
- MQ Commits of persistent messages take a long time - check the disk response time using the tools on the platform
- If you have deep queues
- Applications not processing messages fast enough, for example delays doing a database request
- Not enough application instances
- Deep XMITQ
- Slow network - TCP tuning needs to be done - eg allowing TCP windows and buffers to grow.
- Channel has problems putting message to the queue at the receiver end.
- Batch size too small
- A large MQ NETTIME can be due to network delays, or delays in processing at the remote end of the channel - or both.
- Poor application design
- For example having IIB with 20 message flows, where the path is MQGET, some processing, flatten the data, MQPUT, MQ Commit. MQGET, Unflatten the data, some more processing, flatten the data, MQPUT, MQ Commit. etc
- There is a lot of unnecessary work - for example there are 20 commits each taking 1-10 ms.
- When this flow was converted to one flow with one MQGET, one MQPUT and one MQ Commit - there was major increase in throughput.