We had a customer ask how they get the maximum throughput through MQ with Shared Queue.
As with most performance questions, the answer is it depends, and your mileage may vary (which means on different days a slight change in your environment may cause a relatively large impact to your throughput).
There is information on MQ performance, including shared queue in the MP16 SupportPac
One key concept is the Synchronous/Asynchronous IO to the Coupling Facility(CF).
An IO to DASD has the following
- Start the IO
- Suspend the task
- Wait for IO
- Resume the task
This is how an Async request works for the CF.
If the time to access the CF is very small, the CPU cost of the Suspend and Resume may be relatively expensive.
The Synchronous CF IO issues an instruction to the CF - and does not suspend. In concept, this is single instruction and is using CPU while it is executing. The duration of this instruction depends how long it takes to get to the CF, as well as the duration while the request is being actioned in the CF.
z/OS looks at the time of the Sync IO, and if this is more than the cost of Suspend + Resume, then it is likely to use the Async instruction.
The duration of a sync IO is of the order of 10 microseconds. The duration of an Async IO may be 100 microseconds. So Sync is better than Async.
The major impact on throughput, is the coupling facility, the time taken to get to the structure, and the time to process work in the CF.
- CF placement - closer is faster than remote
- CF CPU type (dedicated ICF better than CFCC thin interrupt, is better than dedicated GPs, is better than shared GPs)
- The links between the z/OS and the CF. For example real cables, or the Internal Coupling Facility within the same z processor.
- As the CPUs in the CF get busier, the response time increases - the classic performance problem
- As the IO to the CF get busier, the response time takes longer - again the classic performance problem
- If there are multiple structures in the CF. Structure_A can be really busy, while Structure_B is not busy. But because of the IO to Structure_A, the IO for Structure_B is impacted.
- There gets a point where it is more efficient to use Async rather than Sync requests, and so the applications will see jump in response times. You can use the z/OS performance reports for the CF to see performance information about CFs and Structures
- Duplexing tends to take longer -as there is a local CF and a remote CF, so updates takes longer as the operation needs to complete on both CFs before the update is complete.
- We have seen situations where it is more efficient to offload data to SMDS because writing the small amount of data to the structure was a Sync request - but writing the entire message to the structure used Async
A slight change to your environment, for example a CICS structure has slightly more activity, can make the MQ requests go from Sync to Async - and so take longer. This can change from moment to moment.
- With more data in a message - the IO takes longer. Around 32KB there may be a switch from Sycn to Async. This depends on the environment.
- Commits have to go to every structure involved in UOW so best to have all messages in one structure
- Have unique/few msgid/correlid. Avoid multiple ( >100) messages with the same msgid or correlid.
- Persistent in sycpoint is good - but is persistent necessary?
- Use get wait - do not poll the queue
- Have a queue open on LPAR to avoid the "first open on LPAR" effect. If a queue manager does not have the queue open, then an open request will go to the CF to get information about the queue, and ask to be notified about any changes to the queue. If the queue is open in a queue manager, then the queue manager already has information about the queue.
- When a queue is closed, if it is the last instance on that queue manager, then the QMGR goes to the CF to say it is not interested in the queue. Having a batch job open the queue, then sit there doing nothing, will ensure a queue manager has the queue open.
- We ran a test and found that running 20 queues with a queue on each structure gave slightly higher throughput than 20 queues in one structure.
- Deep queues. Backing up the structure causes messages to be read from the structure. The more data in the structure means a bigger impact on the application response time due to increased IO time, and CPU busy while the backup is in progress.
What can you do?
- Monitor the response times for example using RMF to display the CF responses and the utilization every minute.
- Monitor the delayed requests in the CF report for the structures (basically monitor the structures generally as that includes the ratio / response times)
- Review the CF as a whole, and the structures in it, and move heavily used structures to a different CF
- Monitor the ration of Sync and Async for your structure
- Monitor the CF busy %. If it is above 65% then you may need to add more CPs.
- Use MQ accounting class(3) to see how many structures are being used.
- Use MQ accounting class(3) to see if messages are persistent or non persistent.
So is it all about speed ?
No - speed and throughput are just part of the overall picture. You need to think about the business requirements, the cost, and problem scenarios.
My manager has a car which can do 150 Miles Per Hour - but his aged mother with a hip problem cannot get in and out of the car - and so the car is not meeting the "business requirements".
A customer running on distributed MQ was having throughput problems due to the erratic disk response time. They made the decision to change from persistent messages to non persistent - to make MQ go faster! Many customers do use just non persistent messages, but the applications have logic to handle failures, such as lost message - but this is designed in. You need to understand the business requirements, and the application design to ensure that the applications can meet the business needs.
It you have two sites, then the site closest to the Coupling Facility may process most of the messages - because the CF is closer. If you switch over to the other site, then it may be worth failing over to a CF on the other site - because then it will be closer to that site. The question of whether you have a CF at each site may come down to a business decision and the cost of providing two CFs
You have to consider what-if scenarios. Your systems may be running fine with only a few messages on the queues. What if a channel fails to start, or an application is not processing the queue. Will the queue grow till it fills the structure, or is there a sensible MAXDEPTH. If you are using SMDS and allow 1 million messages on the queue, how will it perform until the queues are empty. Do you want to limit the queue to 1000 messages and not use SMDS?
So as we often say " it depends"