I put this on the ColinPaice z/OS blog ... and thought it worth sharing with a wider audience...
I was working with a customer who had an "MQ problem" which turned out to be too many virtual machines (VMs) running on the box - so causing lack of CPU. They fixed this and "the MQ problem went away"
There is well know Maslow's hierarchy of needs which says you need air before you think about safety, sense of belonging etc.
So here is Colin's hierarchy of needs... so fix 1) then fix 2) etc
1. CPU - is the image short of CPU?
2. Memory (real and virtual storage) - is there any paging
3. IO - check the IO response time is good
4. Check network response time is good
5. Check subsystem - eg MQ, DB2 are giving good response time
6. Check applications
If you have fixed a problem - start your checks from the top. For example fixing the IO problem allows much more work to flow - so there may now be a CPU problem.
If you have a performance problem, go through the list to see where the problem is. It may save you time before calling for help, as the support team may assume you have gone through the list.
I showed this list to a colleague who said it is really obvious - but if it is so obvious - why do we have so many problems caused by it!
As I was writing this, I was asked about another 'MQ problem' which turned out to be CPU.
Here are some real examples of problems I have dealt with. Tick the one which you have experienced
- "The MQ performance was so bad - I could not even logon to the machine to display the MQ error log" - this was a lack of CPU in the VM
- "It cannot be a CPU problem there are 20 cores on this machine" - yes but the VM is only configured to have one core. Defective End User
- This server does not have a CPU problem" - yes - but half the messages are being routed (using MQ clustering) to that server which does have a CPU problem - problem between keyboard and chair.
- "On average the CPU is only 50% busy" - yes - that is because you have peak workload where you run out of CPU followed by long periods where nothing happens.
- Whoops I made the MQ buffer pool so big - it caused paging.
- Throughput dropped at 8pm each evening - they did backups at 8pm - and the IO response time doubled - so commits took twice as long and transaction rate halved.
- MQ distributed performance was poor - someone had reconfigured the connection to the SAN - IO problem
- "You are running MQ on that system?- that SAN is due to be replace next month as it is old and overloaded. The reason why that machine was not being used is that it is so old and about to be scrapped - and you are running production MQ on it? " - lack of planning and communications
- MQ throughput between MQ on z/OS and Linux died every Saturday. - Backups taken from all distributed machine to z/OS - which swamped the network
- MQGETs are slow since we made the messages persistent - messages were out of syncpoiint - so IO for every message.
- MQ throughput very low - because the application is doing a remote database insert over the network. The MQGET was very quick - the database update was not..
Please share any other experiences you have.