I go to customers and do various health checks. Here are some of the questions I ask when doing a High Availability review of MQ on z/OS. I hope the questions ( and answers) are obvious.. if not please let me know.
High Availability(HA) is keeping the system up even if one component fails. For example running in a sysplex. Shut down one LPAR and work should be able to move seamlessly to another LPAR in the sysplex.
Disaster Recover(DR) is when you have lost you primary system - for example major power outage - or a flood, and you have to go to another site.
- Do you have enough log capacity to run for a day without archiving - in case you are unable to archive?
- Are your logs large. If archiving to disk they should be < 3GB for V710, and < 4GB for V800 and later. See How difficult can it be to allocate a 3GB log?
- Do you have enough logs - V900 increased the number of active logs from 31 to 310.
- Are your logs duplexed?
- Are they on different DASD subsystems?
- Do you use a different HLQ so they are in different user catalogs eg MQM1.MQPA.LOGCOPY1.DS01 and MQM2.MQPA.LOGCOPY2.DS01
- Do you monitor MQ log stats
- MQ Log stats report number of IO with 1 page per IO, and >1 page per IO. If number of pages per IO >1 is much higher than number of pages per IO = 1, then you may be reaching log capacity limits
- Archiving to Tape
- Do you have enough tape drives in case you have to go to DR site and process MQ, DB2 and HSM requests at same time
- Archiving to disk and migration
- How long do you keep archive logs on DASD before migration > 24 hours is good
- What is archive log expiry - when are they deleted? Need to keep logs so you can recover page set and CF structures
- How often are page sets backed up?
- Some people backup 2 or more times a day. Can you use Snapshot capability of DASD subsystem to take instant copy of the page sets.
- Do you use data set backups rather than Volume backups. If you use volume backups they need to be consistent and at same time.
- Do you allow Page sets to expand?
- Allowing expansion can handle peak work load, but then you have bigger page set which take longer to backup.
- How do you manage buffers pools - eg isolation of applications and queues.
- Do you keep user queues out of PSID (0) and PSID(1)
- Do the alternate CF's and the CF's in the DR site have enough capacity for the structures that may move to it.
- Have you checked?
- Are you using
Backup and recovery of the CF structures
- Do you have one queue manager solely for structure backups? - this reduces the recovery time if you have to read the logs, as there is less data to read from the logs
- How often do you backup the CF structures - some high volume customers do it every half hour
- Empty structures take little time to backup. Structures with many messages, or large messages in SMDS take longer to backup
- Do you know how long it takes to backup
- Have you tested CF structure recovery?
- Remove default channel connection
Do you know when changes to business applications will impact your system. For example new function or increased volumes
- Number of messages a second or increased message size. A large increase can impact
- Amount of data logged per second - and not being able to keep 1 days worth of data in the logs
- Increased IO activity to logs
- Increased CF or page set IO
- Buffer pools or structures filling up - and impact on response time
- Amount of CPU used to process work - lack of CPU can impact amount of work, and response time
Is your work balanced - or do you get uneven systems?
- Can your work run on any LPAR?
- Due to CF configuration, MQ work may run on one LPAR in preference to another.
- Do you experience this?
- Is this a problem?
- Having balanced systems, is less impact if one system has an outage