I had a question about how to do Disaster Recovery and what things a customer needs to look out for. Of course we expect it will work perfectly, but there are some things you need to be aware off.
The scenario is where you have Production and a Disaster Recovery site, and the DASD is mirrored asynchronously to the DR site.
What is asynchronous mirroring and how does it impact me?
When an application commits, persistent updates are written to the log data set. When the data has been successfully written to disk, then the commit request returns to the application.
If the DASD is synchronously mirrored to a remote site, then when your application commits, the commit has been done at both sites. With asynchronous mirroring the local site does the commit and the request is sent to the remote site. If there is a failure ( eg broken connection) the latest IO requests may not get to the remote site.
People tend to use synchronous mirroring within about 20 km as the response time is acceptable. If the remote site is 1000 km away, then the time to do a synchronous copy will be too long, and so you need to use Asynchronous IO.
Your DASD is configured to have consistency groups. For DASD volumes in the same consistency group the order of the IOs will be maintained. If there are 3 IOs ; Request1, Request2, Request3 If Request1 gets to the remote site, and request2 fails to get to the remote site, then Request3 will not be done.
DASD and datasets
You may want to archive to DASD, and have HSM migrate the archive log data sets after 24 hours. This means your archives are on disk and you do not have the added complication of ensuring your tapes are consistent with the DASD.
You need to work with your Storage Manager, to make sure that the data sets for a queue manager are in the same consistency group. This will be the BSDS, active logs, page sets, and datasets used in the JCL, such as STEPLIB, CSQINP* data sets etc. You may need to have your archive logs if you need to recover CF structures.
Shared Queue uses DB2 tables, so these tables will need to be available before you start a queue manager.
We have seen a couple of situations where the MQ definitions were not consistent with z/OS XCF definitions at the DR site. You should use CSQ5PQSG to list the members of the QSG, and check that all members of the QSG are present. If not, then use CSQ5PQSG to add the queue manager names.
Before you start a queue manager you should force the MQ structures to clear them, which will cause MQ to recognize that they require rebuild. Use CFSTRUCT RECAUTO to cause them to be rebuilt. If you fail to do this, then the CF and the MQ logs will be inconsistent.
When the first queue manager starts, it will detect the CF structures are in a failed state and will rebuild them. This will involve reading from logs to find the previous CF backup, and replaying updates from all queue manager logs.
You do not need to recover SMDS as they will be rebuilt as the structures are rebuilt.
Before you do a DR you need to check that the CF structure sizes match those at your primary site so that you have enough capacity to hold all of the messages and control information, and that the CF has sufficient CPU capacity to be able to run the workload.
The mover and clustering
As part of the process for moving to DR there will be network changes to route IP to the DR site, so you should not have to change the conname for your channels.
You should make sure your DNS is considered for any DR planning. After the move to the DR the DNS server may now be 1000's of km away, and so have longer response times.
You may not want to start any channels during testing.
1. It is possible to get duplicate messages. The production site sent the messages, but the commit did not get to the DR site. At the DR site the channel sees messages on the XMIT and sends the messages.
2. A requester channel may cause a channel to be started back to your DR queue manager- and any messages flowing into the DR queue manager may be lost (ignored)
3. If the DR queue manager connects to a full repository then updates about a cluster may flow to the DR queue manager and not the production queue manager. This can cause inconsistencies.
If the IP address of your DR site is different to your production site, then you can consider using CHLAUTH rules to stop channels with the 'wrong IP address' connecting to distributed queue managers or cluster full repositories
Moving back after DR
You need to have a plan to move the system back from the DR site - this is often overlooked.