Why am I interested in Active Active and disaster recovery?
There may be times when you want to move MQ on z/OS work from one site to another site and it is too far to be in the same sysplex. For example your data centres are in two different cities. Should I use disaster recovery or an Active-Active environment?
- You want to have two sites a distance too far apart for CFs to be shared between sites
- Data can be mirrored, so that DASD changes on one site get copied to a different site
- DB2 changes can get mirrored so two databases are kept in sync within a couple of seconds - perhaps minutes. This can be done with the IBM product QREP.
Disaster recovery site.
- The systems at the DR site are not doing any useful work. They may be quiesced.
- For recovery at the DR site, it may need the DASD to be reconfigured, and the z/OS images IPLed.
- In effect the same QSG is used - it has the same name, the data is on the same logical disk, but on a different physical disk.
- For MQ, the logs and pages are on disks which have been mirrored.
- When the queue manager starts, the CF structures are in an undefined state. They need to be recovered from a MQ backup of the CF structure. MQ will then process the active and perhaps archive logs (from all queue managers in the QSG), to recover persistent messages to the point of failure in the QSG.
- There will be a DNS change so channels get routed to the DR system.
- Clustering is not needed.
- It may take an hour or more to get this environment up and running.
Both sites are doing production work, but one site may be running personal banking, and the other site running corporate banking. Read only activities such as inquiring an account balance, can be done on either system.
The CICS systems can be started, but are doing no work
There is a QSG on each system.
We use clustering from queue managers outside of this environment to route to the appropriate system eg using cluster ranking. By using CLWLRANK messages are routed to the primary site. If the channel to the primary site is stopped, then messages are automatically routed to the backup site.
MQ outside the environment
There are two scenarios
1) Message sequence important,
2) Message sequence is not important.
Message sequence is not important
Steps to switch
- Stop the cluster channel to the active queue manager(A), messages flow to the other system(B)
- On qmgr A run a batch job which gets from specified queues and does a put to same queue name@B. The mover then moves theses messages
Message sequence is important
Steps to switch
- Stop both cluster channels
- On qmgr A run a batch job which gets from specified queues and does a put to same queuename@B. The mover then moves theses messages
- Once all the messages have been moved, start the channel to B
It should takes 10's of minutes to do switchI
Using the same channel name on different queue managers
It is not good practice to use non clustered channels and using the RESET CHANNEL command to start a channel to a different queue manager, with the same name as there is a risk of losing a message, or getting duplicate messages.
Somewhere the DNS will be changed to route traffic to B
Stop the client channels on A , so that they end cleanly and you do not end up with any indoubt UOW on the channel.
The client reconnects and is now connected to B.
You have clients doing Personal Banking, or corporate banking.
We suggest each business application has a unique port so we can switch each of these independently eg 18.104.22.168(2000) is for corporate banking, and 22.214.171.124(2001) is for personal banking. You may want to move personal banking to system B and corporate banking to system C. You just change the DNS for 126.96.36.199(2000), and personal banking is not affected.
Work running within the environment
Above it spoke about moving messages from one QSG to another, by having an application get the messages and put them to QRremote so they can get sent to the other system.
While this is happening, you may not want the applications processing the messages.
There are a couple of approaches
- You need to stop applications putting messages to the queues. You should stop receiver type channels.
- Set the application queues to disable gets, so the CICS transactions processing messages should end normally.
- Once the transactions have stopped, enable gets again, and run the program to move the messages to the Qremote, so they can be moved to the remote system
More robust solution
- You disable puts for application queues, and the applications should then end cleanly
- Your applications may not be well written, or may have recovery built into them, so if the queue is disabled for gets, it does not end, but retries after a short period.
- In this case disabling the queue for gets and reenabling it will not work. You can use two definitions for the queue.
- Define a QALIAS for the application queue pointing to the base queue. Although the base queue has gets disabled, messages can still be got from the alias queue. So the program which gets messages and puts them to the remote queue should use this alias queue.
Are my definitions in sync?
You may have made configuration changes to your queue managers. You need to ensure these are copied to the other QSG.
You can use CSQUTIL to unload defintions on system A using MAKEREP, and statements are generated that can be run on them on system B to replace QMGR B's definitions with those from system A.
However someone may have made changes on system B and not on system A, and these would be overwritten.
On system B you can use the DISPLAY Q(*) ALTDATE ALTTIME where(ALTDATE,GT,'2013-09-24') to see if any changes were made after a particular date, and DISPLAY Q(*) ALTTIME where(ALTDATE,EQ,'2013-09-23') to see any changes which were made on the given date. The alter time should be the close to the time you last replaced the objects.
If you identify some differences you need to resolve them before a switch or after the switch depending on the priority