We were doing some testing the recovery of a MQ CF structure and trying to understand what happens during recovery. The text below may help clarify what goes on.
The logic of recovering a CF structure from MQ logs is described below.
Conceptually the data from the previously backed up CF structure is read from the MQ log, the log is read forward from the backup and any changes are reapplied to the restored structure.
In practice it works as follows.
- The log range to use is found from the latest backup of each structure to be recovered, to the current time. The log range is identified by LRSN values,(a LRSN uses the 6 most significant digits of a 'store clock value'). Note the whole log (back to the time the structure was created) will be read if you have not done a backup of the structure.
- The logs from each queue manage in the QSG are read for records in this LSRN range.
- The logs are read backward.
- A list of changes to each structure to be recovered is built.
- Data from the cf structure backup is read and the data is restored. If the backupwas done on QMA, and the recovery is running on QMB, then QMB will read the QMA's logs to restore the structure.
- When the start of the backup of the CF structure is read, an internal task is started to take the restored data for the structure and merge it with the changes read from the log.
- Processing continues for each structure being restored.
In the example below the command RECOVER CFSTRUCT(APP3) was issued, and the following message were produced.
04:00:00 CSQE132I CDL2 CSQERRPB Structure recovery started, using log range from LRSN=CC56D01026CC to LRSN=CC56DC368924
This is the start of reading the logs backwards from each qmgr in the QSG from the time of failure to the to the structure backup. The LRSN values give the ranges being used. Log records for all structures (just one structure in this example) being recovered are processed at the same time.
04:02:00 CSQE133I CDL2 CSQERPLS Structure recovery reading log backwards, LRSN=CC56D0414372
This message is produced periodically to show the process. You can use this to check on progress, and estimate the time taken to recover. If this value does not change, it could be caused by needing an archive from tape, and this is slow.
04:02:22 CSQE134I CDL2 CSQERRPB Structure recovery reading log completed
The above process of replaying the logs backwards has finished,
04:02:22 CSQE130I CDL2 CSQERCF2 Recovery of structure APP3 started, using CDL1 log range from RBA=000EE86D902E to RBA=000EF5E8E4DC
The task to process the data for APP3 has been started. The last backup of CF structure APP3 was done on CDL1 within the given RBA range, so this log range has to be read.
04:02:29 CSQE131I CDL2 CSQERCF2 Recovery of structure APP3 completed
The data merge has completed. The structure is recovered.
Most of the time in RECOVER CFSRUCT is spent reading the logs from the time of failure to when the backup was done. You can minimise this time by backing up the structure frequently. Some customers do it every half an hour, so there is at most half an hours worth of log data to process - and this data should still be available in active logs.
If your structures are usually empty then the amount of data backed up will be small, and so the recovery time will be small. If you have lots of messages for an extended period, the the messages have to be copied to the logs, this could be many MB. It took 6 seconds to backup 29MB of data in the structure APP3, and about 7 seconds to recover it.
If you have to recover two structures it is faster to recover them all at the same time rather than done one, then the other. Doing them all at the same time, means the log is processed just once.
In a different test using SMDS with a different structure with more messages on it, it took 44 seconds to backup 1067 MB.
When this structure was recovered it took 150 seconds to process this.
These messages were in SMDS recreated, resulting in I/O to SMDS.
The SMDS space map had to be rebuilt. This took 51 second for 10,000 messages