Tips for maximizing data availability during backup and recovery

You need to develop a plan for backup and recovery. Then, you need to become familiar enough with that plan so that when an outage occurs, you can get back in operation as quickly as possible.

Consider the following factors when you develop and implement your plan:

Decide on the level of availability you need
Practice for recovery
Minimize preventable outages
Determine the required backup frequency
Estimate recovery time using redirected recovery
Minimize the elapsed time of RECOVER jobs
Minimize the elapsed time for COPY jobs
Determine the right characteristics for your logs
Minimize Db2 restart time

Decide on the level of availability you need

Start by determining the primary types of outages you are likely to experience. Then, for each of those types of outages, decide on the maximum amount of time that you can spend on recovery. Consider the trade-off between cost and availability. Recovery plans for continuous availability are very costly, so you need to think about what percentage of the time your systems really need to be available.

The availability of data is affected by the availability of related objects. For example, if there is an availability issue with one object in a related set, the availability of the others may be impacted as well. The related object set includes base table spaces and indexes, objects related by referential constraints, LOB table space and indexes, and XML table spaces and indexes.

Practice for recovery

You cannot know whether a backup and recovery plan is workable unless you practice it. In addition, the pressure of a recovery situation can cause mistakes. The best way to minimize mistakes is to practice your recovery scenario until you know it well. The best time to practice is outside of regular working hours, when fewer key applications are running.

Minimize preventable outages

One aspect of your backup and recovery plan should be eliminating the need to recover whenever possible. One way to do that is to prevent outages caused by errors in Db2. Be sure to check available maintenance often, and apply fixes for problems that are likely to cause outages.

Determine the required backup frequency

Use your recovery criteria to decide how often to make copies of your databases.

For example, if you use image copies and if the maximum acceptable recovery time after you lose a volume of data is two hours, your volumes typically hold about 4 GB of data, and you can read about 2 GB of data per hour, you should make copies after every 4 GB of data that is written. You can use the COPY option SHRLEVEL CHANGE or DFSMSdss concurrent copy to make copies while transactions and batch jobs are running. You should also make a copy after running jobs that make large numbers of changes. In addition to copying your table spaces, you should also consider copying your indexes.

You can take system-level backups using the BACKUP SYSTEM utility. Because the FlashCopy® technology is used, the entire system is backed up very quickly with virtually no data unavailability.

You can make additional backup image copies from a primary image copy by using the COPYTOCOPY utility. This capability is especially useful when the backup image is copied to a remote site that is to be used as a disaster recovery site for the local site. Applications can run concurrently with the COPYTOCOPY utility. Only utilities that write to the SYSCOPY catalog table cannot run concurrently with COPYTOCOPY.

Estimate recovery time using redirected recovery

In the event that a recovery is necessary, an accurate recovery time estimation is important. The estimated recovery time can be used to validate the recovery time objective.

Use the following formulas to calculate the recovery time estimate (RTE) from the job output of a redirected recovery. The elapsed time (ET) for the identified phases should be subtracted from the total RECOVER utility elapsed time to exclude processing not done for real recovery:

RTE for point-in-time recovery = Total ET – TRANSLAT ET
RTE for recovery to the current state = Total ET – LOGCSR ET – LOGUNDO ET – TRANSLAT ET

You can use messages to find each variable for the formulas:

DSNU500I reports Total ET
DSNU1565I reports TRANSLAT ET
DSNU1552I reports LOGCSR ET
DSNU1557I reports LOGUNDO ET

Note: The LOGUNDO phase of redirected recovery does not write compensation log records when backing out uncommitted work, while the LOGUNDO phase for real recovery does. This will not affect the calculation of RTE for recovery to the current state since the elapsed time for the LOGUNDO phase for redirected recovery is subtracted. This may affect the calculation of RTE for recovery to a PIT since the elapsed time of the LOGUNDO phase may differ for redirected recovery and real recovery.

Minimize the elapsed time of RECOVER jobs

When recovering system-level backups from disk, the RECOVER utility restores data sets serially by the main task. When recovering system-level backups from tape, the RECOVER utility creates multiple subtasks to restore the image copies and system-level backups for the objects.

If you are using system-level backups, be sure to have recent system-level backups on disk to reduce the recovery time.

Restoring FlashCopy image copies happens very quickly. However, creating a FlashCopy image copy with consistency (FLASHCOPY CONSISTENT) uses more system resources and might take longer than creating an image copy by specifying FLASHCOPY YES. Specifying FLASHCOPY CONSISTENT might take longer, because backing out uncommitted work requires reading the logs and updating the image copy.

For point-in-time recoveries, recovering to quiesce points and SHRLEVEL REFERENCE copies can be faster than recovering to other points in time.

If you are recovering to a non-quiesce point, the following factors can have an impact on performance:

The duration of URs that were active at the point of recovery.
The number of Db2 members that have active URs to be rolled back.

Minimize the elapsed time for COPY jobs

You can use the COPY utility to make image copies of a list of objects in parallel. Image copies can be made to either disk or tape.

Also, you can take FlashCopy image copies with the COPY utility. FlashCopy can reduce both the unavailability of data during the copy operation and the amount of time that is required for recovery operations.

Determine the right characteristics for your logs

Consider the following criteria when determining the right characteristics for your logs:

If you have enough disk space, use more and larger active logs. Recovery from active logs is quicker than from archive logs.
To speed recovery from archive logs, consider archiving to disk.
If you archive to tape, be sure that you have enough tape drives so that Db2 does not have to wait for an available drive on which to mount an archive tape during recovery.
Make the buffer pools and the log buffers large enough to be efficient.

Minimize Db2 restart time

Many recovery processes involve restart of Db2. You need to minimize the time that Db2 shutdown and startup take.

You can limit the backout activity during Db2 system restart. You can postpone the backout of long-running units of recovery until after the Db2 system is operational. Use the installation options LIMIT BACKOUT and BACKOUT DURATION to determine what backout work will be delayed during restart processing.

The following major factors influence the speed of Db2 shutdown:

Number of open Db2 data sets
During shutdown, Db2 must close and deallocate all data sets it uses if the fast shutdown feature has been disabled. The default is to use the fast shutdown feature. Contact IBM® Support for information about enabling and disabling the fast shutdown feature. The maximum number of concurrently open data sets is determined by the Db2 subsystem parameter DSMAX. Closing and deallocation of data sets generally takes .1 to .3 seconds per data set.

Be aware that z/OS® global resource serialization (GRS) can increase the time to close Db2 data sets. If your Db2 data sets are not shared among more than one z/OS system, set the GRS RESMIL parameter value to OFF, or place the Db2 data sets in the SYSTEMS exclusion RNL.
Active threads
Db2 cannot shut down until all threads have terminated. Issue the Db2 command DISPLAY THREAD to determine if any threads are active while Db2 is shutting down. If possible, cancel those threads.
Processing of SMF data
At Db2 shutdown, z/OS does SMF processing for all Db2 data sets that were opened since Db2 startup. You can reduce the time that this processing takes by setting the z/OS parameter DDCONS(NO).

The following major factors influence the speed of Db2 startup:

Db2 checkpoint interval
The Db2 checkpoint interval indicates the number of log records that Db2 writes between successive checkpoints. You can specify the checkpoint frequency in log records, minutes, or both. For more information, see CHECKPOINT TYPE field (CHKTYPE subsystem parameter).

You can use the LOGLOAD option, the CHKTIME option, or a combination of both of these options of the SET LOG command to modify the CHKFREQ value dynamically without recycling Db2. The value that you specify depends on your restart requirements. The default value for the CHKTIME option is 5 minutes.
Long-running units of work
Db2 rolls back uncommitted work during startup. The amount of time for this activity is approximately double the time that the unit of work was running before Db2 shut down. For example, if a unit of work runs for two hours before a Db2 abend, it takes at least four hours to restart Db2. Decide how long you can afford for startup, and avoid units of work that run for more than half that long.

You can use accounting traces to detect long-running units of work. For tasks that modify tables, divide the elapsed time by the number of commit operations to get the average time between commit operations. Add commit operations to applications for which this time is unacceptable.

Recommendation: To detect long-running units of recovery, enable the UR CHECK FREQ option of installation panel DSNTIPL. If long-running units of recovery are unavoidable, consider enabling the LIMIT BACKOUT option on installation panel DSNTIPL.
Size of active logs
If you archive to tape, you can avoid unnecessary startup delays by making each active log big enough to hold the log records for a typical unit of work. This lessens the probability that Db2 will need to wait for tape mounts during startup.