Data sharing considerations for disaster recovery

If you introduce a disaster recovery solution, it is usually based on a Db2 data sharing implementation. The provided information describes the most important concepts and different options for implementing a disaster recovery strategy with data sharing, ranging from the traditional method to the most up-to-date implementation.

The options for implementing a disaster recovery strategy with data sharing are essentially the same as the options in non-data sharing environments. However, some new steps and requirements must be addressed.

Specific information about data sharing is available in Db2 13 for z/OS Data Sharing: Planning and Administration (SC28-2765-00).

Configuring the recovery site

If the distance between the primary site and the secondary site is too far to run a stretched Db2 data sharing group with members on both sites, the recovery site must have a data sharing group that is identical to the group at the local site. It must have the same name and the same number of members, and the names of the members must be the same. The coupling facility resource management (CFRM) policies at the recovery site must define the coupling facility structures with the same names, although the sizes can be different. You can run the data sharing group on as few or as many z/OS® LPARs as you want.

If you have configured SAP enqueue replication into the z/OS coupling facility (see Enqueue replication into a IBM Z coupling facility), then you need to make sure that the CFRM policies at the recovery site include the CF structures that are needed for this mechanism.

The hardware configuration can be different at the recovery site as long as it supports data sharing. Conceptually, there are two ways of running the data sharing group at the recovery site. Each way has different advantages that can influence your choice:

  • Run a multisystem data sharing group

    The local site is most likely configured this way, with a Parallel Sysplex®, which contains many CECs, z/OS LPARs, and Db2 subsystems. This configuration requires a coupling facility, the requisite coupling facility channels, and the Server Time Protocol (STP).

    The advantage of this method is that it has the same availability and growth options as on the local site. In general, it is recommended to use this method to keep the high availability characteristics of your SAP solution when it is running on the secondary site.

  • Run a single-system data sharing group

    In this configuration, all Db2 processing is centralized within a single IBM Z server, which can support the expected workload. Even with a single CEC, a multi-member data sharing group, which is using an internal coupling facility must be installed. After the Db2 group restart, all but one of the Db2 members are shut down, and data is accessed through that single Db2.

    Obviously, this approach loses the availability benefits of the Parallel Sysplex, but the single-system data sharing group has fewer hardware requirements:

    • An STP is not required as the CEC time-of-day clock can be used.
    • Any available coupling facility configuration can be used for the recovery site system, including Integrated Coupling Facilities (ICFs).

    With a single-system data sharing group, there is no longer R/W interest between Db2 members, and the requirements for the coupling facility are:

    • a LOCK structure (which can be smaller).
    • an SCA.

    Group buffer pools are not needed to run a single-system data sharing group. However, small group buffer pools are needed for the initial start-up of the group so that Db2 can allocate them and perform damage-assessment processing. When it is time to do single-system data sharing, remove the group buffer pools by stopping all members. Then restart the member that is handling the workload at the disaster recovery site.

GDPS infrastructure for disaster recovery

GDPS® is an abbreviation for Geographically Dispersed Parallel Sysplex. It is a multi-site application that provides the capability to manage:

  • the remote copy configuration and storage subsystems
  • automated Parallel Sysplex tasks
  • failure recovery

Its main function is to provide an automated recovery for planned and unplanned site outages. GDPS maintains a multisite sysplex, in which some of the z/OS LPARs can be separated by a limited distance. GDPS adheres to the sysplex specification in that it is an application-independent solution.

The primary site contains some of the z/OS LPARs supporting some of the data sharing group members, and the primary set of disks. These disks are the ones that support all DB2® activity that comes from any Db2 member of the group. At the secondary site, there are active sysplex images, which support active Db2 members that are working with the primary set of disks. There is also a secondary set of disks, which are mirror copies of the first site.

GDPS supports three data mirroring technologies:

  1. Metro Mirror (formerly Peer-to-Peer Remote Copy (PPRC)) in which
    • the mirroring is synchronous
    • GDPS manages secondary data consistency and therefore no, or limited, data is lost in failover
    • the production site performs exception condition monitoring. GDPS initiates and processes failover
    • distances between sites are up to 40 km (fiber)
    • both are provided: continuous availability and disaster recovery solution
  2. z/OS Global Mirror (formerly XRC) with:
    • asynchronous data mirroring
    • limited data loss is to be expected in unplanned failover
    • Global Mirror managing secondary data consistency
    • GDPS running Parallel Sysplex restart
    • supporting any distance
    • providing only a disaster recovery solution
  3. Global Mirror with:
    • asynchronous data mirroring
    • disk-based technology
    • supporting any distance
    • supporting a mix of CKD and FBA data

In addition to GDPS, there is also an entry-level offering that consists of Tivoli® Storage Productivity Center for Replication (TPC-R) exploiting z/OS Basic HyperSwap®. Basic HyperSwap masks primary disk storage system failures by transparently switching to the secondary disk storage system. This means it is non-disruptive and is designed for thousands of z/OS volumes. However, it is not intended for advanced disaster recovery scenarios, for example, no data consistency for cascading primary volume failures. For this purpose, use GDPS.

The following is an example of multifunctional disaster recovery infrastructure which uses GDPS and Metro Mirror to provide all the elements of a backup and recovery architecture. It includes

  • conventional recovery, to current and to a previous point in time
  • disaster recovery
  • fast system copy capability to clone systems for testing or reporting
  • forensic analysis system (a corrective system as a "toolbox" in case of application disaster)
  • compliance with the high availability requirements of a true 24x7 transaction environment that is based on SAP

This configuration is prepared to support stringent high availability requirements in which no quiesce points are needed.

The non-disruptive Db2 BACKUP SYSTEM utility is used to obtain backups without production disruption. No loss of transactions and data is encountered. The infrastructure provides for a forensic analysis system as a snapshot of production that can be obtained repeatedly throughout the day.

The components of this sample solution shown in Figure 1 are IBM Z, z/OS Parallel Sysplex, Db2 for z/OS data sharing, GDPS with automation support, IBM® DS8000® disk subsystems with Metro Mirror/Global Mirror and FlashCopy® functions, and enqueue replication servers for high availability of the applications.

The following figure shows the GDPS solution landscape.

Figure 1. Example of high availability with GDPS configuration
This graphic shows an example of high availability with GDPS configuration

This configuration is made up of two sites and three cells. (Cell 2 is where the corrective system is started.) The three cells are encapsulated and safe against floods and earthquakes, for example. The distance between cell 1 and cell 3 should be about 20 km based on GDPS recommendations. Both cells belong to the same sysplex and keep members of the same data sharing group. Cell 2, on the other hand, is out of the sysplex to keep the same Db2 data set names for the corrective system.

DS8000 primary and active set of disks is located on the primary site and, using Metro Mirror, the disks are mirrored to the secondary site. Because the BACKUP SYSTEM utility is used, it is not necessary to split the mirror to get a non-disruptive backup. The design keeps symmetry between both sites, with the same DS8000 disk capacity on each site. Therefore, if one site is not available (disaster, maintenance), the other is able to provide an alternate backup process.

Remote site recovery using archive logs

If you are not using GDPS, you can consider using the following approach. Apart from these configuration issues, the disaster recovery procedural considerations do not affect the procedures already put in place for a single Db2 when enabling data sharing. All steps are documented in the Db2 13 for z/OS Administration Guide (SC28-2761-00).

The procedure for Db2 data sharing group restart at the recovery site differs in that there are steps, which ensure that group restart takes place in order to rebuild the coupling facility structures. In addition, you must prepare each member for conditional restart rather than just a single system.

To force a Db2 group restart, you must ensure that all of the coupling facility structures for this group have been deallocated:

  1. Enter the following MVS command to display the structures for this data sharing group:
    D XCF,STRUCTURE,STRNAME=grpname*
  2. For the LOCK structure and any failed-persistent group buffer pools, enter the following command to force the connections off of those structures:
    SETXCF FORCE,CONNECTION,STRNAME=strname,CONNAME=ALL

    With group buffer pools, after the failed-persistent connection has been forced, the group buffer pool is deallocated automatically.

    To deallocate the LOCK structure and the shared communication area (SCA), it is necessary to force the structures out.

  3. Delete all of the Db2 coupling facility structures by using the following command for each structure:
    SETXCF FORCE,STRUCTURE,STRNAME=strname

    This step is necessary to clean out old information that exists in the coupling facility from your practice startup when you installed the group.

The following is a conceptual description of data sharing disaster recovery using the traditional method of recovery based on image copies and archive logs.

Be sure to have all of the information needed for the recovery. The required image copies of all the data objects will be the same, but now all the bootstrap data sets (BSDSs) and archive logs from all members must be provided using one of three options:

  • Archive log mode(quiesce)

    As previously explained, this command enforces a consistency point by draining new units of recovery. Therefore, this command is restrictive for providing continuous availability but, under successful execution, it gets a group-wide point of consistency whose log record sequence number (LRSN) is specified in the BSDS of the triggering member.

  • Archive log mode(group)

    With this command, members of the group are not quiesced in order to establish a point of consistency, but all of them register a checkpoint for their log offload. Because you are going to conditionally restart all the members of the group, you must find a common point in time on the log in order to provide for consistency throughout the group. You must find the lowest ENDLRSN of all the archive logs generated (see message DSNJ003I), subtract 1 from the lowest LRSN, and prepare the conditional restart for all members using that value.

  • Set log suspend

    If you plan to use a fast volume copy of the system, remember that the suspend command does not have group scope, so that it must be triggered in all group members before splitting pairs or performing FlashCopy.

At the recovery site, remember that each member's BSDS data sets and logs are available. The logs and conditional restart must be defined for each member in the respective BSDS data sets. The conditional restart LRSN for each member must be the same. Contrary to the logs and BSDS data sets, the Db2 Catalog and Directory databases exist only once in the data sharing group and must only be defined and recovered once from any of the active members.

DSNJU004 and DSN1LOGP have options that allow for a complete output from all members.

After all members are successfully restarted and if you are going to run single-system data sharing at the recovery site, stop all members except one by using the STOP DB2 command with MODE(QUIESCE). If you planned to use the light mode when starting the Db2 group, add the LIGHT parameter to the START command. Start the members that run in LIGHT(NO) mode first, followed by the LIGHT(YES) members.

You can continue with all of the steps described in topic Performing remote site recovery from a disaster at a local site in Db2 13 for z/OS Administration Guide (SC28-2761-00).

Tracker site for disaster recovery

A Db2 tracker site is a separate Db2 subsystem or data sharing group that exists solely for the purpose of keeping shadow copies of your primary site data.

No independent work can be run on the tracker site. From the primary site, you transfer the BSDS and the archive logs, then the tracker site runs periodic LOGONLY recoveries to keep the shadow data up-to-date. If a disaster occurs at the primary site, the tracker site becomes the takeover site. Because the tracker site has been shadowing the activity on the primary site, you do not have to constantly ship image copies. The takeover time for the tracker site can be faster because Db2 recovery does not have to use image copies.

The general approach for tracker site recovery based on the Db2 BACKUP SYSTEM is as follows:

  1. Use BACKUP SYSTEM to establish a tracker site.
  2. Periodically send active, bootstrap data set (BSDS), and archive logs to tracker site (Metro Mirror, Global Mirror, z/OS Global Mirror, FTP, or tapes).
  3. Send image copies after load/reorg log(no).
  4. For each tracker recovery cycle:
    • Run RESTORE SYSTEM LOGONLY to roll database forward using logs.
    • Use image copies to recover objects that are in recover pending state.
    • Rebuild indexes that are in rebuild pending state.

More information about setting up a tracker site and recovery procedures can be found in Db2 13 for z/OS Administration Guide (SC28-2761-00) and Db2 13 for z/OS Data Sharing: Planning and Administration (SC28-2765-00).