GDPC High Availability and Disaster Recovery

Geographically dispersed Db2® pureScale® clusters (GDPC) is a regular Db2 pureScale cluster stretched between two geographically separated sites by tens of kilometers resulting in an active/active solution providing a level of disaster recovery support suitable for many types of disasters while leveraging all existing single-site Db2 pureScale high availability characteristics for planned and unplanned hardware and software failure.

During normal operations, both sites are active and available for transactions. In the event of individual members failure within a site or a total site failure, client connections are automatically redirected to surviving members by Workload Balancing (WLB) and Automatic Client Reroute (ACR). No preference is given to starting the member in restart-light mode on another system at the same site. Although this might be the intuitive expectation, there is no benefit in terms of overall failure recovery time as the restarting member will need to communicate with members and CFs from both sites equally. For a primary CF system failure, the primary CF role will fail over to the secondary CF at the surviving site.

The estimated time for a GDPC to recover from software faults is comparable to the recovery time for software faults in a single-site Db2 pureScale cluster. If SCSI-3 PR is not used, a slightly longer impact to the workload is expected for hardware failures that affect an entire system. Recovery time is dependent on many factors, such as the number of file systems, file size, and frequency of writes to the files. Care must be taken to ensure that sufficient space is available for critical file systems such as /var and /tmp because a lack of space on these file systems might affect the operation of the cluster services.

The overall health of GDPC can be queried with the consolidated db2cluster -verify. This command performs a comprehensive list of validations that include, but are not limited to, the following:

Configuration settings in peer domain and IBM Spectrum Scale cluster (for example, host failure detection time is set properly for GDPC)
Communications between members and CFs
Replication setting for each file system
Status of each disks in the file system

An alert is raised for each failed criterion and is displayed accordingly in the standard Db2 pureScale instance monitoring command db2instance -list. Alerts with cluster-wide impact such as replication for one or more systems has stopped, IBM Spectrum Scale cluster and peer domain employ different cluster quorum mechanisms, and others, are displayed for all hosts. For example:

  
  ID     TYPE  STATE    HOME_HOST   CURRENT_HOST  ALERT  PARTITION_NUMBER   LOGICAL_PORT  NETNAME
  --     ----  -----    ---------   ------------  -----  ----------------   ------------  -------
  0    MEMBER  STARTED  m1host      m1host        NO                 0              0        -
  1    MEMBER  STARTED  m2host      m2host        NO                 0              0        -
  128  CF      PRIMARY  cf1host     cf1host       NO                 -              0        -
  129  CF      PEER     cf2host     cf2host       NO                 -              0        -
 
  HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
  --------                   -----                ----------------        -----
  m1host                    ACTIVE                              NO          YES
  m2host                    ACTIVE                              NO          YES
  cf1host                   ACTIVE                              NO          YES
  cf2host                   ACTIVE                              NO          YES

There is currently an alert for members, CFs, hosts, cluster file system, or cluster configuration 
in the data-sharing instance. For more information on the alert, its impact, and how to clear it, 
run the following command: db2cluster -list -alert

The db2cluster -list -alert command can be used to display details, possible solutions, and impacts of each alert.

Storage replica failure scenario

Consider a scenario where one complete storage replica has failed or becomes inaccessible. This type of failure is handled automatically and transparently by GDPC, however there will be a short period of time where update transactions are expected to be delayed. The exact length of the period where update transactions that are affected depends on the number of disks used for all the database file systems, the nature of the workload, and the disk I/O storage controller configuration settings. Note that each disk in a storage replica is considered an independent entity, and as such, rather than detecting an entire storage replica has failed, IBM Spectrum Scale software will be informed by the storage controller separately for each disk in the failed storage replica. Therefore, the length of time until all filesystem accesses return to normal depends on:

How long it takes the workload to drive a filesystem I/O to each of the disks in the failed storage replica,
The length of time for the disk I/O storage controller to report individual disk failures back to the IBM Spectrum Scale software, and the time for the IBM Spectrum Scale software to mark those affected disks as failed.

Some disk I/O accesses can return to normal, while others are still delayed, waiting for a specific disk to return back an error. After all disks in a storage replica have been marked as failed, filesystem I/O times will return to normal, as IBM Spectrum Scale will have stopped replicating data writes to the failed disks. Note that even though GDPC is operational during this entire period, after some disks or an entire storage replica has failed, there will only be a single copy of the filesystem data available, which will leave the GDPC exposed to a single point of failure until the problem has been resolved and replication has been restarted.

As mentioned earlier, one important thing to notice is that the storage failure recovery time is dependent on the storage controller’s configuration – in particular how fast the storage controller will return an error back up to IBM Spectrum Scale so that IBM Spectrum Scale can mark the affected disks as inaccessible. By default, some storage controllers are configured to either retry indefinitely on errors, or to delay reporting errors back up the I/O stack by a lengthy amount of time – sometimes even long enough to allow the storage controller to reboot. Although this is usually desirable when only one replica of storage is available (avoids returning a filesystem error if the error is possibly recoverable at the storage layer), this will increase the storage failure recovery time significantly, and in some cases will make the storage layer seem unresponsive, which might be enough to cause the rest of the cluster to assume that all members and CFs are also unresponsive, causing Tivoli® System Automation MP (Tivoli SA MP) to stop and restart them, which is undesirable. With GDPC, since there is a second replica of data, and a key requirement is automatic and transparent failure recovery of a wide variety of failures including storage failures, the storage controller failure detection time is reduced. A good starting point is to set the storage failure detection time to 20 seconds – the exact mechanism to do this will be dependent on the type of storage and storage controller being used. For an example on how to update the failure detection time for the AIX® MPIO multipath device driver, see Configuring the cluster for high availability.

After the storage controller is back online, the disks might still be considered down by IBM Spectrum Scale. Use the following command to check the status of the disks:

db2cluster -cfs -list -filesystem <file system name>

If any of the disks is in 'down' state, use the following command to re-enable them:

db2cluster -cfs -start -filesystem <file system name> -disk

Once all disks in the file system become online, data that has been added or modified when the disks were unavailable need to be replicated. Use the following command to perform the replication:

db2cluster -cfs -replicate -filesystem <file system name>

Note: The duration depends on the amount of data that needs to be replicated.

Sometimes a file system becomes unbalanced after the replicate action depending on the layout of data on existing disks. If a subsequent rebalance action is recommended by IBM Spectrum Scale, an alert is raised at the completion of the replicate command. Run the following command to resolve the unbalanced alert:

db2cluster -cfs -rebalance -filesystem <file system name>

Note: Both replicate and rebalance commands are I/O intensive operations and it is recommended to run these at off-peak usage periods.

Site failure recovery scenario

Consider a scenario where either site A or site B experiences a total failure, such as a localized power outage, and is expected to eventually come back online. This type of failure is handled automatically and transparently by GDPC. Systems on the surviving site will independently perform restart-light member crash recovery for each of the members from the failed site in parallel. All members that were configured on the failed site will remain in restart-light mode on guest systems on the surviving site until the members’ home systems on the failed site have been recovered, that is if only one member system on the failed site recovers, then only the member configured on that system will failback to its home system. If the site that failed contained the primary CF, then the primary CF role will automatically failover to the secondary CF located on the surviving site. During recovery, there will be a period of time where all write transactions will be paused. The read transactions might be paused as well, depending on whether the data being read is already cached by the member, and whether the data being read is separate from data that was being updated at the time of the site failure. Data that is not already cached by the member must be fetched from the CF, which will be delayed until recovery is complete. The length of time that transactions are paused depends mainly on the time required for IBM Spectrum Scale to perform file system recovery. File system recovery time is primarily influenced by the number of file systems as well as the frequency and size of file system write requests around the time of failure, so workloads with a higher ratio of updates might be affected by longer file system recovery times.

After one site has failed, the surviving site has the following abilities and characteristics:

Have read and write access to the shared file systems (that is, there is full access to the surviving replica of data from the surviving members and CFs).
Service all database transaction requests from clients from the members configured on the systems on the surviving site.
Contain the primary CF (Primary role will transparently failover to the CF host on the surviving site if the primary CF was previously running on the site that failed).
Run the members from the failing site in restart-light mode.

It is important that all the hosts on the surviving site as well as host T remain online, otherwise quorum will not be reached (to maintain majority quorum access to all hosts on the surviving site plus the tiebreaker host is needed).

When the failed site eventually comes back online:

The shared file systems must be manually re-replicated to ensure that any data written at the surviving site is replicated to the failed site. This can be done with the mmnsddiscover and mmchdisk start commands as described in the previous section, "Storage Replica Failure Scenario."
Members will automatically failback onto their home hosts.
The CF on the failed site will restart as a new secondary CF.

Connectivity failure between sites scenario

Consider a scenario where all connectivity is lost between site A and site B (such as, the dark fiber between sites is compromised, or switch failure).

If one site loses all connectivity with the other site, as well as loses connectivity to the tiebreaker site, this form of connectivity failure will be identical to a site failure. The site that can continue to communicate with the tiebreaker site will be the surviving site. Until such time that connectivity is restored to the site, all Db2 members from the systems at the failed site will be restarted in restart-light mode on hosts on the surviving site, and the primary CF role will be moved over to the surviving site, if necessary.

If all connectivity between site A and B is lost, but both sites retain connectivity with the tiebreaker site, the IBM Spectrum Scale software detects the link failure between the two sites, and will choose to evict all systems from one of the sites from the IBM Spectrum Scale domain. Typically, the IBM Spectrum Scale software favors keeping the site that also contains the current IBM Spectrum Scale cluster manager (the current cluster manager can be determined by running the IBM Spectrum Scale mmlsmgr command). The systems on the losing site will be I/O fenced from the cluster until connectivity is restored. In the meantime,Tivoli SA MP responds to the loss of the IBM Spectrum Scale connection by restarting all Db2 members from the affected systems in restart-light mode on systems on the surviving site, and will move the primary CF role from the failed site to the remaining site if necessary. To reduce the amount of Db2 recovery work needed in the event of a connectivity failure between sites, the site containing the primary CF is the one that remains operational. As such, if the mmlsmgr command shows that the IBM Spectrum Scale cluster manager is located on the site that does not also contain the primary CF (as reported by db2instance -list), you can move it to the same site as the primary CF using the mmchmgr command. For example:

root@hostA1> /usr/lpp/mmfs/bin/mmchmgr –c primary_cf_system

As the location of the IBM Spectrum Scale cluster manager may change, especially after a node reboot, it is monitored to ensure that it remains on the same site as the primary CF. If instead of a connectivity loss between sites A and B, all connectivity with the tiebreaker site is lost from both sites, the tiebreaker host T will be expelled from the cluster. As no Db2 member or CF is running on host T, there is no immediate functional impact on the GDPC instance. However, in the event of a subsequent site failure, quorum is lost.