Using roving high availability (HA) failover in partitioned database environments

When you are using a N Plus M failover policy with 'N' active nodes and one standby node, you can enable roving HA failover.

Before you begin

Each node in the cluster must have the roving HA failover support enabled or disabled.

In partitioned database environments where roving HA failover is not enabled, the designated standby node is usually the only node with access to all the disks and volume groups, including the file systems on these volume groups. In those environments, ensure that the external storage LUN mappings and the SAN zones in the cluster can see all the disks in the database instance. In addition, verify that all the volume groups controlled by the cluster are imported on all the cluster nodes. After importing the volume groups, disable the auto-varyon attribute of volume groups and the auto-mount attribute of the file systems on all the active cluster nodes.

If you want to use roving HA failover, you must enable it again using these steps after applying a new fix pack.

About this task

When you are using a N Plus M failover policy with 'N' active nodes and exactly one standby node, a failover operation occurs when one of the active nodes fail. As part of failover, the standby node begins hosting the resources of the failed node. When the failed node comes back online, you would usually have to take a momentary outage in order to move the resources back over to their original active node. Instead of this, you can configure roving HA failover to have the last failed node in the cluster become the standby node for all other partitions in the cluster without requiring any additional fail back operations.
Note: This is only applicable in an environment where all Db2® partitions are defined to run on exactly two hosts and the passive host is the same for every Db2 partition in the cluster.
For example, 4 host environments with 3 active database partitions running on hosts A, B and C:
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_0-rs
                      |- Online IBM.Application:db2_db2inst1_0-rs:hostA
                      '- Offline IBM.Application:db2_db2inst1_0-rs:hostD
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_1-rs
                      |- Online IBM.Application:db2_db2inst1_1-rs:hostB
                      '- Offline IBM.Application:db2_db2inst1_1-rs:hostD
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_2-rs
                      |- Online IBM.Application:db2_db2inst1_2-rs:hostC
                      '- Offline IBM.Application:db2_db2inst1_2-rs:hostD
In the aftermath of a failure to hostB, the resource model would then look as follows (without the roving HA failover feature):
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_0-rs
                      |- Online IBM.Application:db2_db2inst1_0-rs:hostA
                      '- Offline IBM.Application:db2_db2inst1_0-rs:hostD
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_1-rs
                      |- Offline IBM.Application:db2_db2inst1_1-rs:hostB
                      '- Online IBM.Application:db2_db2inst1_1-rs:hostD
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_2-rs
                      |- Online IBM.Application:db2_db2inst1_2-rs:hostC
                      '- Offline IBM.Application:db2_db2inst1_2-rs:hostD
With the roving HA failover feature enabled, the resource model would instead look as follows after a failure to hostB:
Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_0-rs
                      |- Online IBM.Application:db2_db2inst1_0-rs:hostA
                      '- Offline IBM.Application:db2_db2inst1_0-rs:hostB
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_1-rs
                      |- Offline IBM.Application:db2_db2inst1_1-rs:hostB
                      '- Online IBM.Application:db2_db2inst1_1-rs:hostD
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
           |- Online IBM.Application:db2_db2inst1_2-rs
                      |- Online IBM.Application:db2_db2inst1_2-rs:hostC
                      '- Offline IBM.Application:db2_db2inst1_2-rs:hostB

In the above resource model, we see that after the failure to hostB, hostB is now the standby host location for all active partitions on hosts A, C and D.

Procedure

To enable the roving HA failover feature, perform the following steps on each host in the cluster:

  1. Ensure that there is no failover operation in progress.
  2. Make a backup copy of the db2V115_start.kshdb2V121_start.ksh script located in the /usr/sbin/rsct/sapolicies/db2/ directory.
  3. Edit the db2V115_start.kshdb2V121_start.ksh script. Find the following line:
    ROVING_STANDBY_ENABLED=false
    and make the following change to set the ROVING_STANDBY_ENABLED variable to true:
    ROVING_STANDBY_ENABLED=true
  4. Save your changes.

Results

The change will take effect at the next failover operation.

What to do next

If you want to disable the roving HA failover feature, perform the following steps on each host in the cluster:
  1. Ensure that there is no failover operation in progress.
  2. Make a backup copy of the db2V115_start.kshdb2V121_start.ksh script located in the /usr/sbin/rsct/sapolicies/db2/ directory.
  3. Edit the db2V115_start.kshdb2V121_start.ksh script. Find the following line:
    ROVING_STANDBY_ENABLED=true
    and make the following change to set the ROVING_STANDBY_ENABLED variable to false:
    ROVING_STANDBY_ENABLED=false
  4. Save your changes.