Scenario - Deploying a two-sites multiple-standby cluster on Amazon Web Services with three hosts

This scenario provides details on the planning, configuration, and deployment of three-host clusters on Amazon Web Services (AWS) in multiple availability zones. Pacemaker is used exclusively as the cluster manager for this scenario and IBM® Tivoli® System Automation for Multiplatforms (SA MP) is not supported.

Another disaster recovery scenario is to use a two-site, multiple-standby cluster with same-site failover automation deployment. For more information about this scenario, see Scenario - Deploying a two-sites multiple standby cluster with same-site failover automation. A disaster recovery-capable HADR deployment with two standby databases is another popular scenario to use.

Important: In Db2® 11.5.8 and later, Mutual Failover high availability is supported when using Pacemaker as the integrated cluster manager. In Db2 11.5.6 and later, the Pacemaker cluster manager for automated fail-over to HADR standby databases is packaged and installed with Db2. In Db2 11.5.5, Pacemaker is included and available for production environments. In Db2 11.5.4, Pacemaker is included as a technology preview only, for development, test, and proof-of-concept environments.

Objective

The objective of this example scenario is to set up a three-host, multiple standby Pacemaker HADR cluster in two separate AWS availability zones (AZs) within the same region. The three-host clusters need to have the following characteristics:
  • The principal primary host and principal standby host must be in AZ1 with HADR SYNC mode.
  • The single auxiliary standby host must be in AZ2 with HADR SUPERASYNC mode.
  • Automated failover with Pacemaker must be set up between the two hosts in AZ1.
  • No Pacemaker cluster setup can exist in AZ2 or across AZ1 and AZ2.
  • Manual takeover from the auxiliary standby is needed for disaster recovery.
  • The AWS Overlay IP must be set up in AZ1 to allow virtual IP support between the principal primary and principal standby hosts.
  • The AWS Elastic IP must be set up as the alternative server in your client configuration to allow transient failover when the auxiliary standby takes over as the primary.
Note: This scenario example uses two availability zones (AZ), but this scenario can be extended to three AZs. This extension is done with the principal primary, principal standby, and auxiliary standby all located in their respective zones. Users can also configure a two-node HADR cluster that uses Pacemaker along with more quorum and fencing configuration. For more information, see Configuring high availability with the Db2 cluster manager utility (db2cm).
The following diagram depicts the resulting setup that allows client applications to connect from within the same AWS VPC or outside of AWS.

Configure your database with Archive logging

HADR is only supported on a database that is configured with Archive logging. If your database is configured with Circular logging, you must first change the logarchmeth1 and logarchmeth2 configuration parameters. An offline backup of the database is required before the database is changed to use archive logging.

Environment

The following table shows the set of hosts, instances, ports, and intended roles that are used in the scenario:
Hostname Instance name HADR service port SVCENAME Intended role
Host_A db2inst1 10 20000 Principal primary
Host_B db2inst2 20 20000 Principal standby
Host_C db2inst3 30 20000 Auxiliary standby
Without loss of generality, different instance names and ports are chosen for each host in this example scenario. Only the SVCENAME ports on Host_A and Host_B need to be the same. The database hadrdb is configured as a multiple standby database with the topology outlined. All commands that are referenced in this scenario need to be updated with the arguments appropriate for your specific deployment.
Note: An Overlay IP needs to be configured on AWS, and in this example, it is 192.168.1.81.

Configure a multiple standby setup

  1. Take an offline backup of the intended principal primary HADR database by using the following command.
    db2 BACKUP DB hadrdb TO backup_dir
  2. Copy the backup image to the other hosts. On each of the intended standby hosts, issue DROP DB to clean up any old databases that might exist and restore the backup image:
    DB2 DROP DB hadrdb
    DB2 RESTORE DB hadrdb FROM backup_dir
  3. After the databases are restored on all standby hosts, as in a regular HADR setup, the following database configuration parameters must be explicitly set.
    • hadr_local_host
    • hadr_local_svc
    • hadr_remote_host
    • hadr_remote_inst
    • hadr_remote_svc

    On the principal primary, the settings for the hadr_remote_host, hadr_remote_inst, and hadr_remote_svc configuration parameters correspond to the hostname, instance name, and port number of the principal standby. On the principal and auxiliary standby systems, the values of these configuration parameters correspond to the hostname, port number, and instance name of the principal primary.

    In addition, hostnames and port numbers are used to set the hadr_target_list configuration parameter on all the databases. The following example shows the hadr_target_list configuration parameter set for hosts A, B, and C:
    Hostname Intended role hadr_target_list
    Host_A Principal primary Host_B:20|Host_C:30
    Host_B Principal standby Host_A:10|Host_C:30
    Host_C Auxiliary standby Host_A:10|Host_B:20

    In addition to the hadr_target_list configuration settings, the hadr_syncmode parameter needs to be set to SYNC across all databases. The hadr_syncmode parameter can also be set to SYNC for the auxiliary standby. This parameter is set because the synchronization mode that is set with the hadr_syncmode parameter is only effective when the database becomes the principal primary or principal standby. Otherwise, the auxiliary database always has an effective synchronization mode of SUPERASYNC.

  4. On each of the databases update the configuration parameters.

    On Host_A (principal primary):
    db2 "UPDATE DB CFG FOR hadrdb USING
    HADR_TARGET_LIST  Host_B:20|Host_C:30
    HADR_REMOTE_HOST  Host_B
    HADR_REMOTE_SVC   20
    HADR_LOCAL_HOST   Host_A
    HADR_LOCAL_SVC    10
    HADR_SYNCMODE     sync
    HADR_REMOTE_INST  db2inst2"
    
    db2 update alternate server for database hadrdb using hostname HOST_C port 20000
    On Host_B (principal standby):
    db2 "UPDATE DB CFG FOR hadrdb USING
    HADR_TARGET_LIST  Host_A:10|Host_C:30
    HADR_REMOTE_HOST  Host_A
    HADR_REMOTE_SVC   10
    HADR_LOCAL_HOST   Host_B
    HADR_LOCAL_SVC    20
    HADR_SYNCMODE     sync
    HADR_REMOTE_INST  db2inst1"
    
    db2 update alternate server for database hadrdb using hostname HOST_C port 20000
    On Host_C (auxiliary standby):
    db2 "UPDATE DB CFG FOR hadrdb USING
    HADR_TARGET_LIST  Host_A:10|Host_B:20
    HADR_REMOTE_HOST  Host_A
    HADR_REMOTE_SVC   10
    HADR_LOCAL_HOST   Host_C
    HADR_LOCAL_SVC    30
    HADR_SYNCMODE     sync
    HADR_REMOTE_INST  db2inst1"
    
    db2 update alternate server for database hadrdb using hostname 192.168.1.81 port 20000
    After completion of the parts previously outlined, the configuration for each database is shown.
    Configuration Parameter Host_A Host_B Host_C
    hadr_target_list Host_B:20|Host_C:30 Host_A:10|Host_C:30 Host_A:10|Host_B:20
    hadr_remote_host Host_B Host_A Host_C
    hadr_remote_svc 20 10 10
    hadr_remote_inst db2inst2 db2inst1 db2inst1
    hadr_local_host Host_A Host_B Host_C
    hadr_local_svc 10 20 30
    Configured hadr_syncmode SYNC SYNC SYNC
    Effective hadr_syncmode N/A SYNC SUPERASYNC
    Note: The effective hadr_syncmode parameter can be viewed by running the db2pd -db hadrdb -hadr command on each host.
    Note: Verify that the AWS Security policy allows for TCP connections between the ports that are needed for the Db2 instance ports and the HADR service ports. By default, all communications are restricted within the virtual private cloud (VPC). To allow connections, an inbound rule can be configured for the security group that belongs to the VPC. For more information, see Authorize inbound traffic for your Linux instances.

Starting the HADR databases

When your HADR configuration is complete, you need to start the HADR databases on both the primary and standby hosts.
  1. Start HADR on the standby databases first by issuing the following commands on Host_B and Host_C.
    db2 START HADR ON DB hadrdb AS STANDBY
  2. Start HADR on the principal primary database. In this example, the primary host is Host_A.
    db2 START HADR ON DB hadrdb AS PRIMARY
  3. Verify that HADR is up and running, query the status of the databases from the principal primary on Host_A by running the db2pd -db hadrdb -hadr command, which returns information about all the standby databases. For example:
    Database Member 0 -- Database HADRDB -- Active -- Up 0 days 13:08:27 -- Date 2021-11-11-05.06.42.980971
    
                     HADR_ROLE = PRIMARY
                   REPLAY_TYPE = PHYSICAL
                 HADR_SYNCMODE = SYNC
                    STANDBY_ID = 1
                 LOG_STREAM_ID = 0
                    HADR_STATE = PEER
                    HADR_FLAGS = TCP_PROTOCOL
           PRIMARY_MEMBER_HOST = HOST_A
              PRIMARY_INSTANCE = db2inst1
                PRIMARY_MEMBER = 0
           STANDBY_MEMBER_HOST = HOST_B
              STANDBY_INSTANCE = db2inst1
                STANDBY_MEMBER = 0
           HADR_CONNECT_STATUS = CONNECTED
    
                     HADR_ROLE = PRIMARY
                   REPLAY_TYPE = PHYSICAL
                 HADR_SYNCMODE = SUPERASYNC
                    STANDBY_ID = 2
                 LOG_STREAM_ID = 0
                    HADR_STATE = REMOTE_CATCHUP
                    HADR_FLAGS = TCP_PROTOCOL
           PRIMARY_MEMBER_HOST = HOST_B
              PRIMARY_INSTANCE = db2inst1
                PRIMARY_MEMBER = 0
           STANDBY_MEMBER_HOST = HOST_A
              STANDBY_INSTANCE = db2inst1
                STANDBY_MEMBER = 0
           HADR_CONNECT_STATUS = CONNECTED
    Once HADR is running, your Pacemaker resources need to be created for cluster management on Host_A and Host_B. You must first create Pacemaker resources on the primary site.
  4. Complete the following steps as root:
    1. Create the cluster and Ethernet resource:
      db2cm -create -cluster -domain db2ha -host Host_A -publicEthernet eth0 -host Host_B -publicEthernet eth0
    2. Create the following instance resources:
      db2cm -create -instance db2inst1 -host Host_A
      db2cm -create -instance db2inst2 -host Host_B
    3. On Host_A, create the database resource:
      db2cm -create -db hadrdb -instance db2inst1

Configuring Overlay IP Address

After configuring the Pacemaker cluster, an Overlay IP needs to be configured on AWS to act as a dynamic virtual IP that applications can connect to. This Overlay IP points to either Host_A or Host_B, depending on which host is the principal primary. For more information on how to configure Overlay IPs, refer to Setting up a Db2 HADR Pacemaker cluster with Overlay IP as Virtual IP on AWS .

To ensure application transparency in the event of a disaster recovery takeover, an alternate server list needs to be set up for the auxiliary standby. This list points to an IP address of the auxiliary standby.

The action taken depends on the location of the clients in relation to the VPC: :
  • If the clients are all located within the VPC as the HADR cluster, no action is required. The auxiliary host's IP address can be used as the alternate server IP.
  • If the clients can connect from outside of the VPC, the auxiliary standby must be set up with a Public IP address that can be accessed from outside of AWS. Multiple solutions are provided by AWS. This includes, but is not limited to, the following:

Performing a manual takeover for disaster recovery

If both Host_A and Host_B go down in AZ1, a manual takeover is needed.

  1. On Host_C, have Host_C takeover as the principal primary:
    db2 takeover hadr on db hadrdb
    Note: There is no automation on this site, so this should be a temporary state for disaster recovery.
  2. Once either host in the original availability zone comes back online, run the following on either Host_A or Host_B:
    db2 takeover hadr on db hadrdb

    Pacemaker automatically detects that the database is the principal primary again and continues to manage all the resources.

Configuring the client connections

Under $INSTHOME/sqllib/cfg, create a db2dsdriver.cfg file to connect to the database. The following example shows a sample db2dsdriver.cfg file:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <dsncollection>
      <dsn alias="HADRDB" name="HADRDB" host="192.168.1.81" port="20000" />
   </dsncollection>
   <databases>
      <database name="HADRDB" host="192.168.1.81" port="20000">
         <acr>
            <!—Automatic Client Reroute(acr) is already enabled by default -->
            <parameter name="enableSeamlessAcr" value="true" />
            <!--Enable server list for application first connect -->
            <parameter name="enableAlternateServerListFirstConnect" value="true" />
            <alternateserverlist>
               <server hostname="Host_C" port="20000" />
            </alternateserverlist>
         </acr>
      </database>
   </databases>
</configuration>
Note: The database host and the DSN host need to point to the Overlay IP that is configured in AWS. The Overlay IP address, is the first address that the client attempts to connect to. In the event of an outage, such that the principal primary is not located at that Overlay IP, the client attempts to connect to the server in the alternate server list.

Takeover behavior

Host_A Host_B Host_C
Principal primary Principal standby Auxiliary standby
With multiple standbys configured in this topology, recovery will be automated by Pacemaker with the following behavior depending on the scenario:
  • Principal primary database failure
    • The principal primary database fails on Host_A, then the principal standby database on Host_B takes over automatically as the principal primary. When the old principal primary database comes back online, it reintegrates as the principal standby.
      Host_A Host_B Host_C
      Principal standby Principal primary Auxiliary standby
  • Principal standby failure
    • The principal standby database fails on Host_B, then Pacemaker attempts to bring it back as the principal standby, while all other databases remain in the same role.
      Host_A Host_B Host_C
      Principal primary Principal standby Auxiliary standby
  • Auxiliary standby failure
    • An auxiliary standby database fails on Host_C . There is no automation on this host, so the database must be brought back online manually.
      Host_A Host_B Host_C
      Principal primary Principal standby Down
  • Both the principal primary database and the principal standby fail
    • If a manual takeover by force is issued on Host_C, then the databases on Host_A and Host_B will both become unmanaged by Pacemaker. The database on Host_C, becomes the principal primary.
      Host_A Host_B Host_C
      Down Down Principal primary
    • Once either Host_A or Host_B come back online, a manual takeover needs to be performed to bring the database back to the principal primary site and enable automation. Host_C returns to being the auxiliary standby, and once the other host comes back online in the principal primary site, the host reintegrates as the principal standby.
      Host_A Host_B Host_C
      Principal primary Principal standby Auxiliary standby