Implementing disaster recovery in IBM PureData System for Transactions

Recovery using DB2 high availability and disaster recovery feature

Learn how to set up and execute disaster recovery for IBM® DB2® V10.5 databases on the IBM PureData® System for Transactions. The solution is based on the DB2 High Availability and Disaster Recovery feature.

Enzo Cialini (ecialini@ca.ibm.com), IBM PureData Systems Chief Architect, IBM

Enzo Cialini photoEnzo Cialini has been working with Database Technology at the IBM Toronto Laboratory for ten years and is certified in DB2 UDB Database Administration and DB2 UDB Database Application Development. He is also an Advanced Technical Expert in DB2 for DRDA, and an Advanced Technical Expert in DB2 for Clusters. He is currently responsible for managing the DB2 UDB System Verification Test department, with a focus on high availability, and has been involved with DB2 in high availability environments for many years. His experience ranges from implementing and supporting numerous installations to consulting.



Ian D. M. Hakes (ihakes@ca.ibm.com), IBM PureData Systems Information Development Lead, IBM

Author photaIan Hakes is the lead writer for the IBM PureData System for Transactions product. He has worked on delivering IBM database products and technologies for the last 15 years.



Richard Lubell (lubellri@ie.ibm.com), Information Developer, IBM

Richard Lubell photoRichard Lubell is an Information Developer based in the IBM Dublin lab. He has worked in software development since 2000, creating instructional and educational material for financial transaction and content control applications, leadership development training, and business process mapping. He currently works on documentation for IBM Smart Analytics System and IBM database platforms DBaaS on the cloud.



Paul McInerney (paulmci@ca.ibm.com), User Experience Specialist, IBM Information Management development, IBM

Paul McInerney photoPaul McInerney is a User Experience (UX) specialist with several years experience designing features for IBM Information Management and database products. He is currently working on the development team for the IBM PureData System for Transactions product.



31 July 2013

Also available in Russian

Introduction

Enterprise applications need a continuously available data source to keep workflows operating at maximum efficiency. Any downtime due to a localized disaster can leave mission-critical databases inaccessible. Minimizing your outage time and data loss isn't just a business requirement; it's a business necessity.

This article outlines the setup and operation of a disaster recovery (DR) solution for DB2 V10.5 databases on the IBM PureData System for Transactions, Fix Pack 3 or later. The solution is based on the DB2 High Availability and Disaster Recovery (HADR) feature, and includes additional elements that are not handled directly by the HADR feature. HADR replicates any data changes from a primary database source to a standby database target. This technology is fully integrated into DB2 and uses a standard TCP/IP network interface between the primary and standby database systems. HADR provides fast takeover operations for both planned and unplanned events.

HADR in the IBM PureData System for Transactions is designed to provide disaster recovery at the system, instance, and database level. This solution requires two PureData System for Transactions systems. System takeover is implemented manually through high availability and disaster recovery commands and can be integrated into your enterprise strategy for disaster recovery.

This scenario covers the entire lifecycle: setup, normal operations, and conducting a takeover.

Disaster recovery system configuration

This article provides guidance for a disaster recovery solution that involves a specific IBM PureData System for Transactions configuration. The configuration used to illustrate the concepts and tasks consists of two systems (referred to as System A and System B). To accommodate a site-wide disaster recovery scenario, the standby system must be in a geographically separate data center. In this scenario, one system (System A) hosts all the primary databases while the other system (System B) hosts all the standby databases. System A already has application workloads running against the database before it is configured for disaster recovery. Consequently, this scenario is designed to minimize the disruption to the original System A when you add System B as the standby system.

The instructions in this paper use an environment that is described by the following configuration and settings:

Listing 1. Configuration settings
  Primary System == System A
  DB2 Instance == hdrinst1 (small)
  2CF + 2Member
  0 rack6_c2_ite0.example.com 0 rack6_c2_ite0 – MEMBER
  1 rack6_c2_ite1.example.com 0 rack6_c2_ite1 – MEMBER
  128 rack6_c2_ite0.example.com 0 rack6_c2_ite0 – CF
  129 rack6_c2_ite0.example.com 0 rack6_c2_ite1 – CF
  
  DB2 Database == foo_db
  TCPIP Client/Server port (SVCENAME) == 65100

  Standby System == System B
  DB2 Instance == hdrinst1 (small)
  2CF + 2Member
  0 rack7_c2_ite2.example.com 0 rack7_c2_ite2 – MEMBER
  1 rack7_c2_ite3.example.com 0 rack7_c2_ite3 – MEMBER
  128 rack7_c2_ite2.example.com 0 rack7_c2_ite2 – CF
  129 rack7_c2_ite3.example.com 0 rack7_c2_ite3 – CF
  
  DB2 Database == foo_db
  TCPIP Client/Server port (SVCENAME) == 65100

Disaster recovery mechanism

In IBM PureData System for Transactions, HADR is configured and operated on each database. Consistent with standard DB2 HADR, some aspects of the system environment must remain identical across both systems in order for the standby system to perform smoothly after the takeover operation. The following aspects of the primary and standby systems must remain identical:

  • DB2 database configuration
  • DB2 instance configuration
  • System configuration aspects

HADR in a PureData context

While most HADR concepts and tasks apply to HADR on the IBM PureData System for Transactions, there are some unique considerations.

First, a PureData system hosts DB2 pureScale® instances and there are some special considerations for using HADR in DB2 pureScale instances. For details, see High availability disaster recovery in DB2 pureScale environments in the DB2 Information Center.

Second, the system has built-in capabilities for deploying and managing databases. These confer certain advantages and constraints when compared to other DB2 HADR environments.

The main purpose of this article is to provide guidance that is tailored to PureData System for Transactions environments.


Tasks involving disaster recovery

The tasks we will describe assume that the primary system (System A) is in service and running database workloads. Another system (System B) is set up in another data center. System B is the standby system set up in case a disaster renders System A incapable of processing database transactions.

Task: Configuring system-level aspects

For this task, you will set up a TCP/IP network connection between the System A and System B data networks. All members in System A can access all the members in System B and vice versa.

Task: Setting up each DB2 instance on the standby system

For this task, you will do the following:

  1. Deploy an instance on System B, based on the existing instance on System A.
  2. Define a strategy to propagate instance-level changes between System A and System B to keep the primary and standby instances identical.

Task: Configuring HADR for each database in a DB2 instance

This task includes the following:

  1. Deploy a new database in the deployed instance on System B.
  2. Restore a database backup of the System A database into the newly deployed database on System B.
  3. Define a strategy to propagate database-level changes between System A and System B to ensure that the database parameters are identical.
  4. Configure and start HADR operations on System A and System B.

At this point, HADR is configured such that System A is being protected by System B.

Task: Using PureData features in an HADR environment

This task includes standard operations on the primary system (System A) or the standby system (System B).

For the primary system, introducing HADR has little or no impact on standard operations. On the standby system, database backup or connect operationse not supported because the databases are in a standby role.

Task: Performing a takeover in a PureData system.

This task includes performing a takeover due to a planned or unplanned event.

There are no special considerations for HADR takeover operations on an IBM PureData System for Transactions system as compared to a typical DB2 pureScale system.


System level aspects

System level setup refers to the initial configuration required on the systems before they are ready for use by the database administrators. This section covers the hardware, networking, and data center integration tasks that are handled by system administrators.

Hardware prerequisites

Both the primary and standby systems must be IBM PureData System for Transactions. The standby system must have the same number of compute nodes/members as the primary system instance that requires HADR capability. This requirement ensures that all of the compute nodes/members that are used by the primary instance are covered by the capacity of the standby system.

Security prerequisites

If your systems control security through console users:

  • User accounts must be created on both the primary system and the standby system with identical passwords. This configuration is necessary to ensure user access continues after any disaster recovery event.
  • For HADR configuration tasks, the console users on both systems must have the Create new DB2 pureScale instances authority to deploy instances.

If your systems use LDAP users and groups for security processing:

  • The primary and standby systems must have access to the same LDAP server content. This configuration is necessary to ensure user access continues after any disaster recovery event.
  • You must configure and enable the LDAP plug-in through the Workload Console before deploying the instances. Click Cloud > System Plug-ins. Select the LDAP plugin and click Configure. Contact your LDAP administrator to fill the plug-in configuration details.
  • For HADR configuration, the LDAP accounts must have the Create new DB2 pureScale instances authority to deploy instances.

The primary system and the standby system must use the same form of authentication, either LDAP or console accounts. Mixed security methods are not supported.

User authorization levels

The user that performs the HADR configuration procedures must be either the instance owner or their designated proxy, or a database owner with SYSADM or SYSCTL rights. This user must exist on both systems.

The instance owner, or any other console user with administrator rights on the instance, can grant the appropriate DB2 administration authorization to an instance or database level user through the Workload Console.

Instance level users are managed from the DB2 pureScale Instances panel (click Database > DB2 pureScale Instances). Select the instance and add the user account to the Access granted to list.

Database level users are managed from the Databases panel (click Database > Databases). Select the database and add the user account to the Access granted to list.

Network

A network administrator must connect the primary and standby systems to an existing high-speed, high-capacity TCP/IP network that links the data centers. Each member in the primary system must be able to access the members in the standby system.

You can validate the correct network configuration by logging into each member on the primary system instance and confirming that you can ping each member in the standby system. Similarly, log into each member on the standby system instance and confirm that you can ping each of the primary system instance members.

Configuring access to Tivoli Storage Manager

To facilitate the database restore operation on the standby system, you must ensure that the standby system has access to the database backup image taken on the primary system.

Before you begin, ensure that Tivoli® Storage Manager is already installed and configured on a central server that is accessible by both systems, or on separate Tivoli Storage Manager servers that are connected to each other via Tivoli Storage Manager node replication.

Configure and enable the TSM plug-in through the Workload Console before deploying the instance(s). Click Cloud > System Plug-ins. Select the TSM plugin and click Configure.


Workload level setup

Workload level setup refers to the configuration required on database workloads in the PureData System for Transactions. This configuration is typically handled by database administrators and is divided into two levels:

  1. The setup that is required for each DB2 instance that hosts one or more standby databases.
  2. The setup that is required for each DB2 pureScale standby database.

Setting up the DB2 instance on the standby system

Deploy a DB2 Version 10.5 instance on the standby system. Any or all active databases in a DB2 Version 10.5 instance on the primary system can be configured for disaster recovery.

To set up a standby instance:

Deploy a new instance on the standby system. In the Workload Console, click Database > DB2 pureScale Instances. On this panel, click the green plus icon and then specify the options for your new instance. This instance must be the same size, with the same number of DB2 pureScale members, as the active primary instance. The standby instance should also have the same name as on the primary instance.

If not already completed, create and update any console-specific user accounts for the standby instance owner to provide access to this instance.

If you have updated the default instance configuration settings on the primary system, make the same changes to the new standby instance on the standby system to ensure that operations proceed smoothly after the HADR takeover process.

Appendix B contains the details of a script that automates comparing and updating the configuration settings on the primary and standby systems.


HADR ports

Both the primary and standby databases require a port for the HADR service. If there is more than one database that is configured for HADR on the same instance, define one port for each database.

If there is a firewall between the primary system and the standby system, ensure that there is an exception that is defined in both directions for the ports that are used by the databases, including the HADR ports.

The port that is used for HADR service should not be the same as any port used for typical DB2 database operations. The default DB2 pureScale ports are 50001, 56000, 56001, and 60000 - 60004.

During the HADR setup, the defined port is used in the database parameters HADR_LOCAL_SVC, HADR_REMOTE_HOST and HADR_TARGET_LIST.

Your network administrator must configure the network connectivity routing so that all members in System A can access all members in System B and vice versa.


Verifying network connectivity

To verify connectivity between the primary and standby systems:

  1. On the primary system, determine the database topology through the Workload Console by clicking Database > Databases. Select your database and record the Host and IP address values in the Show database members panel.
  2. On the standby system, confirm that the topology of the standby database is the same by repeating the previous step on the standby system. The number of hosts should be the same and all should have a status of STARTED, though the hosts and IP addresses will be different.
  3. On each member of the primary system, test connectivity by pinging each standby system member using the host name. From Member 0 on the primary, ping Member 0 and Member 1 of the standby system:
    ping rack6_c2_ite0
    ping rack6_c2_ite1

    Repeat this step for Member 1 on the primary. If this step is successful, repeat this step from the standby system to the primary system.

  4. If the previous step is not successful, repeat the process using the IP addresses displayed in the Workload Console instead of the host names of the standby system nodes:
    ping 20.155.4.142
    ping 20.155.4.143

    If this step is successful, repeat from the standby system to the primary system, then continue with the HADR setup process. If this step is not successful, contact your network administrator to review the network routing between the primary and standby systems.


Setting up the standby database

Databases on the primary and standby systems must be configured with the same operating system version, DB2 version, table space configuration, and log file space.

Creating the standby database

Deploy the standby database by restoring a backup image of the existing primary database to the standby system.

On the standby system:

  1. Use the Workload Console and the same instance owner account as the primary instance (for example, hdrinst1) to deploy an empty database with the same database name as the primary database (for example, foo_db). Click Database > Databases. On this panel, click the green plus icon and then specify the options for your new database.
  2. To prevent connections to the database from the database performance monitor, you must turn off the data collection. In the Workload Console, click Database > Database Performance Monitor. Click Databases and select the empty HADR database (foo_db) from the list of connections. Click the Monitor drop-down list and select Disable Automatic Collection.
  3. You must also disable the health alerts for the database. In the Database Performance Monitor console, click Health > Alerts and open the Health Alert Configuration tab. Select the empty HADR database (foo_db) and remove the checkmark from the Monitor database health and click Apply.
  4. Restore the primary system database backup image into the new empty database (foo_db) on the standby system.

To deliver the backup image of the primary database:

  1. On the primary system, back up the database to Tivoli Storage Manager by using the backup panel in the PureData System for Transactions system console.
  2. If the primary and standby systems use different Tivoli Storage Manager servers, use Tivoli Storage Manager node replication to move the backup image from the primary to the standby system.
  3. Make note of the most recent time stamp and compare on the standby system to ascertain that the same time stamp exists there.

To restore the backup image, you can use the command line to restore the standby database from the primary backup.

From a command line on the DR system:

  1. Deactivate the standby database before running the restore command:
    db2 DEACTIVATE DB <mydb>
  2. Run the restore command on any node:
    db2 RESTORE DB <mydb> INTO <mydb> USE tsm TAKEN AT <timestamp> ON <datapath>

    where <mydb> is the name of the database, <timestamp> is found in the primary database backup delivery task, and <datapath> is the location on the disk where the database data resides.

    You can find the <datapath> by running the df –h command on the DR system. The <datapath> value follows the format:

    .../db2sd_data_<instance_name>_<mydb>_<generated_ID>

    For example, you can run:

    db2 DEACTIVATE DB foo_db
    db2 RESTORE DB foo_db INTO foo_db USE tsm TAKEN AT 201307251234 \ 
    ON .../db2sd_data_coreinst_foo_db_fe:d0:22:23:eb:35:dd

    When you restore the database on the DR system, leave the database in the rollforward pending state. Otherwise, the HADR start command cannot complete, leaving the standby database in a failed state.

  3. On the DR system, update the NEWLOGPATH and MIRRORLOGPATH parameters to match the existing file system in the standby database. You can find the existing path by running the df –h command on the DR system. The path values for log and mlog follow the format:
    .../db2sd_log_<instance_name>_<mydb>_<generated_ID>
    .../db2sd_mlog_<instance_name>_<mydb>_<generated_ID>
  4. Run the following command to update the standby database configuration file:
    db2 UPDATE DB CFG FOR <mydb> USING \
    NEWLOGPATH /db2sd_log_<instance_name>_<mydb>_<generated_ID> \
    MIRRORLOGPATH /db2sd_mlog_<instance_name>_<mydb>_<generated_ID>
                            
    db2 DEACTIVATE DB <mydb>

To complete the deployment, run the following task:

  • Re-enable the database performance monitor. On the standby system, turn the database performance monitor data collection back on. In the Workload Console, click Database > Database Performance Monitor. Click Databases and select the now restored standby database copy (foo_db) from the list of connections. Click the Monitor drop-down list and select Enable Automatic Collection.

The database performance monitor does not provide monitoring of the standby system database because the database is in a standby role and does not accept connections. However, if you enable the database performance monitor for the standby system database, monitoring operations start when the standby system database assumes primary operations.


Configuring HADR

Background

You must be the database owner or a user with SYSCTL or SYSADM rights to perform these operations.

The HADR configuration task must be performed once for each standby database.

Appendix B contains scripts with instructions on how to verify the HADR database parameters and automate the HADR configuration processes on the primary and standby systems.

Confirm that the following database parameters are set to the recommended values for an HADR configuration. In the command line, run the command:

db2 get db cfg for foo_db
  • LOGINDEXBUILD – Set to ON. This action prevents the standby system from waiting for invalid indexes to build and provides the best performance for HADR. The default setting is OFF.
  • LOGARCHMETH1 – Set to TSM; setting to a value other than OFF disables circular logging. Circular logging reuses log space when the log fills up, potentially leading to transactions being lost if the logs fail to sync before the primary log fills and resets. This parameter is set to TSM when the system is registered with a Tivoli Storage Manager server.
  • INDEXREC – When set to RESTART, invalid indexes are rebuilt at time of takeover. The default is RESTART.

If any of these parameters are incorrect, correct the values by using the following command:

db2 UPDATE DB CFG FOR <database> USING <parameter name> <parameter value>

The following parameters use port numbers that should not be the same as the port numbers dedicated to the primary and standby DB2 pureScale instances:

  • HADR_LOCAL_HOST – The name of the local host. The value of this parameter on the primary system should be the same as the value of HADR_REMOTE_HOST on the standby system. The value of this parameter on the standby system should be the same as the value of HADR_REMOTE_HOST on the primary system.
  • HADR_TARGET_LIST – The list of target host and port pairs that represent the standby database.
  • HADR_REMOTE_HOST - The name of the remote host. The value of this parameter on the primary should be the same as the value of HADR_LOCAL_HOST on the standby. The value of this parameter on the standby should be the same as the value of HADR_LOCAL_HOST on the primary system.
  • HADR_LOCAL_SVC – The TCP/IP service name or port number where the local HADR system accepts connections.
  • HADR_REMOTE_INST – The name of the remote instance. The value of this parameter should be the same on both the primary and the standby.

See the section on HADR ports for a description of which ports are needed by DB2 pureScale database operations. For example, on the primary system, the parameters might look like:

HADR_TARGET_LIST {rack7_c2_ite2:4000|rack7_c2_ite3:4000}
HADR_REMOTE_HOST {rack7_c2_ite2:4000|rack7_c2_ite3:4000}
HADR_REMOTE_INST hdrinst1
HADR_SYNCMODE ASYNC

And on the standby system, the parameters would be similar:

HADR_TARGET_LIST {rack6_c2_ite0:4000|rack6_c2_ite1:4000}
HADR_REMOTE_HOST {rack6_c2_ite0:4000|rack6_c2_ite1:4000}
HADR_REMOTE_INST hdrinst1
HADR_SYNCMODE ASYNC

Process

Global level configuration commands need to be run on only one member in the cluster. The configuration changes apply to all members.

Member level configuration commands must be applied to each member to update local IP addresses or member names.

All commands should be run from the DB2 command line.

Configure the HADR settings (database configuration values) for the database first on the primary system (System A), then on the standby system (System B).

On System A:

  • Global level parameter updates, run on any member node:
    db2 "UPDATE DB CFG FOR foo_db USING
     HADR_TARGET_LIST {rack7_c2_ite2:4000|rack7_c2_ite3:4000}
     HADR_REMOTE_HOST {rack7_c2_ite2:4000|rack7_c2_ite3:4000}
     HADR_REMOTE_INST hdrinst1
     HADR_SYNCMODE ASYNC"
  • Member level parameter updates:
    • For Member 0:
      db2 "UPDATE DB CFG FOR foo_db MEMBER 0
       USING HADR_LOCAL_HOST rack6_c2_ite0
          HADR_LOCAL_SVC 4000"
    • For Member 1:
      db2 "UPDATE DB CFG FOR foo_db MEMBER 1
       USING HADR_LOCAL_HOST rack6_c2_ite1
          HADR_LOCAL_SVC 4000"

On System B:

  • Global level parameter updates, run on any member node:
    db2 "UPDATE DB CFG FOR foo_db USING
     HADR_TARGET_LIST {rack6_c2_ite0:4000|rack6_c2_ite1:4000}
     HADR_REMOTE_HOST {rack6_c2_ite0:4000|rack6_c2_ite1:4000}
        HADR_REMOTE_INST hdrinst1
        HADR_SYNCMODE ASYNC"
  • Member level parameter updates:
    • For Member 0:
      db2 "UPDATE DB CFG FOR foo_db MEMBER 0 USING
       HADR_LOCAL_HOST rack7_c2_ite2
          HADR_LOCAL_SVC 4000"
    • For Member 1:
      db2 "UPDATE DB CFG FOR foo_db MEMBER 1 USING
       HADR_LOCAL_HOST rack7_c2_ite3
          HADR_LOCAL_SVC 4000"

At this stage, the HADR related parameters are set and the systems are ready to start the HADR process.

Start HADR

On the primary system (System A), you must start HADR on all members on the primary database. On the standby system (System B), HADR needs to be started on only one of the members. That member is designated as the preferred replay member.

On System B:

You must deactivate the database on the standby system before you start HADR:

db2 DEACTIVATE DATABASE foo_db

For Member 0, issue the following command:

db2 START HADR ON DATABASE foo_db AS STANDBY

On System A:

For ALL members, issue the following command:

db2 START HADR ON DB foo_db AS PRIMARY

Verify that HADR is up and running properly:

On System A, call the MON_GET_HADR table function:

db2 "SELECT LOG_STREAM_ID, PRIMARY_MEMBER, HADR_ROLE,
 STANDBY_MEMBER, STANDBY_ID, HADR_STATE,
 varchar(PRIMARY_MEMBER_HOST,30) AS PRMRY_MEMBER_HOST,
 varchar(STANDBY_MEMBER_HOST,30) AS STNDBY_MEMBER_HOST
    FROM TABLE (MON_GET_HADR(-2))"

Sample output:

LOG_STREAM_ID PRIMARY_MEMBER HADR_ROLE STANDBY_MEMBER
------------- -------------- --------- --------------
1             1              PRIMARY   0
0             0              PRIMARY   0

Sample output continued:

STANDBY_ID HADR_STATE PRMRY_MEMBER_HOST STNDBY_MEMBER_HOST
---------- ---------- ----------------- ------------------
1          PEER       rack6_c2_ite1     rack7_c2_ite2
1          PEER       rack6_c2_ite0     rack7_c2_ite2

Confirm that every one of the primary members is present and each one has an HADR_ROLE of PRIMARY.

Confirm that member 0 on System B, the preferred replay member, is the current replay member by looking at the STANDBY_MEMBER field. Every log stream reports the same standby member because all the primary members are connected to that standby member.

You can also use the db2pd command to return detailed information about the HADR environment:

db2pd –db foo_db –hadr
Listing 2. Sample output for the db2pd command
Database Member 0 -- Database foo_db -- Active -- 
    Up 0 days 01:00:24 -- Date 2013-07-24-20.33.16.378486

                 HADR_ROLE = PRIMARY
               REPLAY_TYPE = PHYSICAL
             HADR_SYNCMODE = ASYNC
                STANDBY_ID = 1
             LOG_STREAM_ID = 0
                HADR_STATE = PEER
                HADR_FLAGS =
       PRIMARY_MEMBER_HOST = 20.155.4.140
          PRIMARY_INSTANCE = hdrinst1
            PRIMARY_MEMBER = 0
       STANDBY_MEMBER_HOST = 20.155.2.140
          STANDBY_INSTANCE = hdrinst1
            STANDBY_MEMBER = 0
       HADR_CONNECT_STATUS = CONNECTED
  HADR_CONNECT_STATUS_TIME = 07/24/2013 20:29:39.895023 (1374697779)

At this point, you have successfully configured disaster recovery for the IBM PureData System for Transactions database using HADR.

Configure client rerouting

If you want to have client applications automatically route connection requests to the standby system after a disaster recovery event occurs on the primary system, then you can follow the steps in this section.

Member Level parameter:

  • On System A:

    For Member 0:

    db2 "UPDATE ALTERNATE SERVER FOR DATABASE foo_db
        USING HOSTNAME rack7_c2_ite2 PORT 65100"
  • On System B:

    For Member 0:

    db2 "UPDATE ALTERNATE SERVER FOR DATABASE foo_db
        USING HOSTNAME rack6_c2_ite0 PORT 65100"

Alternatively, you can use the ADMIN_CMD procedure:

CALL SYSPROC.ADMIN_CMD ('UPDATE ALTERNATE SERVER FOR DATABASE foo_db
    USING HOSTNAME rack7_c2_ite2 65100')
CALL SYSPROC.ADMIN_CMD ('UPDATE ALTERNATE SERVER FOR DATABASE foo_db
    USING HOSTNAME rack6_c2_ite0 65100')

Non-logged operations

HADR handles only logged database operations; other aspects of workload disaster recovery must be addressed separately. These aspects include the following items:

  • Setting up operating system user IDs at the system level
  • Database manager configuration parameters at the instance level
  • DB2 registry variables at the instance level
  • Database configuration parameters at the database level
  • Restoring external routines at the database level

If you use a manual process for these items, then you must discover them and propagate any changes to the standby system during normal operations.

Appendix B details a script that simplifies the task of comparing the database and instance configurations between the systems. The script produces a list of the commands that are required to adjust those settings on the standby system.


Normal operations

System health

The running HADR service provides several metrics to measure the health of your setup.

When the MON_GET_HADR table function is called on the primary system, it returns information about HADR functionality on both the primary and standby databases.

However, because the standby does not accept client connections, use the db2pd command to query information directly on the standby system.

The following fields, returned from either the MON_GET_HADR table function or the db2pd command, detail your connection status:

  • HEARTBEAT_MISSED – The number of heartbeat messages that are not received on time for this log stream since the database started on the local member.
  • HEARTBEAT_EXPECTED - The number of heartbeat messages that are expected on this log stream, since the database started on the local member. Comparing this value to the value of HEARTBEAT_MISSED gives you a measure of network health during a particular time interval.
  • STANDBY_SPOOL_PERCENT - The percentage of spool space that is used in relation to the configured spool limit. This value is an indicator of how much HADR log spooling space is being used.
  • STANDBY_ERROR_TIME - The last time at which the standby system encountered a major error.
  • HADR_CONNECT_STATUS - The current HADR connection status of the database. Possible values are CONGESTED, CONNECTED, and DISCONNECTED.
  • TIME_SINCE_LAST_RECV - The time, in milliseconds, that elapsed since the last message was received. When this value exceeds the defined heartbeat interval, a heartbeat is considered to be missed.
  • STANDBY_RECV_REPLAY_GAP - The average number of bytes in the gap between the standby log receive position and the standby log replay position. This field indicates whether the standby log is falling behind. When the standby log falls behind, there is a risk that the standby database stops receiving logs and blocks the peer state of the primary database.

See MON_GET_HADR table function in the DB2 Version 10.5 Information Center for a full list of monitoring information that is returned by this table function.


Takeover operations

Planned event takeover

You can switch roles between primary and standby databases for planned events by using the DB2 TAKEOVER HADR command.

To execute the planned takeover of a database, access the standby system instance (System B) and issue the following CLP command:

db2 TAKEOVER HADR ON DATABASE foo_db

Confirm that the database is accessible on System B by running the connect command:

db2 CONNECT TO foo_db

At this point, the database on System B is the primary and the database on System A is the standby.

To failback after a planned takeover event, that is, to have the primary database running on System A and the standby database running on System B, issue a takeover from System A to perform the role switch:

db2 TAKEOVER HADR ON DATABASE foo_db

Confirm that the database is again accessible on System A by running the connect command:

db2 CONNECT TO foo_db

For further details about the TAKEOVER command, see HADR commands in the DB2 Version 10.5 Information Center.

Unplanned event takeover

Any unplanned failover must be executed manually; there is no support for automatic disaster recovery takeover.

Issue the following command from the standby system (System B):

db2 TAKEOVER HADR ON DATABASE foo_db BY FORCE

Conclusion

This article covered the setup and operation of an alternative disaster recovery solution that is specific to IBM PureData System for Transactions. The solution is based on the HADR features in DB2 Version 10.5, leveraging the industry-leading capabilities of DB2 while also addressing aspects specific to IBM PureData System for Transactions.


Acknowledgements

The authors wish to gratefully acknowledge the following people for their input, contributions, reviews, and corrections: Diaa El-Din Ali, Marco Bonezzi, Paddy Burke, Yvonne Chan, Ciaran De Buitlear, Belarmino Fernandez, Brian McKeown, Nancy Miller, Andriy Miranskyy, Ryan Paton, Ismail Rawoof, Pablo Perez Rodriguez, Armin Tabrizi, and Michael Zuliani.


Appendix A: Concepts and terminology

Disaster recovery refers to restoring database operations in the case of a catastrophic loss of the primary data system. The extent of this data loss can range widely depending on the severity of the event, and whether the outage is based on software, network, hardware, or even full site failure.

HADR solution capabilities are defined by the amount of storage that is dedicated to backups, the capacity of your network, and archive logging policies. Business needs are also a consideration:

  • RPO (Recovery Point Objective) – the amount of data loss in the solution. The solution supports HADR in ASYNC and SUPERASYNC synchronization modes. The RPO values for these synch modes are the same as for systems other than PureData System for Transactions, namely zero data loss for planned events and near-zero data loss for unplanned events.
  • RTO (Recovery Time Objective) - the elapsed time between manually initiating a takeover and when applications can connect to the database is measured in seconds per database. This time does not include the time between identifying a problem that is occurring and an operator manually initiating the takeover. Multiple databases can be switched over simultaneously. Therefore, a complete system RTO is also measured in seconds.

HADR setup consists of a primary system that is connected through a TCP/IP connection to a disaster recovery system, which might be in another city. As changes occur on a database on the primary system, log data is shipped continuously from the primary database and reapplied to a database on the standby system. This process continuously updates the standby database, allowing for quick takeover operations. Dedicated connections allow the primary and standby databases to maintain uninterrupted contact and establish the replication state.

Standby system refers to the IBM PureData System for Transaction that is being set up to handle database failover operations in case of a catastrophic event that renders the primary system unavailable.

Failover refers to the forced rapid change in status of the standby database to become the primary database and maintain data availability. This change of status should occur only when the primary database is non-functional. HADR reports a failure when heartbeat status messages indicate breaks in communication between the two systems, but failover is not always required. If the primary fails, failover is performed manually, helping to prevent any unnecessary interruptions in service.

Takeover refers to the role change between primary and standby systems during non-emergency or planned circumstances.

Failback occurs after a takeover operation. The original primary is brought back up and returned to primary status. A role change is then made to switch the current "failover" primary back to its role as the original standby.


Appendix B: System configuration scripts

Legal Disclaimer: These programs have not been tested under all conditions and are provided by IBM without obligation of support of any kind.

IBM PROVIDES THESE PROGRAMS "AS IS" WITH NO WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE WARRANTIES OF TITLE, NON-INFRINGEMENT OR INTERFERENCE AND THE IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow for the exclusion or limitation of implied warranties, so the above limitations or exclusions may not apply to you. IBM shall not be liable for any damages you suffer as a result of using, modifying or distributing these programs or their derivatives.


Verification of system configuration settings

During the HADR setup, the database administrator needs to gather the instance and database configuration settings and the registry variables for both systems. Only when the system settings have been compared and updated is HADR initiated on the primary and standby systems. This process of comparing, contrasting, and updating the instance and database configurations is laborious and can be prone to user error.

To simplify this task, we include an 'as-is' Perl script to do the following tasks:

  • Connect to a database on the primary (System A) and standby (System B) instances
  • Retrieve and compare the instance and database configurations and registry variables
  • Produce a list of commands that are required to adjust the configuration on System B so that it is identical to the System A configuration.

This script must run on a client computer that is connected to the HADR system, and has DB2 or DB2 Connect installed and a copy of the Perl language installed. The DB2 and Perl executable binaries must be in the $PATH environment variable. Open a command window and run the script with the command:

Listing 3. Sample script
  perl cfg_replicator.pl -ph <primary_host> -sh <secondary_host>
    -in <instance_name> -pn <primary_host_portnumber>
    -sn <secondary_host_portnumber> -db <database_name>
    -u <username> -p <password> -d <update_direction>
    -o <output_file_name>

Where the script takes the following parameters:

  • -ph <primary_host>
    The host name or IP address of the primary host (System A)
  • -sh <secondary_host>
    The host name or IP address of the standby host (System B)
  • -in <instance_name>
    The name of the instance
  • -pn <primary_host_portnumber>
    The port number of the primary instance (on System A)
  • -sn <secondary_host_portnumber>
    The port number of the standby instance (on System B)
  • -db <database_name>
    The name of the database
  • -u <username>
    The user ID (this user must have permission to get instance and database configuration)
  • -p <password>
    The password for that user ID
  • -d <update_direction>
    The direction of "mirroring" the statements (can be either left or right). If you set the value to left, the script produces the commands that are needed to set the standby (System B) configuration identical to the primary (System A) configuration. You would use these commands in the standard scenario that is outlined in this white paper. If you want instead to set the configuration of the primary equal to the configuration of the standby system, set the parameter to right to generate the necessary commands.
  • -o <output_file_name>
    The name of the output file. This file contains the generated commands needed to align the configurations.

In addition to those parameters, the script has the following parameters:

  • -help: provides additional technical details
  • -man: provides additional technical details

The script attempts to mirror all configuration parameters. However, some parameters (for example, log paths) may differ between the primary and standby systems. Review the output of the script before you run the generated commands.


Configuring all HADR settings

The process of configuring and setting up HADR on a PureData System for Transactions is complex and the repetitive nature of some tasks can be prone to user error.

This 'as-is' script is included to assist you with the following tasks:

  • Update the HADR database configuration parameters for each database
  • Update the alternative server database configuration parameter for each database
  • Enable automatic client reroute (ACR)
  • Start HADR on the standby and primary systems
  • Show the status of the HADR setup

This script has the following prerequisites:

  • Two DB2 Version 10.5 pureScale instances (the script works for any number of members)
  • A primary database that is configured with LOGARCHMETH1 set to TSM and LOGINDEXBUILD set to ON
  • A backup of the primary database that was done with these parameters defined. The backup could be done via Tivoli Storage Manager or manually using standard DB2 backup
  • A primary database able to receive new connections
  • The same instance user for both instances
  • The user that runs the script must have SYSAMD or SYSCTL authority
  • The standby instance must be created with the same instance name as on the primary
  • The primary system must have network connectivity to the standby system

For the purposes of this script, the primary member is the member on the primary system where the script is run. The replay member is the standby member that is defined as a parameter when the script is run; and this standby member receives and processes the logs that are received from all primary members.

This script first checks all the prerequisites that must be met before it continues with the configuration of HADR. The script also checks for basic connectivity between the primary member (from where you run the script) and the replay member. After all the requirements are checked, the script proceeds with the HADR configuration on the standby instance and then on the primary instance. The standby database is configured to point to the primary database; and vice versa for the primary database.

The next step in the script is to activate and start HADR. After the main setup is done and the database configuration is updated on every member, the script deactivates the standby database and then starts HADR on the standby system using the main member (replay member) in the standby instance. After HADR is started on the standby instance, the script then starts HADR on each member of the primary instance.

At this point, HADR starts to sync and catch up all the logged information from the primary database logs to the standby database logs.

The last step in the configuration is to enable the automatic client reroute feature. This feature updates the alternative server information on each member so that when a takeover occurs, any client connections are redirected to the new primary system.

After the successful update of the alternative server settings, the script verifies the result of the configuration and checks the actual status of the HADR setup. At this point, everything should be started and configured as expected, with all primary members in peer status.

Run this script from the home directory of the database owner user on the first member of the primary database. It must be run by the database owner that is specified at database creation time, or a user with SYSCTL or SYSADM authority. This user must be the same on every member and must have the same password on every member.

After the file is copied to the home directory on the primary member, expand the compressed file:

tar –zxvf pdtx_hadr.tar.gz

Inside the uncompressed hadr directory, carefully read the instructions in the README.txt file, and then use the following command to run the script:

./pdtx_hadr_setup.sh -d <database> -s <standby_member> -a

In the example scenario outlined in this tutorial:

./pdtx_hadr_setup.sh -d foo_db -s 0 -a

Where the script takes the following parameters:

  • -d <database>
    The database that you want to configure with HADR
  • -s <standby_member>
    The standby member. This member is the first member in the standby instance (the replay member)
  • -a
    Apply the configuration. If the script is called without the -a parameter, it only writes the configuration steps into the configuration files that are in $HOME/hadr/cfg/ on the member where the script is running. If -a is specified, the script applies all the configuration changes, starts HADR on the standby and primary databases and finally configures the automatic client reroute feature.

With the configuration applied, the returned result of running the script is the current status of the HADR setup.


Download

DescriptionNameSize
Sample files for this articlesamples.zip22KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=939506
ArticleTitle=Implementing disaster recovery in IBM PureData System for Transactions
publish-date=07312013