Set up and manage disaster recovery for pureScale using HADR

A practical guide

IBM® DB2® V10.5 includes high availability disaster recovery (HADR) support for pureScale® to provide integrated disaster recovery. This article discusses in detail how to set up HADR for pureScale and how to manage and monitor your HADR databases. Also learn about troubleshooting various problematic scenarios in the HADR for pureScale environment.

Share:

Yugandhra Rayanki (yrayanki@in.ibm.com), DB2 LUW Quality Assurance, IBM

Yugandhra Rayanki has been with IBM for over 9 years and has worked with different teams within the DB2 Quality Assurance group. He is currently working with system verification testing for DB2 pureScale. Yugandhra is an IBM DB2 Certified Advanced DBA.



Dinoop Thomas (dinthoma@in.ibm.com), DB2 LUW Quality Assurance, IBM

Dinoop Thomas has been with IBM for over 6 years and works with DB2 Quality Assurance group. Previously, he worked as a TXSeries L3 Support Engineer. Dinoop currently is focused on system verification testing for DB2 pureScale.



Tamilselvan Narayanaswamy (tnarayan@in.ibm.com), DB2 LUW Quality Assurance, IBM

Tamilselvan Narayanaswamy been with IBM for 13 years and has worked with the DB2 Quality Assurance team for the past 3.5 years. He is specialized in DB2 high availability, disaster recovery, and pureScale technologies. Previous positions within IBM include advanced support and quality for an online transaction processing product.



Aslam Nomani (aslam@ca.ibm.com), DB2 LUW Quality Assurance Test and Tooling, IBM

Aslam NomaniAslam Nomani has spent most of his 14 years with IBM with the Quality Assurance team. His main area of focus has been high availability and disaster recovery solutions. Over the past several years, Aslam has been the quality assurance architect for the DB2 pureScale Feature.



20 March 2014

Introduction

DB2 pureScale, IBM's premium clustered database solution for transaction processing environments, provides extreme scalability and continuous availability. In DB2 V10.5, high availability disaster recovery (HADR) support has been enabled for pureScale to provide integrated disaster recovery. In this article, learn in detail how to set up and manage your HADR databases. Troubleshooting common problems in HADR for pureScale environments is also covered.


Disaster recovery and pureScale

This section discusses pureScale, the need for disaster recovery (DR), various options available for DR, and a brief introduction to the integrated DR solution.

Introduction to pureScale

Learn more about DB2 pureScale.

DB2 pureScale has a cluster-based, shared-data architecture with multiple members working together as a single DB2 instance. Figure 1 shows an example of a DB2 pureScale instance with three members, a primary cluster caching facility (CF), and a secondary CF for redundancy. It supports continuous availability by providing access to the database even if one or more members are down. In DB2 pureScale, the Tivoli System Automation for Multiplatforms (SA MP) component detects any failure in the members. The failure is communicated to other members in the cluster, which then send details to the client about available members and the load each member can currently handle. The workload previously handled by the failed member is distributed across all available members until the failed member gets restarted.

Figure 1. A pureScale instance with 3 members and 2 CF configurations
A pureScale instance with 3 members and 2 CF configurations

DB2 pureScale provides extreme scalability. When needed, one or more members can be added to the DB2 pureScale cluster to handle more concurrent workloads. Essentially, customers can provision hardware for their database system with their current requirements and keep increasing as the need arises. The members can also be removed from the cluster when the business demand decreases, thereby providing excellent flexibility.

Options for major outages

DB2 pureScale provides continuous availability and extreme scalability, but there will be a database outage if the entire pureScale cluster goes down. Prior to DB2 version 10.5, there were a few ways you could enable DR in such scenarios:

  • Backup and restore
  • Geographically dispersed pureScale cluster
  • Disk mirroring
  • Q replication
  • IBM Change Data Capture (CDC)

Integrated DR solution using HADR

HADR support for DB2 pureScale, introduced in DB2 V10.5, provides an easy, integrated DR solution for your pureScale cluster.

HADR is an integrated feature that comes with DB2 and supports high availability and DR for DB2 Enterprise Standard Edition. It's a replication technology that ships transaction logs from the primary database to a standby database. HADR also applies the logs on the standby to keep the primary and standby databases in sync.


Introduction to HADR for pureScale

You can configure and manage HADR for pureScale the same as you would in a DB2 ESE environment. As shown in Figure 2, one difference is the HADR primary cluster at a local site and an HADR standby cluster at a remote site. Set up the standby cluster in the same way you would in ESE: take backup from the local site, restore it at the remote site, update HADR parameters on both sites, and start HADR on both the standby and primary cluster. Just as with ESE, HADR's integrated role-switch functions allow you to switch the workload over to the standby site.

Figure 2. HADR setup in pureScale environment
HADR setup in pureScale environment

On the primary cluster, each member has a separate log stream that is shipped by an individual HADR engine dispatching unit (EDU) (db2hadrp) running on the member. On the standby cluster, an HADR EDU (db2hadrs) runs for each primary member to receive the logs shipped. Another set of EDUs (the log replay master and replay workers) then merges and replays the log records, as in Figure 3.

Figure 3. Logs shipped from separate log stream to replay member
Logs shipped from separate log stream to replay member

The member that receives, merges, and replays the logs on the standby is called the replay member. You designate a preferred replay member by issuing the START HADR command on that particular member. There is also a preferred replay member on the primary, but that designation only comes into effect if the primary has its role switched to standby. If the preferred replay member is not available, another member is started and becomes the replay member. Figure 4 shows an example.

Figure 4. HADR standby for pureScale
HADR standby for pureScale

Supported

Because HADR support for pureScale was enabled in V10.5 as a DR solution, it currently supports only ASYNC and SUPERASYNC synchronization mode for the primary and standby cluster to communicate.

Reliability aspects

This section briefly introduces the new functions that make HADR for pureScale a robust DR solution.

Assisted remote catchup

In the HADR for pureScale environment, all members on the primary cluster should be available to send logs to the standby cluster where they're merged and replayed by the replay member. What happens if one or more members on the primary are unavailable due to planned or unplanned outages, or if a member is not reachable by the replay member due to network issues? In such a scenario, one of the online members in the primary cluster works as a proxy to send the logs on behalf of the failed or unavailable member, as in Figure 5. With this activity, called assisted remote catchup, an online member helps to ship the logs for unavailable or failed members and thus the log shipping on the primary cluster is not interrupted.

Figure 5. Assisted remote catch up
Assisted Remote Catch up

An additional db2hadrp EDU runs on the member to assist the primary member that is either unavailable or not reachable by the standby. After the member becomes available and the standby replay member is able to connect to it, assisted remote catchup terminates.

High availability of replay member

On the standby cluster, only one member will receive logs from a primary cluster and replay them against the database on standby. What happens if the replay member is unavailable? The database will be activated on another member and HADR will be started to continue the replication. The log replay is not interrupted on the standby cluster, as in Figure 6.

Figure 6. Replay member switch during failure
Replay member switch during failure

Command execution on non-replay members

Although you can issue HADR-related commands such as START/STOP HADR and TAKEOVER HADR, and DB2 commands such as ACTIVATE/DEACTIVATE DATABASE on the non-replay members, the commands are not executed directly on these members. Instead, these commands are forwarded to the replay member to execute, as the database is active only on the replay member. This process, called command forwarding, is shown in Figure 7.

Figure 7. Command forwarding in HADR standby
Command forwarding in HADR standby

Setting up HADR for pureScale

This section explains various things to consider before setup then walks through the steps to set up HADR in a DB2 pureScale environment.

Preparing your environment

There are a few prerequisites to cover before setting up your HADR for DB2 pureScale environment.

Topology

The HADR primary and standby clusters should have identical member topology (for example, the number of members in the primary cluster should be the same as that of the standby cluster). The member IDs should also be the same for the primary and standby clusters.

If there are hardware resource constraints, the standby setup can have logical members sharing the same host. However, after a role switch the new primary will not be as powerful as the old primary.

Reserve a port for HADR communication

As a best practice, it is advisable to reserve a dedicated port on each of the primary and standby members for HADR communication.

Sufficient space for log archive path

The log archival path should have sufficient disk space to hold the archived transaction logs. The archive path should be configured on the shared GPFS file system so it is accessible by all the members in the cluster.

The archive path can also be set to Tivoli Storage Manager (TSM).

Choosing a SYNCMODE

HADR for DB2 pureScale currently supports only ASYNC and SUPERASYNC synchronization modes.

System clock

HADR primary members should have the same system clock set across the cluster and should be in sync with clocks of the HADR standby members.

Database configuration parameters

Table 1 explains the database configuration parameters that you need to update in the primary and standby clusters while setting up HADR in the DB2 pureScale environment.

Table 1. Parameters to be updated
ParameterDescription
HADR_LOCAL_HOSTSpecifies the hostname of the primary or standby member's machine. This parameter needs to be updated for every member in the primary and the standby. For example, in a pureScale setup with two members HADR_LOCAL_HOST needs to be updated for each member, as follows.

db2 update db cfg for <dbname> member 0 using HADR_LOCAL_HOST <hostname of member 0>

db2 update db cfg for<dbname> member 1 using HADR_LOCAL_HOST <hostname of member 1>

HADR_LOCAL_SVCSpecifies the port used for HADR communication between the primary and the standby. You can update this parameter separately for each member in case the ports reserved are different on each member.
HADR_REMOTE_HOST Is automatically updated from HADR_TARGET_LIST. However it is recommended that you verify this parameter using db2 get db cfg for <dbname> show detail after HADR is started.
HADR_TARGET_LISTIs used to specify the hostnames and HADR port for each member in the cluster. On the primary, this parameter needs to be filled with the hostname and port information of the standby members, and vice versa.

With a large cluster, it's not a requirement to include all the members of the cluster in the list. Setting the most likely members to be online should be sufficient, as the list is constructed automatically when the HADR pair is connected. It is strongly recommended that you list the preferred replay member in the HADR_TARGET_LIST of the primary cluster. Setup scenario shows how to fill in this parameter.

HADR_TIMEOUTSpecifies the number of seconds HADR waits for communication between primary and standby before reporting a communication error. Having an optimal value for this parameter helps in detecting communication errors early.
HADR_SPOOL_LIMIT Spooling allows transaction logs to be received from the primary and stored on the standby's disk if the receive buffer of the standby is full. Spooling is enabled by default. Adequate disk space should be reserved on the active log path for the transaction logs to be spooled.

Setup scenario

The following scenario walks through setting up Company XYZ's HADR environment for database HADR4SD. The primary data center is in Toronto and the DR site is in New York City, as shown in Figure 8.

Figure 8. HADR for DB2 pureScale with 2 Members and 2 CFs
HADR for DB2 pureScale with 2 Members and 2 CFs

On Site A – Primary data center (Toronto)

  1. Enable archive logging for the database HADR4SD using the following command, as in Listing 1.
    db2 update db cfg for HADR4SD using LOGARCHMETH1 DISK: <path>

    Note that the archival log path above should be accessible by all the members of the primary data center.

    Listing 1. Update archive log to enable database recovery

    Click to see code listing

    Listing 1. Update archive log to enable database recovery

    (svtdbm8@coralxib50) /home/svtdbm8
    $ db2 update db cfg for HADR4SD using LOGARCHMETH1 DISK:/db2gpfs/ARCHLOG/Primary
    DB20000I The UPDATE  DATABASE CONFIGURATION command completed successfully.
  2. Back up the database HADR4SD on the primary data center using the following command, as in Listing 2.
    db2 backup db HADR4SD compress
    Listing 2. Backup database on primary to configure standby
    (svtdbm8@cora1xib50)/home/svtdbm8
    $ db2 backup db hadr4sd compress
    Backup successful. The timestamp for this backup image is : 20131203042122

    Transfer the above backup image to the standby cluster.

On Site B - DR site (New York)

  1. Restore the backup image of the database HADR4SD taken at the primary data center on the DR site. Use the following command, as in Listing 3.
    db2 restore db HADR4SD taken at <timestamp> on <path>
    Listing 3. Restore the backup of primary on standby server
    (svtdbm8@coralxib52) /home/svtdbm8
    $ db2 "restore db HADR4SD taken at 20131203042122 on /db2gpfs/dbdir/"
    DB20000I The RESTORE DATABASE command completed successfully.
    (svtdbm8@cora1xib52) /home/svtdbm8
  2. Update the archival logpath of the standby server using the logarchmeth1 db cfg parameter, if required, as in Listing 4.
    Listing 4. Update archive log to enable database recovery
    (svtdbm8@coralxib52) /home/svtdbm8 
    db2 update db cfg for HADR4SD using LOGARCHMETH1 DISK:/db2gpfs/Archlog/Standby
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
    (svtdbm8@coralxib52) /home/svtdbn8

On Site A – Primary data center (Toronto)

  1. Update the HADR db cfg parameters as shown in Listing 5. See Database configuration parameters for an explanation of the HADR db cfg parameters.
    Listing 5. Update HADR parameters on primary cluster
    (svtdbm8@coralxib50) /home/svtdbm8
    $ db2 update db cfg for HADR4SD member 0 using HADR_LOCAL_HOST coralxib50 \
    				              HADR_LOCAL_SVC 50622
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
    (svtdbm8@cora1xib50) /home/svtdbm8
    db2 update db cfg for HADR4SD member 1 using HADR_LOCAL_HOST coralxib51 \
    				           HADR_LOCAL_SVC 50623
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
    (svtdbm8@coralxib50) /home/svtdbm8
    db2 " update db cfg for HADR4SD using HADR_TARGET_LIST {coralxib52:50622|coralxib53:50623} \
    			              HADR_REMOTE_INST svtdbm8"
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.

On Site B - DR site (New York)

  1. Update the HADR db cfg parameters on standby as shown in Listing 6.
    Listing 6. Update HADR parameters on standby cluster
    (svtdbm8@coralxib52) /home/svtdbm8
    $ db2 update db cfg for HADR4SD member 0 using HADR_LOCAL_HOST coralxib52 \
    				              HADR_LOCAL_SVC 50622
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
    (svtdbm8@cora1xib52) /home/svtdbm8
    db2 update db cfg for HADR4SD member 1 using HADR_LOCAL_HOST coralxib53 \
    				           HADR_LOCAL_SVC 50623
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
    (svtdbm8@coralxib52) /home/svtdbm8
    db2 " update db cfg for HADR4SD using HADR_TARGET_LIST {coralxib50:50622|coralxib51:50623} \
    			              HADR_REMOTE_INST svtdbm8"
    DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully.
  2. Start HADR as standby using the following command, as in Listing 7.
    db2 start hard on db <dbname> as standby

    Note that the member from which the HADR standby is started is designated as the preferred replay member.

    Listing 7. Start HADR on standby cluster
    (svtdbm8@coralxib52) 
    $ db2 start hadr on db hadr4sd as standby
    DB20000I  The START HADR ON DATABASE command completed successfully.

On Site A – Primary data center (Toronto)

  1. Start HADR as primary using the following command, as in Listing 8.
    db2 start hard on db <dbname> as primary

    The member from which the HADR primary is started is designated as the preferred replay member once it becomes standby after the role changes.

    Listing 8. Start HADR on primary cluster
    (svtdbm8cora1xib50) /home/svtdbm8
    db2 start hadr on db hadr4sd as primary
    DB20000I The START HADR ON DATABASE command completed successfully.
  2. Verify that the HADR setup was successful using the following command.
    db2pd –hadr –db <dbname>

    Listing 9, on primary member 0, shows the following information.

    HADR_ROLE as PRIMARY, HADR_STATE as PEER and HADR_CONNECT_STATUS as CONNECTED for member 0

    HADR_ROLE as PRIMARY, HADR_STATE as REMOTE_CATCHUP and HADR_CONNECT_STATUS as CONNECTED for member 1

    Since the database is not active on member 1 initially, primary member 0 does do assisted remote catchup for member 1. Assisted remote catchup state would terminate on activation of the database on member 1.

    Listing 9. Verify HADR status using db2pd on primary
    (svtdbm8coralxib50) /home/svtdbm8
    db2pd hadr -db HADR4SD | egrep HADR_ROLE | HADR_SYNCMODE |  HADR_STATE | HADR_CONNECT_STATUS
     |  MEMBER
    
    			HADR_ROLE = PRIMARY
    		      HADR_SYNCMODE = ASYNC
    			HADR_STATE = PEER
    	          PRIMARY_MEMBER_HOST = coralxib50
    		     PRIMARY_MEMBER = 0
    	          STANDBY_MEMBER_HOST = coralxib52
    		     STANDBY_MEMBER = 0
    	             HADR_CONNECT_STATUS = CONNECTED
                       HADR_CONNECT_STATUS_TIME = 12/17/2013 03:17:39.687586 (1387268259)
    			  HADR_ROLE = PRIMARY
    		        HADR_SYNCMODE = ASYNC
    			  HADR_STATE = REMOTE_CATCHUP
    	            PRIMARY_MEMBER_HOST = coraixibSO
     		      PRIMARY_MEMBER = 0
    	            STANDBY_MEMBER_HOST = coralxib52
    		      STANDBY_MEMBER = 0
                                   HADR_CONNECT_STATUS = CONNECTED
                        HADR_CONNECT_STATUS_TIME = 12/17/2013 03:18:01.864910 (1387268281)

    The STANDBY_MEMBER_HOST in the output above indicates the replay member on the standby.

    Listing 10 shows information on the standby replay member (member 0). HADR_ROLE is STANDBY, HADR_STATE is PEER for member 0 and REMOTE_CATCHUP for member 1, and HADR_CONNECT_STATUS is CONNECTED.

    Listing 10. Verify HADR status using db2pd on standby
    (svtdbm8@coralxib52) /home/svtdbm8
    db2pd -hadr -db HADR4SD | egrep 'HADR_ROLE|HADR_SYNCMODE|HADR_STATE|HADR_CONNECT_STATUS
    | MEMBER'
    			HADR_ROLE = STANDBY
    		      HADR_SYNCMODE = ASYNC
                                                        HADR_STATE = PEER
                               PRIMARY_MEMBER_HOST = cora1xib50
                                           PRIMARY_MEMBER = 0
                               STANDBY_MEMBER_HOST = cora1xib52
                                           STANDBY_MEMBER = 0
                                 HADR_CONNECT_STATUS = CONNECTED
                     HADR_CONNECT_STATUS_TIME = 12/17/2013 03:17:39.686040 (1387268259)
                                                        HADR_ROLE = STANDBY
                                            HADR_SYNCMODE = ASYNC
                                                       HADR_STATE = REMOTE_CATCHUP
                              PRIMARY_MEMBER_HOST = coralxib50
                                          PRIMARY_MEMBER = 0
                              STANDBY_MEMBER_HOST = cora1xibS2
                                          STANDBY_MEMBER = 0
                                HADR_CONNECT_STATUS = CONNECTED
                     HADR_CONNECT_STATUS_TIME =  12/17/2013 03:18:01.863571 (1387268281)

Managing your HADR databases

This section provides an overview of tasks related to managing your HADR environment such as monitoring, handling site change activity, handling disaster recovery, changing the member topology, and performing rolling updates.

Monitoring

The monitoring infrastructure in DB2 allows you to handle the things that need to be monitored in an HADR environment. In a DB2 pureScale environment, there are two preferred methods to monitor your HADR databases: the db2pd command, and the MON_GET_HADR table function.

db2pd command
Provides information about your HADR setup including the HADR role, HADR connection status, HADR state, log positions of the primary and standby, and so forth. db2pd can be issued from a primary or a standby database.

To monitor your HADR setup, you can issue the db2pd command from the standby replay member or individual primary members. Issuing the command from the standby replay member shows all HADR related information for all the primary members the standby is connected to. You can use the following command to monitor HADR.

db2pd -hadr -db <dbname>

Listing 11 shows some relevant portions of db2pd output issued on the standby replay member.

Listing 11. Monitor HADR status using db2pd
(svtdbm8@cora1xib52) /home/svtdbm8
$ db2pd -hadr -db HADR4SD | egrep 'HADR_ROLE|HADR_SYNCMODE|HADR_STATE|HADR_CONNECT_STATUS|MEMBER' 
			HADR_ROLE = STANDBY
		      HADR_SYNCMODE = ASYNC
		                 HADR_STATE = PEER
                           PRIMARY_MEMBER_HOST = coralxib50
		    PRIMARY_MEMBER = 0
	         STANDBY_MEMBER_HOST = coralxib52
		    STANDBY_MEMBER = 0
	            HADR_CONNECT_STATUS = CONNECTED
	HADR_CONNECT_STATUS_TIME = 12/03/2013 04:59:01.997730(1386064741)
		                  HADR_ROLE = STANDBY
		       HADR_SYNCMODE = ASYNC
		                  HADR_STATE = PEER
	           PRIMARY_MEMBER_HOST = coralxib51
		      PRIMARY_MEMBER = 1
	           STANDBY_MEMBER_HOST = coralxib52
		      STANDBY_MEMBER = 0
	              HADR_CONNECT_STATUS = CONNECTED
	   HADR_CONNECT_STATUS_TIME = 12/03/2013 06:11:06.850199 (1386069066)

The output shows that the standby coralxib52 is connected to two primary members: coralxib50 and coralxib51. The standby is in PEER state with both the primary members.

The db2pd command can be used on both primary and standby as it doesn't require a connection to a database.

MON_GET_HADR table function
Can be issued on the primary database to retrieve information about the standby database and the related HADR parameters.

Listing 12 shows an example of using the MON_GET_HADR function to retrieve the status of various HADR parameters.

Listing 12. Monitor HADR status using MON_GET_HADR table function
$ db2 "select substr(PRIMARY_MEMBER_HOST,1,15) as PRIMARY_HOSTNAME,substr(PRIMARY_MEMBER,1,5) a
s PRIMARY_MEMBERID , substr(HADR_ROLE,1,10) as HADR_ROLE,substr(STANDBY_MEMBER_HOST,1,15) as ST
ANDBY_HOSTNAME,HADR_CONNECT_STATUS,substr(HADR_STATE,1,16) as HADR_STATE from table (MON_GET_HADR(NULL))"

PRIMARY_HOSTNAME PRIMARY_MEMBERID HADR_ROLE STANDBY_HOSTNAME   

--------------- ---------------- --------- -----------------

HADR_CONNECT_STATUS HADR_STATE
------------------- ----------

coralxib50  0    PRIMARY   coralxib52  CONNECTED PEER

1 record(s) selected.

The MON_GET_HADR table function requires a connection to the database. The MON_GET_HADR table function can't be invoked on the standby because HADR for DB2 pureScale doesn't support Reads on Standby.

Ensuring HADR replication is in progress

Some of the fields in the db2pd -hadr command output can be useful in determining whether HADR log replication is progressing in standby when transactions are running on primary. Periodically monitor the parameters in Table 2 to verify replication progress.

Table 2. Parameters to monitor
ParameterDescription
HADR_CONNECT_STATUSShould indicate CONNECTED.
HADR_STATEShould be either PEER or REMOTE_CATCHUP.
PRIMARY_LOG_FILE STANDBY_LOG_FILE Can be monitored from primary for logs generated on primary and the logs received on standby.

If the current log position of the primary is changing with time and the receive log position of the standby doesn't change, it's an indication of the standby not receiving logs from the primary.

A higher value of HADR_LOG_GAP is also an indication that the log receiving rate on the standby is lower when compared to the log generation rate on the primary.

STANDBY_REPLAY_LOG_FILE STANDBY_RECV_REPLAY_GAPCan be monitored from the standby to understand the log replay progress and the gap between logs received and the logs replayed.

If the replay position of the standby doesn't change with time, it's an indication that the log replay is getting stuck on the standby. An increasing receive replay gap also indicates that either the standby is not able to replay the logs it received from the primary or the replay is slow.

Monitoring the HADR primary during assisted remote catchup

This section discusses the db2pd -hadr -db <dbname> output when assisted remote catchup is active for a primary member. (See Assisted remote catchup for more information.)

When the db2 start hadr on db <dbname> as primary command is issued on a primary member, the database is activated only on this member. The database will be activated and HADR will be active on other members only during a connect to the database or an explicit activate using the db2 activate db <dbname> command.

If the database has not been activated on a member or the member is offline due to a planned or unplanned outage, the standby replay member requests assisted remote catchup for this member. The assisted remote catchup request is handled by any of the active primary members (called assisting member). The assisting member in this case will work on behalf of the unavailable member until it is available.

Listing 13 shows ASSISTED_REMOTE_CATCHUP in HADR_FLAGS for the member being assisted.

Listing 13. Monitor for active assisted remote catchup

Click to see code listing

Listing 13. Monitor for active assisted remote catchup

             (svtdbm8@coralxib50) /TMP/HADR/Scripts/Tools
$ db2pd -hadr -db HADR4SD | egrep 'HADR_ROLE|HADR_SYNCMODE|HADR_STATE|HADR_CONNECT_STATUS|MEMBER|FLAGS|LOG_STREAM_ID'
		HADR_ROLE = PRIMARY
	      HADR_SYNCMODE = ASYNC
	         LOG_STREAM_ID = 0
	                 HADR_STATE = PEER
	          HADR_FLAGS       =
          PRIMARY_MEMBER_HOST = coralxib50
	     PRIMARY_MEMBER = 0
          STANDBY_MEMBER_HOST = coralxib52
	     STANDBY_MEMBER = 0
             HADR_CONNECT_STATUS = CONNECTED
   HADR_C0NNECT_STATUS_TIME =12/31/2013 07:22:24.986000(1388491360)
	                    HADR_ROLE = PRIMARY
	         HADR_SYNCMODE = ASYNC
	            LOG_STREAM_ID = 1
	                    HADR_STATE = REMOTE_CATCNUP
	          HADR_FLAGS          = ASSISTED_REMOTE_CATCHUP
             PRIMARY_MEMBER_HOST = coraxib50
	        PRIMARY_MEMBER = 0
              STANDBY_MEMBER_HOST = coralxib52
                          STANDBY_MEMBER = 0
                HADR_CONMECT_STATUS = CONNECTED
     HADR_CONNECT_STATUS_TIME = 12/31/2013 07:22:24.927269 (1386492544)

You might encounter a scenario where a network problem is caused by the standby replay member's inability to reach a particular primary member, although the primary member may be active and handling transactions. In such situations, the standby replay member requests that any of the active primary members ship the logs on behalf of the member it is not able to communicate with.

Listing 14 shows ASSISTED_REMOTE_CATCHUP ASSISTED_MEMBER_ACTIVE in HADR_FLAGS indicating that an active member is being assisted.

Listing 14. Monitor active assisted remote catchup for an active member

Click to see code listing

Listing 14. Monitor active assisted remote catchup for an active member

(svtdbm8@coralxib50) /TMP/HADR/Scripts/Tools
db2pd -hadr -db HADR4SD egrep "HADR_ROLE|HADR_SYNCMODE|HADR_STATE|HADR_CONNECT_STATUS|MEMBER|FLAGS|LOGSTREAM_ID"
		HADR_SYNCMODE = ASYNC
		   LOG_STREAM_ID = 0
		           HADR_STATE = PEER
		          HADR_FLAGS =
	    PRIMARY_MEMBER_HOST = cora1xib50
	                PRIMARY_MEMBER = 0
	     STANDBY_MEMBER_HOST = coralxib52
		STANDBY_MEMBER = 0
	        HADR_CONNECT_STATUS = CONNECTED
              HADR_CONNECT_STATUS_TIME = 12/31/2013 07:02:40.986000 	(1388491360)
	 	               HADR_ROLE = PRIMARY
		    HADR_SYNCMODE = ASYNC
		       LOG_STREAM_ID = 1
		               HADR_STATE = REMOTE_CATCHUP
		              HADR_FLAGS = ASSISTED_REMOTE_CATCHUP ASSISTED_MEMBER_ACTIVE
	        PRIMARY_MEMBER_HOST = coralxib50
		    PRIMARY_MEMBER = 0
                           STANDBY_MEMBER_HOST = coralxib52
	                      STANDBY_MEMBER = 0
	             HADR_CONNECT_STATUS = CONNECTED
	  HADR_CONNECT_STATUS_TIME = 12/31/2013 07:36:33.425159 (1388493393)

Checking consistency of tablespaces

In an HADR setup, it is recommended that you periodically check the consistency of the tablespaces on the standby. Some operations, such as a NONRECOVERABLE load if executed on the primary, would lead to the tablespace being marked invalid in the standby. If the standby database later becomes the primary as part of a role switch, then the tablespace will remain inaccessible.

To check the consistency of the tablespaces, use the following command from the standby.

db2pd –tablespaces –db <dbname>

Verify the State field in the output under the Tablespace Statistics section, as in Listing 15. The value 0x0000000 indicates that the table space state is normal.

Listing 15. Tablespace statistics using db2pd on standby

Click to see code listing

Listing 15. Tablespace statistics using db2pd on standby

0x00002AFAA0A86640 1 SMS SysTmp 8192 32 Yes 192 1 1 On 1 0 31 No TEMPSPACE1
0x00002AFAA0A96960 2 DMS Large 8192 32 Yes 192 1 1 Off 1 0 31 Yes USERSPACE1
Tablespace Statistics:
Address Id TotalPgs UsablePgs UsedPgs PndFreePgs FreePgs HWM Max HWM  State MinRecTime MQuiescers Pathdropped TrackmodState
0x00002AFAA0A78B60 0 16384 16380 14940 0 1440 14940 14940 0x00000000 0 0 No n/a
0x00002AFAA0A86640 1  1  1  0  0 - - 0x00000000 0 0 No n/a
0x00002AFAA0A96960 2 4096 4064 96 0 3968 96 96 0x00000000 0 0 No n/a 

Tablespace Autoresize Statistics:
Address 	Id	AS	AR	InitSize	IncSize	IIP	MaxSize
0x00002AFAA0A78B60  0 Yes Yes 33554432 -1       No    None
0x00002AFAA0A86640  1 Yes No  0         0       No    0

Changing database configuration parameters

In an HADR setup, any update to database configuration parameters on the primary is not replicated to the standby. Database configuration parameters will need to be updated separately on the standby.

For a database configuration parameter not related to HADR, a deactivation/activation of the database is required if the parameter is not dynamic.

For HADR-specific parameters to take effect, a stop/start of HADR is sufficient.

Handling change activity for a data center and disaster recovery site

In a DB2 pureScale environment with HADR set up, the data center is the primary site where all transactions are executed. The transaction logs are shipped to the DR site and replayed to keep both primary and standby in sync.

Data center DR drills are for checking the readiness of the disaster recovery site to act as a primary site in case of an outage in the data center. Any planned maintenance activity in the data center may require the disaster recovery site to act as the primary site until the activity is completed in the data center.

The steps below outline the sequence for a site change activity.

  1. Verify the db2pd -hadr -db <dbname> output for each primary member, as follows.
    • The HADR_CONNECT_STATUS should be CONNECTED.
    • The HADR_STATE should be PEER for ASYNC and REMOTE_CATCHUP for SUPERASYNC synchronization modes, respectively. If any of the primary members are currently getting assisted then the state can be REMOTE_CATCHUP for that member.
    • The HADR_LOG_GAP between the primary and the standby should be minimal.
    • The STANDBY_RECV_REPLAY_GAP should not be large. The standby needs to replay all the logs it has received before the site change activity. A large value will lead to longer time for site change activity.
    • The tablespace consistency check can be done on standby. See Checking consistency of tablespaces for more details.
  2. Perform role switch by issuing the following command on the standby.
    db2 takeover hadr on db <dbname>'
  3. If the command completes successfully, check whether the role change was successful by verifying from db2pd -hadr output:
    • The HADR_ROLE field shows that the primary and standby have switched roles.
    • The HADR_CONNECT_STATUS field shows CONNECTED and the HADR_STATE field shows PEER or REMOTE_CATCHUP.

Application behavior during site change activity

Automatic client reroute is supported with pureScale and you can use it to reroute clients to the new standby in a role switch or failover. Alternate group is preferred over alternate server because alternate groups can define multiple members for a cluster.

If automatic client reroute (ACR) is enabled for your applications, all the applications connected to the primary will get rerouted to the standby after it takes over as primary.

See Resources for details on ACR and related client features such as workload balancing (WLB) and client affinity.

Handling fix pack updates

DB2 pureScale supports installation of online fix pack updates to a higher code level in an HADR environment, which is commonly called rolling updates. Transactions can continue to run on the primary and get replicated to the standby during the updating process with online fix packs. The online fix pack update will only stop the member or CF of the nodes where the update is performed and doesn't require completely stopping the instance.

If the WLB feature of DB2 pureScale is enabled while updating members in the primary, the workload can be handled by the other members of the primary.

Regarding the standby, while you are updating the replay member another member in the standby cluster will perform the replay of logs once the replay member gets stopped.

For details see Installing online fix pack updates to a higher code level in a HADR environment in the DB2 V10.5 Information Center.

Adding or dropping members

DB2 pureScale supports adding a member to an existing HADR setup online without impacting the HADR setup. The member needs to be added to the standby followed by the primary.

Removing a member from the HADR primary cluster requires that you stop HADR on the primary and reinitialize the standby based on the primary's updated topology.

For details about adding or dropping members in an HADR environment, see DB2 pureScale topology changes and high availability disaster recovery (HADR) in the DB2 V10.5 Information Center.

Disaster recovery during data center site failure

You might need to perform DR when the data center site fails. An outage in the data center site may be due to network outage, a fire, earthquake, or other catastrophic event that makes the site completely unavailable.

Use the following procedure in the event of a data center site failure.

  1. Perform a forced takeover from the standby (DR site) database using the following command.
    db2 takeover hadr on db <dbname> by force
  2. If the command completes successfully, check the db2pd -hadr output on the DR site to see whether the HADR_ROLE field changed to PRIMARY.
  3. If you want to reintegrate the old primary as standby after the data center site failure is resolved:
    • In the data center site, run the following command.
      db2 start hadr on db <dbname> as standby
    • If the command completes successfully, verify from db2pd -hadr -db <dbname> output whether the standby is catching up with the new primary. The HADR_CONNECT_STATUS field should show CONNECTED and the HADR_STATE field should show PEER or REMOTE_CATCHUP.

If the db2pd -hadr -db <dbname> output shows the standby to be Inactive, then the reintegration of old primary as standby might have failed due to primary and standby log streams becoming incompatible. You can confirm this by checking the following message in the db2diag.log file of any of the primary or standby members.

MESSAGE: ADM12500E The HADR standby database
cannot be made consistent with the primary database. The log stream of the standby
database is incompatible with that of the primary database. To use this database as
a standby, it must be recreated from a backup image or split mirror of the primary
 database.

Follow the steps below only if reintegration of the old primary (data center site) as standby has failed.

  1. Drop the database at the data center site using the following command.
    db2 drop db <dbname>
  2. Delete any archive logs (only if archive path is not shared with the DR site) if they are present on the data center site.
  3. Take an online backup of the database from the DR site using the following command.
    db2 backup db <dbname> online

    If the database is currently inactive, an offline backup is sufficient.

  4. Transfer the backup image to the data center site.
  5. Restore the database and update the HADR database configurations on the database at the data center site.
  6. Start the database at the data center site as a standby using the following command.
    db2 start hadr on db <dbname> as standby
  7. Verify from db2pd -hadr -db <dbname> output that the standby is connected and catching up with the primary.

Starting the primary without standby pair

In an HADR environment, activating the primary database will implicitly start HADR. If the standby is unavailable, the primary database will not be able to connect to the standby within the configured HADR_TIMEOUT value. As a result, activation of the primary database will fail with the following message.

SQL1768N Unable to start HADR Reason code = 7

In such a scenario, until the problem is resolved on the standby you can start the primary by force using the following command.

db2 start hadr on db <dbname> as primary by force

Starting the primary by force will put it into REMOTE_CATCHUP_PENDING state. After the standby is started and available, it will connect and catch up with the primary.

Changing the preferred replay member on the standby

When the START HADR command is issued from any member, that member is designated as the preferred replay member.

As explained in High availability of replay member, if a replay member is terminated or stopped in the standby, another standby member assumes the role of replay member. After the replay member that was stopped or failed becomes active, the replay doesn't automatically switch back to that member even if it is the preferred replay member.

When the replay isn't occurring on the preferred replay member, the HADR_FLAGS field will indicate STANDBY_REPLAY_NOT_ON_PREFERRED, as in Listing 16.

Listing 16. Replay member not running on preferred standby member
(svtbm8@cora1xib53) /home/svtdbm8
$ db2pd -hadr -db hadr4sd | egrep "HADR_ROLE |HADR_SYNCMODE|HADR_STATE|HADR_CCNMECT_STATUS|MEMBER|FLAGS"
		HADR_ROLE = STANDBY
	      HADR_SYNCMODE = ASYNC
	                 HADR_STATE = PEER
	                HADR_FLAGS = STANDBY_REPLAY_NOT_ON_PREFERRED
         PRIMARY_MEMBER_HOST = coralxib51
	    PRIMARY_MEMBER = 1
         STANDBY_MEMBER_HOST = coralxib53
	    STANDBY_MEMBER = 1
            HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 12/03/2013 06:36:23.546947 	(1386070583)
	                  HADR_ROLE = STANDBY
	       HADR_SYNCMODE = ASYNC
	                  HADR_STATE = PEER
	                 HADR_FLAGS = STANDBY_REPLAY_NOT_ON_PREFERRED
           PRIMARY_MEMBER_HOST = coralxib50
	      PRIMARY_MEMBER = 0
            STANDBY_MEMBER_HOST = coralxib53
	      STANDBY_MEMBER = 1
              HADR_CONNECT_STATUS = CONNECTED
   HADR_CONNECT_STATUS_TIME = 12/03/2013 06:36:23.632850 	(1386070583)

To change the preferred replay member in the standby, issue the following commands from the standby member that you want to designate as the preferred replay member.

db2 deactivate db <dbname>
db2 start hadr on db <dbname> as standby

Troubleshooting and common error codes

This section covers some common troubleshooting situations and error codes.

HADR port is already in use

It's important to reserve the TCP/IP port that will be used for HADR communication in /etc/services or a similar file on your platform to avoid the port already being used by any another process.

Problem
If you do not reserve the port that you configure for HADR communication (the one specified by the HADR_LOCAL_SVC database configuration parameter), the same port might be used by another process on the server before HADR starts. HADR cannot start if the port is already being used by another process.
Symptoms
On the standby, if the HADR port specified by HADR_LOCAL_SVC is used by another process, you'll find the following message in the db2diag.log.
..
EDUID : 40034 EDUNAME: db2hadrs.0.0 (HADR4SD) 0
FUNCTION: DB2 UDB, oper system services, sqloPdbBindSocket, probe:1172
MESSAGE : ZRC=0x810F001B=-2129723365=SQLO_ADDR_IN_USE "Address already in use"         
DATA #1 : <preformatted>
Port=[50623]
..

It will try five times, and if the port is still in use by another process it will return the following message and terminate the HADR EDUs.

EDUID : 40034 EDUNAME: db2hadrs.0.0 (HADR4SD) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduS,
        probe:2145
0
MESSAGE : ZRC=0x87800142=-2021654206=HDR_ZRC_CONFIG_LOCAL_SVC            
 "HADR_LOCAL_SVC configuration parameter is an invalid service name"

Even if the START HADR command completes successfully on the standby, you should confirm that HADR has started by using the db2pd command.

On the primary, if the HADR port is being used by another process then starting HADR will fail with the following error.

SQL1768N Unable to start HADR. Reason code = "5"
Solution
Use the netstat (or similar) command to find out if another process is listening on this particular port. You can either choose a new port for HADR communication or stop the process that is currently using the port before you make an attempt to start HADR.

For example, netstat –Aan on AIX or netstat –anp on Linux will show if any other process is listening on a port.

Frequent connection and disconnections of the HADR pair

Problem
HADR protection is not dependable because of frequent disconnections of your HADR pair. This can result in your standby not being closely synchronized with the primary. Frequent disconnections can occur due to either a slow network between primary and standby or an operation on standby taking more time than the HADR_TIMEOUT value set. For example, the log file creation on standby might take more time based on log file size and the I/O speed.
Symptoms

You can observe frequent connections and disconnections in the HADR_CONNECT_STATUS field in the db2pd command or MON_GET_HADR table function output, as in Listing 17.

Listing 17. CONNECT/DISCONNECT of HADR pair
	HADR_CONNECT_STATUS = CONNECTED
      HADR_CONNECT_STATUS_TIME = 12/06/2013 01:48:03.379836 (1386312483)
                                         HADR_ROLE = PRIMARY
                                         HADR_STATE = PEER
                  HADR_CONNECT_STATUS = CONGESTED
      HADR_CONNECT_STATUS_TIME = 12/06/2013 01:54:09.951183 (1386312849)
                                         HADR_ROLE = PRIMARY
                                        HADR_STATE = PEER
                 HADR_CONNECT_STATUS = CONGESTED
     HADR_CONNECT_STATUS_TIME = 12/06/2013 01:54:09.951183 (1386312849)
                                        HADR_ROLE = PRIMARY
                                       HADR_STATE = DISCONNECTED
                HADR_CONNECT_STATUS = DISCONNECTED
   HADR_CONNECT_STATUS_TIME = 12/06/2013 01:54:14.058221 (1386312854)
                                      HADR_ROLE = PRIMARY
                                     HADR_STATE = DISCONNECTED
              HADR_CONNECT_STATUS = DISCONNECTED
  HADR_CONNECT_STATUS_TIME = 12/06/2013 01:54:14.058221 (1386312854)
                                     HADR_ROLE = PRIMARY
                                    HADR_STATE = DISCONNECTED
              HADR_CONNECT_STATUS = DISCONNECTED
  HADR_CONNECT_STATUS_TIME = 12/06/2013 01:54:14.058221 (1386312854)

You would also see the following message in the db2diag.log.

Click to see code listing

2013-12-06-01.06.09.092749-300 I680310E598 LEVEL: Error
PID : 8687 TID : 47049621890816 PROC : db2sysc 0
INSTANCE: svtdbm8 NODE : 000 DB : HADR4SD
HOSTNAME: coralxib50
EDUID : 85 EDUNAME: db2hadrp.0.1 (HADR4SD) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent,
    probe:20200
MESSAGE : Did not receive anything through HADR connection for the duration of HADR_TIMEOUT. Closing connection.
DATA #1 : Hexdump, 4 bytes
0x00002ACA97FF82D0 : 2900 0000
Solution
Ensure that the HADR_TIMEOUT database configuration parameter is set to a minimum of 60 seconds to avoid any false alarms and provide enough time for the standby to complete the activities.

Connect to primary fails

Problem
A client application trying to connect to the primary, or an explicit activate of the database, or the start of the HADR primary waits for HADR_TIMEOUT time and fails.
Symptoms
The connect or activate of the database, or the start of HADR on primary, fails with the following message.
SQL1768N Unable to start HADR. Reason code = "7".
Solution
When you activate or connect to the database on primary, it will start HADR and check if it can pair up with a database on standby for HADR_TIMEOUT period and fails if that doesn't happen. It's always recommended in an HADR environment that you make sure the HADR standby is running before you connect/activate or start HADR on the primary.

Slow log replay on the standby

Problem
In some cases, the standby might be slow in replaying the logs it receives from the primary. And, the standby might not be able to receive logs from the primary at the rate at which logs are being generated on the primary. In either of these situations, there can be back pressure on the primary if synchronization mode is set to ASYNC.
Symptoms
The db2pd -hadr -db <dbname> command issued from the primary shows that the log gap between primary and standby is increasing, as indicated by the HADR_LOG_GAP field.

The db2pd -hadr -db <dbname> command issued from the standby shows:

  • The STANDBY_RECV_BUF_PERCENT is close to 100%
  • The STANDBY_SPOOL_PERCENT is close to 100%
  • The HADR_FLAGS field indicates STANDBY_RECV_BLOCKED
Solution
The logs shipped by the primary to the standby are stored in the standby receive buffer until they are replayed. If the receive buffer is full, the standby can still continue to receive logs in the spool.

If the standby receive buffer and the spool are full, the standby won't be able to receive any more logs from the primary. If the db2pd -hadr -db <dbname> output on the standby indicates STANDBY_RECV_BUF_PERCENT and STANDBY_SPOOL_PERCENT are close to 100%, it's an indicator that both the receive buffer and the spool are nearing capacity.

The default value of the standby receive buffer is twice the size of the LOGBUFSZ configuration parameter in the primary. On the standby, the receive buffer can be increased using the DB2_HADR_BUF_SIZE registry variable to avoid a bottleneck on the receive buffer.

If the problem persists after increasing the receive buffer size, you can consider increasing the spool limit using the HADR_SPOOL_LIMIT database configuration parameter.

You can also check if there are any bottlenecks such as I/O waits on the disk, which makes the replay slow.

HADR pair is not connected

Problem
Even though the START HADR command succeeded, HADR is not active on the standby.
Symptoms
The db2pd output on standby indicates that the HADR is not active and the HADR_CONNECT_STATUS on primary shows DISCONNECTED.

Check the db2diag.log of all the members on the primary and the standby for any of the following messages.

  • ADM12513E Unable to establish HADR primary-standby connection because HADR configurations are incompatible on primary and standby.
  • ADM12509E HADR encountered an abnormal condition.
  • ADM12500E The HADR standby database cannot be made consistent with the primary database. The log stream of the standby database is incompatible with that of the primary database. To use this database as a standby, it must be recreated from a backup image or split mirror of the primary database.
Solution
The problem could be an HADR parameter mismatch, a log stream mismatch, or log incompatibility between primary and standby. You can get the precise reason for the failure by prefixing db2 ? to the message number to print the verbose message text to the screen. For example, you could enter db2 ? ADM12509E.

Common errors in the HADR environment

The following is a list of common errors in the HADR environment.

  • SQL1767N - Start HADR cannot complete
  • SQL1768N - Unable to start HADR
  • SQL1769N - Stop HADR cannot complete
  • SQL1770N - Takeover HADR cannot complete
  • SQL1771N - Non recoverable databases cannot be used as either HADR primary or HADR standby database
  • SQL1772N - Infinite active logging cannot be enabled on either HADR primary or HADR standby databases
  • SQL1773N - The statement or command requires functionality that is not supported on a read-enabled HADR standby database
  • SQL1774N - Table space restore or rollforward cannot be issued on an HADR primary or HADR standby database
  • SQL1776N - The command cannot be issued on an HADR database. Reason code = reason-code
  • SQL1777N - HADR is already started

An error message can have multiple reason codes that provide a detailed description of the scenario in which the error was triggered. The user response associated with the reason code helps you understand the steps to be taken to resolve the error.


Conclusion

In this article you learned how to set up HADR for pureScale and how to manage and monitor your HADR databases. You also learned how to troubleshoot various problematic scenarios. This article can serve as a reference guide for system or database administrators who set up and maintain DR using HADR in their pureScale environment.


Acknowledgements

Thank you to Effi Ofer and Roger Zheng for their contributions to this article.

We would also like to thank Rob Causley, Serge Boivin, and Malikarjun Choudary of the IBM Software Group - Information Management for their editorial contributions.

Resources

Learn

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=966025
ArticleTitle=Set up and manage disaster recovery for pureScale using HADR
publish-date=03202014