Database configuration for high availability disaster recovery (HADR)

You can use database configuration parameters to help achieve optimal performance with Db2® HADR.

In most cases, use the same database configuration parameter settings and database manager configuration parameter settings on the systems where the primary and standby databases are located. If the settings for the configuration parameters on the standby database are different from the settings on the primary, the following problems might occur:

Error messages might be returned for the standby database while the log files that were shipped from the primary database are being replayed.
After a takeover operation, the new primary database might be unable to handle the workload, resulting in performance problems or in applications receiving error messages that they did not receive when they were connected to the original primary database.

Changes to the configuration parameters on the primary database are not automatically propagated to the standby database. You must manually make changes on the standby database. For dynamic configuration parameters, changes take effect without having to shut down and restart the database management system (DBMS) or the database. For non-dynamic configuration parameters, changes take effect after the standby database is restarted.

Following are sections on specific configuration topics for HADR:

Size of log files configuration parameter on the standby database

One exception to the configuration parameter behavior that is described in the previous paragraph is the behavior of the logfilsiz database configuration parameter. Although the value of this parameter is not replicated to the standby database, to help ensure that there are identical log files on both databases, the setting for the logfilsiz configuration parameter on the standby is ignored. Instead, the database creates local log files whose sizes match the size of the log files on the primary database.

After a takeover, the original standby (new primary) uses the logfilsiz parameter value that you set on the original primary until you restart the database. At that point, the new primary reverts to using the value that you set locally. In addition, the current log file is truncated and any pre-created log files are resized on the new primary.

If the databases keep switching roles as a result of a non-forced takeover and neither database is deactivated, the log file size that is used is always the one from the original primary database. However, if there is a deactivation and then a restart on the original standby (new primary), the new primary uses the log file size that you configured locally. This log file size continues to be used if the original primary takes over again. Only after a deactivation and restart on the original primary would the log file size revert to the settings on the original primary.

Database configuration parameter autorestart

The recommended configuration for the autorestart parameter on HADR systems is ON. If the autorestart parameter is set to OFF, and the server fails, your response depends on whether you want to restart or fail over to the standby:

If you want to restart, run the RESTART DATABASE command manually. If the restart fails, perform failover.
If you want to fail over, perform the following steps:
1. Shut down the old primary to prevent a split brain. Do this by either stopping the Db2 instance or powering off the host machine. If the server is not accessible for administration, fence it off from clients by disabling the client/server network.
  Note: Deactivating the database is not sufficient because client connections can bring it back online. If it failed in a consistent state, then even if the autorestart parameter is set to OFF, client connections can bring it back online.
2. After you shut down old primary, issue the TAKEOVER HADR command with the BY FORCE option on the standby.

Log receive buffer size on a standby database

By default, the log receive buffer size on a standby database is two times the value that you specify for the logbufsz configuration parameter on the primary database. This size might not be sufficient. For example, consider what might happen when the HADR synchronization mode is set to ASYNC and the primary and standby databases are in peer state. If the primary database is also experiencing a high transaction load, the log receive buffer on the standby database might fill to capacity, and the log shipping operation from the primary database might stall. To manage these temporary peaks, you can make either of the following configuration changes:

Increase the size of the log receive buffer on the standby database by modifying the value of the DB2_HADR_BUF_SIZE registry variable.
Enable log spooling on a standby database by setting the hadr_spool_limit configuration parameter.

Load operations and HADR

If you issue the LOAD command on the primary database with the COPY YES parameter, the command executes on the primary database, and the data is replicated to the standby database if the load copy can be accessed through the path or device that is specified by the command. If load copy data cannot be accessed from the standby database, the table space in which the table is stored is marked invalid on the standby database. Any future log records that pertain to this table space are skipped. To ensure that the load operation can access the load copy on the standby database, use a shared location for the output file from the COPY YES parameter. Alternatively, you can deactivate the standby database while performing the load on the primary, place a copy of the output file in the standby path, and then activate the standby database.

If you issue the LOAD command with the NONRECOVERABLE parameter on the primary database, the command executes on the primary database, and the table on the standby database is marked invalid. Any future log records that pertain to this table are skipped. You can issue the LOAD command with the COPY YES and REPLACE parameters to bring the table back with new data on both the primary and standby databases, or you can drop the table to recover the space.

Note: You cannot bring a table back using the LOAD command with the COPY YES and REPLACE options if the table has one of the following characteristics:

The table was created with the NOT LOGGED INITIALLY attribute.
The table is a multidimensional clustered (MDC) table.
The table has compression dictionaries.
The table has XML columns.

Note: A LOAD command with the REPLACE option first deletes all data from the table and then inserts only new data. This happens on both the primary and standby databases.

Because a load operation with the COPY NO parameter is not supported with HADR, the operation is automatically converted to a load operation with the NONRECOVERABLE parameter. To enable a load operation with the COPY NO parameter to be converted to a load operation with the COPY YES parameter, set the DB2_LOAD_COPY_NO_OVERRIDE registry variable on the primary database. This registry variable is ignored on the standby database. Ensure that the device or directory that you specify for the primary database can be accessed by the standby database by using the same path, device, or load library.

If you are using the Tivoli® Storage Manager (TSM) software to perform a load operation with the COPY YES parameter, you might have to set the vendoropt configuration parameter on the primary and standby databases. Depending on how you configured TSM, the values on the primary and standby databases might not be the same. Also, when using TSM to perform a load operation with the COPY YES parameter, you must issue the db2adutl command with the GRANT parameter to give the standby database read access to the files that are loaded.

If table data is replicated by a load operation with the COPY YES parameter, the indexes are replicated as follows:

If you specify the REBUILD indexing mode option with the LOAD command and the LOG INDEX BUILD table attribute is set to ON (using the ALTER TABLE statement), or if it is set to NULL and the logindexbuild database configuration parameter is set to ON, the primary database includes the rebuilt index object (that is, all of the indexes defined on the table) in the copy file to enable the standby database to replicate the index object. If the index object on the standby database is marked invalid before the load operation, it becomes usable again after the load operation as a result of the index rebuild.
If you specify the INCREMENTAL indexing mode option with the LOAD command and the LOG INDEX BUILD table attribute is set to ON (using the ALTER TABLE statement), or if it is set to NULL and the logindexbuild database configuration parameter on the primary database is set to ON, the index object on the standby database is updated only if it is not marked invalid before the load operation. Otherwise, the index is marked invalid on the standby database.

DB2_HADR_PEER_WAIT_LIMIT registry variable

Restriction: None of this section applies to auxiliary standbys because they are in SUPERASYNC synchronization mode, so they do not ever enter peer state.

If you set the DB2_HADR_PEER_WAIT_LIMIT registry variable, the HADR primary database breaks out of peer state if logging on the primary database has been blocked for the specified number of seconds because of log replication to the standby. When this limit is reached, the primary database breaks the connection to the standby database. If you disable the peer window by setting the hadr_peer_window configuration parameter to 0, the primary enters the disconnected state, and logging resumes. If you enable the peer window, the primary database enters disconnected peer state, in which logging continues to be blocked. The primary leaves disconnected peer state upon reconnection or peer window expiration. Logging resumes after the primary leaves disconnected peer state.

Note: If you set DB2_HADR_PEER_WAIT_LIMIT, use a minimum value of 10 to avoid triggering false alarms.

Honoring peer window transition when a database breaks out of peer state ensures peer window semantics for safe takeover in all cases. If the primary fails during the transition, normal peer window protection still applies: safe takeover from the standby if it is still in disconnected peer state.

On the standby side, after disconnection, the database continues replaying already received logs. After the received logs have been replayed, the standby reconnects to the primary. After replaying the received logs, the standby reconnects to the primary. Upon reconnection, normal state transition follows: first remote catchup state, then peer state.

Relationship to hadr_timeout database configuration parameter

The hadr_timeout database configuration parameter does not break the primary out of peer state if the primary keeps receiving heartbeat messages from the standby while blocked. The hadr_timeout database configuration parameter specifies a timeout value for the HADR network layer. An HADR database breaks the connection to its partner database if it has not received any message from its partner for the period that is specified by the hadr_timeout configuration parameter. The timeout does not control timeout for higher-layer operations such as log shipping and ack (acknowledgement) signals. If log replay on the standby database is stuck on a large operation such as load or reorganization, the HADR component still sends heartbeat messages to the primary database on the normal schedule. In such a scenario, the primary is blocked as long as the standby replay is blocked unless you set the DB2_HADR_PEER_WAIT_LIMIT registry variable.

The DB2_HADR_PEER_WAIT_LIMIT registry variable unblocks primary logging regardless of connection status. Even if you do not set the DB2_HADR_PEER_WAIT_LIMIT registry variable, the primary always breaks out of peer state when a network error is detected or the connection is closed, possibly as result of the hadr_timeout configuration parameter.

DB2_FAIL_RECOVERY_ON_TABLESPACE_ERROR registry variable

In an HADR environment, by default, when a standby database has a table space in an invalid or error state, the replay of transactions on this table space stops. On other valid table spaces, the replay of transactions will continue. This default behavior is preferable when the affected table space is only a small portion of the database and most applications can function with the valid table spaces.

You can specify a different behavior, by setting the DB2_FAIL_RECOVERY_ON_TABLESPACE_ERROR registry variable to ROLLFORWARD. With this setting, the standby database will shut down when encountering a table space which is invalid or in error. The standby database can be restarted after the error is fixed. However, if the error cannot be fixed, you can recover the affected table space by referring to Recovering table space errors on an HADR standby database for more information.

HADR configuration parameters

Some HADR configuration parameters are static, such as hadr_local_host and hadr_remote_host. Static parameters are loaded on database startup, and changes are ignored during run time. HADR parameters are also loaded when the START HADR command completes. On the primary database, HADR can be started and stopped dynamically, with the database remaining online. Thus, one way to refresh the effective value of an HADR configuration parameter without shutting down the database is to stop and restart HADR. In contrast, the STOP HADR brings down the database on the standby, so the standby's parameters cannot be refreshed with the database online.

Host name parameters and service and port name parameters

There are six interrelated configuration parameters that you need to set for HADR:

hadr_target_list
hadr_local_host
hadr_remote_host
hadr_local_svc
hadr_remote_svc (except in a Db2 pureScale® environment, which does not make use of this parameter)
hadr_remote_inst

The target list specifies a set of host:port pairs that act as standbys (for the primary) or the standby hosts to be used if the standby takes over as the new HADR primary database. For a detailed description of its usage, see the hadr_target_list topic in the Related links.

TCP connections are used for communication between the primary and standby databases. The local parameters specify the local address and the remote parameters specify the remote address. A primary database or primary database member listens on its local address for new connections. A standby database that is not connected to a primary database retries connection to its remote address.

The standby database also listens on its local address. In some scenarios, another HADR database can contact the standby database on this address and send it messages.

Unless the HADR_NO_IP_CHECK registry variable is set, Db2 runs the following cross-checks at connection time, of the local and remote addresses of the primary and principal standby databases:

 my local address = your remote address

and

my remote address = your local address

The check is done using the IP address and port number, rather than the literal string in the configuration parameters. You need to set the hadr_local_host and hadr_local_svc configuration parameters respectively, to specify the internal address and external address of the host. You can bypass this check by setting the DB2_HADR_NO_IP_CHECK registry variable to ON. For more information, see HADR and network address translation (NAT) support.

You can configure an HADR database to use either IPv4 or IPv6 to locate its partner database. If the host server does not support IPv6, you must use IPv4. If the server supports IPv6, whether the database uses IPv4 or IPv6 depends upon the format of the address that you specify for the hadr_local_host and hadr_remote_host configuration parameters. The database attempts to resolve the two parameters to the same IP format and use IPv6 when possible. Table 1 shows how the IP mode is determined for IPv6-enabled servers:

Table 1. How the address space used for HADR communication is determined
IP mode used for hadr_local_host parameter	IP mode used for hadr_remote_host parameter	IP mode used for HADR communications
IPv4 address	IPv4 address	IPv4
IPv4 address	IPv6 address	Error
IPv4 address	host name, maps to IPv4 only	IPv4
IPv4 address	host name, maps to IPv6 only	Error
IPv4 address	host name, maps to IPv4 and v6	IPv4
IPv6 address	IPv4 address	Error
IPv6 address	IPv6 address	IPv6
IPv6 address	host name, maps to IPv4 only	Error
IPv6 address	host name, maps to IPv6 only	IPv6
IPv6 address	host name, maps to IPv4 and IPv6	IPv6
hostname, maps to IPv4 only	IPv4 address	IPv4
hostname, maps to IPv4 only	IPv6 address	Error
hostname, maps to IPv4 only	hostname, maps to IPv4 only	IPv4
hostname, maps to IPv4 only	hostname, maps to IPv6 only	Error
hostname, maps to IPv4 only	hostname, maps to IPv4 and IPv6	IPv4
hostname, maps to IPv6 only	IPv4 address	Error
hostname, maps to IPv6 only	IPv6 address	IPv6
hostname, maps to IPv6 only	hostname, maps to IPv4 only	Error
hostname, maps to IPv6 only	hostname, maps to IPv6 only	IPv6
hostname, maps to IPv6 only	hostname, maps to IPv4 and IPv6	IPv6
hostname, maps to IPv4 and IPv6	IPv4 address	IPv4
hostname, maps to IPv4 and IPv6	IPv6 address	IPv6
hostname, maps to IPv4 and IPv6	hostname, maps to IPv4 only	IPv4
hostname, maps to IPv4 and IPv6	hostname, maps to IPv6 only	IPv6
hostname, maps to IPv4 and IPv6	hostname, maps to IPv4 and IPv6	IPv6

The primary and standby databases can make HADR connections only if they use the same IPv4 or IPv6 format. If one server is IPv6 enabled (but also supports IPv4) and the other server supports IPv4 only, at least one of the hhadr_local_host and hadr_remote_host parameters on the IPv6 server must specify an IPv4 address to force database on this server to use IPv4.

You can set the HADR local service and remote service parameters (hadr_local_svc and hadr_remote_svc) to either a port number or a service name. The values that you specify must map to ports that are not being used by any other service, including other Db2 components or other HADR databases. In particular, you cannot set either parameter value to the TCP/IP port that is used by the server to await communications from remote clients (the value of the svcename database manager configuration parameter) or the next port (the value of the svcename parameter + 1).

If the primary and standby databases are on different servers, they can use the same port number or service name; otherwise, they must have different values.

Automatic reconfiguration

The hadr_remote_host, hadr_remote_svc, and hadr_remote_inst configuration parameters are automatically reset when HADR starts if you did not correctly set them. Even though this automatic reconfiguration occurs, always try to set the correct initial values because that reconfiguration might not take effect until a connection is made between a standby and its primary. In some HADR deployments, those initial values might be needed. For example, if you are using the IBM® Tivoli System Automation for Multiplatforms software, the value for the hadr_remote_inst configuration parameter is needed to construct a resource name.

Note: If the DB2_HADR_NO_IP_CHECK registry variable is set to ON, the hadr_remote_host and hadr_remote_svc are not automatically updated.

Reconfiguration is predicated on the values of the hadr_target_list configuration parameter being correct; if anything is wrong in a target list entry, you must correct it manually.

On the primary, the reconfiguration occurs in the following manner:

If the values for the hadr_remote_host and hadr_remote_svc configuration parameters do not match the host:port pair that is the first entry of the hadr_target_list configuration parameter (namely, the principal standby), the hadr_remote_host and hadr_remote_svc configuration parameters are updated with the values from the target list.
If the value for the hadr_remote_inst configuration parameter does not match the instance name of the principal standby, the correct instance name is copied to the hadr_remote_inst configuration parameter for the primary after the principal standby connects to it.

On a standby database, the reconfiguration occurs in the following manner:

When a standby starts, it attempts to connect to the database that you specified for its hadr_remote_host, hadr_remote_inst, and hadr_remote_svc configuration parameters.
If the standby cannot connect to the primary, it waits for the primary to connect to it.
The primary attempts to connect to its standbys using addresses listed in its hadr_target_list parameter. After the primary connects to a standby, the hadr_remote_host, hadr_remote_inst, and hadr_remote_svc configuration parameters for the standby are updated with the correct values for the primary.

In a non-forced takeover, the values for the hadr_remote_host, hadr_remote_inst, and hadr_remote_svc configuration parameters on the new primary are automatically updated to its principal standby, and these parameters on the standbys listed in the new primary's hadr_target_list are automatically updated to point to the new primary. Any database that is not listed in the new primary's hadr_target_list is not updated. Those databases continue to attempt to connect to the old primary and get rejected because the old primary is now a standby. The old primary is guaranteed to be in the new primary's target list because of the requirement of mutual inclusion in the target list.

In a forced takeover, automatic update on the new primary and its standbys (excluding the old primary) work the same way as non-forced takeover. However, automatic update on the old primary does not happen until it is shut down and restarted as a standby for reintegration.

Any database that is not online during the takeover will be automatically reconfigured after it starts. Automatic reconfiguration might not take effect immediately on startup, because it relies on the new primary to periodically contact the standby. On startup, a standby might attempt to connect to the old primary and follow the log stream of the old primary, causing it to diverge from the new primary's log stream and, making that standby unable to pair with the new primary. As a result, you must shut down the old primary before takeover to avoid that kind of split brain scenario.

Synchronization mode

The setting for the hadr_syncmode configuration parameter does not have to be the same on the primary and standby databases. Whatever setting you specify for the hadr_syncmode configuration parameter on a standby is considered its configured synchronization mode; this setting has relevance only if the standby becomes a primary. The standby is assigned an effective synchronization mode. For any auxiliary standby, the effective synchronization mode is always SUPERASYNC. For the principal standby, the effective synchronization mode is the setting for the hadr_syncmode configuration parameter for the primary. For a standby, the monitoring interfaces display the effective synchronization mode as the synchronization mode.

Note: If you have set up HADR without using the hadr_target_list configuration parameter (a method that is deprecated starting in 10.5), the hadr_syncmode configuration parameter must be identical on the primary and standby databases.

For more detailed information, see Db2 high availability disaster recovery (HADR) synchronization mode.

HADR timeout and peer window

The timeout period, which you specify with the hadr_timeout configuration parameter, must be identical on the primary and standby databases. The consistency of the values of these configuration parameters is checked when an HADR pair establishes a connection.

With one exception, when the primary database starts, it waits for the longer of the two following periods for a standby to connect:

For a minimum of 30 seconds
For the number of seconds that is specified by the hadr_timeout database configuration parameter.

If the principal standby does not connect in the specified time, the startup fails; a connection to an auxiliary standby is optional. The one exception to this behavior is when you issue the START HADR command with the BY FORCE parameter. In this case, the primary database starts without waiting for the standby database to connect to it.

After an HADR pair establishes a connection, they exchange heartbeat messages. The heartbeat interval is computed from factors like the hadr_timeout and hadr_peer_window configuration parameters. It is reported by the HEARTBEAT_INTERVAL field in the MON_GET_HADR table function and the db2pd command. If one database does not receive any message from the other database within the number of seconds that is specified by the hadr_timeout configuration parameter, it initiates a disconnect. This behavior means that at most, it takes the number of seconds that is specified by the hadr_timeout configuration parameter for an HADR database to detect the failure of either its partner database or the intervening network. If you set the hadr_timeout configuration parameter too low, you might receive false alarms and frequent disconnections.

The setting for the hadr_peer_window configuration parameter does not have to be the same on the primary and standby databases; the principal standby uses the peer window setting of the primary. The exception to this is if you set up HADR without using the hadr_target_list configuration parameter (a method that is deprecated starting in 10.5), in which case, the hadr_peer_window configuration parameter must be identical on the primary and standby databases.

Peer window cannot be enabled (that is, it must be set to 0) in the following situations:

If you are using the Db2 pureScale feature
If the hadr_syncmode parameter is set to ASYNC or SUPERASYNC
If you are configuring an auxiliary standby

If you have the hadr_peer_window configuration parameter set to a nonzero value and the primary loses connection to the standby in peer state, the primary database does not commit transactions until the connection with the standby database is restored or the time value of the hadr_peer_window configuration parameter elapses, whichever happens first.

For maximal availability, the default value for the hadr_peer_window database configuration parameter is 0. When this parameter is set to 0, as soon as the connection between the primary and the standby is closed, the primary drops out of peer state to avoid blocking transactions. The connection can close because the standby closed the connection, a network error is detected, or timeout is reached. For increased data consistency, but reduced availability, you can set the hadr_peer_window database configuration parameter to a nonzero value.

For more information, see Setting the hadr_timeout and hadr_peer_window database configuration parameters.

The following sample configuration is for the primary and standby databases:

Primary database:

   HADR_TARGET_LIST          host2.ibm.com:hadr_service
   HADR_LOCAL_HOST           host1.ibm.com
   HADR_LOCAL_SVC            hadr_service
   HADR_REMOTE_HOST          host2.ibm.com
   HADR_REMOTE_SVC           hadr_service
   HADR_REMOTE_INST          dbinst2
   HADR_TIMEOUT              120
   HADR_SYNCMODE             NEARSYNC
   HADR_PEER_WINDOW          120

Standby database:

   HADR_TARGET_LIST          host1.ibm.com:hadr_service
   HADR_LOCAL_HOST           host2.ibm.com
   HADR_LOCAL_SVC            hadr_service
   HADR_REMOTE_HOST          host1.ibm.com
   HADR_REMOTE_SVC           hadr_service
   HADR_REMOTE_INST          dbinst1
   HADR_TIMEOUT              120
   HADR_SYNCMODE             NEARSYNC
   HADR_PEER_WINDOW          120