Performing an HADR failover operation

When you want the current standby database to become the new primary database because the current primary database is not available, you can perform a forced takeover, or failover.

Before you begin

A takeover operation can only take place if the standby is in peer state, disconnected peer state, remote catchup pending state, or remote catchup state. If the standby database is in any other state, an error will be returned.
Note: You can make a standby database that is in local catchup state available for normal use by converting it to a standard database. To do this, shut the database down by issuing the DEACTIVATE DATABASE command, and then issue the STOP HADR command. Once HADR has been stopped, you must complete a rollforward operation on the former standby database before it can be used. A database cannot rejoin an HADR pair after it has been converted from a standby database to a standard database. To restart HADR on the two servers, follow the procedure for initializing HADR.
If you have configured a peer window, shut down the primary before the window expires to avoid potential transaction loss in any related failover.

About this task

The TAKEOVER HADR command with the BY FORCE can only be issued on the standby database. In Db2® pureScale® environments, you can issue the command from any member in the standby cluster, including non-replay members.

During a failover, the following steps occur:
  1. A disabling message is sent to the primary, if it is connected.
  2. After it receives the disabling message, the primary database is shut down and log writing is stopped.
  3. Log shipping and log retrieval is stopped on the standby, which entails a risk of data loss.
  4. All received logs (that is, the logs that are stored in the log path) are replayed on the standby.
  5. Any open transactions are rolled back on the standby.
  6. The standby's role changes to primary and the new primary database is opened for client connections.
Warning:

This procedure might cause a loss of data. Review the following information before performing this emergency procedure:

  • Ensure that the primary database is no longer processing database transactions. If the primary database is still running, but cannot communicate with the standby database, executing a forced takeover operation (issuing the TAKEOVER HADR command with the BY FORCE option) could result in two primary databases. When there are two primary databases, each database will have different data, and the two databases can no longer be automatically synchronized.
    • Deactivate the primary database or stop its instance, if possible. (This might not be possible if the primary system has hung, crashed, or is otherwise inaccessible.) After a failover operation is performed, if the failed database is later restarted, it will not automatically assume the role of primary database.
  • The likelihood and extent of transaction loss depends on your specific configuration and circumstances:
    • If the primary database fails while in peer state or disconnected peer state and the synchronization mode is synchronous (SYNC), the standby database does not lose transactions that were reported committed to an application before the primary database failed.
    • If the primary database fails while in peer state or disconnected peer state and the synchronization mode is near synchronous (NEARSYNC), the standby database can only lose transactions committed by the primary database if both the primary and the standby databases fail at the same time.
    • If the primary database fails while in peer state or disconnected peer state and the synchronization mode is asynchronous (ASYNC), the standby database can lose transactions committed by the primary database if the standby database did not receive all of the log records for the transactions before the takeover operation was performed. The standby database can also lose transactions committed by the primary database if the standby database crashes before it was able to write all the received logs to disk.
      Note: Peer window is not allowed in ASYNC mode, therefore the primary database can never enter disconnected peer state in that mode.
    • If the primary database fails while in remote catchup state and the synchronization mode is super asynchronous (SUPERASYNC), the standby database can lose transactions committed by the primary database if the standby database did not receive all of the log records for the transactions before the takeover operation was performed. The standby database can also lose transactions committed by the primary database if the standby database crashes before it was able to write all the received logs to disk.
      Note: Databases can never be in peer or disconnected peer state in SUPERASYNC mode.
    • If the primary database fails while in remote catchup pending state, transactions that have not been received and processed by the standby database are lost.
      Note: Any log gap shown in the database snapshot represents the gap at the last time the primary and standby databases were communicating with each other; the primary database might have processed a very large number of transactions since that time.
  • Ensure that any application that connects to the new primary (or that is rerouted to the new primary by client reroute), is prepared to handle the following:
    • There is data loss during failover. The new primary does not have all of the transactions committed on the old primary. This can happen even when the hadr_syncmode configuration parameter is set to SYNC. Because an HADR standby applies logs sequentially, you can assume that if a transaction in an SQL session is committed on the new primary, all previous transactions in the same session have also been committed on the new primary. The commit sequence of transactions across multiple sessions can be determined only with detailed analysis of the log stream.
    • It is possible that a transaction can be issued to the original primary, committed on the original primary and replicated to the new primary (original standby), but not be reported as committed because the original primary crashed before it could report to the client that the transaction was committed. Any application you write should be able to handle that transactions issued to the original primary, but not reported as committed on the original primary, are committed on the new primary (original standby).
    • Some operations are not replicated, such as changes to database configuration and to external UDF objects.
  • HADR does not interface with the Db2 fault monitor (db2fm) which can be used to automatically restart a failed database. If the fault monitor is enabled, you should be aware of possible fault monitor action on a presumably failed primary database.

Procedure

To fail over the primary role to the standby:

  • Use the CLP to initiate a failover operation on the standby database.
    1. Completely disable the failed primary database. When a database encounters internal errors, normal shutdown commands might not completely shut it down. You might need to use operating system commands to remove resources such as processes, shared memory, or network connections.
    2. Issue the TAKEOVER HADR command with the BY FORCE option on the standby database.
      In the following example the failover takes place on database LEAFS:
      TAKEOVER HADR ON DB LEAFS BY FORCE
      The BY FORCE option is required because the primary is expected to be offline.

      If the primary database is not completely disabled, the standby database still has a connection to the primary and sends a disabling message to the primary database forcing it to shut down. The standby database still switches to the role of primary database whether or not it receives confirmation from that the primary database has been shutdown.

  • Call the db2HADRTakeover application programming interface (API) from an application.
  • Open the task assistant for the TAKEOVER HADR command in IBM® Data Studio.

Results

If, at the time of the failover, the standby has a connection to the primary (or any member on the primary in a Db2 pureScale environment), it sends a disabling message to the old primary to prevent a split brain scenario with dual primaries. You can clear the disabling message by doing one of the following:
  • starting the failed primary as a standby (that is, reintegrating it)
  • starting the failed primary as a primary using the BY FORCE option
  • stopping HADR on the failed primary
  • dropping the failed primary database
  • restoring the database

What to do next

If you want to reintegrate the old primary as the new standby, the old primary's log streams cannot have diverged from the new primary's. For more information on this procedure, see the Related links.