Detecting and responding to system outages in a high availability solution

Implementing a high availability solution does not prevent hardware or software failures. However, having redundant systems and a failover mechanism enables your solution to detect and respond to failures, and reroute workload so that user applications are still able to do work.

Procedure

When a failure occurs, your database solution must do the following:

  1. Detect the failure.

    Failover software can use heartbeat monitoring to confirm the availability of system components. A heartbeat monitor listens for regular communication from all the components of the system. If the heartbeat monitor stops hearing from a component, the heartbeat monitor signals to the system that the component has failed.

  2. Respond to the failure: failover.
    1. Identify, bring online, and initialize a secondary component to take over operations for the failed component.
    2. Reroute workload to the secondary component.
    3. Remove the failed component from the system.
  3. Recover from the failure.

    If a primary database server fails, the first priority is to redirect clients to an alternate server or to failover to a standby database so that client applications can do their work with as little interruption as possible. Once that failover succeeds, you must repair whatever went wrong on the failed database server so that is can be reintegrate it back into the solution. Repairing the failed database server may just mean restarting it.

  4. Return to normal operations.

    Once the failed database system is repaired, you must integrate it back into the database solution. You could reintegrate a failed primary database as the standby database for the database that took over as the primary database when the failure occurred. You could also force the repaired database server to take over as the primary database server again.

What to do next

Db2® database can perform some of these steps for you. For example:

  • The Db2 High Availability Disaster Recovery (HADR) heartbeat monitor element, hadr_heartbeat, can detect when a primary database has failed.

  • Db2 client reroute can transfer workload from a failed database server to a secondary one.

  • The Db2 fault monitor can restart a database instance that terminates unexpectedly.