Reinitializing an HADR configuration to resolve errors in Db2

You can reinitialize a Db2 High Availability Disaster Recovery (HADR) configuration to resolve an error condition that prevents the primary and standby databases from connecting and achieving a peer state.

About this task

For various reasons, an HADR configuration can end up in an error state. In these situations, usually one copy of the database (primary or standby) is working correctly while the other copy is corrupted.

For example, after a worker node reboot, the old primary can sometimes fail to re-integrate if the peer window expires and then a subsequent takeover by force is issued by the HADR automation (governor). In such a scenario you will find a log entries in the governor log (/var/log/governor/governor.log) that are similar to the following example on the new primary (old standby):

2020-04-01 18:35:38,832 INFO 8991-47423382027648: child(13084) executing db2 takeover hadr on db BLUDB by force peer window only
2020-04-01 18:35:39,084 INFO 8991-47423382027648: SQL1770N  Takeover HADR cannot complete. Reason code = "9".

2020-04-01 18:35:39,085 INFO 8991-47423382027648: we have the mandate to force takeover (window=300)
2020-04-01 18:35:39,086 INFO 8991-47423382027648: Result of DNS resolution of remote endpoint: 10.130.0.39
2020-04-01 18:35:40,096 INFO 8991-47423382027648: child(13151) executing db2 takeover hadr on db BLUDB by force

....

2020-04-01 18:35:54,686 INFO 8991-47423382027648: using cached role(PRIMARY) as of 0.369333982468 seconds ago (threshold 1)
2020-04-01 18:35:54,687 INFO 8991-47423382027648:
                            db2 role is PRIMARY,
                            db2 connect status is DISCONNECTED,
                            db2 state is DISCONNECTED

On the old primary (currently disconnected standby), you will see governor logs inside the Db2 database pod that are similar to the following:

2020-04-01 18:36:01,690 INFO 2668-47200027639168: db2 state is LOCAL_CATCHUP
2020-04-01 18:36:01,690 INFO 2668-47200027639168: startup: waiting on db2 to become peer with primary (waited 20 secs)
2020-04-01 18:36:11,701 INFO 2668-47200027639168: Calling db2pd
2020-04-01 18:36:12,260 INFO 2668-47200027639168: db2pd returned
2020-04-01 18:36:12,262 INFO 2668-47200027639168:
Database BLUDB not activated on database member 0 or this database name cannot be found in the local database directory.

Option -hadr requires -db <database> or -alldbs option and active database.

2020-04-01 18:36:12,262 INFO 2668-47200027639168: db2 state is None
2020-04-01 18:36:12,262 INFO 2668-47200027639168: startup: waiting on db2 to become peer with primary (waited 30 secs)

The old primary never integrates as the new standby after the rebooted host comes back online. In this situation, the only option is to reinitialize the HADR system by using the following procedure.

Procedure

  1. Stop HADR on the standby pod:
    oc exec -it <standby_pod> -- manage_hadr -stop
  2. Stop HADR on the primary pod:
    oc exec -it <primary_pod> -- manage_hadr -stop
  3. Set up the HADR on the primary and standby pods by using thesetup_config_hadr script.
  4. Start HADR services on the standby pod:
    oc exec -it <standby_pod> -- manage_hadr -start_as standby
  5. Start HADR services on the primary pod:
    oc exec -it <primary_pod> -- manage_hadr -start_as primary
  6. To verify that the HADR setup completed as expected, run the -status command:
    oc exec -it <primary/standby pod> -- manage_hadr -status