Db2 HADR database pair both assume primary role

This topic will show you how to identify and resolve a case where both databases in an HADR pair assume the primary HADR database role due to compounding issues.

Important: In Db2® 11.5.8 and later, Mutual Failover high availability is supported when using Pacemaker as the integrated cluster manager. In Db2 11.5.6 and later, the Pacemaker cluster manager for automated fail-over to HADR standby databases is packaged and installed with Db2. In Db2 11.5.5, Pacemaker is included and available for production environments. In Db2 11.5.4, Pacemaker is included as a technology preview only, for development, test, and proof-of-concept environments.

Identification of the problem

Confirm both databases have "HADR database role" set to PRIMARY. Run db2 get db cfg for | grep "HADR database role" on each host as the instance user.

[rohant@svlxtorcpacemaker]# db2 get db cfg for gtdb| grep "HADR database role"
HADR database role = PRIMARY

[rohant@svlxtordpacemaker]# db2 get db cfg for gtdb| grep "HADR database role"
HADR database role = PRIMARY

Additionally, running crm status as root will show the database on one host in the failed state.

[root@svlxtorcpacemaker]# crm status
...
Clone Set: db2_rohant_rohant_GTDB-clone [db2_ rohant_rohant_GTDB] (promotable)
    db2_rohant_rohant_GTDB (ocf::heartbeat:db2hadr): FAILED
    Masters: [ svltord ]

The above output from crm status could be a transient state. Run the command a couple of times to confirm that the failure is persistent.

Resolution

Search for the promotion of the standby database in the pacemaker.log or db2diag.log.

Example: pacemaker.log

Jun 19 14:10:52 svltordpacemaker-controld[1765] (abort_transition_graph) notice: Transition 8608 aborted by nodes-1-db2hadr-rohant_rohant_GTDB_reint doing modify db2hadr-rohant_rohant_GTDB_reint=1: Configuration change | cib=18.14477.0 source=te_update_diff_v2:465 path=/
db2hadr(db2_rohant_rohant_GTDB)[31427]: 2020/06/19_14:10:52 INFO: promote: 959: svtdbm: 0: CORAL: Debug data: "DB20000I The TAKEOVER HADR ON DATABASE command completed successfully.". db2hadr_promote() exit with rc=0.

Example: db2diag.log

2020-06-19-14.10.50.093150-420 I133204209A456 LEVEL: Info
PID : 16226 TID : 4395462813968 PROC : db2sysc 0
INSTANCE: rohant NODE : 000 DB : GTDB
HOSTNAME: svltord
EDUID : 80 EDUNAME: db2hadrs.0.0 (CORAL) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleInitialRequest, probe:46000
MESSAGE : Standby has initiated a takeover by force peer window only.....

2020-06-19-14.10.52.593838-420 I133268013A437 LEVEL: Info
PID : 16226 TID : 4395462813968 PROC : db2sysc 0
INSTANCE: rohantNODE : 000 DB : GTDB
HOSTNAME: svltord
EDUID : 80 EDUNAME: db2hadrp.0.1 (CORAL) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleDoneDrain, probe:46840
MESSAGE : Standby has completed takeover (now primary).

As shown in the examples above, svltord was promoted to become the primary, meaning the other host svlxtorc should be reintegrated as standby.

Reintegrate the database as standby on the host that was not promoted to primary by running db2 start hadr on db <dbname> as standby.

Run db2support to collect Db2 and Pacemaker diagnostics for analysis of original conditions leading to a double primary state.