IBM Support

How a TSAMP cluster gets into Db2 Split-brain

Troubleshooting


Problem

HADR DB is set to the Primary role on both nodes in the cluster.

Symptom

Output of "db2pd -db <DBNAME> -hadr" shows as primary on both nodes.
The output of lssam can also show online for the local node and failed offline for the remote node when issued from both nodes.

Cause

Both nodes set pending quorum and attempt to reserve the network tiebreaker by ICMP pinging the default gateway
Both nodes are able to ping the default gateway so the TB has been reserved on both nodes, this leads to both nodes being granted operational quorum.
Both nodes having been granted  Operational Quorum so the current primary continues as primary, but the standby is now converted to primary by TSAMP issuing the hadrVxxx_start.ksh.
This is caused by a network issue and when the network comes back up ICMP is enabled prior to UDP being enabled so the pings from the network tiebreaker work while the UDP heartbeats are still unable to be processed.

Environment

2 node cluster, node01 and node02
Single NIC/Network connecting them, NIC is "en0" on both nodes (AIX example)
Network tiebreaker using default gateway IP address as target
Db2 HADR, instance user is db2inst1, DB is TESTDB
TSAMP/RSCT/Db2/OS versions are unimportant, Db2 split brain can happen on all versions

Diagnosing The Problem

Nodes lose heartbeat communication over network on en0 via UDP packets on port 12347
Node node01 gets PENDING_QUORUM and after 30 sec (Network TB has a PostReserveWaitTime of 30 seconds) it gets HASQUORUM
 
 
```
Apr 18 15:48:59 node01 ConfigRM[33070]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PeerDomain.C,1.99.34.1,20474             :::CONFIGRM_PENDINGQUORUM_ER#012The operational quorum state of the active peer domain has changed to PENDING_QUORUM. #012This state usually indicates that exactly half of the nodes that are defined in the #012peer domain are online.  In this state cluster resources cannot be recovered although #012none will be stopped explicitly.
 
 
Network TieBreaker is tested and an rc=0 indicates a successful poll of the network TieBreaker:

 
```
Apr 18 15:48:59 node01 samtb_net[12844]: Entered op=reserve ip=10.37.145.1 log=1 count=9
Apr 18 15:48:59 node01 samtb_net[12844]: op=reserve ip=10.37.145.1 rc=0 log=1 count=9
```
(If the TieBreaker poll is successful, the node regains QUORUM)
After 30 sec (Network TB has a PostReserveWaitTime of 30 seconds)it gets HASQUORUM
 

```
Apr 18 15:49:29 node01 ConfigRM[33070]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PeerDomain.C,1.99.34.1,20470             :::CONFIGRM_HASQUORUM_ST#012The operational quorum state of the active peer domain has changed to HAS_QUORUM. #012In this state, cluster resources may be recovered and controlled as needed by #012management applications.
 
```
(CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM.)

similarly for node02(Currently secondary node)
 
Node gets PENDING_QUORUM at same time as for primary(node01) and after 30 sec (Network TB has a PostReserveWaitTime of 30 seconds)it gets HASQUORUM
 
```
Apr 18 15:48:54 node02 ConfigRM[38555]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PeerDomain.C,1.99.34.1,20474             :::CONFIGRM_PENDINGQUORUM_ER#012The operational quorum state of the active peer domain has changed to PENDING_QUORUM. #012This state usually indicates that exactly half of the nodes that are defined in the #012peer domain are online.  In this state cluster resources cannot be recovered although #012none will be stopped explicitly.
 
```

Network TieBreaker is tested and an rc=0 indicates a successful poll of the network TieBreaker:
```
Apr 18 15:48:54 node02 samtb_net[29975]: Entered op=reserve ip=10.37.145.1 log=1 count=9
Apr 18 15:48:54 node02  samtb_net[29975]: op=reserve ip=10.37.145.1 rc=0 log=1 count=9
```
(If the TieBreaker poll is successful, the node regains QUORUM)
After 30 sec (Network TB has a PostReserveWaitTime of 30 seconds)it gets HASQUORUM
```
Apr 18 15:49:24 node02  ConfigRM[38555]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PeerDomain.C,1.99.34.1,20470             :::CONFIGRM_HASQUORUM_ST#012The operational quorum state of the active peer domain has changed to HAS_QUORUM. #012In this state, cluster resources may be recovered and controlled as needed by #012management applications.
```
(CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM.)
Both nodes having Operational Quorum and current primary(node01) continues as primary, but the standby(node02) is now converted to primary by TSAMP issuing the db2V111_start.ksh, so now both nodes have a running primary.  
following syslog messages occur when TSAMP starts a DB2 instance:
Mounts are started (eg: (/db2inst1):
 
```
Apr 18 15:49:27 node02 mountV111_start.ksh[32922]: 52: Entered (/db2inst1)
Apr 18 15:49:27 node02  mountV111_start.ksh[32922]: 278: mounting file system: /db2inst1
Apr 18 15:49:27 node02mountV111_start.ksh[32922]: 454: Returning 0 for /db2inst1
```
Db2 partition started:
 
```
Apr 18 15:49:30 node02 db2V111_start.ksh[34204]: 226: Entered /opt/rsct/sapolicies/db2/db2V111_start.ksh, db2inst1, 0
Apr 18 15:49:30 node02 db2V111_start.ksh[34222]: 261: Able to cd to /db2home/db2inst1/sqllib : /opt/rsct/sapolicies/db2/db2V111_start.ksh, db2inst1, 0
Apr 18 15:49:30 node02 db2V111_start.ksh[34204]: 268: 1 partitions total: /opt/rsct/sapolicies/db2/db2V111_start.ksh, db2inst1, 0

 
Apr 18 15:49:55 node02 samtb_net[36896]: op=heartbeat ip=10.37.145.1 rc=0 log=1 count=9
Apr 18 15:50:21 node02 db2V111_start.ksh[34222]: 305: Partition was successfully started.

 

Resolving The Problem

To avoid Db2 split brain you should have multiple network paths between each node, this ensures that the cluster never splits into sub clusters. 
Solution is that you need to avoid network issues or you could switch to a more reliable tiebreaker or add a third node as an arbitrator node.

Please refer below tech-note to add a arbitrator node:
 
https://www.ibm.com/support/pages/node/876032

Document Location

Worldwide

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSRM2X","label":"Tivoli System Automation for Multiplatforms"},"ARM Category":[{"code":"a8m0z000000bldnAAA","label":"Split Brain"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB45","label":"Automation"}}]

Product Synonym

TSAMP;tsa

Document Information

Modified date:
17 August 2020

UID

ibm16253261