Troubleshooting non-pureScale automated HA with Pacemaker

Outline of troubleshooting techniques, commands, and common issues related to the Pacemaker cluster manager in non-pureScale automated HA deployments.

Important: In Db2® 11.5.8 and later, Mutual Failover high availability is supported when using Pacemaker as the integrated cluster manager. In Db2 11.5.6 and later, the Pacemaker cluster manager for automated fail-over to HADR standby databases is packaged and installed with Db2. In Db2 11.5.5, Pacemaker is included and available for production environments. In Db2 11.5.4, Pacemaker is included as a technology preview only, for development, test, and proof-of-concept environments.

Follow the commands to help troubleshoot the HADR cluster with Pacemaker. Troubleshooting cluster manager-related issues almost always begin with checking the state of all the hosts, all resources in the resources model, quorum information, and more, from the perspective of the cluster manager. With the integrated solution, the db2cm -list command is recommended to provide a single view of all relevant resources. This section begins with a detail explanation of the db2cm -list output before going through error scenarios and their resolutions.

Sample db2cm -list output and explanation:
[root@db2tea1 ~]# db2cm -list
      Cluster Status

Domain information:
Domain name               = hadom
Pacemaker version         = 2.0.2-1.db2pcmk.el8
Corosync version          = 3.0.3
Current domain leader     = db2tea1
Number of nodes           = 2
Number of resources       = 6

Node information:
Name name           State
----------------    --------
db2tea1             Online
kedge1              Online

Resource Information:

Resource Name             = db2_db2inst1_db2inst1_SAMPLE
  Resource Type                 = HADR
    DB Name                	= SAMPLE
    Managed                     = true
    HADR Primary Instance       = db2inst1
    HADR Primary Node           = db2tea1
    HADR Primary State          = Online
    HADR Standby Instance       = db2inst1
    HADR Standby Node           = kedge1
    HADR Standby State          = Online

Resource Name             = db2_db2tea1_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = db2tea1
    Instance Name               = db2inst1

Resource Name             = db2_db2tea1_eth1
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = db2tea1
    Interface Name              = eth1

Resource Name             = db2_kedge1_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = kedge1
    Instance Name               = db2inst1

Resource Name             = db2_kedge1_eth1
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = kedge1
    Interface Name              = eth1

Fencing Information:
  Not Configured
Quorum Information:
  Two-node quorum
There are five key components that users should look for in the output: Domain, Node, Resource, Fencing information, and Quorum information.
Domain
Domain information shows installed RPM versions and domain configurations.
Node
Node information shows all the configured nodes in the domain and their active states.
Resource
Resource information lists all the resources in the domain and their states and configurations. The State shows the active state of the resource, and Managed shows that the resource is either disabled or enabled.
Fencing information
Fencing information describes the fencing method used in the domain.
Quorum information
Quorum information should be Two-node quorum, Qdevice, or None based on the quorum type configured
If the Quorum information lists a QDevice setup, the output would be different compared to the above output. It would look similar to the following:
[root@db2tea1 ~]# db2cm -list
      Cluster Status

Domain information:
Domain name               = hadom
Pacemaker version         = 2.0.2-1.db2pcmk.el8
Corosync version          = 3.0.3
Current domain leader     = db2tea1
Number of nodes           = 2
Number of resources       = 6

Node information:
Name name           State
----------------    --------
db2tea1             Online
kedge1              Online

Resource Information:

Resource Name             = db2_db2inst1_db2inst1_SAMPLE
  Resource Type                 = HADR
    HADR DB Name                = SAMPLE
    HADR Primary Instance       = db2inst1
    HADR Primary Node           = db2tea1
    HADR Primary State          = Online
    HADR Priamry Managed        = true

    HADR Standby Instance       = db2inst1
    HADR Standby Node           = kedge1
    HADR Standby State          = Online
    HADR Standby Managed        = true

Resource Name             = db2_db2tea1_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = db2tea1
    Instance Name               = db2inst1

Resource Name             = db2_db2tea1_eth1
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = db2tea1
    Interface Name              = eth1

Resource Name             = db2_kedge1_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = kedge1
    Instance Name               = db2inst1

Resource Name             = db2_kedge1_eth1
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = kedge1
    Interface Name              = eth1

Fencing Information:
  Not Configured
Quorum Information:
  Qdevice

Qdevice information
-------------------
Model:			Net
Node ID:		1
Configured node list:
    0	Node ID = 1
    1	Node ID = 2
Membership node list:	1, 2

Qdevice-net information
----------------------
Cluster name:		hadom
QNetd host:		tierce1:5403
Algorithm:		LMS
Tie-breaker:		Node with lowest node ID
State:			Connected

When engaging Db2 support to analyze an issue with Pacemaker automation, diagnostics should be collected as soon as possible. All necessary Pacemaker diagnostics are collected when running the db2support command which must be run on both hosts in order to have a complete picture.

If an issue is encountered while running a db2cm command, the db2cm logs should also be collected and uploaded to the support case in addition to db2support. Each time a db2cm command is run a new log file will be created under the /tmp directory. The db2cm log name will include the timestamp of when the command was run. For example:
/tmp/db2cm.run.log.20200123

See the following for further troubleshooting of specific issues: