Example RDQM HA configurations and errors

An example RDQM HA configuration, complete with example errors and information on how to resolve them.

The example RDQM HA group consists of three nodes:

mqhavm13.gamsworthwilliam.com (referred to as vm13).
mqhavm14.gamsworthwilliam.com (referred to as vm14).
mqhavm15.gamsworthwilliam.com (referred to as vm15).

Three RDQM HA queue managers have been created:

HAQM1 (created on vm13)
HAQM2 (created on vm14)
HAQM3 (created on vm15)

Initial conditions

The initial condition on each of the nodes is given in the following listings:

vm13

[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1
Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running
CPU:                                    0.00%
Memory:                                 135MB
Queue manager file system:              51MB used, 1.0GB allocated [5%]
HA role:                                Primary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    This node
HA preferred location:                  This node
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM2
Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm14.gamsworthwilliam.com
HA preferred location:                  mqhavm14.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM3
Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm15.gamsworthwilliam.com
HA preferred location:                  mqhavm15.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

vm14

[midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM1
Node:                                   mqhavm14.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm13.gamsworthwilliam.com
HA preferred location:                  mqhavm13.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm13.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

[midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM2
Node:                                   mqhavm14.gamsworthwilliam.com
Queue manager status:                   Running
CPU:                                    0.00%
Memory:                                 135MB
Queue manager file system:              51MB used, 1.0GB allocated [5%]
HA role:                                Primary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    This node
HA preferred location:                  This node
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm13.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

[midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM3
Node:                                   mqhavm14.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm15.gamsworthwilliam.com
HA preferred location:                  mqhavm15.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm13.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

vm15

[midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM1
Node:                                   mqhavm15.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm13.gamsworthwilliam.com
HA preferred location:                  mqhavm13.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm13.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

[midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM2
Node:                                   mqhavm15.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm14.gamsworthwilliam.com
HA preferred location:                  mqhavm14.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm13.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

[midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM3
Node:                                   mqhavm15.gamsworthwilliam.com
Queue manager status:                   Running
CPU:                                    0.02%
Memory:                                 135MB
Queue manager file system:              51MB used, 1.0GB allocated [5%]
HA role:                                Primary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    This node
HA preferred location:                  This node
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm13.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

DRBD scenarios

RDQM HA configurations use DRBD for data replication. The following scenarios illustrate the following possible problems with DRBD:

Loss of DRBD quorum
Loss of a single DRBD connection
Synchronization stuck

DRBD Scenario 1: Loss of DRBD quorum

If the node running an RDQM HA queue manager loses the DRBD quorum for the DRBD resource corresponding to the queue manager, DRBD immediately starts returning errors from I/O operations, which will cause the queue manager to start producing FDCs and eventually stop.

If the remaining two nodes have a DRBD quorum for the DRBD resource then Pacemaker chooses one of the two nodes to start the queue manager. Because there were no updates on the original node from the time where the quorum was lost, it is safe to start the queue manager somewhere else.

The two main ways that you can monitor for a loss of DRBD quorum are:

By using the rdqmstatus command.
By monitoring the syslog of the node where the RDQM HA queue manager is initially running.

rdqmstatus

If you use the rdqmstatus command, if the node vm13 loses DRBD quorum for the DRBD resource for HAQM1, you might see status similar to the following example:

[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1
Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Remote unavailable
HA control:                             Enabled
HA current location:                    mqhavm14.gamsworthwilliam.com
HA preferred location:                  This node
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Remote unavailable
HA out of sync data:                    0KB

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Remote unavailable
HA out of sync data:                    0KB
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

Notice that the HA status has changed to Remote unavailable, which indicates that both DRBD connections to the other nodes have been lost.

In this case the other two nodes have DRBD quorum for the DRBD resource so the RDQM is running somewhere else, on mqhavm14.gamsworthwilliam.com as shown as the value of HA current location.

monitoring syslog

If you monitor syslog, you will see that DRBD logs a message when it loses quorum for a resource:

Jul 30 09:38:36 mqhavm13 kernel: drbd haqm1/0 drbd100: quorum( yes -> no )

When quorum is restored a similar message is logged:

Jul 30 10:27:32 mqhavm13 kernel: drbd haqm1/0 drbd100: quorum( no -> yes )

DRBD Scenario 2: Loss of a single DRBD connection

If only one of the two DRBD connections from a node running an RDQM HA queue manager is lost then the queue manager does not move.

Starting from the same initial conditions as in the first scenario, after blocking just one of the DRBD replication links, the status reported by rdqmstatus on vm13 is similar to the following example:

Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running
CPU:                                    0.01%
Memory:                                 133MB
Queue manager file system:              52MB used, 1.0GB allocated [5%]
HA role:                                Primary
HA status:                              Mixed
HA control:                             Enabled
HA current location:                    This node
HA preferred location:                  This node
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com

HA status:                              Remote unavailable
HA out of sync data:                    0KB

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.

DRBD Scenario 3: Synchronization stuck

Some versions of DRBD had an issue where a synchronization would appear to be stuck and this prevented an RDQM HA queue manager from failing over to a node when the sync to that node is still in progress.

One way to see this is to use the drbdadm status command. When operating normally a response similar to the following example is output:

[midtownjojo@mqhavm13 ~]$ drbdadm status
haqm1 role:Primary
  disk:UpToDate
  mqhavm14.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate
  mqhavm15.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate

haqm2 role:Secondary
  disk:UpToDate
  mqhavm14.gamsworthwilliam.com role:Primary
    peer-disk:UpToDate
  mqhavm15.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate

haqm3 role:Secondary
  disk:UpToDate
  mqhavm14.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate
  mqhavm15.gamsworthwilliam.com role:Primary
    peer-disk:UpToDate

If synchronization gets stuck, the response is similar to the following example:

[midtownjojo@mqhavm13 ~]$ drbdadm status
haqm1 role:Primary
  disk:UpToDate
  mqhavm14.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate
  mqhavm15.gamsworthwilliam.com role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:90.91

haqm2 role:Secondary
  disk:UpToDate
  mqhavm14.gamsworthwilliam.com role:Primary
    peer-disk:UpToDate
  mqhavm15.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate

haqm3 role:Secondary
  disk:UpToDate
  mqhavm14.gamsworthwilliam.com role:Secondary
    peer-disk:UpToDate
  mqhavm15.gamsworthwilliam.com role:Primary
    peer-disk:UpToDate

In this case the RDQM HA queue manager HAQM1 cannot move to vm15 as the disk on vm15 is Inconsistent.

The done value is the percentage complete. If that value is not increasing you could try disconnecting that replica then connecting it again with the following commands (run as root) on vm13:

drbdadm disconnect haqm1:mqhavm15.gamsworthwilliam.com
drbdadm connect haqm1:mqhavm15.gamsworthwilliam.com

If the replication to both Secondary nodes is stuck, you can do the disconnect and connect commands without specifying a node and that will disconnect both connections:

drbdadm disconnect haqm1
drbdadm connect haqm1

Pacemaker scenarios

RDQM HA configurations use Pacemaker to determine where an RDQM HA queue manager runs. The following scenarios illustrate the following possible problems that involve Pacemaker:

Corosync main process not scheduled
RDQM HA queue manager not running where it should

Pacemaker scenario 1: `Corosync` main process not scheduled

If you see a message in the syslog similar to the following example this indicates that the system is either too busy to schedule CPU time to the main Corosync process or, more commonly, that the system is a Virtual Machine and the Hypervisor has not scheduled any CPU time to the entire VM.

corosync[10800]:  [MAIN  ] Corosync main process was not scheduled for 2787.0891 ms (threshold is 1320.0000 ms). Consider token timeout increase.

Both Pacemaker (and Corosync) and DRBD have timers that are used to detect loss of quorum, so messages like the example indicate that the node did not run for so long that it would have been dropped from the quorum. The Corosync timeout is 1.65 seconds and the threshold of 1.32 seconds is 80% of that, so the message shown in the example is printed when the delay in the scheduling of the main Corosync process hits 80% of the timeout. In the example the process was not scheduled for nearly three seconds. Whatever is causing such a problem must be resolved. One thing that might help in a similar situation is to reduce the requirements of the VM, for example, reducing the number of vCPUs required, as this makes it easier for the Hypervisor to schedule the VM.

Pacemaker scenario 2: An RDQM HA queue manager is not running where it should be

The main tool to help troubleshooting in this scenario is the rdqmstatus command. The following example shows a response for the configuration when everything is working as expected. The commands are run on VM13:

%rdqmstatus -m HAQM1

Node:                                    mqhavm13.gamsworthwilliam.com
Queue manager status:                    Running
CPU:                                     0.00
Memory:                                  123MB
Queue manager file system:               606MB used, 1.0GB allocated [60%]
HA role:                                 Primary
HA status:                               Normal
HA control:                              Enabled
HA current location:                     This node
HA preferred location:                   This node
HA preferred location:                   This node
HA blocked location:                     None
HA floating IP interface:                eth4
HA floating IP address:                  192.0.2.4


%rdqmstatus -m HAQM2

Node:                                    mqhavm13.gamsworthwilliam.com
Queue manager status:                    Running elsewhere
HA role:                                 Secondary
HA status:                               Normal
HA control:                              Enabled
HA current location:                     mqhavm14.gamsworthwilliam.com
HA preferred location:                   mqhavm14.gamsworthwilliam.com
HA blocked location:                     None
HA floating IP interface:                eth4
HA floating IP address:                  192.0.2.6



%rdqmstatus -m HAQM3

Node:                                    mqhavm13.gamsworthwilliam.com
Queue manager status:                    Running elsewhere
HA role:                                 Secondary
HA status:                               Normal
HA control:                              Enabled
HA current location:                     mqhavm15.gamsworthwilliam.com
HA preferred location:                   mqhavm15.gamsworthwilliam.com
HA blocked location:                     None
HA floating IP interface:                eth4
HA floating IP address:                  192.0.2.8

Note the following points:

All three nodes are shown with an HA status of Normal.
Each RDQM HA queue manager is running on the node where it was created, for example, HAQM1 is running on vm13 and so on.

This scenario is constructed by preventing HAQM1 from running on vm14, and then attempting to move HAQM1 to vm14. HAQM1 cannot run on vm14 because the file /var/mqm/mqs.ini on vm14 has an invalid value for the Directory of queue manager HAQM1.

The preferred location for HAQM1 is changed to vm14 by running the following command on vm13:

rdqmadm -m HAQM1 -n mqhavm14.gamsworthwilliam.com -p

This command would normally cause HAQM1 to move to vm14 but in this case checking the status on vm13 returns the following information:

$ rdqmstatus -m HAQM1
Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running
CPU:                                    0.15%
Memory:                                 133MB
Queue manager file system:              52MB used, 1.0GB allocated [5%]
HA role:                                Primary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    This node
HA preferred location:                  mqhavm14.gamsworthwilliam.com
HA blocked location:                    mqhavm14.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal

HAQM1 is still running on vm13, it has not moved to vm14 as requested and the cause needs investigating. Examining the status and including failed resource actions gives the following response:

$ rdqmstatus -m HAQM1 -a

Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running
CPU:                                    0.15%
Memory:                                 133MB
Queue manager file system:              52MB used, 1.0GB allocated [5%]
HA role:                                Primary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    This node
HA preferred location:                  mqhavm14.gamsworthwilliam.com
HA blocked location:                    mqhavm14.gamsworthwilliam.com
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal

Failed resource action:                 Start
Resource type:                          Queue manager
Failure node:                           mqhavm14.gamsworthwilliam.com
Failure time:                           2022-01-01 12:00:00
Failure reason:                         Generic error
Blocked location:                       mqhavm14.gamsworthwilliam.com

Take note of the Failed resource action section that has appeared.

The entry shows that when Pacemaker tried to check the state of HAQM1 on vm14 it got an error because HAQM1 is not configured, which is because of the deliberate misconfiguration in /var/mqm/mqs.ini.

Correcting the failure

To correct the failure you must correct the underlying problem (in this case restoring the correct directory value for HAQM1 in /var/mqm/mqs.ini on vm14). Then you must clear the failed action by using the command rdqmclean on the appropriate resource, which in this case is the resource haqm1 as that is the resource mentioned in the failed action. For example:

$ rdqmclean -m HAQM1

Then check the failed resource action status again:

$ rdqmstatus -m HAQM1 -a

The failed action has disappeared and HAQM1 is now running on vm14 as expected. The following example shows the RDQM status:

$ rdqmstatus -m HAQM1
Node:                                   mqhavm13.gamsworthwilliam.com
Queue manager status:                   Running elsewhere
HA role:                                Secondary
HA status:                              Normal
HA control:                             Enabled
HA current location:                    mqhavm14.gamsworthwilliam.com
HA preferred location:                  mqhavm14.gamsworthwilliam.com
HA blocked location:                    None
HA floating IP interface:               None
HA floating IP address:                 None

Node:                                   mqhavm14.gamsworthwilliam.com
HA status:                              Normal

Node:                                   mqhavm15.gamsworthwilliam.com
HA status:                              Normal