![[IBM MQ Advanced]](ngadv.gif)
Example RDQM HA configurations and errors
An example RDQM HA configuration, complete with example errors and information on how to resolve them.
- mqhavm13.gamsworthwilliam.com (referred to as vm13).
- mqhavm14.gamsworthwilliam.com (referred to as vm14).
- mqhavm15.gamsworthwilliam.com (referred to as vm15).
- HAQM1 (created on vm13)
- HAQM2 (created on vm14)
- HAQM3 (created on vm15)
Initial conditions
- vm13
-
[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running CPU: 0.00% Memory: 135MB Queue manager file system: 51MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM2 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm14.gamsworthwilliam.com HA preferred location: mqhavm14.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM3 Node: mqhavm13.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm15.gamsworthwilliam.com HA preferred location: mqhavm15.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm14.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. - vm14
-
[midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM1 Node: mqhavm14.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm13.gamsworthwilliam.com HA preferred location: mqhavm13.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM2 Node: mqhavm14.gamsworthwilliam.com Queue manager status: Running CPU: 0.00% Memory: 135MB Queue manager file system: 51MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm14 ~]$ rdqmstatus -m HAQM3 Node: mqhavm14.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm15.gamsworthwilliam.com HA preferred location: mqhavm15.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm15.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. - vm15
-
[midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM1 Node: mqhavm15.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm13.gamsworthwilliam.com HA preferred location: mqhavm13.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm14.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM2 Node: mqhavm15.gamsworthwilliam.com Queue manager status: Running elsewhere HA role: Secondary HA status: Normal HA control: Enabled HA current location: mqhavm14.gamsworthwilliam.com HA preferred location: mqhavm14.gamsworthwilliam.com HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm14.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo. [midtownjojo@mqhavm15 ~]$ rdqmstatus -m HAQM3 Node: mqhavm15.gamsworthwilliam.com Queue manager status: Running CPU: 0.02% Memory: 135MB Queue manager file system: 51MB used, 1.0GB allocated [5%] HA role: Primary HA status: Normal HA control: Enabled HA current location: This node HA preferred location: This node HA floating IP interface: None HA floating IP address: None Node: mqhavm13.gamsworthwilliam.com HA status: Normal Node: mqhavm14.gamsworthwilliam.com HA status: Normal Command '/opt/mqm/bin/rdqmstatus' run with sudo.
DRBD scenarios
- Loss of DRBD quorum
- Loss of a single DRBD connection
- Synchronization stuck
DRBD Scenario 1: Loss of DRBD quorum
If the node running an RDQM HA queue manager loses the DRBD quorum for the DRBD resource corresponding to the queue manager, DRBD immediately starts returning errors from I/O operations, which will cause the queue manager to start producing FDCs and eventually stop.
If the remaining two nodes have a DRBD quorum for the DRBD resource then Pacemaker chooses one of the two nodes to start the queue manager. Because there were no updates on the original node from the time where the quorum was lost, it is safe to start the queue manager somewhere else.
- By using the rdqmstatus command.
- By monitoring the syslog of the node where the RDQM HA queue manager is initially running.
rdqmstatus
If you use the rdqmstatus command, if the node vm13 loses DRBD quorum for the DRBD resource for HAQM1, you might see status similar to the following example:
[midtownjojo@mqhavm13 ~]$ rdqmstatus -m HAQM1
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running elsewhere
HA role: Secondary
HA status: Remote unavailable
HA control: Enabled
HA current location: mqhavm14.gamsworthwilliam.com
HA preferred location: This node
HA floating IP interface: None
HA floating IP address: None
Node: mqhavm14.gamsworthwilliam.com
HA status: Remote unavailable
HA out of sync data: 0KB
Node: mqhavm15.gamsworthwilliam.com
HA status: Remote unavailable
HA out of sync data: 0KB
Command '/opt/mqm/bin/rdqmstatus' run with sudo.
Notice that the HA status has changed to Remote unavailable,
which indicates that both DRBD connections to the other nodes have been lost.
In this case the other two nodes have DRBD quorum for the DRBD resource so the RDQM is running
somewhere else, on mqhavm14.gamsworthwilliam.com as shown as the value of HA current
location.
monitoring syslog
Jul 30 09:38:36 mqhavm13 kernel: drbd haqm1/0 drbd100: quorum( yes -> no )Jul 30 10:27:32 mqhavm13 kernel: drbd haqm1/0 drbd100: quorum( no -> yes )DRBD Scenario 2: Loss of a single DRBD connection
If only one of the two DRBD connections from a node running an RDQM HA queue manager is lost then the queue manager does not move.
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running
CPU: 0.01%
Memory: 133MB
Queue manager file system: 52MB used, 1.0GB allocated [5%]
HA role: Primary
HA status: Mixed
HA control: Enabled
HA current location: This node
HA preferred location: This node
HA floating IP interface: None
HA floating IP address: None
Node: mqhavm14.gamsworthwilliam.com
HA status: Remote unavailable
HA out of sync data: 0KB
Node: mqhavm15.gamsworthwilliam.com
HA status: Normal
Command '/opt/mqm/bin/rdqmstatus' run with sudo.
DRBD Scenario 3: Synchronization stuck
Some versions of DRBD had an issue where a synchronization would appear to be stuck and this prevented an RDQM HA queue manager from failing over to a node when the sync to that node is still in progress.
drbdadm status command. When operating
normally a response similar to the following example is
output:[midtownjojo@mqhavm13 ~]$ drbdadm status
haqm1 role:Primary
disk:UpToDate
mqhavm14.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
mqhavm15.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
haqm2 role:Secondary
disk:UpToDate
mqhavm14.gamsworthwilliam.com role:Primary
peer-disk:UpToDate
mqhavm15.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
haqm3 role:Secondary
disk:UpToDate
mqhavm14.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
mqhavm15.gamsworthwilliam.com role:Primary
peer-disk:UpToDate[midtownjojo@mqhavm13 ~]$ drbdadm status
haqm1 role:Primary
disk:UpToDate
mqhavm14.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
mqhavm15.gamsworthwilliam.com role:Secondary
replication:SyncSource peer-disk:Inconsistent done:90.91
haqm2 role:Secondary
disk:UpToDate
mqhavm14.gamsworthwilliam.com role:Primary
peer-disk:UpToDate
mqhavm15.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
haqm3 role:Secondary
disk:UpToDate
mqhavm14.gamsworthwilliam.com role:Secondary
peer-disk:UpToDate
mqhavm15.gamsworthwilliam.com role:Primary
peer-disk:UpToDateIn this case the RDQM HA queue manager HAQM1 cannot move to vm15 as the disk on vm15 is
Inconsistent.
done value is the percentage complete. If that value is not increasing you
could try disconnecting that replica then connecting it again with the following commands (run as
root) on vm13:drbdadm disconnect haqm1:mqhavm15.gamsworthwilliam.com
drbdadm connect haqm1:mqhavm15.gamsworthwilliam.comdrbdadm disconnect haqm1
drbdadm connect haqm1Pacemaker scenarios
Corosyncmain process not scheduled- RDQM HA queue manager not running where it should
Pacemaker scenario 1: Corosync main process not scheduled
Corosync process or,
more commonly, that the system is a Virtual Machine and the Hypervisor has not scheduled any CPU
time to the entire
VM.corosync[10800]: [MAIN ] Corosync main process was not scheduled for 2787.0891 ms (threshold is 1320.0000 ms). Consider token timeout increase.Both Pacemaker (and Corosync) and DRBD have timers that are used to detect loss
of quorum, so messages like the example indicate that the node did not run for so long that it would
have been dropped from the quorum. The Corosync timeout is 1.65 seconds and the
threshold of 1.32 seconds is 80% of that, so the message shown in the example is printed when the
delay in the scheduling of the main Corosync process hits 80% of the timeout. In
the example the process was not scheduled for nearly three seconds. Whatever is causing such a
problem must be resolved. One thing that might help in a similar situation is to reduce the
requirements of the VM, for example, reducing the number of vCPUs required, as this makes it easier
for the Hypervisor to schedule the VM.
![[MQ 9.3.0 Jun 2022]](ng930.gif)
Pacemaker scenario 2: An RDQM HA queue manager is not running where it should be
%rdqmstatus -m HAQM1
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running
CPU: 0.00
Memory: 123MB
Queue manager file system: 606MB used, 1.0GB allocated [60%]
HA role: Primary
HA status: Normal
HA control: Enabled
HA current location: This node
HA preferred location: This node
HA preferred location: This node
HA blocked location: None
HA floating IP interface: eth4
HA floating IP address: 192.0.2.4
%rdqmstatus -m HAQM2
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running elsewhere
HA role: Secondary
HA status: Normal
HA control: Enabled
HA current location: mqhavm14.gamsworthwilliam.com
HA preferred location: mqhavm14.gamsworthwilliam.com
HA blocked location: None
HA floating IP interface: eth4
HA floating IP address: 192.0.2.6
%rdqmstatus -m HAQM3
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running elsewhere
HA role: Secondary
HA status: Normal
HA control: Enabled
HA current location: mqhavm15.gamsworthwilliam.com
HA preferred location: mqhavm15.gamsworthwilliam.com
HA blocked location: None
HA floating IP interface: eth4
HA floating IP address: 192.0.2.8- All three nodes are shown with an HA status of
Normal. - Each RDQM HA queue manager is running on the node where it was created, for example, HAQM1 is running on vm13 and so on.
This scenario is constructed by preventing HAQM1 from running on vm14, and then attempting to move HAQM1 to vm14. HAQM1 cannot run on vm14 because the file /var/mqm/mqs.ini on vm14 has an invalid value for the Directory of queue manager HAQM1.
rdqmadm -m HAQM1 -n mqhavm14.gamsworthwilliam.com -p$ rdqmstatus -m HAQM1
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running
CPU: 0.15%
Memory: 133MB
Queue manager file system: 52MB used, 1.0GB allocated [5%]
HA role: Primary
HA status: Normal
HA control: Enabled
HA current location: This node
HA preferred location: mqhavm14.gamsworthwilliam.com
HA blocked location: mqhavm14.gamsworthwilliam.com
HA floating IP interface: None
HA floating IP address: None
Node: mqhavm14.gamsworthwilliam.com
HA status: Normal
Node: mqhavm15.gamsworthwilliam.com
HA status: Normal$ rdqmstatus -m HAQM1 -a
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running
CPU: 0.15%
Memory: 133MB
Queue manager file system: 52MB used, 1.0GB allocated [5%]
HA role: Primary
HA status: Normal
HA control: Enabled
HA current location: This node
HA preferred location: mqhavm14.gamsworthwilliam.com
HA blocked location: mqhavm14.gamsworthwilliam.com
HA floating IP interface: None
HA floating IP address: None
Node: mqhavm14.gamsworthwilliam.com
HA status: Normal
Node: mqhavm15.gamsworthwilliam.com
HA status: Normal
Failed resource action: Start
Resource type: Queue manager
Failure node: mqhavm14.gamsworthwilliam.com
Failure time: 2022-01-01 12:00:00
Failure reason: Generic error
Blocked location: mqhavm14.gamsworthwilliam.comTake note of the Failed resource action section that has appeared.
The entry shows that when Pacemaker tried to check the state of HAQM1 on vm14 it got an error because HAQM1 is not configured, which is because of the deliberate misconfiguration in /var/mqm/mqs.ini.
![[MQ 9.3.0 Jun 2022]](ng930.gif)
Correcting the failure
$ rdqmclean -m HAQM1$ rdqmstatus -m HAQM1 -a$ rdqmstatus -m HAQM1
Node: mqhavm13.gamsworthwilliam.com
Queue manager status: Running elsewhere
HA role: Secondary
HA status: Normal
HA control: Enabled
HA current location: mqhavm14.gamsworthwilliam.com
HA preferred location: mqhavm14.gamsworthwilliam.com
HA blocked location: None
HA floating IP interface: None
HA floating IP address: None
Node: mqhavm14.gamsworthwilliam.com
HA status: Normal
Node: mqhavm15.gamsworthwilliam.com
HA status: Normal