Resolving a partitioned problem in a high availability configuration
A partitioned problem occurs when the two appliances in a high availability configuration lose the ability to communicate with each other. If both the primary and secondary connections are lost, the queue manager will run on both appliances at the same time.
If the two appliances in your high availability configuration lose both primary and secondary
interface connections, then replication no longer occurs between the two appliances. After the
connection is restored, the data replication system detects that there have been independent changes
to the same resources on both appliances. This situation is described as a partitioned situation,
because the two appliances have two different views of the current state of the queue manager (it is
sometimes called a 'split-brain' situation). When the first connection is restored (either primary
or secondary), the queue manager is stopped on one appliance but continues to run on the other. The
HA status is shown as partitioned
.
Choosing the 'winner'
To resolve the situation, you must decide which of the two appliances has the data that you want
to retain, you then issue a command that identifies this appliance as the winner
. Data on the
other appliance is discarded. The queue manager is then started on one appliance, and the data
replicated to the other appliance.
To help you decide, you can run the status command for the affected queue manager on each appliance. The status command returns an HA status of partitioned together with a report of how much out-of-sync data the appliance has for that queue manager. See Viewing the status of a high availability queue manager.
You can also view the actual data associated with the affected queue manager on each appliance by interrogating the state of each queue manager. Stop all your HA queue managers and disable HA control by issuing the following command for each HA queue manager:
endmqm -w qmgr
- Display the HA status to confirm that the
HA control
field showsdisabled
, and that the queue manager is ended on both appliances:
See Viewing the status of a high availability queue manager. (If the queue manager is running on the other appliance, end it there too.) You can use the HA last in sync field in the status reports to help determine when the queue manager data diverged and the seriousness of the partitioning.status qmgr
- On the appliance that is the primary for the queue manager, restart the queue manager outside of
HA control:
This command starts the queue manager without starting the listener, to prevent any applications connecting. (If you need a listener in order to connect to the queue manager from, for example, IBM MQ Explorer, then you should start a listener manually by using a RUNMQSC command. Start the listener on a different port to the normal one so that applications cannot connect.)strmqm -ns qmgr
- Display the HA status again and verify that the queue manager is running and that
HA control
still showsdisabled
:status qmgr
- Interrogate the state of the queue manager, for example, by browsing messages.
- End the queue manager once more:
endmqm -w qmgr
- Display the HA status again and verify that the queue manager is ended on both
appliances:
status qmgr
- Suspend the HA group on this appliance by entering the following
command:
sethagrp -s
- Display the HA status of this appliance to confirm that it is in the standby
state:
dsphagrp
- Log in to the other appliance and start the queue
manager:
(This command starts the queue manager without the listener. If you need a listener, see the advice in step 2.)strmqm -ns qmgr
- Check the status and confirm that the queue manager is running and that HA control is still
disabled:
status qmgr
- Interrogate the state of the queue manager on that appliance, for example, by browsing messages.
- End the queue manager:
endmqm -w qmgr
- Display the HA status and confirm that HA control is still disabled and that the queue manager is ended on both appliances.
- Resume the suspended appliance:
sethagrp -r
- Display the HA status and confirm the suspended appliance is now active, and HA control is still
disabled, and HA status is
Partitioned
on both appliances.dsphagrp status qmgr
Implementing your choice of winning appliance
- Make the queue manager on the winning appliance the
primary:
Where HAQMName is the name of the queue manager. The data is then synchronized between the queue managers on the winning appliance and the other appliance.makehaprimary HAQMName
- Check the HA status to confirm that synchronization has finished and a status of
Normal
is reported on both appliances:status qmgr
- Start the queue manager with HA
control:
Whichever appliance that you start the queue manager on, it should run on the appliance that has been designated as its preferred appliance.strmqm qmgr
- Display the HA status on both appliances and confirm HA control is enabled and the queue manager
is active on only the preferred appliance:
status qmgr
Other situations
If the two appliances lose the replication interface, the HA status is reported as
Remote appliance(s) unavailable
. The running queue manager might accumulate
out-of-sync data. The other queue manager remains in standby with no out-of-sync data. When the
connection is remade, replication is resumed.
If your HA queue manager is configured for disaster recovery, and failed over to the recovery appliance when your HA group went out of service, then you might have to resolve data partitioning between the HA group and the recovery appliance. After you have restored your HA group, and resolved data partitioning between the primary and secondary appliances, you must follow the procedure described in Switching back to the main appliance.