Resolving a partitioned problem in a high availability configuration

A partitioned problem occurs when the two appliances in a high availability configuration lose the ability to communicate with each other. If both the primary and secondary connections are lost, the queue manager will run on both appliances at the same time.

If the two appliances in your high availability configuration lose both primary and secondary interface connections, then replication no longer occurs between the two appliances. After the connection is restored, the data replication system detects that there have been independent changes to the same resources on both appliances. This situation is described as a partitioned situation, because the two appliances have two different views of the current state of the queue manager (it is sometimes called a 'split-brain' situation). When the first connection is restored (either primary or secondary), the queue manager is stopped on one appliance but continues to run on the other. The HA status is shown as partitioned.

Choosing the 'winner'

To resolve the situation, you must decide which of the two appliances has the data that you want to retain, you then issue a command that identifies this appliance as the winner. Data on the other appliance is discarded. The queue manager is then started on one appliance, and the data replicated to the other appliance.

To help you decide, you can run the status command for the affected queue manager on each appliance. The status command returns an HA status of partitioned together with a report of how much out-of-sync data the appliance has for that queue manager. See Viewing the status of a high availability queue manager.

You can also view the actual data associated with the affected queue manager on each appliance by interrogating the state of each queue manager. Stop all your HA queue managers and disable HA control by issuing the following command for each HA queue manager:

endmqm -w qmgr
Then follow these steps to run each affected queue manager on each appliance, outside of HA control, so that you can interrogate the state of each version of the queue manager:
  1. Display the HA status to confirm that the HA control field shows disabled, and that the queue manager is ended on both appliances:
    status qmgr
    See Viewing the status of a high availability queue manager. (If the queue manager is running on the other appliance, end it there too.) You can use the HA last in sync field in the status reports to help determine when the queue manager data diverged and the seriousness of the partitioning.
  2. On the appliance that is the primary for the queue manager, restart the queue manager outside of HA control:
    strmqm -ns qmgr
    This command starts the queue manager without starting the listener, to prevent any applications connecting. (If you need a listener in order to connect to the queue manager from, for example, IBM MQ Explorer, then you should start a listener manually by using a RUNMQSC command. Start the listener on a different port to the normal one so that applications cannot connect.)
  3. Display the HA status again and verify that the queue manager is running and that HA control still shows disabled:
    status qmgr
  4. Interrogate the state of the queue manager, for example, by browsing messages.
  5. End the queue manager once more:
    endmqm -w qmgr
  6. Display the HA status again and verify that the queue manager is ended on both appliances:
    status qmgr
  7. Suspend the HA group on this appliance by entering the following command:
    sethagrp -s
  8. Display the HA status of this appliance to confirm that it is in the standby state:
    dsphagrp
  9. Log in to the other appliance and start the queue manager:
    strmqm -ns qmgr
    (This command starts the queue manager without the listener. If you need a listener, see the advice in step 2.)
  10. Check the status and confirm that the queue manager is running and that HA control is still disabled:
    status qmgr
  11. Interrogate the state of the queue manager on that appliance, for example, by browsing messages.
  12. End the queue manager:
    endmqm -w qmgr
  13. Display the HA status and confirm that HA control is still disabled and that the queue manager is ended on both appliances.
  14. Resume the suspended appliance:
    sethagrp -r
  15. Display the HA status and confirm the suspended appliance is now active, and HA control is still disabled, and HA status is Partitioned on both appliances.
    dsphagrp
    status qmgr

Implementing your choice of winning appliance

Note: It is a good idea to take back ups of both sets of data before you implement your choice of 'winner', see Backing up a queue manager.
You identify the winner by running the following commands on the chosen appliance:
  1. Make the queue manager on the winning appliance the primary:
    makehaprimary HAQMName
    Where HAQMName is the name of the queue manager. The data is then synchronized between the queue managers on the winning appliance and the other appliance.
  2. Check the HA status to confirm that synchronization has finished and a status of Normal is reported on both appliances:
    status qmgr
  3. Start the queue manager with HA control:
    strmqm qmgr
    Whichever appliance that you start the queue manager on, it should run on the appliance that has been designated as its preferred appliance.
  4. Display the HA status on both appliances and confirm HA control is enabled and the queue manager is active on only the preferred appliance:
    status qmgr

Other situations

If the two appliances lose the replication interface, the HA status is reported as Remote appliance(s) unavailable. The running queue manager might accumulate out-of-sync data. The other queue manager remains in standby with no out-of-sync data. When the connection is remade, replication is resumed.

If your HA queue manager is configured for disaster recovery, and failed over to the recovery appliance when your HA group went out of service, then you might have to resolve data partitioning between the HA group and the recovery appliance. After you have restored your HA group, and resolved data partitioning between the primary and secondary appliances, you must follow the procedure described in Switching back to the main appliance.