Causes of HA failover

There are many potential causes of a high availability (HA) queue manager failing over from one appliance to the other.

The following list describes the potential causes, in the order that they are likely to occur.
Changing preferred location
If you change the preferred location of an HA queue manager, the queue manager moves to the newly specified preferred location.
Suspending HA appliance
If you suspend an HA appliance, all the HA queue managers currently running on that appliance move to the other appliance.
Resuming HA appliance
When you resume a previously suspended appliance, all HA queue managers that have that appliance as the preferred location move back to it.
Note: There might be a delay while each queue manager synchronizes changed data from the other appliance.
Appliance shutdown or reboot
If you shut down an appliance, or restart it, all HA queue managers currently running on an appliance being shut down move to the other appliance.
Appliance startup
When you restart a previously shut down appliance, all HA queue managers that have that appliance as the preferred location move back to it.
Note:
  • There might be a delay while each queue manager synchronizes changed data from the other appliance.
  • Start up after firmware upgrade is different, HA queue managers might need to migrate before they can run on the upgraded appliance, see Appliance migration.
Appliance migration
After some migrations to a new firmware version, HA queue managers are placed into Migration Pending state. The following migrations are affected in this way:
  • Migration from version 8.0 or 9.0 CD to version 9.1 and later: all HA queue managers are placed in the Migration Pending state.
  • Migration from version 9.1 to 9.2 and later: only HA queue managers with DR configuration are placed in the Migration Pending state.
When migration is manually triggered for an HA queue manager in the Migration Pending state, it is moved to the migrated appliance. When the other appliance is migrated to the newer firmware, all HA queue manager with that appliance set as preferred location will move back to it.
Appliance power failure
If an appliance suffers a power failure, HA queue managers running on that appliance fail over to the other appliance in the HA pair.
Loss of HA replication connection
Losing the HA replication connection (by default, eth21) does not trigger the failover of HA queue managers, unless the HA heartbeat connection is also lost.
Loss of HA heartbeat connection
From IBM® MQ Appliance 9.2, losing HA heartbeat connection (that is both eth13 and eth17) does not trigger the failover of HA queue managers, unless the HA replication connection is also lost.
For earlier versions, losing the HA heartbeat connection (that is, both eth13 and eth17) will trigger the failover of HA queue managers currently running on the affected appliance.
Note:
  • There might be a delay as the active HA replication connection can sometimes cause the failover to block, in which case failover will be re-attempted at regular intervals.
  • This scenario is likely to cause partitioned data.
Loss of all HA connections
Losing all HA connections (that is, both heartbeat, eth13 and eth17, and replication, by default eth21) triggers the failover of HA queue managers currently running on the affected appliance.
Note: This scenario is likely to cause partitioned data.
Loss of DR replication connection
HA queue managers with DR configuration periodically monitor the DR replication connection from both HA appliances. If the DR replication for a queue manager is lost on the primary appliance, the queue manager moves to the other appliance provided it still has an active DR replication connection. This ensures that queue manager data continues being replicating to the remote DR appliance. (If the recovery site is itself an HA pair, the connection is not monitored and cannot cause HA failover.)
Note: To avoid frequent relocation in the event of transitory loss of connectivity on the primary queue manager, failover is not triggered until three consecutive failed connection checks have occurred. The connection checks take place at 20 second intervals.
Firmware reload
For versions 9.1.2 and 9.1.0.8 onwards, a firmware reload causes all HA queue managers currently running on the appliance to move to the other appliance. After the firmware has reloaded, all HA queue managers with the affected appliance set as their preferred location move back to it.
Note: There might be a delay while each queue manager synchronizes changed data from the other appliance.
For versions before 9.1.2 and 9.1.0.8, a firmware reload causes failover for all HA queue managers currently set as secondary on the other appliance. After the firmware has reloaded, all HA queue managers with the affected appliance set as preferred fail back to that appliance.
Note: This scenario is likely to cause partitioned data for versions before 9.1.2 and 9.1.0.8.
Abrupt stop of queue manager
If an HA queue manager stops abruptly, attempts are made to restart it on the preferred appliance. If that fails, the queue manager fails over to the other appliance.
Note: Check appliance and queue manager error logs for the causes of an abrupt stop. Potential causes include, but are not limited to, product defects, disk failure, lack of resources, modification of network interface configuration after creating HA queue manager.

You can use the status command to view details of failed resource actions that might have caused HA failover.

If an HA configuration has a failed resource action, you can view details of the action by specifying the -a option. See Failed resource actions for more information about failed resource actions. The following information is provided:
Failed resource action
The type of resource action that has failed.
Resource type
The type of resource the action was attempted on.
Failure location
Whether the failure occurred on the appliance on which you are running the status command or on the other appliance in the HA group.
Failure time
The time that the failure occurred.
Failure reason
The cause of the failure.
Blocked location
Whether the failure is preventing the queue manager from running on this appliance, or the other appliance in the HA group, or both.