Causes of HA failover
There are many potential causes of a high availability (HA) queue manager failing over from one appliance to the other.
The following list describes the potential causes, in the order that they are likely to occur.
- Changing preferred location
- If you change the preferred location of an HA queue manager, the queue manager moves to the newly specified preferred location.
- Suspending HA appliance
- If you suspend an HA appliance, all the HA queue managers currently running on that appliance move to the other appliance.
- Resuming HA appliance
- When you resume a previously suspended appliance, all HA queue managers that have that appliance
as the preferred location move back to it. Note: There might be a delay while each queue manager synchronizes changed data from the other appliance.
- Appliance shutdown or reboot
- If you shut down an appliance, or restart it, all HA queue managers currently running on an appliance being shut down move to the other appliance.
- Appliance startup
- When you restart a previously shut down appliance, all HA queue managers that have that
appliance as the preferred location move back to it.Note:
- There might be a delay while each queue manager synchronizes changed data from the other appliance.
- Start up after firmware upgrade is different, HA queue managers might need to migrate before they can run on the upgraded appliance, see Appliance migration.
- Appliance migration
- After some migrations to a new firmware version, HA queue managers are placed into
Migration Pending
state. The following migrations are affected in this way:- Migration from version 8.0 or 9.0 CD to version 9.1 and later: all HA queue managers are placed
in the
Migration Pending
state. - Migration from version 9.1 to 9.2 and later: only HA queue managers with DR configuration are
placed in the
Migration Pending
state.
- Migration from version 8.0 or 9.0 CD to version 9.1 and later: all HA queue managers are placed
in the
- Appliance power failure
- If an appliance suffers a power failure, HA queue managers running on that appliance fail over to the other appliance in the HA pair.
- Loss of HA replication connection
- Losing the HA replication connection (by default, eth21) does not trigger the failover of HA queue managers, unless the HA heartbeat connection is also lost.
- Loss of HA heartbeat connection
- From IBM® MQ Appliance 9.2, losing HA heartbeat connection (that is both eth13 and eth17) does not trigger the failover of HA queue managers, unless the HA replication connection is also lost.
- Loss of all HA connections
- Losing all HA connections (that is, both heartbeat, eth13 and eth17, and replication, by default
eth21) triggers the failover of HA queue managers currently running on the affected
appliance.Note: This scenario is likely to cause partitioned data.
- Loss of DR replication connection
- HA queue managers with DR configuration periodically monitor the DR replication connection from
both HA appliances. If the DR replication for a queue manager is lost on the primary appliance, the
queue manager moves to the other appliance provided it still has an active DR replication
connection. This ensures that queue manager data continues being replicating to the remote DR
appliance. (If the recovery site is itself an HA pair, the connection is not monitored and cannot
cause HA failover.)Note: To avoid frequent relocation in the event of transitory loss of connectivity on the primary queue manager, failover is not triggered until three consecutive failed connection checks have occurred. The connection checks take place at 20 second intervals.
- Firmware reload
- For versions 9.1.2 and 9.1.0.8 onwards, a firmware reload causes all HA queue managers
currently running on the appliance to move to the other appliance. After the firmware has reloaded,
all HA queue managers with the affected appliance set as their preferred location move back to
it.Note: There might be a delay while each queue manager synchronizes changed data from the other appliance.
- Abrupt stop of queue manager
- If an HA queue manager stops abruptly, attempts are made to restart it on the preferred
appliance. If that fails, the queue manager fails over to the other appliance.Note: Check appliance and queue manager error logs for the causes of an abrupt stop. Potential causes include, but are not limited to, product defects, disk failure, lack of resources, modification of network interface configuration after creating HA queue manager.
You can use the status command to view details of failed resource actions that might have caused HA failover.
If an HA configuration has a failed resource action, you can view
details of the action by specifying the -a option. See Failed resource actions for more information about failed resource
actions. The following information is provided:
- Failed resource action
- The type of resource action that has failed.
- Resource type
- The type of resource the action was attempted on.
- Failure location
- Whether the failure occurred on the appliance on which you are running the status command or on the other appliance in the HA group.
- Failure time
- The time that the failure occurred.
- Failure reason
- The cause of the failure.
- Blocked location
- Whether the failure is preventing the queue manager from running on this appliance, or the other appliance in the HA group, or both.