[MQ 9.2.2 Mar 2021]

Failed resource actions

Failed resource actions arise when the Pacemaker component of an RDQM high availability configuration encounters some problem with a resource on one of the nodes in an HA group.

The RDQM HA solution uses Pacemaker for monitoring and managing resources (see RDQM high availability). If Pacemaker encounters an error performing an operation on a resource on a node, it records this information using a failed resource action. Some failed resource actions prevent the resource from running and must be cleared before Pacemaker can restart the resource.

You can use the rdqmstatus -m command to see if there any failed resource actions that are stopping a queue manager from starting on one or more nodes.

You can then use the rdqmstatus -m qmname -a command to view the details of failed resource actions that are associated with a queue manager. Follow this action by using the rdqmclean command to clear these failed resource actions, and so free up any restricted resources. (You must also take action to resolve the problems that caused the failed resource action in the first place.)

The following resources are controlled by Pacemaker in an RDQM HA configuration, and can be the subjects of failed resource actions:
  • Queue manager
  • Floating IP
  • RDQM control
  • Filesystem
  • DR replication (DRBD)
  • HA replication (DRBD)

Each type of resource can be subject to the following types of failure:

Soft
Soft failures are transient, and Pacemaker continues to try to recover the resource until it times out or is otherwise stopped.
Hard
A hard error requires administrative intervention. Hard errors block the resource from running on a particular node.
Fatal
A fatal error requires administrative intervention. Fatal errors block the resource from running on any node.

See Viewing RDQM and HA group status for examples of status including failed resource queue actions.

You can use the rdqmclean command to clear all failed resource actions associated with a specified queue manager, or all failed resource actions in the RDQM HA configuration.

Note: Some failed resource actions do not result in the queue manager from being blocked on a node. For example, after an unexpected queue manager end, Pacemaker attempts to restart the queue manager on the node on which is was found to be not running. If the start is successful, the queue manager is not blocked from running on the node. The only way you would become aware of the failed resource action in this case is by running rdqmstatus -m qmname -a.