Advanced node failure detection

Advanced node failure detection function is provided which can be used to reduce the number of failure scenarios which result in cluster partitions.

There are some failure situations in which heartbeat monitoring cannot determine what exactly failed. Failure might be the result of a communication failure between cluster nodes or an entire cluster node has failed. Take for example, the case where a cluster node fails due to a failure in a critical hardware component such as a processor. The whole machine can go down without giving cluster resource services on that node an opportunity to notify other cluster nodes of the failure. The other cluster nodes see only a failure in the heartbeat monitoring. They are unable to know if it was due to a node failure or a failure in some part of the communication path (a line, a router, or an adapter).

When this type of failure occurs, cluster resource service assumes that the node that is not responding could still be operational and partitions the cluster.

In 7.1, an advanced node failure detection function is provided which can be used to reduce the number of failure scenarios which result in cluster partitions. An additional monitoring technique is used to provide another source of information to allow cluster resource services to determine when a cluster node has failed.

This advanced function uses a Hardware Management Console (HMC) for those IBM® systems which can be managed by an HMC or a Virtual I/O Server (VIOS) partition on an Integrated Virtualization Manager (IVM) managed server. In either case, the HMC or IVM is able to monitor the state of logical partitions or the entire system and notify cluster resource services when state changes in the partition or system occur. Cluster resource services can use this state change information to know when a cluster node has failed and avoid partitioning the cluster with only the knowledge of a heartbeat monitor.

In this example, an HMC is being used to manage two different IBM systems. For example, the HMC can power up each system or configure logical partitions on each system. In addition, the HMC is monitoring the state of each system and logical partitions on each system. Assume that each system is a cluster node and cluster resource services is monitoring a heartbeat between the two cluster nodes.

With the advanced node failure detection function, cluster resource services can be configured to make use of the HMC. For example, Node A can be configured to have a cluster monitor that uses the HMC. Whenever HMC detects that Node B fails (either the system or the logical partition for Node B), it notifies cluster resource services on Node A of the failure. Cluster resource services on Node A then marks Node B as failed and perform failover processing rather than partitioning the cluster.

Likewise, Node B can also be configured to have a cluster monitor. In this example, then, a failure of either Node A or Node B would result in a notification from the HMC to the other node.

Refer to Managing failover outage events for more information about failure scenarios that result in a cluster partition when advanced node failure detection is not used and that result in node failure when the advanced detection is used.

Start of change Notification of failures by an HMC with Common Information Model (CIM) server or VIOS depends upon a CIM server running on the cluster node which is to receive the notification. If the CIM server is not running, the advanced node failure detection is not aware of node failures. The CIM server must be started and left running anytime the cluster node is active. Use the STRTCPSVR *CIMOM CL command to start the CIM server. End of change

Start of change If you are using an HMC with version V8R8.5.0 or later you can configure a cluster monitor that uses an HMC REST server. This type of cluster monitor does not require a CIM server. End of change