PowerHA SystemMirror use of Cluster Aware AIX

PowerHA® SystemMirror® is built in addition to the core clustering capabilities that are supported in the AIX® operating system. PowerHA SystemMirror is supported for all editions of AIX that support Cluster Aware AIX (CAA) capabilities.

CAA and PowerHA SystemMirror use Universal IDs (UID and UUID) to track disks and nodes. Dynamically changing UID and UUID is not supported. The UID and UUID are normally invariant under most circumstances. However, there are known scenarios such as reinstalling the operating system where the UID and UUID can change. If you make changes to the UID and UUID, you must remove and recreate the CAA cluster to ensure all UID and UUIDs are updated.

In AIX Version 7.2, or later, or in IBM® AIX 7.1 with Technology Level 4, or later, CAA detects and handles network failures after 20 seconds (default value). To change the default value from 20 seconds, run the clmgr modify cluster NETWORK_FAILURE_DETECTION_TIME=<xxx> command, where xxx is the number of seconds, in the range 5 - 590.

The following information is about the key components of Cluster Aware AIX that are used as the foundation to build a PowerHA SystemMirror solution stack:

Heartbeat management

By default, PowerHA SystemMirror uses unicast communications for heartbeat. As an alternative, multicast communications may be configured instead of unicast. For multicast, you can optionally select a multicast address, or let Cluster Aware AIX (CAA) automatically assign one. You can specify a multicast address while configuring the cluster, or have a multicast setup by Cluster Aware AIX (CAA) during the configuration based on the network environment. Cluster communication is achieved by communicating over multiple redundant paths of communication. The following redundant paths of communication provide a robust clustering foundation that might not be prone to cluster partitioning:

TCP/IP Networks: PowerHA SystemMirror and Cluster Aware AIX use all network interfaces that are available for cluster communication. All of these interfaces are discovered by default and used for health management and other cluster communication. You can use the PowerHA SystemMirror management interfaces to remove any interface that you do not want to be used for application availability. You can also define the interfaces that you do not want to be used as private interfaces with PowerHA SystemMirror.
SAN based communication: CAA supports storage area network (SAN) fabric-based cluster communication, including heartbeating, for a limited number of adapters. This type of heartbeating is optional and might not work with most environments because of network zoning requirements that allow packets to move from one client to another client by using Small Computer System Interface (SCSI) protocol.
Central cluster-repository based communication: Cluster health and other cluster communication is achieved through the central repository disk. PowerHA SystemMirror 7.2, or later, provides an Automatic Repository Disk Replacement (ARR) function that automatically replaces a failed repository disk with a backup repository disk. The ARR function is available only if you configure and identify a backup repository disk by using PowerHA SystemMirror.

Network interface failure detection time

PowerHA SystemMirror relies on CAA to monitor and detect network interface failures and node failures. In IBM AIX 7.1 with Technology Level 4, or earlier, CAA detected network failures within a fixed amount of time (5 seconds). If a hardware failure occurred in these versions of the AIX operating system, the failures were reported immediately. This type of reporting is called quick failure process. This detection and reporting process in the AIX operating system is different than how PowerHA SystemMirror Version 6.1 reports and detects failures. In PowerHA SystemMirror 6.1, failures are not declared until the full network failure detection time occurs. This process is called full wait time based on relaxed failure detection.

In AIX Version 7.2, or later, or in IBM AIX 7.1 with Technology Level 4, or later, you can use the NETWORK_FAILURE_DETECTION_TIME option with the clmgr command to set the failure detection time for the network interface. The default value for the NETWORK_FAILURE_DETECTION_TIME option is 20 seconds. In AIX Version 7.2, or later, or in IBM AIX 7.1 with Technology Level 4, or later, the failure detection process occurs after the full wait period of the failure detection time. These version of the AIX operating system do not use the quick failure detection process.

To change the default value from 20 seconds for the NETWORK_FAILURE_DETECTION_TIME option, run the clmgr modify cluster NETWORK_FAILURE_DETECTION_TIME=<xxx> command, where xxx is one of the following values:

0: If you specify this value and the cluster is synchronized, then the network failure detection occurs after 5 seconds and uses the quick failure detection process. This option was used in IBM AIX 7.1 with Technology Level 4, or earlier.
5 - 590 seconds: If you specify a value in this range and if the cluster is synchronized, the network failure detection occurs after the specified value and uses the full wait time process.

Node failure detection time

PowerHA SystemMirror and CAA can detect failure of a partner node in a cluster when heartbeats are missing from network communication and disk communication. When these communication channels are lost, monitoring is enabled for a set period of time. This monitoring is known as node failure detection time.

To configure node failure detection time, you can use one of the following options:

SMIT

To configure node failure detection time, complete the following steps:

From the command line, enter smit sysmirror.
In the SMIT interface, select Custom Cluster Configurations > CLuster Nodes and Networks > Manage the Cluster > Cluster heartbeat settings, and press Enter.
Complete all required field, and press Enter.

Command line

From the command line, run the clmgr modify cluster HEARTBEAT_FREQUENCY=<v1> GRACE_PERIOD=<v2> command, where v1 and v2 are values in seconds.

The HEARTBEAT_FREQUENCY option is the node communication time-out value. This value is the number of seconds that CAA waits to receive packets from the partner node before completing the next step in the process to determine whether the partner node has failed. Valid values for the The HEARTBEAT_FREQUENCY option are 20 - 600 seconds. The default value is 30 seconds. The value for the HEARTBEAT_FREQUENCY options must be 10 seconds more than the value used for the NETWORK_FAILURE_DETECTION_TIME option.

The GRACE_PERIOD option is the additional time for which CAA waits after the value specified for the HEARTBEAT_FREQUENCY option. The default value of the GRACE_PERIOD option is 10 seconds.

Enhanced event management

CAA generates fine granular storage and network events that are used by PowerHA SystemMirror to provide a better decision-making capability for high availability management.

Manage storage across the nodes

PowerHA SystemMirror uses the storage fencing capabilities of AIX for better storage management across the nodes in the cluster. The fencing capabilities are supported for only disks that are configured with native AIX Multipath I/O (MPIO). PowerHA SystemMirror manages shared disks through the enhanced concurrent volume management method.

Note: PowerHA SystemMirror attempts to use the CAA storage framework fencing capability to prevent access of shared disks by nodes that do not have access to the owning shared volume group. This fencing capability prevents data corruption because of inadvertent access to shared disks from multiple nodes. However, the CAA storage framework fencing capability is supported only for native AIX MPIO.