Split policy

A cluster split event can occur when a group of nodes cannot communicate with the remaining nodes in a cluster. A cluster split event splits the cluster into two or more partitions.

After a cluster split event, the subdomain that has most of nodes are retained and the other subdomains are deleted. If exactly half of the nodes of a domain are online and the remaining nodes are inaccessible, PowerHA® SystemMirror® must determine which subdomain has operational quorum by using Reliable Scalable Cluster Technology (RSCT). The subdomain with an operational quorum is retained and the other subdomains are deleted.

You can use PowerHA SystemMirror to configure a split policy that specifies the response to a cluster split event.

To configure a split policy, the following options are available:

None

The None split policy allows each partition that is created by the cluster split event to become an independent cluster and each partition is started independently of the other partition. A user can check the quorum status by using the following command: lssrc -ls IBM.RecoveryRM

Before a split event, operational quorum state is in HAS_QUORUM state and the configuration quorum is TRUE. A user can do both operational change such as moving the cluster online or offline and configuration change such as adding resource group or deleting resource group into the cluster.

However, after the spilt event, for each partition, the operational quorum state will be HAS_QUORUM but configurational quorum state will be FALSE. Hence, PowerHA SystemMirror and the user can perform operational change but not the configurational change.

For example, in two-nodes cluster, if the cluster split event occurs, all the resources are online on both the nodes. As both the nodes have quorum, if the merge event occurs, lower priority node is rebooted if a critical resource is running on it and only RSCT subsystems is restarted if a non-critical resource or no resource is running on it. You must use the None split policy if a user prefers multiple instances of an application that runs simultaneously after split event.

To set the cluster policy as None, enter the following command:

clmgr modify cluster SPLIT_POLICY=none

Manual

The Manual split policy allows each partition that is created by the cluster split event to become an independent cluster. However, each partition cannot start a workload, until the user allots the ownership to the subcluster. Till then the Operational quorum state is in PENDING_QUORUM for each subcluster. Hence, PowerHA SystemMirror does not perform any action. The user can provide ownership by using the following command:

runact -c IBM.PeerDomain ResolveOpQuorumTie Ownership=0/1

where

1: Changes the quorum state from PENDING_QUORUM to HAS_QUORUM
0: Changes the quorum state from PENDING_QUORUM to NO_QUORUM

The subcluster with NO_QUORUM state will have no authority to do any operational change on the nodes. If critical resource is running on NO_QUORUM node, the node is rebooted to avoid corruption of critical resource. The subcluster with HAS_QUORUM state has authority to do any operational change on the subcluster.

Before a split event, the operational quorum state is HAS_QUORUM and configuration quorum is TRUE. PowerHA SystemMirror allows both operational change and configurational change into the cluster.

But after the split event, for each operation of the partition, the quorum state will be PENDING_QUORUM and configurational quorum will be FALSE. PowerHA SystemMirror does not allow the user to do any operational change not even the configurational changes.

If a cluster splits, all the resources are in same state as they were before the split event. The user can give permission as 1 to one subcluster and 0 to another cluster, so that the resources run only on one sub cluster. After giving permission, the PENDING_QUORUM will change to either HAS_QUORUM or NO_QUORUM.

After the split event, the node that has critical resource running on it is rebooted if 0 permission is given. If the non-critical resource or no resource is running on that subcluster, it remains same.

Manual Policy can be used if a user to give permission after split event before creation of another instance of application.

The subcluster with HAS_QUORUM state automatically allows a user to perform any operational changes if the merge event occurs,. The lower priority node is rebooted if a critical resource is running on it with an operational quorum state as HAS_QUORUM before merge and restarts the subsystems when the non-critical resource or no resource is running on it.

To set the cluster policy as Manual, enter the following command:

clmgr modify cluster SPLIT_POLICY=manual

Tiebreaker

You can use the tiebreaker option to specify a SCSI disk or a Network File System (NFS) file that is used by the split and merge policies.

A tiebreaker disk or an NFS file is used when the sites in the cluster can no longer communicate with each other. This communication failure results in the cluster splitting the sites into two, independent partitions. If failure occurs because the cluster communication links are not responding, both partitions attempt to lock the tiebreaker disk or the NFS file. The partition that acquires the tiebreaker disk continues to function, while the other partition reboots, or has cluster services restarted, depending on if any critical resources are configured.

The disk or NFS-mounted file that is identified as the tiebreaker must be accessible to all nodes in the cluster.

Split Policy Tie Breaker allow one of the partitions that is created by the cluster split event to become an independent cluster. After winning the tiebreaker, subcluster can start the workload or can allow the user to perform an operational change. Whenever a split event occurs, the operational quorum state remains in the PENDING_QUORUM state till one of the subclusters wins the tie breaker. The subcluster, which won the tie breaker starts the workload or allow the user to do operational changes.

The following are types of tiebreaker in PowerHA SystemMirror for Linux:

NFS tiebreaker

When Network File System (NFS) type of tiebreaker is configured, one directory of the NFS server gets mounted on all nodes of the cluster. When a split event occurs due to network failure on one node, that node loses the mounting of the NFS server directory and becomes the losing node. The other node is the winning node.

Before you configure the NFS tiebreaker, a user must have proper understating like permission, accessibility of the NFS server machine and must check the directory on the NFS server that is going to be mounted on the local directory of the nodes. To configure the NFS tiebreaker in PowerHA SystemMirror for Linux, enter the following command:

clmgr modify cluster clMain SPLIT_POLICY=tiebreaker 
TIEBREAKER=nfs NFS_SERVER=192.2.2.5 NFS_SERVER_MOUNT_POINT=/test_nfs_tie 
NFS_LOCAL_MOUNT_POINT=/test_nfs_local NFS_FILE_NAME=nfsTestConfig

where,

SPLIT_POLICY=tiebreaker is a type of split policy.
TIEBREAKER=nfs is type of tiebreaker.
NFS_SERVER=192.2.2.5 is the IP of nfs-server machine.
NFS_SERVER_MOUNT_POINT=/test_nfs_tie is the directory on the nfs-server machine.
NFS_LOCAL_MOUNT_POINT=/test_nfs_local is the directory on the nodes.
NFS_FILE_NAME=nfsTestConfig is any file name given by then user.

Before the split event, the quorum state can be seen by using the following command:

lssrc -ls IBM.RecoveryRM
Operational Quorum State               : HAS_QUORUM
In Config Quorum                       : TRUE

After the split event occurs, the quorum status will be:

Operational Quorum State	             : PENDING_QUORUM
In Config Quorum		                    : FALSE

The quorum shows the following state after winning the tiebreaker on the winning node:

Operational Quorum State               : HAS_QUORUM
In Config Quorum                       : FALSE

The quorum shows the following state after losing the tiebreaker on the losing node:

Operational Quorum State               : NO_QUORUM 
In Config Quorum                       : FALSE

The losing node is rebooted if the critical resource is running on it. Otherwise, it remains as it is.

The winning node starts the workload if not running before otherwise it is running as it is and user can perform operational configuration changes. The losing node will merge automatically to the cluster after reboot if a critical resource is running on it. If a non-critical resource is running on the node, the RSCT subsystem is restarted on that node after the merge event.

The NFS tiebreaker must be used if a single-node interface fails. If two-nodes are connected to two different switches and if one switch goes down, the node that is connected to that switch will not be able to communicate. However, the second node can still communicate because the other switch is working. The second node wins as it can reach the NFS server.

Disk tiebreaker

Disk tiebreaker uses disk to resolve the tie. When the cluster split event occurs, each node tries to make a lock on the disk. The node, which locks the disk first becomes the winning node. The other node is the losing node.

Before you configure the disk tiebreaker, the user must find out about the shared disk among the nodes by using the following command:

lsrsrc IBM.Disk

To configure the disk tiebreaker in PowerHA SystemMirror for Linux, enter the following command:

clmgr modify cluster SPLIT_POLICY=tiebreaker 
TIEBREAKER=disk DISK_WWID=36017595807eed37r0000000000000045

Where

SPLIT_POLICY=tiebreaker                  is type of split policy 
TIEBREAKER=disk                                  is type of tiebreaker
DISK_WWID=36017595807eed37r0000000000000045      is the shared disk used for tiebreaker.

The disk tiebreaker must be used if interface failure occurs on both the nodes. If both the nodes are connected to the same switch and if the switch goes down, the two nodes cannot reach the NFS serve r. The winning node is decided by the locking system on the disk. The node that locks the disk first is the winning node.