Troubleshooting
Problem
This document discusses what a Cluster Partition status means and how to resolve it.
Resolving The Problem
What Is a Cluster Partition?
A cluster partition occurs whenever contact is lost between one or more nodes in the cluster and the failure of the lost nodes cannot be confirmed. This is not to be confused with a partition in a logical partition (LPAR) environment. If you receive message CPFBB20 in the history log or the QCSTCTL joblog, a cluster partition has occurred.
When a cluster partition condition is detected, Cluster Resource Services (CRS) limits the types of actions that you can perform on the nodes in the cluster. The restriction of functions during a partition is done so that CRS will be able to merge the partitions, once the problem that caused the issue has been resolved.
A cluster partition is not always avoidable and is not always due to communication problems. Power loss and hardware failures are examples of things that can cause a cluster partition and are not communications related. The typical network or communications-related cluster partition can best be avoided by configuring redundant communication paths between the nodes in the cluster. A redundant communications path means that you have two TCP/IP interfaces configured for each of the nodes in the cluster. If a failure on the first communications path occurs, the second communications path can take over to keep communications running between the nodes. This minimizes the conditions that could put one or more of the nodes in the cluster into a cluster partition. When configuring multiple TCP/IP interfaces, each interface should be associated with a separate line.
Identifying a Cluster Partition
Run the DSPCLUINF command from a node (system) currently in the cluster. If the Consistent information in cluster shows *YES (as shown below), we know that this node (system) is currently part of the cluster. If this is the case, look further down at the different nodes and their status. If there is a status listed as PARTITION (see section in blue below), we have a state of Cluster partition. This means that the cluster cannot communicate with that system.
Recovering from a Cluster Partition
The cause of the cluster partition must be identified. Is the problem communication/network related or was a node (system) of the cluster lost because of a hardware or power failure? If the problem is communication/network related, the communication/network problem must be resolved then the partitions should come back together without any intervention from the customer. This could take approximately 15 minutes.
If the problem was because of a power loss or some type of hardware problem, that must also be addressed. Once the problem has been identified, the customer must mark the node as FAILED using the CHGCLUNODE with the OPTION(*CHGSTS) command from a non-partitioned cluster node. If the cluster node that is partitioned still has an active cluster, then ENDCLUNOD must be done on it first, then followed by the STRCLUNOD command.
A cluster partition occurs whenever contact is lost between one or more nodes in the cluster and the failure of the lost nodes cannot be confirmed. This is not to be confused with a partition in a logical partition (LPAR) environment. If you receive message CPFBB20 in the history log or the QCSTCTL joblog, a cluster partition has occurred.
When a cluster partition condition is detected, Cluster Resource Services (CRS) limits the types of actions that you can perform on the nodes in the cluster. The restriction of functions during a partition is done so that CRS will be able to merge the partitions, once the problem that caused the issue has been resolved.
A cluster partition is not always avoidable and is not always due to communication problems. Power loss and hardware failures are examples of things that can cause a cluster partition and are not communications related. The typical network or communications-related cluster partition can best be avoided by configuring redundant communication paths between the nodes in the cluster. A redundant communications path means that you have two TCP/IP interfaces configured for each of the nodes in the cluster. If a failure on the first communications path occurs, the second communications path can take over to keep communications running between the nodes. This minimizes the conditions that could put one or more of the nodes in the cluster into a cluster partition. When configuring multiple TCP/IP interfaces, each interface should be associated with a separate line.
Identifying a Cluster Partition
Run the DSPCLUINF command from a node (system) currently in the cluster. If the Consistent information in cluster shows *YES (as shown below), we know that this node (system) is currently part of the cluster. If this is the case, look further down at the different nodes and their status. If there is a status listed as PARTITION (see section in blue below), we have a state of Cluster partition. This means that the cluster cannot communicate with that system.
Display Cluster Information Cluster . . . . . . . . . . . . . : TST_CLU Consistent information in cluster : *YES <------- Current PowerHA version . . . . . : 5.5.2 Current cluster version . . . . . : 10.10 Cluster message queue . . . . . . : *NONE Library . . . . . . . . . . . . : *NONE Failover wait time . . . . . . . . : *NOWAIT Failover default action . . . . . : *PROCEED Cluster Membership List Node Status ------Interface Addresses------ TSTNOD1 Active 1.2.3.100 TSTNOD2 Active 1.2.3.101 TSTNOD3 Partition 1.2.3.102 |
Recovering from a Cluster Partition
The cause of the cluster partition must be identified. Is the problem communication/network related or was a node (system) of the cluster lost because of a hardware or power failure? If the problem is communication/network related, the communication/network problem must be resolved then the partitions should come back together without any intervention from the customer. This could take approximately 15 minutes.
If the problem was because of a power loss or some type of hardware problem, that must also be addressed. Once the problem has been identified, the customer must mark the node as FAILED using the CHGCLUNODE with the OPTION(*CHGSTS) command from a non-partitioned cluster node. If the cluster node that is partitioned still has an active cluster, then ENDCLUNOD must be done on it first, then followed by the STRCLUNOD command.
[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m3p000000F8x5AAC","label":"High Availability-\u003ECluster"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions"}]
Historical Number
353809697
Was this topic helpful?
Document Information
More support for:
IBM i
Component:
High Availability->Cluster
Software version:
All Versions
Operating system(s):
IBM i
Document number:
638829
Modified date:
28 September 2024
UID
nas8N1015886
Manage My Notification Subscriptions