IBM Support

Cluster: Cluster Partition State

Troubleshooting


Problem

This document discusses what a Cluster Partition status means and how to resolve it.

Resolving The Problem

What Is a Cluster Partition?

A cluster partition occurs whenever contact is lost between one or more nodes in the cluster and the failure of the lost nodes cannot be confirmed. This is not to be confused with a partition in a logical partition (LPAR) environment. If you receive message CPFBB20 in the history log or the QCSTCTL joblog, a cluster partition has occurred.

When a cluster partition condition is detected, Cluster Resource Services (CRS) limits the types of actions that you can perform on the nodes in the cluster. The restriction of functions during a partition is done so that CRS will be able to merge the partitions, once the problem that caused the issue has been resolved.

A cluster partition is not always avoidable and is not always due to communication problems. Power loss and hardware failures are examples of things that can cause a cluster partition and are not communications related. The typical network or communications-related cluster partition can best be avoided by configuring redundant communication paths between the nodes in the cluster. A redundant communications path means that you have two TCP/IP interfaces configured for each of the nodes in the cluster. If a failure on the first communications path occurs, the second communications path can take over to keep communications running between the nodes. This minimizes the conditions that could put one or more of the nodes in the cluster into a cluster partition. When configuring multiple TCP/IP interfaces, each interface should be associated with a separate line.

Identifying a Cluster Partition

Run the DSPCLUINF command from a node (system) currently in the cluster. If the Consistent information in cluster shows *YES (as shown below), we know that this node (system) is currently part of the cluster. If this is the case, look further down at the different nodes and their status. If there is a status listed as PARTITION (see section in blue below), we have a state of Cluster partition. This means that the cluster cannot communicate with that system.
 
                        Display Cluster Information                          
                                                                               
Cluster  . . . . . . . . . . . . . :   TST_CLU                                    
Consistent information in cluster  :   *YES   <-------                                  
Current PowerHA version  . . . . . :   5.5.2   
Current cluster version  . . . . . :   10.10   
Cluster message queue  . . . . . . :   *NONE   
  Library  . . . . . . . . . . . . :     *NONE 
Failover wait time . . . . . . . . :   *NOWAIT 
Failover default action  . . . . . :   *PROCEED

                                                                               
                            Cluster Membership List                            
                                                                               
 Node         Status         ------Interface Addresses------
 TSTNOD1      Active         1.2.3.100
 TSTNOD2      Active         1.2.3.101
 TSTNOD3      Partition      1.2.3.102

Recovering from a Cluster Partition

The cause of the cluster partition must be identified. Is the problem communication/network related or was a node (system) of the cluster lost because of a hardware or power failure? If the problem is communication/network related, the communication/network problem must be resolved then the partitions should come back together without any intervention from the customer. This could take approximately 15 minutes.

If the problem was because of a power loss or some type of hardware problem, that must also be addressed.  Once the problem has been identified, the customer must mark the node as FAILED using the CHGCLUNODE with the OPTION(*CHGSTS) command from a non-partitioned cluster node. If the cluster node that is partitioned still has an active cluster, then ENDCLUNOD must be done on it first, then followed by the STRCLUNOD command.

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m3p000000F8x5AAC","label":"High Availability-\u003ECluster"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions"}]

Historical Number

353809697

Document Information

More support for:
IBM i

Component:
High Availability->Cluster

Software version:
All Versions

Operating system(s):
IBM i

Document number:
638829

Modified date:
28 September 2024

UID

nas8N1015886

Manage My Notification Subscriptions