Troubleshooting repository disks

If any node in the cluster encounters errors with the repository disk or a failure while accessing the disk, the cluster enters a limited or restricted mode of operation. In this mode of operation most topology-related operations are not allowed, and any node that is restarted cannot rejoin the cluster.

When the repository disk fails, you are notified of the disk failure. PowerHA® SystemMirror® continues to notify you of the repository disk failure until it is resolved.

To determine what the problem is with the repository disk, you can view the following log files:

  • hacmp.out
  • AIX error log (using the errpt command)

Example: hacmp.out log

The following is an example of an error message in the hacmp.out log file when a repository disk fails:

ERROR: rep_disk_notify : Tue Jan 10 13:38:22 CST 2012 : Node "r6r4m32"(0x54628FEA1D0611E183EE001A64B90DF0) on Cluster r6r4m31_32_33_34 has lost access to repository disk hdisk75.

Example: AIX error log

When a node loses access to the repository disk, an entry is made in the AIX error log of each node that has a problem.

The following is an example of an error message in the error log file when a repository disk fails.
Note: To view the AIX error log, you must use the errpt command.
LABEL:          OPMSG
IDENTIFIER:     AA8AB241

Date/Time:       Tue Jan 10 13:38:22 CST 2012
Sequence Number: 21581
Machine Id:      00CDB2C14C00
Node Id:         r6r4m32
Class:           O
Type:            TEMP
WPAR:            Global
Resource Name:   clevmgrd

Description
OPERATOR NOTIFICATION

User Causes
ERRLOGGER COMMAND

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
MESSAGE FROM ERRLOGGER COMMAND
Error: Node 0x54628FEA1D0611E183EE001A64B90DF0 has lost access to repository disk hdisk75.

Replacing a failed or lost repository disk

If a repository disk fails, the repository disk must be recovered on a different disk to restore all cluster operations. The circumstances for your cluster environment and the type of the repository disk failure determine the possible methods for recovering the repository disk.

Automatic Repository Disk Replacement (ARR)

PowerHA SystemMirror Version 7.2.0, or later, uses the ARR capability of CAA (in AIX® Version 7.2, or later, or in AIX Version 7.1 with Technology Level 4, or later), to handle repository disk failures. ARR automatically replaces a failed repository disk with a backup repository disk. The ARR function is available only if you configure a backup repository disk by using PowerHA SystemMirror. For more information about ARR, see the Repository disk failure topic.

You must clean up the failed repository disk because the ARR will not clean the disk as it is not accessible. To clean up the failed repository disk, use the following command:
CAA_FORCE_ENABLED=true rmcluster -r <disk name>
The following are two possible scenarios where a repository disk fails and the possible methods for restoring the repository disk on a new storage disk.
Repository disk fails but the cluster is still operational
In this scenario, the repository disk access is lost on one or more nodes in the cluster. When this failure occurs, Cluster Aware AIX (CAA) continues to operate in restricted mode by using repository disk information which it has cached in memory. If CAA remains active on a single node in the cluster, the information from the previous repository disk information can be used to rebuild the a new repository disk.
To rebuild the repository disk after a failure, complete the following steps from any node where CAA is still active:
  1. Verify that CAA is active on the node by using the lscluster -c command and then the lscluster -m command.
  2. Replace the repository disk by completing the steps in the Replacing a repository disk with SMIT topic. PowerHA SystemMirror recognizes the problem and interacts with CAA to rebuild the repository disk on the new storage disk.
    Note: This step updates the repository information that is stored in the PowerHA SystemMirror configuration data.

    You do not need to perform Step 1 and Step 2, if the ARR function is available.

  3. Synchronize thePowerHA SystemMirror cluster configuration information by selecting Cluster Nodes and Networks > Verify and Synchronize Cluster Configuration from the SMIT interface.
Repository disk fails and the nodes in the cluster rebooted
In this rare scenario, a series of critical failures occur that result in a worst case scenario where access to the repository disk is lost and all nodes in the cluster were rebooted. Thus, none of the nodes in the cluster remained online during the failure and you cannot rebuild the repository disk from the AIX operating systems memory. When the nodes are brought back online, they cannot start CAA because a repository disk is not present in the cluster. To fix this problem, it is ideal to bring back the repository disk and allow the cluster self heal. If that is not possible, you must rebuild the repository disk on a new storage disk and use it to start the CAA cluster.
To rebuild the repository disk and start cluster services, complete the following steps:
  1. On a node in the cluster rebuild the repository by completing the steps in the Replacing a repository disk with SMIT topic. PowerHA SystemMirror recognizes the problem and interacts with CAA to rebuild the repository disk on the new storage disk.
    Note: This step updates the repository information that is stored in the PowerHA SystemMirror configuration data and rebuilds the repository disk from the CAA cluster cache file.

    If the ARR function is available, you do not need to perform Step 1, and the disk is replaced automatically.

    After the repository disk is replaced, run the verify and synchronization operations. If some of the nodes are down, the verify and synchronization operations might fail with errors. To run the verify and synchronization operations successfully, enter the following command:
    #/usr/es/sbin/cluster/utilities/cldare -f -dr
    
    You can ignore the cl_rsh errors if any.
  2. Start cluster services on the node that hosts the repository disk by completing the steps in the Starting cluster services topic.
  3. All other nodes in the cluster continue to attempt to access the original repository disk. You must configure these nodes to use the new repository disk and start CAA cluster services. Verify that the CAA cluster is not active on any of these nodes by using the lscluster -m command. If the CAA cluster is inactive or the local node is in the DOWN state, enter the following commands to remove the old repository disk information:
    export CAA_FORCE_ENABLED=true
    clusterconf -fu
  4. To have other nodes join the CAA cluster, use the following command on the active node with the newly created repository disk:
    clusterconf -p

    For AIX Version 7.1 with Technology Level 4, or later, you do not need to perform Step 3 and Step 4. After you complete Step 2, all nodes that were rebooted must wait for about 10 minutes to use the new repository disk.

  5. Verify that CAA is active by first using the lscluster -c command and then the lscluster -m command.
  6. Synchronize thePowerHA SystemMirror cluster configuration information about the newly created repository disk to all other nodes by selecting Cluster Nodes and Networks > Verify and Synchronize Cluster Configuration from the SMIT interface.
  7. Start PowerHA SystemMirror cluster services on all nodes (besides the first node where the repository disk was created) by selecting System Management (C-SPOC) > PowerHA SystemMirror Services > Start Cluster Services from the SMIT interface.

Snapshot migration and repository disk

The snapshot migration process for an online cluster requires that the cluster information in the snapshot matches the online cluster information. This requirement also applies to repository disks. If you change a repository disk configuration, you must update the snapshot to reflect these changes and then complete the snapshot migration process.