Recovery of cluster after Red Hat OpenShift node failure

IBM Storage Scale container native operator does not have support for the removal of a node. There have been cases where a node fails to recover after the node suffers an outage, for example, after a Red Hat OpenShift update.

Before you begin

Before the maintenance of the failed node, ensure that the node selectors, quorum and manager designations, and network configuration annotations are fully understood. These configurations are applied in subsequent steps after removal of the node to ensure parity of the cluster.

Collect configuration before servicing the OpenShift Container Platform cluster

Run an IBM Storage Scale container native must-gather to collect the state before servicing the Red Hat OpenShift cluster. For more information, see must-gather

Delete one Scale node from IBM Storage Scale container native cluster

To delete the Scale node from IBM Storage Scale container native cluster, complete the following steps:

  1. Stop the running operator pod by setting the replicas in the deployment to 0.

     kubectl scale deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator --replicas=0
    
  2. If the failed Red Hat OpenShift node exists, delete the corresponding Scale core pod.

     kubectl delete pod REPLACE_WITH_CORE_POD_OF_FAILED_OCP_NODE -n ibm-spectrum-scale
    
  3. Enter a currently running Scale core pod to run the remaining commands.

     kubectl -n ibm-spectrum-scale rsh REPLACE_WITH_RUNNING_CORE_POD
    
  4. Note the quorum and manager designations as denoted in mmlscluster. This information is needed when we attempt to add back a new node.

     mmlscluster
    
  5. Map the affected Red Hat OpenShift node to the GPFS admin interface. It should report a status of Unknown.

    The GPFS admin interface would have the shortname of the affected Red Hat OpenShift node.

     mmgetstate -a
    
  6. Ensure that most Scale nodes are in the Active state. Then, force delete the desired GPFS admin interface from the Scale cluster.

    Exercise caution when shutting down GPFS on quorum nodes or deleting quorum nodes from the GPFS cluster. If the number of remaining quorum nodes falls under the requirement for a quorum, then you are unable to perform file system operations.

     mmdelnode -N REPLACE_WITH_FAILED_GPFS_ADMIN_NODE --force
    
  7. Validate that the bad GPFS admin node is deleted from the Scale cluster.

     mmgetstate -a
    
  8. Exit from the Scale core pod.

     exit
    

Add new Red Hat OpenShift node for use in existing IBM Storage Scale container native cluster

The previous node configurations of the failed node need to be applied to the new Red Hat OpenShift node. So that the IBM Storage Scale container native operator can successfully add it to the existing IBM Storage Scale container native cluster.

  1. Ensure the quorum and manager designations are matching what the previous node had in place. Ensure that the GPFS cluster is not in minority of quorum. If the number of remaining quorum nodes falls under the requirement for a quorum, then you are unable to perform file system operations.

    Alternatively, if a node is removed but no replacement node is added, ensure that the appropriate number of nodes are assigned to satisfy the quorum.

  2. If using CNI, ensure that the network configuration is valid according to CNI documentation. Ensure that the scale.spectrum.ibm.com/daemon-network annotation is present on the new Red Hat OpenShift node. For more information, see Container network interface (CNI) configuration.

  3. Ensure the node selector labels (required for the Scale core pods) from the Cluster CR are present on the newly added Red Hat OpenShift node.

  4. Scale up the operator pods by setting the replicas in the deployment to 1.

    kubectl scale deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator --replicas=1
    
  5. The new Scale core pod should be created and go to running in a short amount of time.

Validate IBM Storage Scale container native cluster after removal and replacement of Red Hat OpenShift node

After the removal and replacement of the failed Red Hat OpenShift node has been completed, perform the validation steps. For more information, see Validating installation. Ensure the following: