Recovery of cluster after Red Hat OpenShift node failure
IBM Storage Scale container native operator does not have support for the removal of a node. There have been cases where a node fails to recover after the node suffers an outage, for example, after a Red Hat OpenShift update.
- Before you begin
- Collect configuration before servicing the OpenShift Container Platform cluster
- Delete one Scale node from IBM Storage Scale container native cluster
- Add new Red Hat OpenShift node for use in existing IBM Storage Scale container native cluster
- Validate the IBM Storage Scale container native cluster after removal and replacement of Red Hat OpenShift node
Before you begin
Before the maintenance of the failed node, ensure that the node selectors, quorum and manager designations, and network configuration annotations are fully understood. These configurations are applied in subsequent steps after removal of the node to ensure parity of the cluster.
- Log the node selector labels (required for the Scale core pods) that are being used by the Cluster CR.
- Understand if the node to be removed is a quorum node for the IBM Storage Scale container native cluster.
- Identify if a CNI network configuration is in use for the IBM Storage Scale container native cluster.
Collect configuration before servicing the OpenShift Container Platform cluster
Run an IBM Storage Scale container native must-gather to collect the state before servicing the Red Hat OpenShift cluster. For more information, see must-gather
Delete one Scale node from IBM Storage Scale container native cluster
To delete the Scale node from IBM Storage Scale container native cluster, complete the following steps:
-
Stop the running operator pod by setting the
replicas
in the deployment to 0.kubectl scale deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator --replicas=0
-
If the failed Red Hat OpenShift node exists, delete the corresponding Scale core pod.
kubectl delete pod REPLACE_WITH_CORE_POD_OF_FAILED_OCP_NODE -n ibm-spectrum-scale
-
Enter a currently running Scale core pod to run the remaining commands.
kubectl -n ibm-spectrum-scale rsh REPLACE_WITH_RUNNING_CORE_POD
-
Note the quorum and manager designations as denoted in
mmlscluster
. This information is needed when we attempt to add back a new node.mmlscluster
-
Map the affected Red Hat OpenShift node to the GPFS admin interface. It should report a status of
Unknown
.The GPFS admin interface would have the shortname of the affected Red Hat OpenShift node.
mmgetstate -a
-
Ensure that most Scale nodes are in the
Active
state. Then, force delete the desired GPFS admin interface from the Scale cluster.Exercise caution when shutting down GPFS on quorum nodes or deleting quorum nodes from the GPFS cluster. If the number of remaining quorum nodes falls under the requirement for a quorum, then you are unable to perform file system operations.
mmdelnode -N REPLACE_WITH_FAILED_GPFS_ADMIN_NODE --force
-
Validate that the bad GPFS admin node is deleted from the Scale cluster.
mmgetstate -a
-
Exit from the Scale core pod.
exit
Add new Red Hat OpenShift node for use in existing IBM Storage Scale container native cluster
The previous node configurations of the failed node need to be applied to the new Red Hat OpenShift node. So that the IBM Storage Scale container native operator can successfully add it to the existing IBM Storage Scale container native cluster.
-
Ensure the quorum and manager designations are matching what the previous node had in place. Ensure that the GPFS cluster is not in minority of quorum. If the number of remaining quorum nodes falls under the requirement for a quorum, then you are unable to perform file system operations.
Alternatively, if a node is removed but no replacement node is added, ensure that the appropriate number of nodes are assigned to satisfy the quorum.
-
If using CNI, ensure that the network configuration is valid according to CNI documentation. Ensure that the
scale.spectrum.ibm.com/daemon-network
annotation is present on the new Red Hat OpenShift node. For more information, see Container network interface (CNI) configuration. -
Ensure the node selector labels (required for the Scale core pods) from the Cluster CR are present on the newly added Red Hat OpenShift node.
-
Scale up the operator pods by setting the
replicas
in the deployment to 1.kubectl scale deployment ibm-spectrum-scale-controller-manager -n ibm-spectrum-scale-operator --replicas=1
-
The new Scale core pod should be created and go to running in a short amount of time.
Validate IBM Storage Scale container native cluster after removal and replacement of Red Hat OpenShift node
After the removal and replacement of the failed Red Hat OpenShift node has been completed, perform the validation steps. For more information, see Validating installation. Ensure the following:
-
All pods are in a running state. Make sure that the CSI pod on the replaced node is running. This validates that file systems are active and running.
-
Validate that the IBM Storage Scale cluster has successfully added the node with the desired node designation.
- Validate that the nodes are active in the IBM Storage Scale cluster.