Red Hat OpenShift Node issues
Adding a node fails - the node appears to already belong to a GPFS cluster
When adding a worker node into Red Hat OpenShift, and using the nodeSelector of node-role.kubernetes.io/worker
in the Cluster CR, the IBM Storage Scale Container native operator deploys a core pod to the newly added node and attempt
to add this node into the GPFS cluster. There can be a situation where the core pod is in "Init: 1/2" state with no sign of recovery.
The operator log contains entries matching ERROR Failed to add node
and mmaddnode
failing with the reason:
The node appears to already belong to a GPFS cluster.
To recover from this scenario, use the following steps:
-
Create a debug pod to the node where the pod is failing to start and delete the GPFS metadata.
oc debug node/<openshift_worker_node> -T -- chroot /host sh -c "rm -rf /var/mmfs; rm -rf /var/adm/ras"
Example:
oc debug node/worker0.example.com -T -- chroot /host sh -c "rm -rf /var/mmfs; rm -rf /var/adm/ras" Starting pod/worker0examplecom-debug ... To use host binaries, run `chroot /host` Removing debug pod ...
-
Delete the core pod. If the core pod is called
worker3
, run:oc delete pod worker3 -n ibm-spectrum-scale
. -
The operator should reconcile and attempt to create the pod again and succeed.