Red Hat OpenShift Node issues

Adding a node fails - the node appears to already belong to a GPFS cluster

When adding a worker node into Red Hat OpenShift, and using the nodeSelector of node-role.kubernetes.io/worker in the Cluster CR, the IBM Storage Scale Container native operator deploys a core pod to the newly added node and attempt to add this node into the GPFS cluster. There can be a situation where the core pod is in "Init: 1/2" state with no sign of recovery.

The operator log contains entries matching ERROR Failed to add node and mmaddnode failing with the reason:

The node appears to already belong to a GPFS cluster.

To recover from this scenario, use the following steps:

  1. Create a debug pod to the node where the pod is failing to start and delete the GPFS metadata.

     oc debug node/<openshift_worker_node> -T -- chroot /host sh -c "rm -rf /var/mmfs; rm -rf /var/adm/ras"
    

    Example:

     oc debug node/worker0.example.com -T -- chroot /host sh -c "rm -rf /var/mmfs; rm -rf /var/adm/ras"
     Starting pod/worker0examplecom-debug ...
     To use host binaries, run `chroot /host`
     Removing debug pod ...
    
  2. Delete the core pod. If the core pod is called worker3, run: oc delete pod worker3 -n ibm-spectrum-scale.

  3. The operator should reconcile and attempt to create the pod again and succeed.