Red Hat OpenShift Data Foundation storage node failure

You can do a node replacement proactively for an operational node and reactively for a failed node. For a failed node backed by local storage devices, you must replace the Red Hat® OpenShift® Data Foundation storage node.

Before you begin
Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks as the node planned for replacement.
Note: Contact IBM Support before you proceed with any of these fixes.
Do the following steps to check for the occurrence of Red Hat OpenShift Data Foundation storage node failure and identify the failed node:
  1. Set the Red Hat OpenShift Data Foundation cluster to maintenance mode:
    oc label -n ibm-spectrum-fusion-ns odfcluster ""
    Example output:
    [root@fu40 ~]# oc label -n ibm-spectrum-fusion-ns odfcluster "" labeled
  2. Identify the failed node:
    1. Log in to IBM Fusion user interface.
    2. Go to Data foundation page and check for warning in the Health section for storage cluster.
    Alternatively, you can use the oc command to identify the node:
    oc get node -l
    Sample output:
    [root@fu71-f09-vm3 ~]# oc get node -l
    NAME                               STATUS     ROLES    AGE   VERSION
    f09-prc4m-worker-cluster-b-9chb5   NotReady   worker   27d   v1.24.0+4f0dd4d
    f09-prc4m-worker-cluster-c-mfb77   Ready      worker   31d   v1.24.0+4f0dd4d
    f09-prc4m-worker-cluster-d-r5bxx   Ready      worker   27d   v1.24.0+4f0dd4d
  3. Identify the failed mon (if any) and Red Hat OpenShift Dedicated pods that are running in the node, which is planned for replacement:
    In an operational storage node environment:
    oc get pods -n openshift-storage -o wide | grep -i <node_name>
  4. If the storage node failed in a failed storage node, there is no node_name for the failed pods. Filter the pending pods instead.
    oc get pods -n openshift-storage -o wide | grep -i pending
    Example output: The mon deployment is rook-ceph-mon-d, and the Red Hat OpenShift Dedicated deployment is ook-ceph-osd-0.
    [root@fu71-f09-vm3 ~]# oc get pods -n openshift-storage -o wide | grep -i pending
     rook-ceph-mon-d-67686857d7-zv62c                                  0/2     Pending     0          8m50s   <none>            <none>                             <none>           <none>
     rook-ceph-osd-0-75b954c9bf-62xm4                                  0/2     Pending     0          8m50s   <none>            <none>                             <none>           <none>
  5. Remove the failed objects.
    Remove the failed node from odfcluster CR
      autoScaleUp: false
      creator: CreatedByFusion
      - capacity: "0"
        count: 3
        name: ocs-deviceset-ibm-spectrum-fusion-local
        storageClass: ibm-spectrum-fusion-local
        keyManagementService: {}
        - disk
        - part
        size: 2Ti
      - f09-prc4m-worker-cluster-d-r5bxx
      - f09-prc4m-worker-cluster-c-mfb77
      - f09-prc4m-worker-cluster-b-9chb5 <<-- this one, remove this line.
    Remove the mon and Red Hat OpenShift Dedicated pods
    Scale down the deployments of the identified pods. The mon deployment is rook-ceph-mon-d and the Red Hat OpenShift Dedicated deployment is rook-ceph-osd-0.
    oc scale deployment rook-ceph-mon-d --replicas=0 -n openshift-storage
    oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

    Ensure that you confirm the values of mon_id and osd_id.

    Remove the crashcollector pods
    Remove the crashcollector pods ( if any). You must put scale replica to 0.
    oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Mark the failed node as unschedulable
    Mark the node as SchedulingDisabled.
    oc adm cordon <node_name>
    Example command and output:
    oc adm cordon f09-prc4m-worker-cluster-b-9chb5
    node/f09-prc4m-worker-cluster-b-9chb5 cordoned
    oc get node -l
    NAME                               STATUS                        ROLES    AGE   VERSION
    f09-prc4m-worker-cluster-b-9chb5   NotReady,SchedulingDisabled   worker   28d   v1.24.0+4f0dd4d
    f09-prc4m-worker-cluster-c-mfb77   Ready                         worker   31d   v1.24.0+4f0dd4d
    f09-prc4m-worker-cluster-d-r5bxx   Ready                         worker   27d   v1.24.0+4f0dd4d
    Remove the pods which are in Terminating state
    This step is for the failed storage node. You can ignore this step if your are removing an operational node.
    oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    Example command and output:
    oc get pods -A -o wide | grep -i f09-prc4m-worker-cluster-b-9chb5 |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
    pod "isf-data-protection-operator-controller-manager-5c7cf574d5ms4xx" force deleted
    Drain the node
    oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
    Example command and output:
    oc adm drain f09-prc4m-worker-cluster-b-9chb5 --force --delete-emptydir-data=true --ignore-daemonsets
    node/f09-prc4m-worker-cluster-b-9chb5 already cordoned
    WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node-7696f, openshift-cluster-node-tuning-operator/tuned-fk949, openshift-dns/dns-default-gvv4m, openshift-dns/node-resolver-t5dk8, openshift-image-registry/node-ca-wtnp9, openshift-ingress-canary/ingress-canary-kgxts, openshift-local-storage/diskmaker-discovery-qkmh7, openshift-local-storage/diskmaker-manager-m9q42, openshift-machine-config-operator/machine-config-daemon-252j8, openshift-monitoring/node-exporter-cghwc, openshift-multus/multus-additional-cni-plugins-mkz4m, openshift-multus/multus-bz789, openshift-multus/network-metrics-daemon-57v5r, openshift-network-diagnostics/network-check-target-n6bhw, openshift-sdn/sdn-fhp47, openshift-storage/csi-cephfsplugin-5vsp9, openshift-storage/csi-rbdplugin-bfpfs
    node/f09-prc4m-worker-cluster-b-9chb5 drained
    Delete the node
    Delete the failed node:
    oc delete node <node_name>
    If you do not want to destroy this node for test purpose, you can remove the storage label
    oc label nodes/f09-prc4m-worker-cluster-b-9chb5
  6. Add new OpenShift compute nodes.
    1. Create a new compute node and ensure that the new node is in ready state.
    2. Update new node info in odfcluster CR.
      Edit the odfcluster cr and add new node name in StorageNodes
      oc edit -n ibm-spectrum-fusion-ns odfcluster
          - f09-prc4m-worker-cluster-b-djplp  <<--- new node
          - f09-prc4m-worker-cluster-d-r5bxx
          - f09-prc4m-worker-cluster-c-mfb77
      Verify whether the new nodes are labeled successfully as storage node.
      oc get node -l
      NAME                               STATUS   ROLES    AGE   VERSION
      f09-prc4m-worker-cluster-b-djplp   Ready    worker   27d   v1.24.0+4f0dd4d <<-- this node
      f09-prc4m-worker-cluster-c-mfb77   Ready    worker   31d   v1.24.0+4f0dd4d
      f09-prc4m-worker-cluster-d-r5bxx   Ready    worker   27d   v1.24.0+4f0dd4d
      Verify whether the new PVs are created automatically. The local PV gets created automatically in a short time.
      oc get pv | grep Available
      local-pv-e97b23d7                          2Ti        RWO            Delete           Available                                                                             ibm-spectrum-fusion-local              2m49s
    3. Replace the failed Red Hat OpenShift Dedicated disks.

      Remove the failed Red Hat OpenShift Dedicated from the cluster. You can also specify multiple failed ODs. Use the correct failed_osd_id.

      The failed_osd_id is the integer in the pod name immediately after the rook-ceph-osd prefix. You can add comma separated Red Hat OpenShift Dedicated IDs in the command to remove more than one Red Hat OpenShift Dedicated, for example, FAILED_OSD_IDS=0,1,2.

      Remove the failed Red Hat OpenShift Dedicated:
      oc process -n openshift-storage ocs-osd-removal \
      -p FAILED_OSD_IDS=<failed_osd_id> FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
      Example output:
      [root@fu71-f09-vm3 ~]# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 FORCE_OSD_REMOVAL=true | oc create -n openshift-storage -f -
      Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "operator" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "operator" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "operator" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "operator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
      job.batch/ocs-osd-removal-job created
      Check the status of the ocs-osd-removal-job pod to verify whether the Red Hat OpenShift Dedicated got removed successfully. A status of Completed confirms that the Red Hat OpenShift Dedicated removal job succeeded.
      oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
      Example output:
      [root@fu71-f09-vm3 ~]# oc get pod -n openshift-storage  | grep ocs-osd-removal
      ocs-osd-removal-job-ls65l                                         0/1     Completed   0          23s
      Ensure that the Red Hat OpenShift Dedicated removal is completed:
      oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
      Example output:
      [root@fu71-f09-vm3 ~]# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
      2022-11-24 15:19:28.750910 I | cephosd: completed removal of OSD 0
      Delete the ocs-osd-removal-job:
      oc delete -n openshift-storage job ocs-osd-removal-job
      Delete the Released PV which is attached to the previous node.
      oc get pv | grep -i released
      local-pv-e4a12175                          2Ti        RWO            Delete           Released    openshift-storage/ocs-deviceset-ibm-spectrum-fusion-local-0-data-1m6gnz   ibm-spectrum-fusion-local              3h34m
      [root@fu71-f09-vm3 ~]# oc delete pv local-pv-e4a12175
      persistentvolume "local-pv-e4a12175" deleted
  7. Recover the failed objects.
    1. Restart the mon deployment/pod:
      1. Update the nodeSelector in deployment with new node.
        oc edit deployment -n openshift-storage rook-ceph-mon-d
       f09-prc4m-worker-cluster-b-djplp <<--new node
      2. Scale the replica to 1 and wait till the mon pods are in running state.
        oc scale deployment rook-ceph-mon-d --replicas=1 -n openshift-storage
        oc scale deployment rook-ceph-mon-d --replicas=1 -n openshift-storage
            deployment.apps/rook-ceph-mon-d scaled
        [root@fu71-f09-vm3 ~]# oc get pod -n openshift-storage | grep mon
            rook-ceph-mon-a-5bbb9dd98b-z54fx                                  2/2     Running     0          4m45s
            rook-ceph-mon-b-7fdd8f958b-lk9g2                                  2/2     Running     0          5m18s
            rook-ceph-mon-d-6945fbbfc5-nhhw8                                  2/2     Running     0          5m41s
    2. Verify the Red Hat OpenShift Dedicated pods.
      Wait till all the pods are in running state.
      oc get pods -o wide -n openshift-storage| grep osd
      [root@fu71-f09-vm3 ~]# oc get pods -o wide -n openshift-storage| grep osd
      rook-ceph-osd-0-d559cc4fb-xspr8                                   2/2     Running     0          4m58s       f09-prc4m-worker-cluster-b-djplp   <none>           <none>
      rook-ceph-osd-1-6df7f9c669-n94md                                  2/2     Running     0          5m20s       f09-prc4m-worker-cluster-d-r5bxx   <none>           <none>
      rook-ceph-osd-2-5c5d48ff7c-sdd7l                                  2/2     Running     0          5m17s      f09-prc4m-worker-cluster-c-mfb77   <none>           <none>
      rook-ceph-osd-prepare-1bf7dd3d71fe899383e625dd0c27ea37-x9vtk      0/1     Completed   0          4h8m       f09-prc4m-worker-cluster-d-r5bxx   <none>           <none>
      rook-ceph-osd-prepare-24272b5641dc95baffc7932d78894e3c-zhz8m      0/1     Completed   0          5m24s       f09-prc4m-worker-cluster-b-djplp   <none>           <none>
      rook-ceph-osd-prepare-6f3c3b4626ec9888b6fbe5597afd55ea-zh7cp      0/1     Completed   0          4h8m      f09-prc4m-worker-cluster-c-mfb77   <none>           <none>
    3. Verify the Red Hat OpenShift Dedicated encryption settings. If cluster wide encryption is enabled, make sure the “crypt” keyword beside the ocs-deviceset name(s)
      oc debug node/<new-node-name> -- chroot /host dmsetup ls
      Example output:
      # oc debug node/fu47 -- chroot /host dmsetup ls
      Starting pod/fu47-debug ...
      To use host binaries, run `chroot /host`
      ocs-deviceset-sc-lvs-0-data-0clwxf-block-dmcrypt	(253:0)

      If verification fails, contact IBM support .

  8. Exit maintenance mode after all steps are completed.
    oc label -n ibm-spectrum-fusion-ns odfcluster ""
    Example output:
    [root@fu40 ~]# oc label -n ibm-spectrum-fusion-ns odfcluster "" unlabeled
  9. Go to Data foundation page in IBM Fusion user interface and check the health of the Storage cluster in the Health section.