Replacing a failed VMware node on installer-provisioned infrastructure

Use this information to replace a failed VMware node on an installer-provisioned infrastructure.

Procedure

  1. Log in to the OpenShift Web Console, and click Compute > Nodes.
  2. Identify the faulty node that you need to replace and click on its Machine Name.
  3. Go to Actions > Edit Annotations and click Add More.
  4. Add machine.openshift.io/exclude-node-draining, and click Save.
  5. Go to Action menu > Delete Machine and click Delete.
    A new machine is automatically created, wait for new machine to start.
    Important: This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
  6. Go to Compute > Nodes and confirm that the new node is in a Ready state.
  7. Apply the Fusion Data Foundation label to the new node using one of the following steps:
    From the user interface
    1. Go to Action Menu > Edit Labels > .
    2. Add cluster.ocs.openshift.io/openshift-storage, and click Save.
    From the command-line interface
    Apply the Fusion Data Foundation label to the new node:, where <new_node_name> specifies the name of the new node.
    oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

What to do next

Verify that the new node and all pods are running.
  1. Verify that the new node is present in the output:
    oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Workloads > Pods and confirm that at least the following pods on the new node are in a Running state:
    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all the other required Fusion Data Foundation pods are in Running state.
  4. Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
    oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
  5. If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in the previous step, do the following:

    1. Create a debug pod and open a chroot environment for the one or more selected hosts:
      oc debug node/<node_name>
      chroot /host
    2. Display the list of available block devices:, using the lsblk command.

      Check for the crypt keyword beside the one or more ocs-deviceset names.

  6. If the verification steps fail, contact IBM Support.