Replacing a failed VMware node on user-provisioned infrastructure

Use this information to replace a failed VMware node on a user-provisioned infrastructure.

Before you begin

  • Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
  • You must be logged into the OpenShift Container Platform cluster.

Procedure

  1. Identify the node and its Virtual Machine (VM) that you need replace.
  2. Delete the node, where <node_name> specifies the name of the node that needs to be replaced.
    oc delete nodes <node_name>
  3. Log in to VMware vSphere, and terminate the VM that you identified.
    Important: Delete the VM only from the inventory and not from the disk.
  4. Create a new VM on VMware vSphere with the required infrastructure.
    For more information, see ../planning/platform_requirements.html.
  5. Create a new OpenShift Container Platform worker node using the new VM.
  6. Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state.
    oc get csr
  7. Approve all the required OpenShift Container Platform CSRs for the new node, where <certificate_name> specifies the name of the CSR.
    oc adm certificate approve <certificate_name>
  8. Go to Compute > Nodes and confirm that the new node is in a Ready state.
  9. Apply the Fusion Data Foundation label to the new node using one of the following steps:
    From the user interface
    1. Go to Action Menu > Edit Labels > .
    2. Add cluster.ocs.openshift.io/openshift-storage, and click Save.
    From the command-line interface
    Apply the Fusion Data Foundation label to the new node:, where <new_node_name> specifies the name of the new node.
    oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

What to do next

Verify that the new node and all pods are running.
  1. Verify that the new node is present in the output:
    oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Workloads > Pods and confirm that at least the following pods on the new node are in a Running state:
    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all the other required Fusion Data Foundation pods are in Running state.
  4. Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
    oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
  5. If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in the previous step, do the following:

    1. Create a debug pod and open a chroot environment for the one or more selected hosts:
      oc debug node/<node_name>
      chroot /host
    2. Display the list of available block devices:, using the lsblk command.

      Check for the crypt keyword beside the one or more ocs-deviceset names.

  6. If the verification steps fail, contact IBM Support.