Replacing failed nodes on Azure installer-provisioned infrastructure

Use this information to replace a failed Azure node on an installer-provisioned infrastructure.

Procedure

  1. Log in to the OpenShift Web Console, and click Compute > Nodes.
  2. Identify the faulty node that you need to replace and click on its Machine Name.
  3. Go to Actions > Edit Annotations and click Add More.
  4. Add machine.openshift.io/exclude-node-draining, and click Save.
  5. Go to Action menu > Delete Machine and click Delete.
    A new machine is automatically created, wait for new machine to start.
    Important: This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
  6. Go to Compute > Nodes and confirm that the new node is in a Ready state.
  7. Apply the Fusion Data Foundation label to the new node using one of the following steps:
    From the user interface
    1. Go to Action Menu > Edit Labels > .
    2. Add cluster.ocs.openshift.io/openshift-storage, and click Save.
    From the command-line interface
    Apply the Fusion Data Foundation label to the new node:, where <new_node_name> specifies the name of the new node.
    oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  8. Optional: If the failed Azure instance is not removed automatically, terminate the instance from the Azure console.

What to do next

Verify that the new node and all pods are running.
  1. Verify that the new node is present in the output:
    oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Workloads > Pods and confirm that at least the following pods on the new node are in a Running state:
    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all the other required Fusion Data Foundation pods are in Running state.
  4. Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
    oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
  5. If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in the previous step, do the following:

    1. Create a debug pod and open a chroot environment for the one or more selected hosts:
      oc debug node/<node_name>
      chroot /host
    2. Display the list of available block devices:, using the lsblk command.

      Check for the crypt keyword beside the one or more ocs-deviceset names.

  6. If the verification steps fail, contact IBM Support.