Replacing a failed AWS node on user-provisioned infrastructure

Use this information to replace failed AWS node on a user-provisioned infrastructure.

Before you begin

  • Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
  • You must be logged into the OpenShift Container Platform cluster.

Procedure

  1. Identify the Amazon Web Service (AWS) machine instance of the node that you need to replace.
  2. Log in to AWS, and terminate the AWS machine instance that you identified.
  3. Create a new AWS machine instance with the required infrastructure.
    For more information, see Platform requirements.
  4. Create a new OpenShift Container Platform node using the new AWS machine instance.
  5. Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state.
    oc get csr
  6. Approve all the required OpenShift Container Platform CSRs for the new node, where <certificate_name> specifies the name of the CSR.
    oc adm certificate approve <certificate_name>
  7. Go to Compute > Nodes and confirm that the new node is in a Ready state.
  8. Apply the Fusion Data Foundation label to the new node using one of the following steps:
    From the user interface
    1. Go to Action Menu > Edit Labels > .
    2. Add cluster.ocs.openshift.io/openshift-storage, and click Save.
    From the command-line interface
    Apply the Fusion Data Foundation label to the new node:, where <new_node_name> specifies the name of the new node.
    oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

What to do next

Verify that the new node and all pods are running.
  1. Verify that the new node is present in the output:
    oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Workloads > Pods and confirm that at least the following pods on the new node are in a Running state:
    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all the other required Fusion Data Foundation pods are in Running state.
  4. Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
    oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd
  5. If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in the previous step, do the following:

    1. Create a debug pod and open a chroot environment for the one or more selected hosts:
      oc debug node/<node_name>
      chroot /host
    2. Display the list of available block devices:, using the lsblk command.

      Check for the crypt keyword beside the one or more ocs-deviceset names.

  6. If the verification steps fail, contact IBM Support.