Cleaning up localhost node from the IBM Storage Fusion HCI System cluster

Follow the instructions to clean up the localhost node from the IBM Storage Fusion HCI System cluster.

About this task

The localhost node must not be added to the OpenShift® cluster, as it creates an issue at a later stage. A couple of OpenShift objects are created when the localhost node is added to the IBM Storage Fusion HCI System cluster.

Procedure

Follow the steps to clean up the localhost node so that the node addition can retry after you resolve the DHCP or DNS misconfiguration issues.
  1. Run the following command to edit the ComputeProvisionWorker CR of the compute node that indicated an error message This node might contain an invalid hostname (localhost).
    # oc edit cpw provisionworker-compute-1-ru5
    Edit the respective CR to update the location information to an empty string to ensure that the ComputeProvisionWorker controller is not involved when you clean different objects of the compute node.
    # oc edit cpw provisionworker-compute-1-ru5
    ...
    spec:
      location: ""
  2. Delete the respective machine object of the compute node.

    You need to identify the machine object first and then mark it with an annotation. Then, scale down the machineset object to delete the machine object.

    1. Run the following command to get the machine object name from the BMH object.
      # oc -n openshift-machine-api get bmh,machine
      Example output:
      # oc -n openshift-machine-api get bmh,machine
      
      NAME                                    STATE                    CONSUMER                           ONLINE   ERROR   AGE
      baremetalhost.metal3.io/compute-1-ru5   provisioned              isf-rackae6-42ps4-worker-0-r6fmw   true             22h
      baremetalhost.metal3.io/compute-1-ru6   provisioned              isf-rackae6-42ps4-worker-0-t922n   true             22h
      baremetalhost.metal3.io/compute-1-ru7   provisioned              isf-rackae6-42ps4-worker-0-g842m   true             22h
      baremetalhost.metal3.io/control-1-ru2   externally provisioned   isf-rackae6-42ps4-master-0         true             44h
      baremetalhost.metal3.io/control-1-ru3   externally provisioned   isf-rackae6-42ps4-master-1         true             44h
      baremetalhost.metal3.io/control-1-ru4   externally provisioned   isf-rackae6-42ps4-master-2         true             44h
      
      NAME                                                            PHASE     TYPE   REGION   ZONE   AGE
      machine.machine.openshift.io/isf-rackae6-42ps4-master-0         Running                          44h
      machine.machine.openshift.io/isf-rackae6-42ps4-master-1         Running                          44h
      machine.machine.openshift.io/isf-rackae6-42ps4-master-2         Running                          44h
      machine.machine.openshift.io/isf-rackae6-42ps4-worker-0-g842m   Running                          22h
      machine.machine.openshift.io/isf-rackae6-42ps4-worker-0-r6fmw   Running                          22h
      machine.machine.openshift.io/isf-rackae6-42ps4-worker-0-t922n   Running                          22h
      
      The BMH object of compute-1-ru5 maps to the isf-rackae6-42ps4-worker-0-r6fmw of the machine object.
    2. Mark the machine object to delete by a special annotation machine.openshift.io/cluster-api-delete-machine: delete-me. The special marking helps to override the machine deletion policy rule.
      # oc -n openshift-machine-api edit machine isf-rackae6-42ps4-worker-0-r6fmw
      
      apiVersion: machine.openshift.io/v1beta1
      kind: Machine
      metadata:
        annotations:
          machine.openshift.io/cluster-api-delete-machine: delete-me
      ...
      
    3. Now you need to scale down the machineset object so that the deletion of the machine object gets initiated.
      Note: After the machineset scale down is performed, the machine objects corresponding to compute-1-ru5 are cleaned up, and the status of the BMH object corresponding to compute-1-ru5 changes to deprovisioning.
      # oc -n openshift-machine-api get machineset
      NAME                         DESIRED   CURRENT   READY   AVAILABLE   AGE
      isf-rackae6-cltp4-worker-0   3         3         3       3           8h
      
      # oc -n openshift-machine-api get machineset -oyaml | grep replicas
       replicas: 3
      
      # oc -n openshift-machine-api scale --replicas=<old replica value - 1> machineset <machine set name>
      machineset.machine.openshift.io/isf-rackae6-42ps4-worker-0 scaled
      
    4. After the machineset scale down is performed successfully, the status of the BMH object corresponding to compute-1-ru5 changes to ready.
      Important: This activity might take few minutes, and wait for it to be reflected against the BMH object before proceed with the further steps.
  3. Delete the compute nodes in the node object from the OpenShift Container Platform cluster if anything present.
    # oc get nodes
    
    NAME                                            STATUS   ROLES           AGE   VERSION
    compute-1-ru6.isf-rackae6.rtp.raleigh.ibm.com   Ready    worker          21h   v1.23.17+16bcd69
    compute-1-ru7.isf-rackae6.rtp.raleigh.ibm.com   Ready    worker          21h   v1.23.17+16bcd69
    control-1-ru2.isf-rackae6.rtp.raleigh.ibm.com   Ready    master,worker   43h   v1.23.17+16bcd69
    control-1-ru3.isf-rackae6.rtp.raleigh.ibm.com   Ready    master,worker   43h   v1.23.17+16bcd69
    control-1-ru4.isf-rackae6.rtp.raleigh.ibm.com   Ready    master,worker   43h   v1.23.17+16bcd69
    localhost                                       Ready    worker          20h   v1.23.17+16bcd69
    
    # oc delete node localhost 
    
  4. Delete the BMH object of the compute node.
     # oc -n openshift-machine-api get bmh 
    
    NAME            STATE                    CONSUMER                           ONLINE   ERROR   AGE
    compute-1-ru5   available                                                   false            24h
    compute-1-ru6   provisioned              isf-rackae6-42ps4-worker-0-t922n   true             24h
    compute-1-ru7   provisioned              isf-rackae6-42ps4-worker-0-g842m   true             24h
    control-1-ru2   externally provisioned   isf-rackae6-42ps4-master-0         true             46h
    control-1-ru3   externally provisioned   isf-rackae6-42ps4-master-1         true             46h
    control-1-ru4   externally provisioned   isf-rackae6-42ps4-master-2         true             46h
    
    # oc -n openshift-machine-api delete bmh compute-1-ru5
    baremetalhost.metal3.io "compute-1-ru5" deleted
    
  5. Delete all pending CertificateSigningRequests corresponding to the localhost node.
    # for i in `oc get csr --no-headers | grep -i system:node:localhost | grep -i pending | awk '{ print $1 }'`;do oc delete csr $i; done
    
  6. Fix the DNS or DHCP issue to get the correct hostname for the corresponding compute node instead of the local host.
  7. Delete the ComputeProvisionWorker object of the compute node.
    # oc -n ibm-spectrum-fusion-ns get cpw 
    NAME                            AGE
    provisionworker-compute-1-ru5   24h
    provisionworker-compute-1-ru6   24h
    provisionworker-compute-1-ru7   24h
    
    # oc -n ibm-spectrum-fusion-ns delete cpw provisionworker-compute-1-ru5
    computeprovisionworker.install.isf.ibm.com "provisionworker-compute-1-ru5" deleted
    
  8. If the issue occurs during installation at the time of OpenShift configuration, the node conversion resumes automatically. If the issue occurs during the node upsize, then retry the node addition.