Add node failure

Node addition failures and their resolutions.

Add node failure due to inspection error

Cause
This issue occurred because of ipmitool version in the rack. Run the following command to know the version number of the IPMI:
portcontrol -ipmi on
Example response:
The current security settings require incoming IPMI over LAN to use cipher suite ID 17.
If you are using the IPMItool utility (prior to version 1.8.19), you must specify the option <-C 17>.
To effectively enable this service, you must ensure that 'ipmi' is selected in the '-ai' option of the 'users' command.
ok
ipmitool -V
Example response:
ipmitool version 1.8.18
Diagnosis
  1. Log in to the compute operator pod.
  2. Go to fw/onecli directory.
  3. Run the following command to create a output directory:
    ./OneCli  config set IMM.Security_Mode Compatibility --bmc USERID:'Wo6k5tSKT0'@[fd8c:215d:178e:c0de:ea80:88ff:fe07:b66b] --never-check-trust
    Example output:
    Create output directory at /fw/onecli/logs/OneCli-20241007-062329-988988/ failed with error is 13
    
    [WARNING]: No permission to create output directory at /fw/onecli/logs/OneCli-20241007-062329-988988/
    All output file will not be saved. You can specify another directory by "--output" parameter or with higher permission.
    
    Start to connect BMC at fd8c:215d:178e:c0de:ea80:88ff:fe07:b66b to apply config set
    Connected to BMC at IP address  by REDFISH
    Invoking SET command ...
    IMM.Security_Mode=Compatibility
    /fw/onecli/logs/OneCli-988988-20241007-062329/common_result.xml: cannot open file
    
    No permission to create output directory at /fw/onecli/logs/OneCli-20241007-062329-988988/
    All output file will not be saved.
    Succeed.
Resolution
To resolve the issue, see IPMI check Error .

Add node failure to OpenShift Container Platform cluster

Node addition to the Red Hat® OpenShift® Container Platform cluster fails.

Severity
Warning or Error
Explanation

Node addition to the Red Hat OpenShift Container Platform cluster reports a failure. As a part of the node addition process, IBM Storage Fusion HCI System validates and runs a few steps. If any of these validations or steps fail, a node addition failure gets reported.

Some of the reported failures can auto resolve after a few minutes because the IBM Storage Fusion HCI System retries the same validation and steps in every reconciliation cycle. You can view the actual failure reason in the status of the computeprovisionworker CR.

In the following example, the message field indicates a node DNS validation failure. Determine the failure message from the computeprovisionworker CR status, and follow the steps in the Recommended actions section.

oc -n ibm-spectrum-fusion-ns get cpw provisionworker-compute-1-ru5 -o yaml
apiVersion: install.isf.ibm.com/v1
kind: ComputeProvisionWorker
metadata:
 creationTimestamp: "2024-05-01T21:39:53Z"
 generation: 1
 name: provisionworker-compute-1-ru5
 namespace: ibm-spectrum-fusion-ns
 resourceVersion: "8733225"
 uid: 3aa59cb6-c032-4b31-98ec-26284ed4588a
spec:
 location: RU5
 rackSerial: 8Y2DXXX
status:
 hostname: compute-1-ru5.mycluster.mydomain.com
 installStatus: Installing
 ipAddress: XX.YY.ZZ.15
 location: RU5
 macAddress: 08:xx:eb:ff:yy:zz
 machineSetScaled: true
 message: DNS validation for node failed.
 messageCode: IOCPW0011
 name: provisionworker-compute-1-ru5
 nodeType: storage
 ocpRole: compute-1-ru5
 oneTimeNetworkBoot: Completed
 rackName: mycluster
 startTimeStamp: "2024-05-01T21:39:54Z"
Recommended actions
DNS validation for node failed
The error indicates a node DNS validation failure, which can be intermittent because of an unsuccessful DNS call. It eventually succeeds in the next cycle. However, if the issue persists for more than 10 minutes, check the DHCP and DNS configurations to confirm that the correct FQDN to IP exists for the DNS lookup and reverse lookup. After you fix the issue, the node addition in IBM Storage Fusion HCI System proceeds automatically to the next step.
Unable to add a node and failed to set boot order
Usually, this failure gets resolved in the subsequent reconcile cycles. If it persists more than 10 minutes, then a Baseboard Management Controller (BMC) restart can resolve the issue. Sometimes, the IMM does not respond to ipmi or redfish commands. Contact IBM support to restart the BMC.

After your restart the BMC, the error gets resolved in the subsequent reconciliation cycles.

Failed to read userconfig secret object
The error indicates that an intermittent API error exists in reading the OpenShift Container Platform secret. It can occur due to a delay in the API response or some other network interruption. This error self-resolves automatically in the successive reconciliation cycles.
Failed to read kickstart configmap data
The error indicates that an intermittent API error exists in reading the OpenShift Container Platform configmap. It can occur due to a delay in the API response or some other network interruption. It self-resolves automatically in the successive reconciliation cycles.
Unable to add a node and the Red Hat OpenShift node did not transition to Ready state
The CSRs pending for approval is the most common reason for this error. The IBM Storage Fusion HCI System operator approves these CSRs as a part of the node addition process. If it does not occur as a rare case scenario, manually approve them to resolve the issue. Run the following commands to check and approve the pending CSRs:
  1. Run the following command to look for pending CSRs:
    oc get csr | grep -i pending
  2. Run the following command to approve pending CSRs:
    oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
Failed to scale machineset during new node addition
As a part of the Red Hat OpenShift cluster expansion process, a Bare Metal host object gets created for every node that is added to the cluster. This Bare Metal host object go through the following states: registering > inspecting > available > provisioning > provisioned.
After the Bare Metal host becomes available, the replica of the Bare Metal machineset CR in the OpenShift Container Platform cluster increases by one to start the machine creation process of that host. If that step fails, then you can see a reported error in the computeprovisionworker status. The Fusion operator tries to increase it in the next reconcile cycle again and that eventually succeeds. If it does not succeed in a rare case scenario, do the following steps to manually increase the replica count of the machineset CR.
  1. Run the following command to get the machineset name:
    oc get machinesets -n openshift-machine-api | tail -1 | awk '{print $1}'
  2. Run the following command to get the machineset current replica count:
    oc get machinesets -n openshift-machine-api | grep worker | awk '{print $2}'
  3. Run the following command to increase the replica by one:
    oc scale --replicas=<old replica count from step2> + 1 machineset <machineset name from step 1> -n openshift-machine-api
In the subsequent reconciliation cycles, the node addition process moves to the next steps and gets reflected in the IBM Storage Fusion HCI System user interface.
Node might contain an invalid hostname (localhost)
The error indicates that the added host did not get the expected fully qualified hostname assigned from DHCP. A common reason for this error is that the DHCP server does not get configured correctly. Check the following content in the DHCP server:
  1. Hostname option is set for this host entry
  2. No mistakes exist in the mac address that is used for this host entry
  3. DHCP service is running and it is reachable from the host
After you fix the issue on the DHCP side, restart the node by using the Bare Metal host object from the Red Hat OpenShift console. To restart the node, do the following steps from the OpenShift Container Platform web console:
  1. Go to Home > Search.
  2. Search for CRs with ComputePowerOp. It lists CRs and nodes in the cluster.
  3. Go to the CR of the node to restart and then go to the YAML tab.
  4. Add the following line to Spec to power off the node.
    powerOP: GracefulShutdown
    if you want to power on the node, change the line to the following line:
    powerOP: On
Unable to add one or more nodes as they contain an invalid hostname (localhost)
This error indicates that the host you want to add did not get the expected valid fully qualified hostname assigned from DHCP. A common reason for this error is that the DHCP server is not configured correctly. In a DHCP server, check whether the following conditions exist:
  • Hostname option is set for this host entry
  • No mistake in the mac address used for this host entry
  • DHCP service is running and DHCP is reachable from the host
After you fix the DHCP issues, restart the node to proceed with the node addition. To restart node, do the following steps from the Red Hat OpenShift Container Platform web console:
  1. Go to Home > Search.
  2. Search for CRs with ComputePowerOp.
  3. Go to the CR of the node to restart and then go to the YAML tab.
  4. Add the following line to Spec to power off the node.
    powerOP: GracefulShutdown
    if you want to power on the node, change the line to the following line:
    powerOP: On