Add node failure
Node addition failures and their resolutions.
Add node failure due to inspection error
- Cause
- This issue occurred because of ipmitool version in the rack. Run the following command to know
the version number of the IPMI:
Example response:portcontrol -ipmi on
The current security settings require incoming IPMI over LAN to use cipher suite ID 17. If you are using the IPMItool utility (prior to version 1.8.19), you must specify the option <-C 17>. To effectively enable this service, you must ensure that 'ipmi' is selected in the '-ai' option of the 'users' command. ok
Example response:ipmitool -V
ipmitool version 1.8.18
- Diagnosis
-
- Log in to the compute operator pod.
- Go to fw/onecli directory.
- Run the following command to create a output directory:
Example output:./OneCli config set IMM.Security_Mode Compatibility --bmc USERID:'Wo6k5tSKT0'@[fd8c:215d:178e:c0de:ea80:88ff:fe07:b66b] --never-check-trust
Create output directory at /fw/onecli/logs/OneCli-20241007-062329-988988/ failed with error is 13 [WARNING]: No permission to create output directory at /fw/onecli/logs/OneCli-20241007-062329-988988/ All output file will not be saved. You can specify another directory by "--output" parameter or with higher permission. Start to connect BMC at fd8c:215d:178e:c0de:ea80:88ff:fe07:b66b to apply config set Connected to BMC at IP address by REDFISH Invoking SET command ... IMM.Security_Mode=Compatibility /fw/onecli/logs/OneCli-988988-20241007-062329/common_result.xml: cannot open file No permission to create output directory at /fw/onecli/logs/OneCli-20241007-062329-988988/ All output file will not be saved. Succeed.
- Resolution
- To resolve the issue, see IPMI check Error .
Add node failure to OpenShift Container Platform cluster
Node addition to the Red Hat® OpenShift® Container Platform cluster fails.
- Severity
- Warning or Error
- Explanation
-
Node addition to the Red Hat OpenShift Container Platform cluster reports a failure. As a part of the node addition process, IBM Storage Fusion HCI System validates and runs a few steps. If any of these validations or steps fail, a node addition failure gets reported.
Some of the reported failures can auto resolve after a few minutes because the IBM Storage Fusion HCI System retries the same validation and steps in every reconciliation cycle. You can view the actual failure reason in the status of the
computeprovisionworker
CR.In the following example, themessage
field indicates a node DNS validation failure. Determine the failure message from thecomputeprovisionworker
CR status, and follow the steps in the Recommended actions section.oc -n ibm-spectrum-fusion-ns get cpw provisionworker-compute-1-ru5 -o yaml apiVersion: install.isf.ibm.com/v1 kind: ComputeProvisionWorker metadata: creationTimestamp: "2024-05-01T21:39:53Z" generation: 1 name: provisionworker-compute-1-ru5 namespace: ibm-spectrum-fusion-ns resourceVersion: "8733225" uid: 3aa59cb6-c032-4b31-98ec-26284ed4588a spec: location: RU5 rackSerial: 8Y2DXXX status: hostname: compute-1-ru5.mycluster.mydomain.com installStatus: Installing ipAddress: XX.YY.ZZ.15 location: RU5 macAddress: 08:xx:eb:ff:yy:zz machineSetScaled: true message: DNS validation for node failed. messageCode: IOCPW0011 name: provisionworker-compute-1-ru5 nodeType: storage ocpRole: compute-1-ru5 oneTimeNetworkBoot: Completed rackName: mycluster startTimeStamp: "2024-05-01T21:39:54Z"
- Recommended actions
-
- DNS validation for node failed
- The error indicates a node DNS validation failure, which can be intermittent because of an unsuccessful DNS call. It eventually succeeds in the next cycle. However, if the issue persists for more than 10 minutes, check the DHCP and DNS configurations to confirm that the correct FQDN to IP exists for the DNS lookup and reverse lookup. After you fix the issue, the node addition in IBM Storage Fusion HCI System proceeds automatically to the next step.
- Unable to add a node and failed to set boot order
- Usually, this failure gets resolved in the subsequent reconcile cycles. If it persists more than
10 minutes, then a Baseboard Management Controller (BMC) restart can resolve the issue. Sometimes,
the IMM does not respond to ipmi or redfish commands. Contact
IBM support to restart the BMC.
After your restart the BMC, the error gets resolved in the subsequent reconciliation cycles.
- Failed to read userconfig secret object
- The error indicates that an intermittent API error exists in reading the OpenShift Container Platform secret. It can occur due to a delay in the API response or some other network interruption. This error self-resolves automatically in the successive reconciliation cycles.
- Failed to read kickstart configmap data
- The error indicates that an intermittent API error exists in reading the OpenShift Container Platform configmap. It can occur due to a delay in the API response or some other network interruption. It self-resolves automatically in the successive reconciliation cycles.
- Unable to add a node and the Red Hat OpenShift node did not transition to Ready state
- The CSRs pending for approval is the most common reason for this error. The IBM Storage Fusion HCI System operator approves these CSRs as a part of
the node addition process. If it does not occur as a rare case scenario, manually approve them to
resolve the issue. Run the following commands to check and approve the pending CSRs:
- Run the following command to look for pending CSRs:
oc get csr | grep -i pending
- Run the following command to approve pending CSRs:
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
- Run the following command to look for pending CSRs:
- Failed to scale machineset during new node addition
- As a part of the Red Hat
OpenShift cluster expansion process, a Bare Metal host object gets created for every
node that is added to the cluster. This Bare Metal host object go through the following
states:
registering
>inspecting
>available
>provisioning
>provisioned
.After the Bare Metal host becomesavailable
, the replica of the Bare Metal machineset CR in the OpenShift Container Platform cluster increases by one to start the machine creation process of that host. If that step fails, then you can see a reported error in thecomputeprovisionworker
status. The Fusion operator tries to increase it in the next reconcile cycle again and that eventually succeeds. If it does not succeed in a rare case scenario, do the following steps to manually increase the replica count of the machineset CR.- Run the following command to get the machineset
name:
oc get machinesets -n openshift-machine-api | tail -1 | awk '{print $1}'
- Run the following command to get the machineset current replica
count:
oc get machinesets -n openshift-machine-api | grep worker | awk '{print $2}'
- Run the following command to increase the replica by one:
oc scale --replicas=<old replica count from step2> + 1 machineset <machineset name from step 1> -n openshift-machine-api
- Run the following command to get the machineset
name:
- Node might contain an invalid hostname (localhost)
- The error indicates that the added host did not get the expected fully qualified hostname
assigned from DHCP. A common reason for this error is that the DHCP server does not get configured
correctly. Check the following content in the DHCP server:
- Hostname option is set for this host entry
- No mistakes exist in the mac address that is used for this host entry
- DHCP service is running and it is reachable from the host
After you fix the issue on the DHCP side, restart the node by using the Bare Metal host object from the Red Hat OpenShift console. To restart the node, do the following steps from the OpenShift Container Platform web console:- Go to .
- Search for CRs with
ComputePowerOp
. It lists CRs and nodes in the cluster. - Go to the CR of the node to restart and then go to the YAML tab.
- Add the following line to
Spec
to power off the node.
if you want to power on the node, change the line to the following line:powerOP: GracefulShutdown
powerOP: On
- Unable to add one or more nodes as they contain an invalid hostname (localhost)
- This error indicates that the host you want to add did not get the expected valid fully
qualified hostname assigned from DHCP. A common reason for this error is that the DHCP server is not
configured correctly. In a DHCP server, check whether the following conditions exist:
- Hostname option is set for this host entry
- No mistake in the mac address used for this host entry
- DHCP service is running and DHCP is reachable from the host
After you fix the DHCP issues, restart the node to proceed with the node addition. To restart node, do the following steps from the Red Hat OpenShift Container Platform web console:- Go to Home > Search.
- Search for CRs with
ComputePowerOp
. - Go to the CR of the node to restart and then go to the YAML tab.
- Add the following line to
Spec
to power off the node.
if you want to power on the node, change the line to the following line:powerOP: GracefulShutdown
powerOP: On