Compute node issues
Known issues while you work with the compute nodes.
Note: If a node does not get rebooted even after you complete the workaround steps mentioned in this
troubleshooting section, contact IBM Support.
- Sometimes, after you reboot a node, it may not change its state to ready and the
ovs-configuration.servicemay fail.- Resolution
- As a resolution, reboot the node.
-
- AFM node failed to upsize
- After you upsize AFM nodes, the Events page displays the following error:
<node name> Unable to add a new node. Failed to set boot order at <time stamp>
This issue is also applicable when AFM nodes from factory get added to IBM Storage Fusion HCI System during stage 3 of the installation.
Resolution:- Go to Infrastructure > Nodes.
- Click Discovered tab.
- Select the AFM node and click Add to cluster or click the + icon.
- In the Add node window, click Add.
- Take backup of
CPWand then run the following command:oc get cpw provisionworker-compute-1-ru24 -oyaml > /provisionworker-compute-1-ru24 - Delete
CPWfor that node:oc delete cpw provisionworker-compute-1-ru24 - Edit the
CPWbackup copy withprovisioningInterfaceLabeladded tospec.oc edit cpw /provisionworker-compute-1-ru24Example:apiVersion: install.isf.ibm.com/v1 kind: ComputeProvisionWorker metadata: creationTimestamp: "2023-07-11T15:37:31Z" generation: 1 name: provisionworker-compute-1-ru24 namespace: ibm-spectrum-fusion-ns resourceVersion: "5177873" uid: 6454d43-ba8d-40b7-8100-6e5efd1401a8 spec: location: RU24 provisioningInterfaceLabel: 'UEFI: SLOT3 (86/0/0) PXE IP6 Mellanox Network Adapter' rackSerial: RackH - Run the following command to create
CPW:oc create -f /provisionworker-compute-1-ru24 - Wait for five to ten minus and run the following command to verify no boot order error exists:
oc get cpw provisionworker-compute-1-ru24 -oyaml
-
- Compute node moved to failed state
- If in case, after power operation, the nodes shows the following error message, contact IBM
Support:
Failed to get storage information from the IMM of the node
-
- Compute nodes in hung state
- Steps to reboot when the compute nodes get into a hung state:
As a prerequisite, you must have administrator access to Red Hat® OpenShift® and OC (Red Hat OpenShift CLI) command access to the cluster.
- Log in to OpenShift UI.
- Go to Compute > Bare Metal Hosts.
- Make sure to select
openshift-machine-apiproject on the Bare Metal Hosts page. - From the ellipsis menu, click Power Off for the node.
- Power it on again and check if the issue is resolved.
-
- Pod migration gets stuck in Terminating or ContainerCreating state
-
A controller node goes down and the migration of the pod gets stuck in Terminating or ContainerCreating state.
As a workaround, delete the container creating pod so that it can get scheduled in another available controller node.
-
- Compute node can abruptly become unreachable
- A compute node can abruptly become unreachable either due to a power outage or network disruption.
- Cause
- The pods that run on that specific compute node might have not got evacuated cleanly and they
remain stuck in
TerminatingorUnknownstate.
- Resolution
- To ensure a complete evacuation and migration of the pod that is stuck on the failed node, force
delete the pod manually by using the following command:
oc delete pod <pod name> --grace-period=0 --force --namespace <namespace>
- When the management switch high availability is lost due to the management switch located at RU18 being down, the cluster continues to function, but autodiscovery of nodes is not supported.