Compute node issues

Known issues while you work with the compute nodes.

Note: If a node does not get rebooted even after you complete the workaround steps mentioned in this troubleshooting section, contact IBM Support.

Sometimes, after you reboot a node, it may not change its state to ready and the ovs-configuration.service may fail.

Resolution

As a resolution, reboot the node.

AFM node failed to upsize

After you upsize AFM nodes, the Events page displays the following error:

<node name> Unable to add a new node. Failed to set boot order at <time stamp>

This issue is also applicable when AFM nodes from factory get added to IBM Storage Fusion HCI System during stage 3 of the installation.

Resolution:

Go to Infrastructure > Nodes.
Click Discovered tab.
Select the AFM node and click Add to cluster or click the + icon.
In the Add node window, click Add.

Take backup of CPW and then run the following command:

oc get cpw provisionworker-compute-1-ru24 -oyaml > /provisionworker-compute-1-ru24

Delete CPW for that node:

oc delete cpw provisionworker-compute-1-ru24

Edit the CPW backup copy with provisioningInterfaceLabel added to spec.

oc edit cpw /provisionworker-compute-1-ru24

Example:


apiVersion: install.isf.ibm.com/v1
kind: ComputeProvisionWorker
metadata:
creationTimestamp: "2023-07-11T15:37:31Z"
generation: 1
name: provisionworker-compute-1-ru24
namespace: ibm-spectrum-fusion-ns
resourceVersion: "5177873"
uid: 6454d43-ba8d-40b7-8100-6e5efd1401a8
spec:
location: RU24
provisioningInterfaceLabel: 'UEFI: SLOT3 (86/0/0) PXE IP6 Mellanox Network Adapter'
rackSerial: RackH

Run the following command to create CPW:

oc create -f /provisionworker-compute-1-ru24

Wait for five to ten minus and run the following command to verify no boot order error exists:
```
oc get cpw provisionworker-compute-1-ru24 -oyaml
```

Compute node moved to failed state
If in case, after power operation, the nodes shows the following error message, contact IBM Support:
```
Failed to get storage information from the IMM of the node
```
Compute nodes in hung state
Steps to reboot when the compute nodes get into a hung state:
As a prerequisite, you must have administrator access to Red Hat® OpenShift® and OC (Red Hat OpenShift CLI) command access to the cluster.
1. Log in to OpenShift UI.
2. Go to Compute > Bare Metal Hosts.
3. Make sure to select openshift-machine-api project on the Bare Metal Hosts page.
4. From the ellipsis menu, click Power Off for the node.
5. Power it on again and check if the issue is resolved.
Pod migration gets stuck in Terminating or ContainerCreating state

A controller node goes down and the migration of the pod gets stuck in Terminating or ContainerCreating state.

As a workaround, delete the container creating pod so that it can get scheduled in another available controller node.
Compute node can abruptly become unreachable
A compute node can abruptly become unreachable either due to a power outage or network disruption.

Cause

The pods that run on that specific compute node might have not got evacuated cleanly and they remain stuck in Terminating or Unknown state.
Resolution
To ensure a complete evacuation and migration of the pod that is stuck on the failed node, force delete the pod manually by using the following command:
oc delete pod <pod name> --grace-period=0 --force --namespace <namespace>
When the management switch high availability is lost due to the management switch located at RU18 being down, the cluster continues to function, but autodiscovery of nodes is not supported.