Troubleshooting
Problem
A situation occurred where a compute node had a "Disk Pressure" condition and pods were being evicted. The corruption on the node stopped the pods from being set up on it.
Symptom
The Ingress Operator update appears stuck and several pods are created causing Disk Pressure on the node.
Diagnosing The Problem
To see the stalled ingress operator Cluster Service Version (CSV) run the command:
"oc get csv -n ibm-common-services"
Look for stalled CSV updates. This can cause hundreds of "ibm-management-ingress-operator" pods, which can cause disk pressure on the nodes.
Resolving The Problem
Procedure
- Log in to your Red Hat OpenShift Container Platform Console.
- In Red Hat OpenShift Container Platform Console, click the drop-down arrow next to your username.
- Select the option Copy login command, then click Display Token.
- Highlight and copy the oc login command. The command looks similar to the following:
oc login –-token=sha256......
- Run the following oc command. If any of the compute nodes have the "DiskPressure" status with True, then proceed to the following step.
oc describe node -l node-role.kubernetes.io/worker|egrep -i diskpressure DiskPressure False Wed, 23 Nov 20... ...kubelet has no disk pressure DiskPressure True Wed, 23 Nov 20... ...kubelet has no disk pressure ..etc.
- Run the following command to retrieve a list of the "clusterserviceversion" (CSV) resources under the "ibm-common-services" namespace. If there are previous versions of the updates for the "ibm-management-ingress-operator" stuck in Pending status, then the issue is occurring where the update is stalled.
oc get csv -n ibm-common-services NAME ... VERSION REPLACES PHASE ibm-management-ingress-operator ... 1.16.5 ibm-management-ingress-operator.v1.16.4 Replacing ibm-management-ingress-operator ... 1.16.6 ibm-management-ingress-operator.v1.16.5 Pending
- Run the following command to delete the stalled "CSV".
oc delete csv ibm-management-ingress-operator -n ibm-common-services
- Run the following command to cordon the node that had the "DiskPressure" state of "True" from the earlier step. Replace <node> with the name of the compute node.
$ oc adm cordon <node>
- Reboot the compute node.
- After the node completes the reboot, uncordon the compute node. Replace <node> with the name of the compute node.
oc adm uncordon <node>
- Delete all pods under the "ibm-common-services" namespace.
$ oc delete pod --all -n ibm-common-services
- Review all pods to confirm there are no Evicted pods.
$ oc get pod --all-namespaces | grep Evicted
You can now log in to your Red Hat OpenShift Container Platform Console UI
Related Information
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTDPP","label":"IBM Cloud Pak for Security"},"ARM Category":[{"code":"a8m3p0000000rbnAAA","label":"Administration Task"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
28 November 2022
UID
ibm16838555