IBM Support

Cloud Pak for Security: Compute node on cluster has "Disk Pressure", pods display as "Evicted" and CP4S fails to start

Troubleshooting


Problem

A situation occurred where a compute node had a "Disk Pressure" condition and pods were being evicted. The corruption on the node stopped the pods from being set up on it.

Symptom

The Ingress Operator update appears stuck and several pods are created causing Disk Pressure on the node.

Diagnosing The Problem

To see the stalled ingress operator Cluster Service Version (CSV) run the command:
"oc get csv -n ibm-common-services" 
Look for stalled CSV updates. This can cause hundreds of "ibm-management-ingress-operator" pods, which can cause disk pressure on the nodes. 

Resolving The Problem

Procedure 
  1. Log in to your Red Hat OpenShift Container Platform Console.
  2. In Red Hat OpenShift Container Platform Console, click the drop-down arrow next to your username.
  3. Select the option Copy login command, then click Display Token.
  4. Highlight and copy the oc login command. The command looks similar to the following:
    oc login –-token=sha256......
  5. Run the following oc command. If any of the compute nodes have the "DiskPressure" status with True, then proceed to the following step.
    oc describe node -l node-role.kubernetes.io/worker|egrep -i diskpressure
    
      DiskPressure     False   Wed, 23 Nov 20...   ...kubelet has no disk pressure
      DiskPressure     True    Wed, 23 Nov 20...   ...kubelet has no disk pressure
    
    ..etc.
  6. Run the following command to retrieve a list of the "clusterserviceversion" (CSV) resources under the "ibm-common-services" namespace. If there are previous versions of the updates for the "ibm-management-ingress-operator" stuck in Pending status, then the issue is occurring where the update is stalled.
    oc get csv -n ibm-common-services
    
    NAME                            ... VERSION   REPLACES                                  PHASE
    ibm-management-ingress-operator ... 1.16.5    ibm-management-ingress-operator.v1.16.4   Replacing
    ibm-management-ingress-operator ... 1.16.6    ibm-management-ingress-operator.v1.16.5   Pending
  7. Run the following command to delete the stalled "CSV".
    oc delete csv ibm-management-ingress-operator -n ibm-common-services
  8. Run the following command to cordon the node that had the "DiskPressure" state of "True" from the earlier step. Replace <node> with the name of the compute node.
    $ oc adm cordon <node>
  9. Reboot the compute node.
  10. After the node completes the reboot, uncordon the compute node. Replace <node> with the name of the compute node.
    oc adm uncordon <node>
  11. Delete all pods under the "ibm-common-services" namespace.
    $ oc delete pod --all -n ibm-common-services 
  12. Review all pods to confirm there are no Evicted pods.
    $ oc get pod --all-namespaces | grep Evicted
    Results
    You can now log in to your Red Hat OpenShift Container Platform Console UI

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTDPP","label":"IBM Cloud Pak for Security"},"ARM Category":[{"code":"a8m3p0000000rbnAAA","label":"Administration Task"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
28 November 2022

UID

ibm16838555