Worker nodes show status of disk pressure

On IBM® Cloud, you might notice that pods consume more local storage and worker nodes show a status of disk pressure.

Symptoms

In a managed Red Hat® OpenShift® cluster on IBM Cloud, the worker nodes show a status of disk pressure. Since you can’t log in to the nodes, you cannot find the exact problem.

Causes

Worker nodes on a managed Red Hat OpenShift cluster on IBM Cloud come with less local disk space than recommended by Cloud Pak for Data. When pods with larger images and pods that use local disk storage are running, they can occupy the local disk and create the disk pressure issue.

Diagnosing the problem

You can see which pod is consuming more local storage with the following script.

To run the script:

  1. Log in to your Red Hat OpenShift cluster as a cluster administrator:
    oc login OpenShift_URL:port
  2. Save the following script to getStorageUsage.sh.
    #!/bin/bash
    node=$1
    echo "Node Ephemeral FS"
    curl -k -s -H "Authorization: Bearer $(oc whoami -t)" $(oc whoami --show-server)/api/v1/nodes/${node}/proxy/stats/summary | jq -r '"Used:\(.node.fs.usedBytes) Capacity:\(.node.fs.capacityBytes) Available:\(.node.fs.availableBytes)",  "imagefs:\(.node.runtime.imageFs.usedBytes)"'
    echo ""
    echo "Pod Ephemeral Storage"
    curl -k -s -H "Authorization: Bearer $(oc whoami -t)" $(oc whoami --show-server)/api/v1/nodes/${node}/proxy/stats/summary | jq -r '.pods|.[]|"\(.["ephemeral-storage"].usedBytes) \(.podRef.namespace)/\(.podRef.name)"' | sort -n -r
  3. Get a list of nodes available in your OpenShift cluster:
    oc get nodes
  4. Run the script for each worker node.
    ./getStorageUsage.sh <workernode>

Resolving the problem

You can consider the following options:
  • Use smaller worker nodes, such as 16 vCPU x 64 GB RAM, so that fewer pods are scheduled on the worker nodes.
  • Reboot or reload/replace the worker node that has a status of disk pressure. Be careful if block storage is attached to a VPC Gen2 worker node that must be replaced. For more information, see VPC: Updating worker nodes with Portworx volumes.
  • Reduce the retention periods of your logs or configure logs to be saved in persistent storage instead of local storage if you are running the Prometheus Cluster Monitoring stack. For more information, see Configuring the monitoring stack.