Finding and removing hidden pods that impact the performance of your cluster
In some situations, pods are shutdown but continue to run in the background. The pods are
not visible from the OpenShift® Console and are
not returned when you run the oc get pods
command. However, the pods consume
cluster resources, which impacts the performance of your cluster. Cleaning up the hidden pods can
free up resources and improve the performance of your cluster.
Symptoms
The cluster is using more resources than you expect, given the current workload. For example:- Mounted drives respond slower than usual.
- The cluster is using more CPU than the current requests and limits defined by the current, known pods.
- The cluster is using more memory than the current requests and limits defined by the current, known pods.
Causes
When a pod is shutdown or deleted, Red Hat® OpenShift Container Platform typically completes a set of cleanup steps, which includes removing the pod from the container runtime interface. However, if a pod is forcefully deleted or killed, the cleanup might not occur. When the cleanup does not occur, the pod continues to run and to consume cluster resources.Environment
This issue can occur on any Red Hat OpenShift Container Platform cluster.Diagnosing the problem
Cluster administrator A cluster administrator can use either of the following methods to identify hidden pods:
- Manually checking each node for hidden pods
-
- Run the following command to get a complete list of pods that are in use and known to Red Hat OpenShift Container Platform:
oc get pods -owide
This command returns information about the node that each known pod is running on.
- Run the following command to get the node names in the format that is required for the
subsequent steps:
oc get nodes -oname
- For each node in the cluster:
- Set the
NODE_NAME
environment variable to the name of the node:export NODE_NAME=<node-name>
- Access the node in debug mode and get the list of pods known to the container runtime
interface.
oc debug -q ${NODE_NAME} -- chroot /host /bin/bash -c "crictl pods --namespace <namespace> -o yaml"
Important: Replace<namespace>
before you run the command.Repeat this step for each namespace on the cluster that you want to investigate.
- Set the
- Compare the list of pods returned by the
oc get pods -owide
command to the list of pods returned by thecrictl pods
commands that you ran on each node.- If there are no nodes with a different number of pods, no additional action is needed.
- If there are nodes with a different number of pods, continue to the next step.
- Record the list of nodes and namespaces where the number of pods returned by the
crictl pods
command is different than the number of pods returned by theoc get pods -owide
command. Then, proceed to Resolving the problem.
- Run the following command to get a complete list of pods that are in use and known to Red Hat OpenShift Container Platform:
- Running a script check for hidden pods
-
You can run the script from a cluster node or from a workstation that can connect to the cluster. The workstation must have the OpenShift command-line interface (
oc
CLI).Best practice: You can run many of the commands in this task exactly as written if you set up environment variables for your installation. For instructions, see Setting up installation environment variables.Ensure that you source the environment variables before you run the commands in this task.
- Save the following script on cluster node or a workstation that can connect to the
cluster.
Save the script as hidden-pods.sh
#!/usr/bin/env bash set -o pipefail -o noclobber -o nounset login() { token=$1 server=$2 namespace=$3 if [[ "${token}" != *"token"* ]] || [[ "${server}" != *"server"* ]]; then usage fi oc login "${token}" "${server}" oc project "${namespace}" } usage() { echo "Usage: hidden-pods.sh [-h|--help] --token=xxxx --server=xxxx <namespace>" echo "Example: ./hidden-pods.sh --token=xxxxx --server=xxxxx zen" exit 1 } print_warn() { echo "############# WARNING ###############" echo "Mismatch of replicas for ${1}" echo "Found ${2}, expected ${3}" echo "#####################################" } check_replicas() { namespace=$1 rs=$(oc get rs --no-headers | awk '{print $1}' | wc -l) echo "There are ${rs} replicasets" replicas=$(oc get rs --no-headers | awk '$3 = $4 {print $1}' | wc -l) echo "There are ${replicas} replicasets with correct replica count" for deploy in $(oc get rs --no-headers | awk '{print $1}'); do echo "ReplicaSet: ${deploy}" requiredReplicas=$(oc get rs ${deploy} --no-headers | awk '{print $3}') echo "Required replicas: ${requiredReplicas}" reportedReplicas=$(oc get rs ${deploy} --no-headers | awk '{print $4}') echo "Reported replicas: ${reportedReplicas}" if [ "${requiredReplicas}" -ne "${reportedReplicas}" ]; then print_warn ${deploy} ${reportedReplicas} ${requiredReplicas} fi actualReplicas=0 for node in ${nodeList}; do podDetected=$(grep "${deploy}" ${node}) numPodsDetected=$(grep -o "${deploy}" ${node} | wc -l) if [ ${numPodsDetected} -gt 0 ]; then echo "Node: ${node}" echo ${podDetected} echo "Number of pods: ${numPodsDetected}" ((actualReplicas = ${actualReplicas} + ${numPodsDetected})) fi done if [ "${actualReplicas}" -ne "${requiredReplicas}" ]; then print_warn ${deploy} ${actualReplicas} ${requiredReplicas} fi echo "Actual replicas: ${actualReplicas}" echo done } check_sts() { namespace=$1 statefulsets=$(oc get sts --no-headers | awk '{sub(/\//," ")} $2 = $3 {print $1}' | wc -l) echo "There are ${statefulsets} statefulsets with correct replica count" for sts in $(oc get sts --no-headers | awk '{print $1}'); do echo "StatefulSet: ${sts}" requiredReplicas=$(oc get sts ${sts} --no-headers | awk '{sub(/\//," ")} {print $2}') echo "Required replicas: ${requiredReplicas}" reportedReplicas=$(oc get sts ${sts} --no-headers | awk '{sub(/\//," ")} {print $3}') echo "Reported replicas: ${reportedReplicas}" if [ "${requiredReplicas}" -ne "${reportedReplicas}" ]; then print_warn ${sts} ${reportedReplicas} ${requiredReplicas} fi actualReplicas=0 for node in ${nodeList}; do podDetected=$(grep "${sts}" ${node}) numPodsDetected=$(grep -o "${sts}" ${node} | wc -l) if [ ${numPodsDetected} -gt 0 ]; then echo "Node: ${node}" echo ${podDetected} echo "Number of pods: ${numPodsDetected}" ((actualReplicas = ${actualReplicas} + ${numPodsDetected})) fi done if [ "${actualReplicas}" -ne "${requiredReplicas}" ]; then print_warn ${sts} ${actualReplicas} ${requiredReplicas} fi echo "Actual replicas: ${actualReplicas}" echo done } main() { creds=("$@") if [ "${#creds[@]}" -ne 3 ] || [ "${creds[0]}" = "-h" ] || [ "${creds[0]}" = "--help" ]; then usage fi login "${creds[0]}" "${creds[1]}" "${creds[2]}" numberNodes=$(oc get nodes -oname | wc -l) nodeList=$(oc get nodes -oname); mkdir node echo "Scanning ${numberNodes} nodes" for node in ${nodeList}; do oc debug -q ${node} -- chroot /host /bin/bash -c "crictl pods --namespace ${creds[2]} -o yaml" | grep "io.kubernetes.pod.name:" 1> ${node} 2>/dev/null done check_replicas "${creds[2]}" check_sts "${creds[2]}" # rm -rf node } main "$@"
- For each namespace on the cluster that you want to investigate:
- Set the
NAMESPACE
environment variable to the name of the namespace to - Run the
script:
./hidden-pods.sh ${SERVER_ARGUMENTS} ${LOGIN_ARGUMENTS} ${NAMESPACE}
- Record the nodes for which the script returns a warning. For
example:
Node: node/<node-name> Number of pods: 1 ############# WARNING ############### Mismatch of replicas for zen-minio Found 4, expected 3 ##################################### Actual replicas: 4
- Set the
- Save the following script on cluster node or a workstation that can connect to the
cluster.
Resolving the problem
Node: node/<node-name>
Number of pods: 1
############# WARNING ###############
Mismatch of replicas for zen-minio
Found 4, expected 3
#####################################
Actual replicas: 4
Node: node/<node-name>
Number of pods: 1
############# WARNING ###############
Mismatch of replicas for zen-minio
Found 2, expected 3
#####################################
Actual replicas: 2
If you see these warnings, completing the steps in this section will not resolve those problems. You must investigate those issues separately.
- Set the
NODE_NAME
environment variable to the name of the node:export NODE_NAME=<node-name>
- Access the node in debug mode and run the following command to restart the container runtime
interface:
oc debug node/${NODE_NAME} -- chroot /host /bin/bash -c 'systemctl restart crio'
When the container runtime interface restarts, it contains only active, known pods.