How do you debug cluster nodes that become "NotReady" with "PLEG is not healthy" as the reason?

Troubleshooting

Problem

Resolving The Problem

What';s Happening

The cluster nodes become "NotReady" with the following reason:
PLEG is not healthy: pleg was last seen active ***h**m***s ago;

Why it';s Happening

Cluster nodes can become "NotReady" with this reason when they are overloaded. Self-checking will be helpful to diagnose the issue.

How to Fix

You need to understand the load on the nodes to determine if it might be contributing to the issues. You can use the following commands to get an understanding of the load on the nodes:

kubectl top nodes: This command show you if any of the nodes are getting close to CPU/memory capacity.
kubectl top pods --all-namespaces: This command will show if any pods are consuming large amounts of CPU/memory.
kubectl get pods --all-namespaces -o=wide: This command will enable you to tie up any pods that are using lots of resources with the nodes that they are running on and understand the total number of pods in the system.

There are situations where PLEG errors can occur if there is a large number of containers on the nodes (even when there is plenty of spare CPU/memory). It depends on configuration, but problems typically occur when you exceed about 400 containers per node. You can see the number of containers per pod in the output from the kubectl get pods command under the READY column.

You can also see PLEG errors if there is are thousands of services in the cluster. You can retrieve a list of the services in the cluster using the following command:
kubectl get services --all-namespaces

Document Location

Worldwide

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSJTBP","label":"IBM Cloud Kubernetes Service and Red Hat OpenShift on IBM Cloud"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB21","label":"Public Cloud Platform"}}]

Tips

How do you debug cluster nodes that become "NotReady" with "PLEG is not healthy" as the reason?

Troubleshooting

Problem

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?