Troubleshooting cluster maintenance
Identifying applications that prevent cluster maintenance
Debug steps to determine when a cluster maintenance action is not being completed.
When do you see this problem?
A drain occurs when a core pod is selected for deletion. Many actions prompt a core pod deletion:
- Pod evictions
- Red Hat OpenShift Machine Config Operator
- User-initiated drains
- Pod spec updates
- Resource requests, for example to change core pod requests for CPU or memory
- Image updates prompted by a new release
What does this look like?
An update that is driven by pod spec updates does present as core pods that wait for deletion. An update that is driven by Red Hat OpenShift Machine Config Operator (MCO) ceases to update nodes. The signature does look similar between the two scenarios. Use the following steps to determine where to direct the support case.
When to open a support case?
To determine whether the MCO is stuck due to IBM Storage Scale container native, run the following command:
kubectl describe daemon -n ibm-spectrum-scale
Example output:
...
Status:
...
Cordon And Drain:
Nodes Cordoned: worker0.example.ibm.com
Nodes Draining:
Node: worker0.example.ibm.com
Ongoing Pod Evictions:
Pod Evictions Failed:
myworkload/mypod-0
myworkload/mypod-1
Pod Eviction Requests:
Pods: worker0, worker1, worker2
Requestor: machine-config-controller
...
Status Details:
...
Pods Waiting To Be Deleted:
Delete Reason: Eviction of 3 pods requested by machine-config-controller
Pods: worker0, worker1, worker2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal PodEvictionRequested 7m9s Daemon Eviction of 3 pods requested by machine-config-controller.
Normal NodeDraining 6m28s Daemon Node worker2.example.ibm.com drained by Storage Scale operator. Evicting 6 pods.
Warning NodeDrainFailed 87s Daemon Failed to evict pod myworkload/mypod-0 on node worker0.example.ibm.com. Reason: Cannot evict pod as it would violate the pod's disruption budget.
Warning NodeDrainFailed 87s Daemon Failed to evict pod myworkload/mypod-1 on node worker0.example.ibm.com. Reason: Cannot evict pod as it would violate the pod's disruption budget.
If the Pod Evictions Failed:
list is empty and no NodeDrainFailed
events are present, then IBM Storage Scale container native is blocking the ongoing drain. To resolve it, raise a support ticket to IBM. For more information,
see Gather data about the IBM Storage Scale container native cluster.
If pods are listed in Pod Evictions Failed:
, check the message that is provided in the NodeDrainFailed
events. The pods that are listed in Pod Evictions Failed:
block the node drain. Try to solve the problem
with the pods that are blocking the node drain. If you cannot solve this problem, raise a support ticket to Red Hat. For more information, see Gather data about the Red Hat OpenShift Container Platform cluster.
Identifying signatures of an ongoing update
Complete the following steps:
-
Check the Daemon status to verify whether any pods are awaiting deletion.
kubectl describe daemon -n ibm-spectrum-scale
Example output:
Status Details: ... Pods Waiting To Be Deleted: Delete Reason: Eviction of 3 pods requested by machine-config-controller. Pods: worker0, worker1, worker2 Delete Reason: Pod worker1 will restart because image of containers config, gpfs, mmbuildgpl are updated. The node will be rebooted if the kernel modules can not be unloaded properly. Pods: worker1
-
Check whether any nodes are cordoned.
kubectl get nodes
Example:
NAME STATUS ROLES AGE VERSION master0.example.ibm.com Ready master 23d v1.24.0+3882f8f master1.example.ibm.com Ready master 23d v1.24.0+3882f8f master2.example.ibm.com Ready master 23d v1.24.0+3882f8f worker0.example.ibm.com Ready,SchedulingDisabled worker 23d v1.24.0+3882f8f worker1.example.ibm.com Ready worker 23d v1.24.0+3882f8f worker2.example.ibm.com Ready worker 23d v1.24.0+3882f8f
kubectl describe daemon -n ibm-spectrum-scale
Example:
Cordon And Drain: Nodes Cordoned: worker0.example.ibm.com
-
Check events in the daemon.
kubectl describe daemon -n ibm-spectrum-scale
Example event:
Warning NodeDrainFailed 87s Daemon Failed to evict pod myworkload/mypod-0 on node worker0.example.ibm.com. Reason: Cannot evict pod as it would violate the pod's disruption budget.