Troubleshooting cluster maintenance

Identifying applications that prevent cluster maintenance

Debug steps to determine when a cluster maintenance action is not being completed.

When do you see this problem?

A drain occurs when a core pod is selected for deletion. Many actions prompt a core pod deletion:

What does this look like?

An update that is driven by pod spec updates does present as core pods that wait for deletion. An update that is driven by Red Hat OpenShift Machine Config Operator (MCO) ceases to update nodes. The signature does look similar between the two scenarios. Use the following steps to determine where to direct the support case.

When to open a support case?

To determine whether the MCO is stuck due to IBM Storage Scale container native, run the following command:

kubectl describe daemon -n ibm-spectrum-scale

Example output:

...
Status:
...
    Cordon And Drain:
        Nodes Cordoned:  worker0.example.ibm.com
        Nodes Draining:
            Node:  worker0.example.ibm.com
        Ongoing Pod Evictions:
        Pod Evictions Failed:
            myworkload/mypod-0
            myworkload/mypod-1
        Pod Eviction Requests:
            Pods: worker0, worker1, worker2
            Requestor: machine-config-controller
    ...
    Status Details:
        ...
        Pods Waiting To Be Deleted:
            Delete Reason:   Eviction of 3 pods requested by machine-config-controller
            Pods:            worker0, worker1, worker2
    ...
Events:
    Type     Reason                 Age    From    Message
    ----     ------                 ----   ----    -------
    Normal   PodEvictionRequested   7m9s   Daemon  Eviction of 3 pods requested by machine-config-controller.
    Normal   NodeDraining           6m28s  Daemon  Node worker2.example.ibm.com drained by Storage Scale operator. Evicting 6 pods.
    Warning  NodeDrainFailed        87s    Daemon  Failed to evict pod myworkload/mypod-0 on node worker0.example.ibm.com. Reason: Cannot evict pod as it would violate the pod's disruption budget.
    Warning  NodeDrainFailed        87s    Daemon  Failed to evict pod myworkload/mypod-1 on node worker0.example.ibm.com. Reason: Cannot evict pod as it would violate the pod's disruption budget.

If the Pod Evictions Failed: list is empty and no NodeDrainFailed events are present, then IBM Storage Scale container native is blocking the ongoing drain. To resolve it, raise a support ticket to IBM. For more information, see Gather data about the IBM Storage Scale container native cluster.

If pods are listed in Pod Evictions Failed:, check the message that is provided in the NodeDrainFailed events. The pods that are listed in Pod Evictions Failed: block the node drain. Try to solve the problem with the pods that are blocking the node drain. If you cannot solve this problem, raise a support ticket to Red Hat. For more information, see Gather data about the Red Hat OpenShift Container Platform cluster.

Identifying signatures of an ongoing update

Complete the following steps:

  1. Check the Daemon status to verify whether any pods are awaiting deletion.

     kubectl describe daemon -n ibm-spectrum-scale
    

    Example output:

     Status Details:
         ...
         Pods Waiting To Be Deleted:
             Delete Reason:   Eviction of 3 pods requested by machine-config-controller.
             Pods:            worker0, worker1, worker2
             Delete Reason:   Pod worker1 will restart because image of containers config, gpfs, mmbuildgpl are updated. The node will be rebooted if the kernel modules can not be unloaded properly.
             Pods:            worker1
    
  2. Check whether any nodes are cordoned.

     kubectl get nodes
    

    Example:

     NAME                      STATUS                     ROLES    AGE   VERSION
     master0.example.ibm.com   Ready                      master   23d   v1.24.0+3882f8f
     master1.example.ibm.com   Ready                      master   23d   v1.24.0+3882f8f
     master2.example.ibm.com   Ready                      master   23d   v1.24.0+3882f8f
     worker0.example.ibm.com   Ready,SchedulingDisabled   worker   23d   v1.24.0+3882f8f
     worker1.example.ibm.com   Ready                      worker   23d   v1.24.0+3882f8f
     worker2.example.ibm.com   Ready                      worker   23d   v1.24.0+3882f8f
    
     kubectl describe daemon -n ibm-spectrum-scale
    

    Example:

     Cordon And Drain:
         Nodes Cordoned:  worker0.example.ibm.com
    
  3. Check events in the daemon.

     kubectl describe daemon -n ibm-spectrum-scale
    

    Example event:

     Warning  NodeDrainFailed  87s    Daemon  Failed to evict pod myworkload/mypod-0 on node worker0.example.ibm.com. Reason: Cannot evict pod as it would violate the pod's disruption budget.