Engine failure status check

You can check the engine failure status by using the oc describe wxdengine command and by checking the ibm-lakehouse-controller-manager logs.

watsonx.data Developer edition

watsonx.data on IBM Software Hub

Procedure

Run the following command to display the Status and Events information of an engine.

oc describe wxdengine <enginename> -n <operand namespace>

Example output:

ibm-lh-lakehouse-prestissimo57-coordinator-blue-0                 0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-0               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-1               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-2               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-3               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-4               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-5               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-6               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-7               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-8               0/1     Pending     0               83m
ibm-lh-lakehouse-prestissimo57-prestissimo-worker-9               0/1     Pending     0               83m

Run the oc describe command on the pod to identify the route cause of the failure.

oc describe pod <pod_name>

For example:

oc describe pod ibm-lh-lakehouse-prestissimo57-prestissimo-worker-0

Output:

Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  59m                  ibm-cpd-scheduler  0/18 nodes are available: 15 Insufficient cpu, 15 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/18 nodes are available: 15 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  13m (x550 over 59m)  ibm-cpd-scheduler  0/18 nodes are available: 15 Insufficient cpu, 15 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/18 nodes are available: 15 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
  Normal   QueuePosition     59m                  ibm-cpd-scheduler  Queue Position: 2