Engine failure status check

You can check the engine failure status by using the oc describe wxdengine command and by checking the ibm-lakehouse-controller-manager logs.

watsonx.data Developer edition

watsonx.data on IBM Software Hub

Procedure

  1. Run the following command to display the Status and Events information of an engine.
    oc describe wxdengine <enginename> -n <operand namespace>
    Example output:
    ibm-lh-lakehouse-prestissimo57-coordinator-blue-0                 0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-0               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-1               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-2               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-3               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-4               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-5               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-6               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-7               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-8               0/1     Pending     0               83m
    ibm-lh-lakehouse-prestissimo57-prestissimo-worker-9               0/1     Pending     0               83m
  2. Run the oc describe command on the pod to identify the route cause of the failure.
    oc describe pod <pod_name>
    For example:
    oc describe pod ibm-lh-lakehouse-prestissimo57-prestissimo-worker-0
    Output:
    Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
    Events:
      Type     Reason            Age                  From               Message
      ----     ------            ----                 ----               -------
      Warning  FailedScheduling  59m                  ibm-cpd-scheduler  0/18 nodes are available: 15 Insufficient cpu, 15 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/18 nodes are available: 15 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
      Warning  FailedScheduling  13m (x550 over 59m)  ibm-cpd-scheduler  0/18 nodes are available: 15 Insufficient cpu, 15 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/18 nodes are available: 15 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.
      Normal   QueuePosition     59m                  ibm-cpd-scheduler  Queue Position: 2