GPU and model issues

You might face watsonx Orchestrate installation issues due to GPU and models. Go though the following sections for resolutions of the problems.

wo-ai-cognitive-mapper-svc cannot schedule to GPU work node

Symptoms

wo-ai-cognitive-mapper-svc pods might not be scheduled onto GPU worker nodes.

Diagnosing the problem

The StatefulSet wo-ai-cognitive-mapper-svc-6c44d9576d is configured to run only 2 replicas, and the corresponding pods:

wo-ai-cognitive-mapper-svc-6c44d9576d-74z2c
wo-ai-cognitive-mapper-svc-6c44d9576d-cfp88

Are running normally without scheduling issues.

The upgrade did not require more replicas, and no GPU scheduling failures were found for the active pods.

Solution

If you still observe an unscheduled or failing pod, you can:

Verify pod status again and confirm whether only two pods exist (expected).
Check pod events by using:
```
oc describe pod <pod-name> -n <namespace>
```
Show more lines look specifically for GPU scheduling errors. For example, insufficient GPU, taints, node selectors, and toleration is missing.

In summary, the service is functioning as expected with two healthy replicas. Scheduling issues must be investigated only if other pods appear or if events indicate node scheduling constraints.

Agentic creation failure without models

Symptoms

If no models are defined in agentic installation of watsonx Orchestrate, agent creation and default agent invocation fails, resulting in errors such as "Creating the agent failed. Please try again."

Solution

To resolve this issue, use one of the following approaches:

For an agentic installation of watsonx Orchestrate, specify at least one model in the install-options.yml file.
Use watsonx_ai_ifm to mirror the wanted models.