Known issues for watsonx Orchestrate

The following known issues and limitations apply to watsonx Orchestrate.

watsonx Orchestrate has the following known issues:

Known issues for 5.4.0

The following sections categorize known issues by type:

Installation issues

For a complete list of troubleshooting information for installation issues, see Installation issues.

Upgrade issues

For a complete list of troubleshooting information for upgrade issues, see Upgrade issues.

UI issues

For a complete list of troubleshooting information for UI issues, see Other issues.

Other issues

For a complete list of troubleshooting information for issues other than installation and upgrade issues, see Other issues.

Installation fails with `wo-archer-server-db-schema-job` stuck in progress

Applies to: 5.4.0

Problem

During airgap installation of watsonx Orchestrate 5.3.1 Patch 5, the installation fails with the custom resource (CR) stuck in InProgress state. The wo-archer-server-db-schema-job job remains in a stuck state and does not complete for an extended period, preventing the installation from progressing.

Solution

To resolve this issue, delete the stuck wo-archer-server-db-schema-job job. The operator will automatically recreate the job, which should then complete successfully and allow the installation to proceed.

Run the following command to delete the job:

oc delete job wo-archer-server-db-schema-job -n <namespace>

Replace <namespace> with your watsonx Orchestrate installation namespace.

After deleting the job, monitor the operator logs to verify that the job is recreated and completes successfully.

Intermittent failure during watsonx Orchestrate installation

Applies to: 5.4.0

Problem

In IBM® Software Hub 5.3.1 Patch 2, the watsonx Orchestrate installation might intermittently fail. During installation, the system can time out while waiting for the watsonx Orchestrate custom resource (CR) to become ready. In this scenario, the CR remains in an InProgress state, and the install-components command exits with an error.

This issue is associated with a race condition during PostgreSQL initialization, where the PostgreSQL persistent volume claim (PVC) can remain in an initializing state. As a result, the operator can become stuck in a reconciliation loop and prevent the installation from completing.

Solution

If the installation fails, complete the following steps:

Check the PostgreSQL PVC annotations to determine whether the PVC is stuck in an initializing state. If necessary, update the PVC annotation to indicate that the PVC is ready.
If the PostgreSQL pod is crashing due to an uninitialized data directory, delete the watsonx Orchestrate PostgreSQL cluster.
Allow the operator to recreate and initialize the PostgreSQL cluster.
Verify that the watsonx Orchestrate CR transitions to a Completed or Ready state.

After these steps are completed, the installation should finish successfully.

`wa-dialog-pods` is in `crashloop` during 5.3.1 Patch 2 installation

Applies to: 5.4.0

Problem

The Data Governor operator entered an ESTemplate error state because OpenSearch Security failed to initialize properly during the initial startup of the OpenSearch pods, preventing successful reconciliation.

Solution

To fix this issue, do the following steps:

To confirm the issue is due to Data Governor operator, run:
```
oc get dg -A
```
Verify that wo-wa-data-governor is stuck on ESTemplate status.
To confirm the root error, run:
```
oc get estemplate
```
Check for Error status on the ESTemplate resource.

To validate OpenSearch Security state, run a shell into the Data Governor pod and curl the OpenSearch endpoint:

$ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 cat internal_users/elastic
<password>

$ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 curl -k https://elastic:<password>@localhost:9200/_cat/indices

Proceed to the next step only if the response is

OpenSearch Security not
initialized

Restart the OpenSearch pods:

$ oc delete pod wo-wa-data-governor-ibm-data-governor-all-search-000 wo-wa-data-governor-ibm-data-governor-all-search-001 wo-wa-data-governor-ibm-data-governor-all-search-002

To verify Data Governor reconciliation, run:
```
oc get dg -A && oc get estemplate -A
```
Confirm that wo-wa-data-governor progresses past ESTemplate status and reconciles successfully, and the ESTemplate resource shows Completed instead of Error.

watsonx Orchestrate installation fails on air‑gapped clusters when the wo-docproc-dpi-liquibase-init pod enters an error state

Applies to: 5.4.0

Problem: In air‑gapped clusters, the installation of watsonx Orchestrate fails because the wo-docproc-dpi-liquibase-init pod enters an error state during deployment.

Solution

Verify the error message by using the following command.

oc logs <pod name>

You might see the following error.

ERROR: Exception Details
ERROR: Exception Primary Class: LockException
ERROR: Exception Primary Reason: Could not acquire change log lock. Currently locked by wo-docproc-dpi-liquibase-init-67x74 (10.254.20.76) since 12/16/25, 11:51 AM
ERROR: Exception Primary Source: 4.23.0

Collect the logs from the affected pod (for example, wo-docproc-dpi-liquibase-init-67x74) to help identify the root cause of the lock.

If the root cause is Reason: liquibase.exception.DatabaseException: An I/O error occurred while sending to the backend., remove the lock and retry the job.
This behavior indicates possible instability in the PostgreSQL connection, and retrying has low impact. If the error persists or appears to have another cause, contact support to verify the issue before you proceed.
Create a debug pod to connect to the PostgreSQL instance, preferably by using the same pod referenced in the error.
```
oc debug wo-docproc-dpi-liquibase-init-67x74
```

From the debug pod, connect to the PostgreSQL instance.

sh-5.1$ psql
psql (16.8, server 15.14)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

Verify that the lock is still in place. The command returns one row.

wo_docproc_dpi=# select * from db_dpi_changelog_lock;

id | locked |        lockgranted         |                      lockedby
----+--------+----------------------------+----------------------------------------------------
  1 | t      | 2025-12-16 11:51:36.393442 | wo-docproc-dpi-liquibase-init-67x74 (10.254.20.76)
(1 row)

Remove the lock.

wo_docproc_dpi=# delete from db_dpi_changelog_lock where locked;
DELETE 1

Verify that the lock is removed. The command does not return any rows.

wo_docproc_dpi=# select * from db_dpi_changelog_lock;

 id | locked | lockgranted | lockedby
----+--------+-------------+------------------
(0 rows)

End the psql connection by pressing Ctrl+D, and then exit the pod by pressing Ctrl+D again.
After the debug pod is deleted, delete the liquibase init job. The docproc operator will re-create it on the next reconcile.
```
oc delete job wo-docproc-dpi-liquibase-init
```
```
job.batch "wo-docproc-dpi-liquibase-init" deleted
```

Verify that the job completes successfully after it is re-created.

oc get pods -lcomponent=dpi-liquibase-init

NAME                                  READY   STATUS      RESTARTS   AGE
wo-docproc-dpi-liquibase-init-hlxlb   0/1     Completed   0          80s

The `deploy-knative-eventing` fails with `error: no matching resources found`

Applies to: 5.4.0

Problem

When running the cpd-cli manage deploy-knative-eventing command, it fails with error: no matching resources found after the message deployment.apps/kafka-controller condition met. This issue arises because no pods with the label app=kafka-broker-dispatcher are present.

Solution

Exec into the docker pod running olm utils.
```
docker exec -it olm-utils-play-v3 bash
```

Check for the line that must be removed.

cat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatcher

Output:

cat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatcher
oc wait pods -n knative-eventing --selector app=kafka-broker-dispatcher --for condition=Ready --timeout=60s

Remove the line.

sed -i '/kafka-broker-dispatcher/d' /opt/ansible/bin/deploy-knative-eventing

Verify whether the line is removed.

cat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatcher

Output:

cat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatcher

Kafka upgrade failure in Data Governor during 5.2.x to 5.4.0 upgrade

Applies to: 5.4.0

Problem

During the watsonx Orchestrate upgrade from Version 5.2.x to 5.4.0, Kafka attempts to upgrade from 3.8.0 to 4.2.0 but fails due to metadata remaining at version 3.8-IV0. This leads to Kafka broker and controller connection failures and prevents identification of the active controller, leading to an inconsistent and incomplete upgrade state.

Solution

Restart the Data Governor Kafka related pods to force reconciliation.

oc delete pods -l ibmevents.ibm.com/cluster=$INSTANCE-data-governor-kafka -n $NAMESPACE

Where $INSTANCE=wo-wa and $NAMESPACE=cpd

Restarting the pods enables the Kafka cluster to complete the upgrade process and return to a Ready state.

watsonx Orchestrate instance in unknown status after upgrade

Applies to: 5.4.0

Problem

After upgrading watsonx Orchestrate, the instance shows an unknown status. When you check the deployment of voice-controller and archer-server, these deployments are scaled down to 0 pods and there is a warning related to request and limits not being set.

Solution

Apply the following patches to resolve the issue:

oc patch deployment wo-archer-server -n ${PROJECT_CPD_INST_OPERANDS} --type='strategic' -p='
spec:
  template:
    spec:
      initContainers:
      - name: combine-ca-certs
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 128Mi
'

oc patch deployment wo-voice-controller -n ${PROJECT_CPD_INST_OPERANDS} --type='strategic' -p='
spec:
  template:
    spec:
      initContainers:
      - name: setup-ca-trust
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 128Mi
'

watsonx Assistant CR stuck in VerifyWait due to unverified tf component

Applies to: 5.4.0

Problem

The watsonx Assistant CR gets stuck in the VerifyWait state with the tf component remaining unverified. When you check the watsonx Assistant CR status, it shows:

oc get wa
NAME    VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE       DATASTORE      AGE
wo-wa   5.4.0     False   Initializing   True       VerifyWait       20/20      12/20      DEQUIESCING   NOT_QUIESCED   143m

The CLU YAML shows that the tf component is still unverified:

oc get watsonassistantclu.assistant.watson.ibm.com/wo-wa -o yaml
failedComponents: []
unverifiedComponents: tf
verified: 6/7

On checking the tf pod status, both pods are in CrashLoopBackOff status:

oc get po | grep tf-
tf-869dbf64c4-kr4cc    1/2   CrashLoopBackOff   9 (2m39s ago)   26m
tf-869dbf64c4-lr2sp    1/2   CrashLoopBackOff   9 (3m51s ago)   26m

Solution

The tf pods may be timing out while downloading models due to slow MCG performance. Check the existing noobaa-default-backing-store BackingStore configuration, then increase the NooBaa backing store resources and set the PV volume count to 3 by running the following command:

oc patch -n openshift-storage backingstore noobaa-default-backing-store --type=merge --patch='{"spec":{"pvPool":{"numVolumes":3,"resources":{"requests":{"memory":"4Gi","cpu":"1"},"limits":{"memory":"8Gi","cpu":"2"}}}}}'

Component services not updated after upgrade to 5.3.1 Patch 5

Applies to: 5.4.0

Problem

When you upgrade from 5.2.x or 5.3.0 to 5.3.1 Patch 5, you might encounter an issue where the component services do not get updated, while the watsonx Orchestrate CR status still shows 100%. The component services get stuck in an error state when the size field is not defined in the watsonx Orchestrate CR, and the fallback logic throws an error.

Solution

Set the size value in the watsonx Orchestrate CR by applying the following patch command:

oc patch wo wo -n ${PROJECT_CPD_INST_OPERANDS} --type=merge -p '{"spec":{"size":"medium"}}'

After applying the patch, the component services should update successfully.

Chat with docs cleanup job fails due to insufficient memory after upgrade

Applies to: 5.4.0

Problem

After upgrading from 5.3.0 to 5.3.1 Patch 3 or Patch 5, the wo-chat-with-docs-expiry-cronjob pod fails with an OOMKilled error. The pod's memory limit is set to 200Mi, which may be insufficient when processing multiple knowledge bases or chat-with-docs resources that need to be deleted. The job performs several memory-intensive operations including:

Multiple Postgres queries and updates to remove knowledge bases and related sub-resources
Milvus vector store cleanup operations
S3 document deletion
HTTP requests to TRM (Tools Runtime Manager) to remove tool deployments

Solution

To resolve this issue, increase the memory limit for the chat with docs expiry cronjob to 300-400Mi. Start with 300Mi and monitor actual usage to determine if further adjustment is needed.

Run the following command to update the memory limit:

Patch command to be included

After applying the patch, monitor the pod's memory usage to ensure the new limit is sufficient.

watsonx Orchestrate instance in UNKNOWN state after upgrade to 5.3.1 Patch 5

Applies to: 5.4.0

Problem

After upgrading from 5.3.0 to 5.3.1 Patch 5, the watsonx Orchestrate instance may appear in an UNKNOWN state.

Solution

To resolve this issue, complete the following steps:

Set the namespace variable:

export PROJECT_CPD_INST_OPERANDS=<your-namespace>

Delete the watsonx Orchestrate Kafka custom resource:

oc delete kafka wo-watson-orchestrate-kafkaibm -n ${PROJECT_CPD_INST_OPERANDS}

Delete the deployments of wo-archer-server and wo-voice-controller:
```
oc delete deployment wo-archer-server wo-voice-controller -n ${PROJECT_CPD_INST_OPERANDS}
```
After completing these steps, wait for the component operator to recreate the resources. Monitor the watsonx Orchestrate instance status until it returns to a healthy state.

Agent Assist chat is stuck and LLM responses are not displayed after upgrade to 5.3.1

Applies to: 5.4.0

Problem

After upgrading to watsonx Orchestrate 5.3.1, Agent Assist does not function correctly for chat or voice interactions. Agent Assist might remain stuck in a buffering state or display the message:

There’s currently no conversation content to summarize.

Although the LLM processes the request successfully and responses are generated in the backend, the responses are not streamed back to the Agent Assist UI. This issue is caused by stale cached UI assets being served after the upgrade, combined with embedded chat security being enabled by default in 5.3.1, which is not compatible with Agent Assist. Agent Assist works only when security is disabled. As a result, some static assets are loaded incorrectly, leading to UI errors and blocked Agent Assist functionality.

Solution

To restore Agent Assist functionality, complete the following steps:

Clear caches.
- Clear the browser cache for affected users.
- Clear CDN or gateway caches to ensure the latest UI assets are served.

Restart the Agent Assist socket handler pods.

oc scale deploy/wo-socket-handler --replicas=0 -n <namespace>
oc scale deploy/wo-socket-handler --replicas=1 -n <namespace>

Disable embedded chat security to restore Agent Assist features. For more information, follow the steps mentioned in Configuring security with scripting.
Verify deployment consistency.
- Ensure that all watsonx Orchestrate components, especially wo-archer-server, are running the correct 5.3.1 image and deployment configuration. Review the wo-archer-server pod's volume mounts to check for any attached PVC from previous customizations.
- If a PVC is attached, detach it from the archer-server deployment to prevent outdated code from overriding the 5.3 image.
- Restart the pod to ensure it runs the correct 5.3 application code from the image.
- Verify that LLM functionality is restored after the pod restart.

After completing these steps, Agent Assist resumes normal operation and LLM responses are displayed correctly in the UI.

Existing tool flows fail after upgrading watsonx Orchestrate from 5.2.2 to 5.3.1 Patch 2

Applies to: 5.4.0

Problem: After upgrading watsonx Orchestrate from version 5.2.2 to 5.3.1 Patch 2, existing tool flows that executed successfully in 5.2.2 fail to execute in the upgraded environment. The issue occurs without modification to the tool flows and blocks execution of previously working agents and flows.
Solution: To fix this issue, follow the steps mentioned in Post upgrade tasks.

Downtime when upgrading from 5.1.x or 5.2.x to 5.3.0

Applies to: 5.4.0

Problem: When you upgrade from 5.1.x or 5.2.x to 5.3, you might see a minimal downtime of around 3–4 minutes at the beginning of the upgrade.

watsonx Orchestrate bootstrap job fails to complete during upgrade or installation

Applies to: 5.4.0

Problem: During upgrade or installation, the watsonx Orchestrate bootstrap job might fail to complete because the wo-skill-sequencing pods run out of memory.

Solution

Increase the memory limit for the wo-skill-sequencing pods by applying an RSI patch, then restart the bootstrap job and validate the upgrade.

Make sure that the following prerequisites are met:

You are logged in to the OpenShift® cluster.
oc CLI and cpd-cli are installed and configured.
PROJECT_CPD_INST_OPERANDS is set to the watsonx Orchestrate operand namespace.
```
export PROJECT_CPD_INST_OPERANDS=<cpd-instance-namespace>
```

Create the RSI working directory.

mkdir -p cpd-cli-workspace/olm-utils-workspace/work/rsi

Create a patch file named skill-seq.json in the RSI directory with the following content.

[
{
"op": "replace",
"path": "/spec/containers/0/resources/limits/memory",
"value": "3Gi"
}
]

Apply the RSI patch.

cpd-cli manage create-rsi-patch \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--patch_name=skill-seq-resource-limit \
--patch_type=rsi_pod_spec \
--patch_spec=/tmp/work/rsi/skill-seq.json \
--spec_format=json \
--include_labels=wo.watsonx.ibm.com/component:wo-skill-sequencing \
--state=active

Restart the bootstrap job.

oc delete job wo-watson-orchestrate-bootstrap-job

Validate the upgrade.

cpd-cli manage get-cr-status \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--components=watsonx_orchestrate

After the fix is applied, new wo‑skill‑sequencing pods start with a 3 Gi memory limit, the bootstrap job completes successfully, and the watsonx Orchestrate upgrade proceeds normally.

Unable to connect or import remote MCP server

Applies to: 5.4.0

Problem: When you try to add MCP tools in the Agent builder screen, the connection to a remote MCP server fails. When you create an agent, click Add tools, select MCP server, click Add MCP server, select Remote MCP server, provide the remote MCP server details, and click Connect, the connection fails with the error Gateway creation failed: 401 {"detail":"Invalid authentication credentials"}. This prevents MCP tools from being added successfully to agents.

502 Bad Gateway error on wxochat endpoint in watsonx Orchestrate

Applies to: 5.4.0

Problem

The /wxochat/ endpoint returns a 502 Bad Gateway error. This occurs due to an SSL handshake failure between the NGINX layer (Zen) and the upstream service (wo-uiproxy). The issue is caused by incorrect SSL certificate validation directives in the /wxochat/ location block of the ZenExtension configuration that prevent proper communication between the proxy and the upstream service.

Solution

To resolve this issue, remove the problematic SSL directives from the ZenExtension configuration:

Set the namespace variable:

export PROJECT_CPD_INST_OPERANDS=<your-namespace>

Back up the existing configuration:

oc get zenextension wo-watson-orchestrate-zen-service -n ${PROJECT_CPD_INST_OPERANDS} -o yaml > zenextension-backup.yaml

Annotate the ZenExtension to prevent the operator from overwriting manual changes:

oc annotate zenextension wo-watson-orchestrate-zen-service wo.watsonx.ibm.com/hands-off="true" -n ${PROJECT_CPD_INST_OPERANDS}

Edit the ZenExtension:

oc edit zenextension wo-watson-orchestrate-zen-service -n ${PROJECT_CPD_INST_OPERANDS}

In the editor, locate the location /wxochat/ block and remove the following four lines if present:

proxy_ssl_trusted_certificate  /etc/internal-nginx-svc-tls/ca.crt;
proxy_ssl_verify               on;
proxy_ssl_certificate          /etc/internal-nginx-svc-tls/tls.crt;
proxy_ssl_certificate_key      /etc/internal-nginx-svc-tls/tls.key;

Important: Do not modify any other location blocks. Maintain correct indentation as NGINX configuration is indentation-sensitive.

Save and exit the editor (in vi: press Esc, type :wq, press Enter).
Wait for the ZenExtension to reconcile. Monitor the status until it returns to Completed:
```
oc get zenextension wo-watson-orchestrate-zen-service -n ${PROJECT_CPD_INST_OPERANDS}
```

Restart the watsonx Orchestrate components:

oc rollout restart `oc get deployment -o name -l icpdsupport/module=components-services-orchestrate`

If the change causes issues, restore from the backup:

oc apply -f zenextension-backup.yaml

Error Displayed When Saving Platform Language Changes in Agentic Clusters

Applies to: 5.4.0

Problem: When an admin user logs in to watsonx Orchestrate and attempts to add and save a language by clicking on the profile and navigating to Settings > Platform languages, the system fails to save the selected languages and instead displays an error.

Issue when add LDAP user to watsonx Orchestrate instance

Applies to: 5.4.0

Problem: In MyInstances screen, when you add LDAP user to watsonx Orchestrate instance and check the status that displays error [POST /grant/][500] addUserToServiceInstanceInternalServerError &{MessageCode: StatusCode:0 Exception: Message:}.
Solution: Add a user second time then it is adding successfully.

Reset test run in Evaluate response settings fails with 500 error

Applies to: 5.4.0

Problem: When you reset, do the test-run from the Evaluate response settings page of AI assistant builder, it fails with a 500 Internal Server Error, and the UI hangs.
Solution: If the page hangs after the error, go to a different page and then return to the same page, or refresh the page.

The preview page not available when the watsonx Assistant is created by using the API

Applies to: 5.4.0

Problem: The Preview page is not accessible when watsonx Assistant is created by using the API.

Milvus standalone pod fails due to etcd backend quota exhaustion

Applies to: 5.4.0

Problem

The Milvus standalone pod (ibm-lh-lakehouse-wo-milvus-standalone) enters CrashLoopBackOff state with the error etcdserver: mvcc: database space exceeded. The pod crashes during startup when Milvus attempts to register sessions in etcd. The backing etcd pod (ibm-lh-lakehouse-wo-milvus-etcd-0) has reached its backend quota of 2 GiB and raised an NOSPACE alarm. This issue is not caused by PVC size limitations, image pull failures, or pod scheduling problems.

Solution

To recover from this issue, perform the following steps:

Verify the Milvus pod is crashing with the mvcc: database space exceeded error.

Confirm the etcd pod has the NOSPACE alarm by running:

oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl alarm list'

Check the etcd endpoint status to confirm the database size has reached the quota:

oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl endpoint status -w table'

Compact the etcd database at the current revision (replace <revision> with the current revision number from the status output):
```
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl compact <revision>'
```
Defragment the etcd database with an extended timeout:
```
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl --command-timeout=120s defrag'
```
Note: The defragmentation process may take several minutes to complete.

Disarm the etcd alarm:

oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl --command-timeout=120s alarm disarm'

Verify the alarm has been cleared:
```
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl alarm list'
```
The command should return no output if the alarm is successfully cleared.
Restart the Milvus standalone pod:
```
oc delete pod -n <namespace> <milvus-pod-name>
```
The deployment will automatically create a replacement pod that should start successfully.

After completing these steps, verify that the deployment status shows READY 1/1 and the pod is in Running state.

Restoring of the Milvus S3 backup requires manual job after the cluster restore and stabilization

Applies to: 5.4.0

Problem

You need to restore the backups manually after the cluster restore and stabilization of Milvus S3 bucket.

Solution

Set the variable PROJECT_CPD_INST_OPERANDS to the watsonx Orchestrate operand namespace.

Run the appropriate restore job for your environment.

For an online restore, run the following job:

apiVersion: batch/v1
kind: Job
metadata:
  name: wo-watson-orchestrate-backup-milvus-s3-restore
  namespace: ${PROJECT_CPD_INST_OPERANDS}
  annotations:
    name: wo-watson-orchestrate-backup-milvus-s3-restore
    namespace: ${PROJECT_CPD_INST_OPERANDS}
  labels:
    app.kubernetes.io/component: backup
    app.kubernetes.io/instance: wo
    app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
    app.kubernetes.io/name: watson-orchestrate
    icpdsupport/addOnId: orchestrate
    icpdsupport/app: backup
    icpdsupport/module: backup-orchestrate
    icpdsupport/podSelector: backup
    wo.watsonx.ibm.com/application: watson-orchestrate
    wo.watsonx.ibm.com/component: backup
    wo.watsonx.ibm.com/cr-name: wo
    wo.watsonx.ibm.com/external-access: "true"
    wo.watsonx.ibm.com/operand-version: 7.1.2
spec:
  backoffLimit: 0
  template:
    metadata:
      annotations:
        productName: IBM watsonx Orchestrate
        productVersion: 5.3.3
      labels:
        app.kubernetes.io/component: backup
        app.kubernetes.io/instance: wo
        app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
        app.kubernetes.io/name: watson-orchestrate
        icpdsupport/addOnId: orchestrate
        icpdsupport/app: backup
        icpdsupport/module: backup-orchestrate
        icpdsupport/podSelector: backup
        wo.watsonx.ibm.com/application: watson-orchestrate
        wo.watsonx.ibm.com/component: backup
        wo.watsonx.ibm.com/cr-name: wo
        wo.watsonx.ibm.com/external-access: "true"
        wo.watsonx.ibm.com/operand-version: 7.1.2
    spec:
      serviceAccountName: wo-watson-orchestrate-backup-restore
      restartPolicy: Never
      containers:
        - name: restore
          image: cp.stg.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:09d6e5c703ea9ca89baec4cbc8549187072fc0b8b00c1f99a68e1b09ec0641cd
          imagePullPolicy: Always
          command:
            - ./milvus-s3-br.sh
            - restore
          env:
            - name: JOB_NAME
              value: wo-watson-orchestrate-backup-milvus-s3-restore
            - name: JOB_NAMESPACE
              value: ${PROJECT_CPD_INST_OPERANDS}
            - name: SERVER_STORAGE_BUCKET
              value: wo-server-storage-bucket-${PROJECT_CPD_INST_OPERANDS}
          volumeMounts:
            - name: milvus-secret
              mountPath: /secrets/wo-milvus-storage-bucket
            - name: s3-account
              mountPath: /secrets/s3-account
            - name: s3-uri
              mountPath: /secrets/s3-uri
            - name: milvus-configmap
              mountPath: /configmaps/wo-milvus-storage-bucket
            - name: s3-cert
              mountPath: /secrets/s3-cert
            - name: s3-backup-pvc
              mountPath: /tmp/s3-backup
      volumes:
        - name: milvus-secret
          secret:
            secretName: wo-milvus-storage-bucket
        - name: s3-account
          secret:
            secretName: noobaa-account-watsonx-orchestrate
        - name: s3-uri
          secret:
            secretName: noobaa-uri-watsonx-orchestrate
        - name: milvus-configmap
          configMap:
            name: wo-milvus-storage-bucket
        - name: s3-cert
          secret:
            secretName: noobaa-cert-watsonx-orchestrate
        - name: s3-backup-pvc
          persistentVolumeClaim:
            claimName: wo-watson-orchestrate-backup-s3

For an offline restore, run the following job:

apiVersion: batch/v1
kind: Job
metadata:
  name: wo-watson-orchestrate-backup-offline-milvus-s3-restore
  namespace: ${PROJECT_CPD_INST_OPERANDS}
  annotations:
    name: wo-watson-orchestrate-backup-offline-milvus-s3-restore
    namespace: ${PROJECT_CPD_INST_OPERANDS}
  labels:
    app.kubernetes.io/component: backup-offline
    app.kubernetes.io/instance: wo
    app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
    app.kubernetes.io/name: watson-orchestrate
    icpdsupport/addOnId: orchestrate
    icpdsupport/app: backup-offline
    icpdsupport/module: backup-offline-orchestrate
    icpdsupport/podSelector: backup-offline
    wo.watsonx.ibm.com/application: watson-orchestrate
    wo.watsonx.ibm.com/component: backup-offline
    wo.watsonx.ibm.com/cr-name: wo
    wo.watsonx.ibm.com/external-access: "true"
    wo.watsonx.ibm.com/operand-version: 7.1.2
spec:
  backoffLimit: 0
  template:
    metadata:
      annotations:
        productName: IBM watsonx Orchestrate
        productVersion: 5.3.3
      labels:
        app.kubernetes.io/component: backup-offline
        app.kubernetes.io/instance: wo
        app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
        app.kubernetes.io/name: watson-orchestrate
        icpdsupport/addOnId: orchestrate
        icpdsupport/app: backup-offline
        icpdsupport/module: backup-offline-orchestrate
        icpdsupport/podSelector: backup-offline
        wo.watsonx.ibm.com/application: watson-orchestrate
        wo.watsonx.ibm.com/component: backup-offline
        wo.watsonx.ibm.com/cr-name: wo
        wo.watsonx.ibm.com/external-access: "true"
        wo.watsonx.ibm.com/operand-version: 7.1.2
    spec:
      serviceAccountName: wo-watson-orchestrate-backup-restore
      restartPolicy: Never
      containers:
        - name: restore
          image: cp.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:09d6e5c703ea9ca89baec4cbc8549187072fc0b8b00c1f99a68e1b09ec0641cd
          imagePullPolicy: Always
          command:
            - ./milvus-s3-br.sh
            - restore
          env:
            - name: JOB_NAME
              value: wo-watson-orchestrate-backup-offline-milvus-s3-restore
            - name: JOB_NAMESPACE
              value: ${PROJECT_CPD_INST_OPERANDS}
            - name: SERVER_STORAGE_BUCKET
              value: wo-server-storage-bucket-${PROJECT_CPD_INST_OPERANDS}
          volumeMounts:
            - name: milvus-secret
              mountPath: /secrets/wo-milvus-storage-bucket
            - name: s3-account
              mountPath: /secrets/s3-account
            - name: s3-uri
              mountPath: /secrets/s3-uri
            - name: milvus-configmap
              mountPath: /configmaps/wo-milvus-storage-bucket
            - name: s3-cert
              mountPath: /secrets/s3-cert
            - name: s3-backup-pvc
              mountPath: /tmp/s3-backup
      volumes:
        - name: milvus-secret
          secret:
            secretName: wo-milvus-storage-bucket
        - name: s3-account
          secret:
            secretName: noobaa-account-watsonx-orchestrate
        - name: s3-uri
          secret:
            secretName: noobaa-uri-watsonx-orchestrate
        - name: milvus-configmap
          configMap:
            name: wo-milvus-storage-bucket
        - name: s3-cert
          secret:
            secretName: noobaa-cert-watsonx-orchestrate
        - name: s3-backup-pvc
          persistentVolumeClaim:
            claimName: wo-watson-orchestrate-backup-offline-s3

SSO OBO and token exchange do not work in on‑premises deployments

Applies to: 5.4.0

Problem

In on‑premises deployments of watsonx Orchestrate, Single Sign-On (SSO) On‑Behalf‑Of (OBO) flow and Identity Provider (IdP) token exchange do not function correctly. Although users can successfully log in to watsonx Orchestrate using SSO, the SSO token is not exchanged with the downstream IdP when an agent executes actions in connected applications. As a result:

User identity is not consistently propagated between the IdP, watsonx Orchestrate, and downstream applications. In such cases, alternative authentication methods, such as the OAuth 2.0 authorization code flow or the client credentials flow, might be required.
Seamless SSO behavior supported in IBM Cloud and AWS SaaS deployments is not available in these on‑premises releases.

This issue affects enterprise users running watsonx Orchestrate in on‑premises environments that rely on SSO‑based connections.

Astra DB connection fails in air-gap environments with HTTP proxy enabled

Applies to: 5.4.0

Problem: In watsonx Orchestrate on‑premises air-gap environments with an HTTP proxy enabled, connecting to Astra DB as a knowledge source fails during agent creation. When an admin creates an agent in Agent builder, navigates to Knowledge > Choose knowledge > Astra DB, and clicks Next after entering the required details, the UI does not progress to the next screen and the connection fails. This occurs because Astra DB keyspaces cannot be retrieved in fully air-gapped configurations.

watsonx Orchestrate CR is stuck in `Inprogress` state though the status shows `Completed`

Applies to: 5.4.0

Problem: This happens due to the data stores repeatedly entering a reconcile loop.
Solution: When the verification shows 20/20, you can ignore the InProgress state because the pods are healthy and fully available.

Domain agents and tools UUIDs are not the same in two clusters

Applies to: 5.4.0

Problem: In multi region active deployment, in cluster-A, as an Admin user you see the UUIDs of Domain agents and tools in Catalog and see in Network tab, then go to cluster-B, and see the UUIDs of Domain agents and tools are different.

The UUID of an agent is different on deleting and importing again the same agent via ADK

Applies to: 5.4.0

Problem: In multi region active deployment, as an Admin if you delete the imported agent in Agent builder and again import the same agent in the same cluster through ADK, you would find a different UUID of the agent.

On‑premises traces do not display Tools Runtime or TRM spans

Applies to: 5.4.0

Problem: In the watsonx Orchestrate on‑premises environment, trace data is incomplete as Tools Runtime and TRM spans are not captured or displayed, resulting in incomplete observability of agent execution.

Backup and restore utility is not included for Milvus

Applies to: 5.4.0

Problem

The Backup and restore utility is not included in the watsonx Orchestrate s3 bucket data for Milvus. You need to take backup manually, and restore after the instance restore is completed by using a Kubernetes job.Pr-backup.

Solution

To resolve the problem, run the following script before you take the backup:

oc project ${PROJECT_CPD_INST_OPERANDS}
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: wo-watson-orchestrate-backup-milvus-s3-backup
  namespace: ${PROJECT_CPD_INST_OPERANDS}
  annotations:
    name: wo-watson-orchestrate-backup-milvus-s3-backup
    namespace: ${PROJECT_CPD_INST_OPERANDS}
  labels:
    app.kubernetes.io/component: backup
    app.kubernetes.io/instance: wo
    app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
    app.kubernetes.io/name: watson-orchestrate
    icpdsupport/addOnId: orchestrate
    icpdsupport/app: backup
    icpdsupport/module: backup-orchestrate
    icpdsupport/podSelector: backup
    wo.watsonx.ibm.com/application: watson-orchestrate
    wo.watsonx.ibm.com/component: backup
    wo.watsonx.ibm.com/cr-name: wo
    wo.watsonx.ibm.com/external-access: "true"
    wo.watsonx.ibm.com/operand-version: 6.0.0
spec:
  backoffLimit: 0
  template:
    metadata:
      annotations:
        productName: IBM watsonx Orchestrate
        productVersion: 5.2.0
      labels:
        app.kubernetes.io/component: backup
        app.kubernetes.io/instance: wo
        app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
        app.kubernetes.io/name: watson-orchestrate
        icpdsupport/addOnId: orchestrate
        icpdsupport/app: backup
        icpdsupport/module: backup-orchestrate
        icpdsupport/podSelector: backup
        wo.watsonx.ibm.com/application: watson-orchestrate
        wo.watsonx.ibm.com/component: backup
        wo.watsonx.ibm.com/cr-name: wo
        wo.watsonx.ibm.com/external-access: "true"
        wo.watsonx.ibm.com/operand-version: 6.0.0
    spec:
      serviceAccountName: wo-watson-orchestrate-backup-restore
      restartPolicy: Never
      containers:
        - name: backup
          image: cp.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:f2ca697cdcea2f349f9b0304a3b28f19f5d3f917b57b9076bbae43052a8a9c20
          imagePullPolicy: Always
          command:
            - ./milvus-s3-br.sh
            - backup
          env:
            - name: JOB_NAME
              value: wo-watson-orchestrate-backup-milvus-s3-backup
            - name: JOB_NAMESPACE
              value: cpd-instance-1
          resources: {}
          volumeMounts:
            - name: milvus-secret
              mountPath: /secrets/wo-milvus-storage-bucket
            - name: milvus-configmap
              mountPath: /configmaps/wo-milvus-storage-bucket
            - name: s3-cert
              mountPath: /secrets/s3-cert
            - name: s3-backup-pvc
              mountPath: /tmp/s3-backup
      volumes:
        - name: milvus-secret
          secret:
            secretName: wo-milvus-storage-bucket
        - name: milvus-configmap
          configMap:
            name: wo-milvus-storage-bucket
        - name: s3-cert
          secret:
            secretName: noobaa-cert-watsonx-orchestrate
        - name: s3-backup-pvc
          persistentVolumeClaim:
            claimName: wo-watson-orchestrate-backup-s3
EOF

Run the following script after you restore:

oc project ${PROJECT_CPD_INST_OPERANDS}
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: wo-watson-orchestrate-backup-milvus-s3-restore
  namespace: ${PROJECT_CPD_INST_OPERANDS}
  annotations:
    name: wo-watson-orchestrate-backup-milvus-s3-restore
    namespace: ${PROJECT_CPD_INST_OPERANDS}
  labels:
    app.kubernetes.io/component: backup
    app.kubernetes.io/instance: wo
    app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
    app.kubernetes.io/name: watson-orchestrate
    icpdsupport/addOnId: orchestrate
    icpdsupport/app: backup
    icpdsupport/module: backup-orchestrate
    icpdsupport/podSelector: backup
    wo.watsonx.ibm.com/application: watson-orchestrate
    wo.watsonx.ibm.com/component: backup
    wo.watsonx.ibm.com/cr-name: wo
    wo.watsonx.ibm.com/external-access: "true"
    wo.watsonx.ibm.com/operand-version: 6.0.0
spec:
  backoffLimit: 0
  template:
    metadata:
      annotations:
        productName: IBM watsonx Orchestrate
        productVersion: 5.2.0
      labels:
        app.kubernetes.io/component: backup
        app.kubernetes.io/instance: wo
        app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator
        app.kubernetes.io/name: watson-orchestrate
        icpdsupport/addOnId: orchestrate
        icpdsupport/app: backup
        icpdsupport/module: backup-orchestrate
        icpdsupport/podSelector: backup
        wo.watsonx.ibm.com/application: watson-orchestrate
        wo.watsonx.ibm.com/component: backup
        wo.watsonx.ibm.com/cr-name: wo
        wo.watsonx.ibm.com/external-access: "true"
        wo.watsonx.ibm.com/operand-version: 6.0.0
    spec:
      serviceAccountName: wo-watson-orchestrate-backup-restore
      restartPolicy: Never
      containers:
        - name: restore
          image: cp.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:f2ca697cdcea2f349f9b0304a3b28f19f5d3f917b57b9076bbae43052a8a9c20
          imagePullPolicy: Always
          command:
            - ./milvus-s3-br.sh
            - restore
          env:
            - name: JOB_NAME
              value: wo-watson-orchestrate-backup-milvus-s3-restore
            - name: JOB_NAMESPACE
              value: cpd-instance-1
          resources: {}
          volumeMounts:
            - name: milvus-secret
              mountPath: /secrets/wo-milvus-storage-bucket
            - name: milvus-configmap
              mountPath: /configmaps/wo-milvus-storage-bucket
            - name: s3-cert
              mountPath: /secrets/s3-cert
            - name: s3-backup-pvc
              mountPath: /tmp/s3-backup
      volumes:
        - name: milvus-secret
          secret:
            secretName: wo-milvus-storage-bucket
        - name: milvus-configmap
          configMap:
            name: wo-milvus-storage-bucket
        - name: s3-cert
          secret:
            secretName: noobaa-cert-watsonx-orchestrate
        - name: s3-backup-pvc
          persistentVolumeClaim:
            claimName: wo-watson-orchestrate-backup-s3

EOF

Known issues for watsonx Orchestrate

Known issues for 5.4.0

Installation issues

Upgrade issues

UI issues

Other issues

Installation fails with wo-archer-server-db-schema-job stuck in progress

Intermittent failure during watsonx Orchestrate installation

wa-dialog-pods is in crashloop during 5.3.1 Patch 2 installation

watsonx Orchestrate installation fails on air‑gapped clusters when the wo-docproc-dpi-liquibase-init pod enters an error state

The deploy-knative-eventing fails with error: no matching resources found

Kafka upgrade failure in Data Governor during 5.2.x to 5.4.0 upgrade

watsonx Orchestrate instance in unknown status after upgrade

watsonx Assistant CR stuck in VerifyWait due to unverified tf component

Component services not updated after upgrade to 5.3.1 Patch 5

Chat with docs cleanup job fails due to insufficient memory after upgrade

watsonx Orchestrate instance in UNKNOWN state after upgrade to 5.3.1 Patch 5

Agent Assist chat is stuck and LLM responses are not displayed after upgrade to 5.3.1

Existing tool flows fail after upgrading watsonx Orchestrate from 5.2.2 to 5.3.1 Patch 2

Downtime when upgrading from 5.1.x or 5.2.x to 5.3.0

watsonx Orchestrate bootstrap job fails to complete during upgrade or installation

Unable to connect or import remote MCP server

502 Bad Gateway error on wxochat endpoint in watsonx Orchestrate

Error Displayed When Saving Platform Language Changes in Agentic Clusters

Issue when add LDAP user to watsonx Orchestrate instance

Reset test run in Evaluate response settings fails with 500 error

The preview page not available when the watsonx Assistant is created by using the API

Milvus standalone pod fails due to etcd backend quota exhaustion

Restoring of the Milvus S3 backup requires manual job after the cluster restore and stabilization

SSO OBO and token exchange do not work in on‑premises deployments

Astra DB connection fails in air-gap environments with HTTP proxy enabled

watsonx Orchestrate CR is stuck in Inprogress state though the status shows Completed

Domain agents and tools UUIDs are not the same in two clusters

The UUID of an agent is different on deleting and importing again the same agent via ADK

On‑premises traces do not display Tools Runtime or TRM spans

Backup and restore utility is not included for Milvus

Installation fails with `wo-archer-server-db-schema-job` stuck in progress

`wa-dialog-pods` is in `crashloop` during 5.3.1 Patch 2 installation

The `deploy-knative-eventing` fails with `error: no matching resources found`

watsonx Orchestrate CR is stuck in `Inprogress` state though the status shows `Completed`