Known issues for watsonx Orchestrate
The following known issues and limitations apply to watsonx Orchestrate.
watsonx Orchestrate has the following known issues:
Known issues for 5.4.0
The following sections categorize known issues by type:
Installation issues
- Installation fails with wo-archer-server-db-schema-job stuck in progress
- Intermittent failure during watsonx Orchestrate installation
- wa-dialog-pods is in crashloop during 5.3.1 Patch 2 installation
- watsonx Orchestrate installation fails on air‑gapped clusters when the wo-docproc-dpi-liquibase-init pod enters an error state
- The deploy-knative-eventing fails with error: no matching resources found
For a complete list of troubleshooting information for installation issues, see Installation issues.
Upgrade issues
- Kafka upgrade failure in Data Governor during 5.2.x to 5.4.0 upgrade
- watsonx Orchestrate instance in unknown status after upgrade
- watsonx Assistant CR stuck in VerifyWait due to unverified tf component
- Component services not updated after upgrade to 5.3.1 Patch 5
- Chat with docs cleanup job fails due to insufficient memory after upgrade
- watsonx Orchestrate instance in UNKNOWN state after upgrade to 5.3.1 Patch 5
- Agent Assist chat is stuck and LLM responses are not displayed after upgrade to 5.3.1
- Existing tool flows fail after upgrading watsonx Orchestrate from 5.2.2 to 5.3.1 Patch 2
- Downtime when upgrading from 5.1.x or 5.2.x to 5.3.0
- watsonx Orchestrate bootstrap job fails to complete during upgrade or installation
For a complete list of troubleshooting information for upgrade issues, see Upgrade issues.
UI issues
- Unable to connect or import remote MCP server
- 502 Bad Gateway error on wxochat endpoint in watsonx Orchestrate
- Error Displayed When Saving Platform Language Changes in Agentic Clusters
- Issue when add LDAP user to watsonx Orchestrate instance
- Reset test run in Evaluate response settings fails with 500 error
- The preview page not available when the watsonx Assistant is created by using the API
For a complete list of troubleshooting information for UI issues, see Other issues.
Other issues
- Milvus standalone pod fails due to etcd backend quota exhaustion
- Restoring of the Milvus S3 backup requires manual job after the cluster restore and stabilization
- SSO OBO and token exchange do not work in on‑premises deployments
- Astra DB connection fails in air-gap environments with HTTP proxy enabled
- watsonx Orchestrate CR is stuck in Inprogress state though the status shows Completed
- Domain agents and tools UUIDs are not the same in two clusters
- The UUID of an agent is different on deleting and importing again the same agent via ADK
- On‑premises traces do not display Tools Runtime or TRM spans
- Backup and restore utility is not included for Milvus
For a complete list of troubleshooting information for issues other than installation and upgrade issues, see Other issues.
Installation fails with wo-archer-server-db-schema-job stuck in
progress
Applies to: 5.4.0
- Problem
- During airgap installation of watsonx
Orchestrate
5.3.1 Patch 5, the installation fails with the custom resource (CR) stuck in
InProgress state. The
wo-archer-server-db-schema-jobjob remains in a stuck state and does not complete for an extended period, preventing the installation from progressing. - Solution
-
To resolve this issue, delete the stuck
wo-archer-server-db-schema-jobjob. The operator will automatically recreate the job, which should then complete successfully and allow the installation to proceed.Run the following command to delete the job:
oc delete job wo-archer-server-db-schema-job -n <namespace>Replace
<namespace>with your watsonx Orchestrate installation namespace.After deleting the job, monitor the operator logs to verify that the job is recreated and completes successfully.
Intermittent failure during watsonx Orchestrate installation
Applies to: 5.4.0
- Problem
-
In IBM® Software Hub 5.3.1 Patch 2, the watsonx Orchestrate installation might intermittently fail. During installation, the system can time out while waiting for the watsonx Orchestrate custom resource (CR) to become ready. In this scenario, the CR remains in an
InProgressstate, and the install-components command exits with an error.This issue is associated with a race condition during PostgreSQL initialization, where the PostgreSQL persistent volume claim (PVC) can remain in an initializing state. As a result, the operator can become stuck in a reconciliation loop and prevent the installation from completing.
- Solution
- If the installation fails, complete the following steps:
- Check the PostgreSQL PVC annotations to determine whether the PVC is stuck in an initializing state. If necessary, update the PVC annotation to indicate that the PVC is ready.
- If the PostgreSQL pod is crashing due to an uninitialized data directory, delete the watsonx Orchestrate PostgreSQL cluster.
- Allow the operator to recreate and initialize the PostgreSQL cluster.
- Verify that the watsonx
Orchestrate CR transitions to a
CompletedorReadystate.
After these steps are completed, the installation should finish successfully.
wa-dialog-pods is in crashloop during 5.3.1 Patch 2
installation
Applies to: 5.4.0
- Problem
- The Data Governor operator entered an
ESTemplateerror state becauseOpenSearch Securityfailed to initialize properly during the initial startup of theOpenSearchpods, preventing successful reconciliation. - Solution
- To fix this issue, do the following steps:
- To confirm the issue is due to Data Governor operator, run:
Verify thatoc get dg -Awo-wa-data-governoris stuck onESTemplatestatus. - To confirm the root error, run:
Check foroc get estemplateErrorstatus on theESTemplateresource. - To validate
OpenSearch Securitystate, run a shell into the Data Governor pod and curl theOpenSearchendpoint:
Proceed to the next step only if the response is$ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 cat internal_users/elastic <password> $ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 curl -k https://elastic:<password>@localhost:9200/_cat/indicesOpenSearch Security not initialized. - Restart the
OpenSearchpods:$ oc delete pod wo-wa-data-governor-ibm-data-governor-all-search-000 wo-wa-data-governor-ibm-data-governor-all-search-001 wo-wa-data-governor-ibm-data-governor-all-search-002 - To verify Data Governor reconciliation,
run:
Confirm thatoc get dg -A && oc get estemplate -Awo-wa-data-governorprogresses pastESTemplatestatus and reconciles successfully, and theESTemplateresource showsCompletedinstead ofError.
- To confirm the issue is due to Data Governor operator, run:
watsonx Orchestrate installation fails on air‑gapped clusters when the wo-docproc-dpi-liquibase-init pod enters an error state
Applies to: 5.4.0
- Problem
- In air‑gapped clusters, the installation of watsonx
Orchestrate fails because the
wo-docproc-dpi-liquibase-initpod enters an error state during deployment.
- Solution
-
- Verify the error message by using the following
command.
You might see the following error.oc logs <pod name>ERROR: Exception Details ERROR: Exception Primary Class: LockException ERROR: Exception Primary Reason: Could not acquire change log lock. Currently locked by wo-docproc-dpi-liquibase-init-67x74 (10.254.20.76) since 12/16/25, 11:51 AM ERROR: Exception Primary Source: 4.23.0
Collect the logs from the affected pod (for example,wo-docproc-dpi-liquibase-init-67x74) to help identify the root cause of the lock. - If the root cause is
Reason: liquibase.exception.DatabaseException: An I/O error occurred while sending to the backend., remove the lock and retry the job.This behavior indicates possible instability in the PostgreSQL connection, and retrying has low impact. If the error persists or appears to have another cause, contact support to verify the issue before you proceed.
- Create a debug pod to connect to the PostgreSQL instance, preferably by using the same pod
referenced in the
error.
oc debug wo-docproc-dpi-liquibase-init-67x74 - From the debug pod, connect to the PostgreSQL
instance.
sh-5.1$ psql psql (16.8, server 15.14) SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off) Type "help" for help. - Verify that the lock is still in place. The command returns one
row.
wo_docproc_dpi=# select * from db_dpi_changelog_lock;id | locked | lockgranted | lockedby ----+--------+----------------------------+---------------------------------------------------- 1 | t | 2025-12-16 11:51:36.393442 | wo-docproc-dpi-liquibase-init-67x74 (10.254.20.76) (1 row)
- Remove the
lock.
wo_docproc_dpi=# delete from db_dpi_changelog_lock where locked; DELETE 1 - Verify that the lock is removed. The command does not return any
rows.
wo_docproc_dpi=# select * from db_dpi_changelog_lock;id | locked | lockgranted | lockedby ----+--------+-------------+------------------ (0 rows)
- End the
psqlconnection by pressing Ctrl+D, and then exit the pod by pressing Ctrl+D again. - After the debug pod is deleted, delete the liquibase init job. The docproc operator will
re-create it on the next
reconcile.
oc delete job wo-docproc-dpi-liquibase-initjob.batch "wo-docproc-dpi-liquibase-init" deleted
- Verify that the job completes successfully after it is
re-created.
oc get pods -lcomponent=dpi-liquibase-initNAME READY STATUS RESTARTS AGE wo-docproc-dpi-liquibase-init-hlxlb 0/1 Completed 0 80s
- Verify the error message by using the following
command.
The deploy-knative-eventing fails with error: no matching resources
found
Applies to: 5.4.0
- Problem
- When running the
cpd-cli manage deploy-knative-eventingcommand, it fails witherror: no matching resources foundafter the messagedeployment.apps/kafka-controller condition met. This issue arises because no pods with the labelapp=kafka-broker-dispatcherare present. - Solution
-
- Exec into the docker pod running
olm utils.docker exec -it olm-utils-play-v3 bash - Check for the line that must be
removed.
Output:cat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatchercat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatcher oc wait pods -n knative-eventing --selector app=kafka-broker-dispatcher --for condition=Ready --timeout=60s - Remove the
line.
sed -i '/kafka-broker-dispatcher/d' /opt/ansible/bin/deploy-knative-eventing - Verify whether the line is
removed.
Output:cat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatchercat /opt/ansible/bin/deploy-knative-eventing | grep kafka-broker-dispatcher
- Exec into the docker pod running
Kafka upgrade failure in Data Governor during 5.2.x to 5.4.0 upgrade
Applies to: 5.4.0
- Problem
- During the watsonx Orchestrate upgrade from Version 5.2.x to 5.4.0, Kafka attempts to upgrade
from 3.8.0 to 4.2.0 but fails due to metadata remaining at version
3.8-IV0. This leads to Kafka broker and controller connection failures and prevents identification of the active controller, leading to an inconsistent and incomplete upgrade state. - Solution
- Restart the Data Governor Kafka related pods to force
reconciliation.
Where $INSTANCE=wo-wa and $NAMESPACE=cpdoc delete pods -l ibmevents.ibm.com/cluster=$INSTANCE-data-governor-kafka -n $NAMESPACERestarting the pods enables the Kafka cluster to complete the upgrade process and return to a
Readystate.
watsonx Orchestrate instance in unknown status after upgrade
Applies to: 5.4.0
- Problem
- After upgrading watsonx
Orchestrate, the instance
shows an unknown status. When you check the deployment of
voice-controllerandarcher-server, these deployments are scaled down to 0 pods and there is a warning related to request and limits not being set. - Solution
- Apply the following patches to resolve the issue:
oc patch deployment wo-archer-server -n ${PROJECT_CPD_INST_OPERANDS} --type='strategic' -p=' spec: template: spec: initContainers: - name: combine-ca-certs resources: limits: cpu: 200m memory: 256Mi requests: cpu: 100m memory: 128Mi 'oc patch deployment wo-voice-controller -n ${PROJECT_CPD_INST_OPERANDS} --type='strategic' -p=' spec: template: spec: initContainers: - name: setup-ca-trust resources: limits: cpu: 200m memory: 256Mi requests: cpu: 100m memory: 128Mi '
watsonx Assistant CR stuck in VerifyWait due to unverified tf component
Applies to: 5.4.0
- Problem
- The watsonx Assistant CR gets stuck in the
VerifyWait state with the
tfcomponent remaining unverified. When you check the watsonx Assistant CR status, it shows:oc get wa NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTORE AGE wo-wa 5.4.0 False Initializing True VerifyWait 20/20 12/20 DEQUIESCING NOT_QUIESCED 143mThe CLU YAML shows that the
tfcomponent is still unverified:oc get watsonassistantclu.assistant.watson.ibm.com/wo-wa -o yaml failedComponents: [] unverifiedComponents: tf verified: 6/7On checking the
tfpod status, both pods are in CrashLoopBackOff status:oc get po | grep tf- tf-869dbf64c4-kr4cc 1/2 CrashLoopBackOff 9 (2m39s ago) 26m tf-869dbf64c4-lr2sp 1/2 CrashLoopBackOff 9 (3m51s ago) 26m - Solution
-
The
tfpods may be timing out while downloading models due to slow MCG performance. Check the existingnoobaa-default-backing-storeBackingStore configuration, then increase the NooBaa backing store resources and set the PV volume count to 3 by running the following command:oc patch -n openshift-storage backingstore noobaa-default-backing-store --type=merge --patch='{"spec":{"pvPool":{"numVolumes":3,"resources":{"requests":{"memory":"4Gi","cpu":"1"},"limits":{"memory":"8Gi","cpu":"2"}}}}}'
Component services not updated after upgrade to 5.3.1 Patch 5
Applies to: 5.4.0
- Problem
- When you upgrade from 5.2.x or 5.3.0 to 5.3.1 Patch 5, you might encounter an issue where the component services do not get updated, while the watsonx Orchestrate CR status still shows 100%. The component services get stuck in an error state when the size field is not defined in the watsonx Orchestrate CR, and the fallback logic throws an error.
- Solution
-
Set the size value in the watsonx Orchestrate CR by applying the following patch command:
oc patch wo wo -n ${PROJECT_CPD_INST_OPERANDS} --type=merge -p '{"spec":{"size":"medium"}}'After applying the patch, the component services should update successfully.
Chat with docs cleanup job fails due to insufficient memory after upgrade
Applies to: 5.4.0
- Problem
- After upgrading from 5.3.0 to 5.3.1 Patch 3 or Patch 5, the
wo-chat-with-docs-expiry-cronjobpod fails with an OOMKilled error. The pod's memory limit is set to 200Mi, which may be insufficient when processing multiple knowledge bases or chat-with-docs resources that need to be deleted. The job performs several memory-intensive operations including:- Multiple Postgres queries and updates to remove knowledge bases and related sub-resources
- Milvus vector store cleanup operations
- S3 document deletion
- HTTP requests to TRM (Tools Runtime Manager) to remove tool deployments
- Solution
-
To resolve this issue, increase the memory limit for the chat with docs expiry cronjob to 300-400Mi. Start with 300Mi and monitor actual usage to determine if further adjustment is needed.
Run the following command to update the memory limit:
Patch command to be includedAfter applying the patch, monitor the pod's memory usage to ensure the new limit is sufficient.
watsonx Orchestrate instance in UNKNOWN state after upgrade to 5.3.1 Patch 5
Applies to: 5.4.0
- Problem
- After upgrading from 5.3.0 to 5.3.1 Patch 5, the watsonx Orchestrate instance may appear in an UNKNOWN state.
- Solution
-
To resolve this issue, complete the following steps:
- Set the namespace
variable:
export PROJECT_CPD_INST_OPERANDS=<your-namespace> - Delete the watsonx
Orchestrate Kafka custom
resource:
oc delete kafka wo-watson-orchestrate-kafkaibm -n ${PROJECT_CPD_INST_OPERANDS} - Delete the deployments of
wo-archer-serverandwo-voice-controller:
After completing these steps, wait for the component operator to recreate the resources. Monitor the watsonx Orchestrate instance status until it returns to a healthy state.oc delete deployment wo-archer-server wo-voice-controller -n ${PROJECT_CPD_INST_OPERANDS}
- Set the namespace
variable:
Agent Assist chat is stuck and LLM responses are not displayed after upgrade to 5.3.1
Applies to: 5.4.0
- Problem
- After upgrading to watsonx
Orchestrate 5.3.1, Agent Assist does not function correctly for chat
or voice interactions. Agent Assist might remain stuck in a buffering state or display the
message:
There’s currently no conversation content to summarize.
Although the LLM processes the request successfully and responses are generated in the backend, the responses are not streamed back to the Agent Assist UI. This issue is caused by stale cached UI assets being served after the upgrade, combined with embedded chat security being enabled by default in 5.3.1, which is not compatible with Agent Assist. Agent Assist works only when security is disabled. As a result, some static assets are loaded incorrectly, leading to UI errors and blocked Agent Assist functionality. - Solution
- To restore Agent Assist functionality, complete the following steps:
- Clear caches.
- Clear the browser cache for affected users.
- Clear CDN or gateway caches to ensure the latest UI assets are served.
- Restart the Agent Assist socket handler
pods.
oc scale deploy/wo-socket-handler --replicas=0 -n <namespace> oc scale deploy/wo-socket-handler --replicas=1 -n <namespace> - Disable embedded chat security to restore Agent Assist features. For more information, follow the steps mentioned in Configuring security with scripting.
- Verify deployment consistency.
- Ensure that all watsonx
Orchestrate components, especially
wo-archer-server, are running the correct 5.3.1 image and deployment configuration. Review thewo-archer-serverpod's volume mounts to check for any attached PVC from previous customizations. - If a PVC is attached, detach it from the
archer-serverdeployment to prevent outdated code from overriding the 5.3 image. - Restart the pod to ensure it runs the correct 5.3 application code from the image.
- Verify that LLM functionality is restored after the pod restart.
- Ensure that all watsonx
Orchestrate components, especially
- Clear caches.
Existing tool flows fail after upgrading watsonx Orchestrate from 5.2.2 to 5.3.1 Patch 2
Applies to: 5.4.0
- Problem
- After upgrading watsonx Orchestrate from version 5.2.2 to 5.3.1 Patch 2, existing tool flows that executed successfully in 5.2.2 fail to execute in the upgraded environment. The issue occurs without modification to the tool flows and blocks execution of previously working agents and flows.
- Solution
- To fix this issue, follow the steps mentioned in Post upgrade tasks.
Downtime when upgrading from 5.1.x or 5.2.x to 5.3.0
Applies to: 5.4.0
- Problem
- When you upgrade from 5.1.x or 5.2.x to 5.3, you might see a minimal downtime of around 3–4 minutes at the beginning of the upgrade.
watsonx Orchestrate bootstrap job fails to complete during upgrade or installation
Applies to: 5.4.0
- Problem
- During upgrade or installation, the watsonx
Orchestrate
bootstrap job might fail to complete because the
wo-skill-sequencingpods run out of memory.
- Solution
- Increase the memory limit for the
wo-skill-sequencingpods by applying an RSI patch, then restart the bootstrap job and validate the upgrade.Make sure that the following prerequisites are met:
- You are logged in to the OpenShift® cluster.
oc CLIandcpd-cliare installed and configured.- PROJECT_CPD_INST_OPERANDS is set to the watsonx
Orchestrate
operand
namespace.
export PROJECT_CPD_INST_OPERANDS=<cpd-instance-namespace>
- Create the RSI working
directory.
mkdir -p cpd-cli-workspace/olm-utils-workspace/work/rsi - Create a patch file named
skill-seq.jsonin the RSI directory with the following content.[ { "op": "replace", "path": "/spec/containers/0/resources/limits/memory", "value": "3Gi" } ] - Apply the RSI
patch.
cpd-cli manage create-rsi-patch \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --patch_name=skill-seq-resource-limit \ --patch_type=rsi_pod_spec \ --patch_spec=/tmp/work/rsi/skill-seq.json \ --spec_format=json \ --include_labels=wo.watsonx.ibm.com/component:wo-skill-sequencing \ --state=active - Restart the bootstrap
job.
oc delete job wo-watson-orchestrate-bootstrap-job - Validate the upgrade.
cpd-cli manage get-cr-status \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --components=watsonx_orchestrate
After the fix is applied, new
wo‑skill‑sequencingpods start with a 3 Gi memory limit, the bootstrap job completes successfully, and the watsonx Orchestrate upgrade proceeds normally.
Unable to connect or import remote MCP server
Applies to: 5.4.0
- Problem
- When you try to add MCP tools in the Agent builder screen, the connection to a remote MCP server fails. When you create an agent, click Add tools, select MCP server, click Add MCP server, select Remote MCP server, provide the remote MCP server details, and click Connect, the connection fails with the error Gateway creation failed: 401 {"detail":"Invalid authentication credentials"}. This prevents MCP tools from being added successfully to agents.
502 Bad Gateway error on wxochat endpoint in watsonx Orchestrate
Applies to: 5.4.0
- Problem
- The /wxochat/ endpoint returns a 502 Bad
Gateway error. This occurs due to an SSL handshake failure between the NGINX layer
(Zen) and the upstream service (
wo-uiproxy). The issue is caused by incorrect SSL certificate validation directives in the /wxochat/ location block of the ZenExtension configuration that prevent proper communication between the proxy and the upstream service. - Solution
-
To resolve this issue, remove the problematic SSL directives from the ZenExtension configuration:
- Set the namespace
variable:
export PROJECT_CPD_INST_OPERANDS=<your-namespace> - Back up the existing
configuration:
oc get zenextension wo-watson-orchestrate-zen-service -n ${PROJECT_CPD_INST_OPERANDS} -o yaml > zenextension-backup.yaml - Annotate the ZenExtension to prevent the operator from overwriting manual
changes:
oc annotate zenextension wo-watson-orchestrate-zen-service wo.watsonx.ibm.com/hands-off="true" -n ${PROJECT_CPD_INST_OPERANDS} - Edit the
ZenExtension:
oc edit zenextension wo-watson-orchestrate-zen-service -n ${PROJECT_CPD_INST_OPERANDS} - In the editor, locate the
location /wxochat/block and remove the following four lines if present:proxy_ssl_trusted_certificate /etc/internal-nginx-svc-tls/ca.crt; proxy_ssl_verify on; proxy_ssl_certificate /etc/internal-nginx-svc-tls/tls.crt; proxy_ssl_certificate_key /etc/internal-nginx-svc-tls/tls.key;Important: Do not modify any other location blocks. Maintain correct indentation as NGINX configuration is indentation-sensitive. - Save and exit the editor (in vi: press Esc, type :wq, press Enter).
- Wait for the ZenExtension to reconcile. Monitor the status until it returns to
Completed:
oc get zenextension wo-watson-orchestrate-zen-service -n ${PROJECT_CPD_INST_OPERANDS} - Restart the watsonx Orchestrate
components:
oc rollout restart `oc get deployment -o name -l icpdsupport/module=components-services-orchestrate`
If the change causes issues, restore from the backup:oc apply -f zenextension-backup.yaml - Set the namespace
variable:
Error Displayed When Saving Platform Language Changes in Agentic Clusters
Applies to: 5.4.0
- Problem
- When an admin user logs in to watsonx Orchestrate and attempts to add and save a language by clicking on the profile and navigating to Settings > Platform languages, the system fails to save the selected languages and instead displays an error.
Issue when add LDAP user to watsonx Orchestrate instance
Applies to: 5.4.0
- Problem
- In MyInstances screen, when you add LDAP user to watsonx
Orchestrate
instance and check the status that displays error
[POST /grant/][500] addUserToServiceInstanceInternalServerError &{MessageCode: StatusCode:0 Exception: Message:}
. - Solution
- Add a user second time then it is adding successfully.
Reset test run in Evaluate response settings fails with 500 error
Applies to: 5.4.0
- Problem
- When you reset, do the test-run from the Evaluate response settings page
of AI assistant builder, it fails with a
500 Internal Server Error, and the UI hangs. - Solution
- If the page hangs after the error, go to a different page and then return to the same page, or refresh the page.
The preview page not available when the watsonx Assistant is created by using the API
Applies to: 5.4.0
- Problem
- The Preview page is not accessible when watsonx Assistant is created by using the API.
Milvus standalone pod fails due to etcd backend quota exhaustion
Applies to: 5.4.0
- Problem
- The Milvus standalone pod (
ibm-lh-lakehouse-wo-milvus-standalone) enters CrashLoopBackOff state with the error etcdserver: mvcc: database space exceeded. The pod crashes during startup when Milvus attempts to register sessions in etcd. The backing etcd pod (ibm-lh-lakehouse-wo-milvus-etcd-0) has reached its backend quota of 2 GiB and raised an NOSPACE alarm. This issue is not caused by PVC size limitations, image pull failures, or pod scheduling problems. - Solution
-
To recover from this issue, perform the following steps:
- Verify the Milvus pod is crashing with the mvcc: database space exceeded error.
- Confirm the etcd pod has the NOSPACE alarm by
running:
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl alarm list' - Check the etcd endpoint status to confirm the database size has reached the
quota:
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl endpoint status -w table' - Compact the etcd database at the current revision (replace <revision> with
the current revision number from the status
output):
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl compact <revision>' - Defragment the etcd database with an extended
timeout:
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl --command-timeout=120s defrag'Note: The defragmentation process may take several minutes to complete. - Disarm the etcd
alarm:
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl --command-timeout=120s alarm disarm' - Verify the alarm has been
cleared:
oc exec -n <namespace> ibm-lh-lakehouse-wo-milvus-etcd-0 -- sh -lc 'ETCDCTL_API=3 etcdctl alarm list'The command should return no output if the alarm is successfully cleared.
- Restart the Milvus standalone
pod:
oc delete pod -n <namespace> <milvus-pod-name>The deployment will automatically create a replacement pod that should start successfully.
After completing these steps, verify that the deployment status shows READY 1/1 and the pod is in Running state.
Restoring of the Milvus S3 backup requires manual job after the cluster restore and stabilization
Applies to: 5.4.0
- Problem
- You need to restore the backups manually after the cluster restore and stabilization of Milvus S3 bucket.
- Solution
-
- Set the variable PROJECT_CPD_INST_OPERANDS to the watsonx Orchestrate operand namespace.
- Run the appropriate restore job for your environment.
- For an online restore, run the following
job:
apiVersion: batch/v1 kind: Job metadata: name: wo-watson-orchestrate-backup-milvus-s3-restore namespace: ${PROJECT_CPD_INST_OPERANDS} annotations: name: wo-watson-orchestrate-backup-milvus-s3-restore namespace: ${PROJECT_CPD_INST_OPERANDS} labels: app.kubernetes.io/component: backup app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup icpdsupport/module: backup-orchestrate icpdsupport/podSelector: backup wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 7.1.2 spec: backoffLimit: 0 template: metadata: annotations: productName: IBM watsonx Orchestrate productVersion: 5.3.3 labels: app.kubernetes.io/component: backup app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup icpdsupport/module: backup-orchestrate icpdsupport/podSelector: backup wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 7.1.2 spec: serviceAccountName: wo-watson-orchestrate-backup-restore restartPolicy: Never containers: - name: restore image: cp.stg.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:09d6e5c703ea9ca89baec4cbc8549187072fc0b8b00c1f99a68e1b09ec0641cd imagePullPolicy: Always command: - ./milvus-s3-br.sh - restore env: - name: JOB_NAME value: wo-watson-orchestrate-backup-milvus-s3-restore - name: JOB_NAMESPACE value: ${PROJECT_CPD_INST_OPERANDS} - name: SERVER_STORAGE_BUCKET value: wo-server-storage-bucket-${PROJECT_CPD_INST_OPERANDS} volumeMounts: - name: milvus-secret mountPath: /secrets/wo-milvus-storage-bucket - name: s3-account mountPath: /secrets/s3-account - name: s3-uri mountPath: /secrets/s3-uri - name: milvus-configmap mountPath: /configmaps/wo-milvus-storage-bucket - name: s3-cert mountPath: /secrets/s3-cert - name: s3-backup-pvc mountPath: /tmp/s3-backup volumes: - name: milvus-secret secret: secretName: wo-milvus-storage-bucket - name: s3-account secret: secretName: noobaa-account-watsonx-orchestrate - name: s3-uri secret: secretName: noobaa-uri-watsonx-orchestrate - name: milvus-configmap configMap: name: wo-milvus-storage-bucket - name: s3-cert secret: secretName: noobaa-cert-watsonx-orchestrate - name: s3-backup-pvc persistentVolumeClaim: claimName: wo-watson-orchestrate-backup-s3 - For an offline restore, run the following
job:
apiVersion: batch/v1 kind: Job metadata: name: wo-watson-orchestrate-backup-offline-milvus-s3-restore namespace: ${PROJECT_CPD_INST_OPERANDS} annotations: name: wo-watson-orchestrate-backup-offline-milvus-s3-restore namespace: ${PROJECT_CPD_INST_OPERANDS} labels: app.kubernetes.io/component: backup-offline app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup-offline icpdsupport/module: backup-offline-orchestrate icpdsupport/podSelector: backup-offline wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup-offline wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 7.1.2 spec: backoffLimit: 0 template: metadata: annotations: productName: IBM watsonx Orchestrate productVersion: 5.3.3 labels: app.kubernetes.io/component: backup-offline app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup-offline icpdsupport/module: backup-offline-orchestrate icpdsupport/podSelector: backup-offline wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup-offline wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 7.1.2 spec: serviceAccountName: wo-watson-orchestrate-backup-restore restartPolicy: Never containers: - name: restore image: cp.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:09d6e5c703ea9ca89baec4cbc8549187072fc0b8b00c1f99a68e1b09ec0641cd imagePullPolicy: Always command: - ./milvus-s3-br.sh - restore env: - name: JOB_NAME value: wo-watson-orchestrate-backup-offline-milvus-s3-restore - name: JOB_NAMESPACE value: ${PROJECT_CPD_INST_OPERANDS} - name: SERVER_STORAGE_BUCKET value: wo-server-storage-bucket-${PROJECT_CPD_INST_OPERANDS} volumeMounts: - name: milvus-secret mountPath: /secrets/wo-milvus-storage-bucket - name: s3-account mountPath: /secrets/s3-account - name: s3-uri mountPath: /secrets/s3-uri - name: milvus-configmap mountPath: /configmaps/wo-milvus-storage-bucket - name: s3-cert mountPath: /secrets/s3-cert - name: s3-backup-pvc mountPath: /tmp/s3-backup volumes: - name: milvus-secret secret: secretName: wo-milvus-storage-bucket - name: s3-account secret: secretName: noobaa-account-watsonx-orchestrate - name: s3-uri secret: secretName: noobaa-uri-watsonx-orchestrate - name: milvus-configmap configMap: name: wo-milvus-storage-bucket - name: s3-cert secret: secretName: noobaa-cert-watsonx-orchestrate - name: s3-backup-pvc persistentVolumeClaim: claimName: wo-watson-orchestrate-backup-offline-s3
- For an online restore, run the following
job:
SSO OBO and token exchange do not work in on‑premises deployments
Applies to: 5.4.0
- Problem
- In on‑premises deployments of watsonx
Orchestrate, Single Sign-On (SSO) On‑Behalf‑Of (OBO)
flow and Identity Provider (IdP) token exchange do not function correctly. Although users can
successfully log in to watsonx
Orchestrate using SSO, the SSO token is not exchanged with the
downstream IdP when an agent executes actions in connected applications. As a result:
- User identity is not consistently propagated between the IdP, watsonx Orchestrate, and downstream applications. In such cases, alternative authentication methods, such as the OAuth 2.0 authorization code flow or the client credentials flow, might be required.
- Seamless SSO behavior supported in IBM Cloud and AWS SaaS deployments is not available in these on‑premises releases.
Astra DB connection fails in air-gap environments with HTTP proxy enabled
Applies to: 5.4.0
- Problem
- In watsonx Orchestrate on‑premises air-gap environments with an HTTP proxy enabled, connecting to Astra DB as a knowledge source fails during agent creation. When an admin creates an agent in Agent builder, navigates to Knowledge > Choose knowledge > Astra DB, and clicks Next after entering the required details, the UI does not progress to the next screen and the connection fails. This occurs because Astra DB keyspaces cannot be retrieved in fully air-gapped configurations.
watsonx Orchestrate CR is stuck in Inprogress state though the status shows
Completed
Applies to: 5.4.0
- Problem
- This happens due to the data stores repeatedly entering a reconcile loop.
- Solution
- When the verification shows 20/20, you can ignore the
InProgressstate because the pods are healthy and fully available.
Domain agents and tools UUIDs are not the same in two clusters
Applies to: 5.4.0
- Problem
- In multi region active deployment, in cluster-A, as an Admin user you see the UUIDs of Domain agents and tools in Catalog and see in Network tab, then go to cluster-B, and see the UUIDs of Domain agents and tools are different.
The UUID of an agent is different on deleting and importing again the same agent via ADK
Applies to: 5.4.0
- Problem
- In multi region active deployment, as an Admin if you delete the imported agent in Agent builder and again import the same agent in the same cluster through ADK, you would find a different UUID of the agent.
On‑premises traces do not display Tools Runtime or TRM spans
Applies to: 5.4.0
- Problem
- In the watsonx Orchestrate on‑premises environment, trace data is incomplete as Tools Runtime and TRM spans are not captured or displayed, resulting in incomplete observability of agent execution.
Backup and restore utility is not included for Milvus
Applies to: 5.4.0
- Problem
- The Backup and restore utility is not included in the watsonx Orchestrate s3 bucket data for Milvus. You need to take backup manually, and restore after the instance restore is completed by using a Kubernetes job.Pr-backup.
- Solution
- To resolve the problem, run the following script before you take the backup:
oc project ${PROJECT_CPD_INST_OPERANDS} cat <<EOF | kubectl apply -f - apiVersion: batch/v1 kind: Job metadata: name: wo-watson-orchestrate-backup-milvus-s3-backup namespace: ${PROJECT_CPD_INST_OPERANDS} annotations: name: wo-watson-orchestrate-backup-milvus-s3-backup namespace: ${PROJECT_CPD_INST_OPERANDS} labels: app.kubernetes.io/component: backup app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup icpdsupport/module: backup-orchestrate icpdsupport/podSelector: backup wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 6.0.0 spec: backoffLimit: 0 template: metadata: annotations: productName: IBM watsonx Orchestrate productVersion: 5.2.0 labels: app.kubernetes.io/component: backup app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup icpdsupport/module: backup-orchestrate icpdsupport/podSelector: backup wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 6.0.0 spec: serviceAccountName: wo-watson-orchestrate-backup-restore restartPolicy: Never containers: - name: backup image: cp.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:f2ca697cdcea2f349f9b0304a3b28f19f5d3f917b57b9076bbae43052a8a9c20 imagePullPolicy: Always command: - ./milvus-s3-br.sh - backup env: - name: JOB_NAME value: wo-watson-orchestrate-backup-milvus-s3-backup - name: JOB_NAMESPACE value: cpd-instance-1 resources: {} volumeMounts: - name: milvus-secret mountPath: /secrets/wo-milvus-storage-bucket - name: milvus-configmap mountPath: /configmaps/wo-milvus-storage-bucket - name: s3-cert mountPath: /secrets/s3-cert - name: s3-backup-pvc mountPath: /tmp/s3-backup volumes: - name: milvus-secret secret: secretName: wo-milvus-storage-bucket - name: milvus-configmap configMap: name: wo-milvus-storage-bucket - name: s3-cert secret: secretName: noobaa-cert-watsonx-orchestrate - name: s3-backup-pvc persistentVolumeClaim: claimName: wo-watson-orchestrate-backup-s3 EOFRun the following script after you restore:
oc project ${PROJECT_CPD_INST_OPERANDS} cat <<EOF | kubectl apply -f - apiVersion: batch/v1 kind: Job metadata: name: wo-watson-orchestrate-backup-milvus-s3-restore namespace: ${PROJECT_CPD_INST_OPERANDS} annotations: name: wo-watson-orchestrate-backup-milvus-s3-restore namespace: ${PROJECT_CPD_INST_OPERANDS} labels: app.kubernetes.io/component: backup app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup icpdsupport/module: backup-orchestrate icpdsupport/podSelector: backup wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 6.0.0 spec: backoffLimit: 0 template: metadata: annotations: productName: IBM watsonx Orchestrate productVersion: 5.2.0 labels: app.kubernetes.io/component: backup app.kubernetes.io/instance: wo app.kubernetes.io/managed-by: ibm-watson-orchestrate-operator app.kubernetes.io/name: watson-orchestrate icpdsupport/addOnId: orchestrate icpdsupport/app: backup icpdsupport/module: backup-orchestrate icpdsupport/podSelector: backup wo.watsonx.ibm.com/application: watson-orchestrate wo.watsonx.ibm.com/component: backup wo.watsonx.ibm.com/cr-name: wo wo.watsonx.ibm.com/external-access: "true" wo.watsonx.ibm.com/operand-version: 6.0.0 spec: serviceAccountName: wo-watson-orchestrate-backup-restore restartPolicy: Never containers: - name: restore image: cp.icr.io/cp/watsonx-orchestrate/ibm-watsonx-orchestrate-onprem-utils@sha256:f2ca697cdcea2f349f9b0304a3b28f19f5d3f917b57b9076bbae43052a8a9c20 imagePullPolicy: Always command: - ./milvus-s3-br.sh - restore env: - name: JOB_NAME value: wo-watson-orchestrate-backup-milvus-s3-restore - name: JOB_NAMESPACE value: cpd-instance-1 resources: {} volumeMounts: - name: milvus-secret mountPath: /secrets/wo-milvus-storage-bucket - name: milvus-configmap mountPath: /configmaps/wo-milvus-storage-bucket - name: s3-cert mountPath: /secrets/s3-cert - name: s3-backup-pvc mountPath: /tmp/s3-backup volumes: - name: milvus-secret secret: secretName: wo-milvus-storage-bucket - name: milvus-configmap configMap: name: wo-milvus-storage-bucket - name: s3-cert secret: secretName: noobaa-cert-watsonx-orchestrate - name: s3-backup-pvc persistentVolumeClaim: claimName: wo-watson-orchestrate-backup-s3 EOF