Known issues and limitations for watsonx Assistant
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.
The following known issues and limitations apply to watsonx Assistant.
- Activity logs do not show up in the new watsonx Assistant installation state
- watsonx Assistant tfmm pods get stuck in Init:0/1 state
- Upgrade from Version 4.8.0 causes one of the many ETCD pods to go to crashloopback state
- watsonx Assistant Upgrade gets stuck in Verify state
- Workspace logs and DataGovernor Elasticsearch pods goes to CrashLoopBackOff status after upgrade
- An etcd operator script fails while upgrading watsonx Assistant 4.8.4
- DataGovernor Elasticsearch pods in CrashLoopBackOff status
- DataGovernor does not recover after shutdown and restart
- One or more watsonx Assistant pods go to the ContainerStatusUnknown state
- EDB Postgres cluster in bad state
- The ETCD pods of watsonx Assistant not re-created after restarting the cluster
- Elasticsearch store not getting cleaned up on upgrade
- Data Governor shutdown issue
- EDB Postgres connection errors when max connections reached
- watsonx Assistant Redis pods not starting because quota is applied to the namespace
- watsonx Assistant Redis pods not running after cluster restart
- watsonx Assistant upgrade gets stuck at apply-cr
- watsonx Assistant upgrade gets stuck at apply-cr or training does not work after the upgrade completes successfully
- Red Hat OpenShift upgrade hangs because some watsonx Assistant pods do not quiesce
- Increasing backup storage for wa-store-cronjob pods that run out of space
- Inaccurate status message from command line after upgrade
- ModelTrain or clu-training pods may not get healthy when upgrading or after installation
For a complete list of known issues and troubleshooting information for all versions of watsonx Assistant, see Troubleshooting known issues. For a complete list of known issues for Cloud Pak for Data, see Limitations and known issues in Cloud Pak for Data.
Activity logs do not show up in the new watsonx Assistant installation state
Applies to: Any release
- Problem
- Activity logs don't show because the EDB Postgres database doesn't have partitions for the year 2025.
- Solution
-
- Only for Versions 4.8.8 and later:
Get the database password.
For Versions lesser than 4.8.8:oc get secret wa-postgres-16-admin-auth -o jsonpath='{.data.password}' | base64 -d && echoGet the database password.oc get secret wa-postgres-admin-auth -o jsonpath='{.data.password}' | base64 -d && echo - Exec into one of the EDB Postgres
pods.
Only for Versions 4.8.8 and later:
For Versions lesser than 4.8.8:oc rsh wa-postgres-16-1oc rsh wa-postgres-1 - Connect to the database.
Only for Versions 4.8.8 and later:
For Versions lesser than 4.8.8:sh-5.1$ psql -h wa-postgres-16-rw.cpd.svc -d conversation_pprd_wa -U postgres Password for user postgres: psql (12.18) SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) Type "help" for help.sh-5.1$ psql -h wa-postgres-rw.cpd.svc -d conversation_pprd_wa -U postgres Password for user postgres: psql (12.18) SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) Type "help" for help. - Verify
event_logtable schema.conversation_pprd_wa=# \d event_log Partitioned table "public.event_log" Column | Type | Collation | Nullable | Default ---------------------+-----------------------------+-----------+----------+-------------------------- event_name | text | | not null | event_method | activity_method | | not null | event_resource | activity_resource | | not null | domain | text | | | status | integer | | | description | text | | | modified | timestamp without time zone | | not null | now() crn_id | text | | | request_id | uuid | | | request_uri | text | | | instance_id | uuid | | not null |If
event_logshows a columnrequest_idorinstance_idasUUID, change those values totextby using the following command.conversation_pprd_wa=# ALTER TABLE event_log ALTER column request_id TYPE text; ALTER TABLE conversation_pprd_wa=# ALTER TABLE event_log ALTER column instance_id TYPE text; ALTER TABLE - Create the following partitions for the year 2025.
-- december 2024 CREATE TABLE IF NOT EXISTS event_logs_y2024m12w1 PARTITION OF event_log FOR VALUES FROM ('2024-12-01T00:00:00.000Z') TO ('2024-12-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2024m12w2 PARTITION OF event_log FOR VALUES FROM ('2024-12-08T00:00:00.000Z') TO ('2024-12-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2024m12w3 PARTITION OF event_log FOR VALUES FROM ('2024-12-15T00:00:00.000Z') TO ('2024-12-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2024m12w4 PARTITION OF event_log FOR VALUES FROM ('2024-12-22T00:00:00.000Z') TO ('2025-01-01T00:00:00.000Z'); -- january 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m01w1 PARTITION OF event_log FOR VALUES FROM ('2025-01-01T00:00:00.000Z') TO ('2025-01-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m01w2 PARTITION OF event_log FOR VALUES FROM ('2025-01-08T00:00:00.000Z') TO ('2025-01-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m01w3 PARTITION OF event_log FOR VALUES FROM ('2025-01-15T00:00:00.000Z') TO ('2025-01-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m01w4 PARTITION OF event_log FOR VALUES FROM ('2025-01-22T00:00:00.000Z') TO ('2025-02-01T00:00:00.000Z'); -- february 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m02w1 PARTITION OF event_log FOR VALUES FROM ('2025-02-01T00:00:00.000Z') TO ('2025-02-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m02w2 PARTITION OF event_log FOR VALUES FROM ('2025-02-08T00:00:00.000Z') TO ('2025-02-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m02w3 PARTITION OF event_log FOR VALUES FROM ('2025-02-15T00:00:00.000Z') TO ('2025-02-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m02w4 PARTITION OF event_log FOR VALUES FROM ('2025-02-22T00:00:00.000Z') TO ('2025-03-01T00:00:00.000Z'); --march 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m03w1 PARTITION OF event_log FOR VALUES FROM ('2025-03-01T00:00:00.000Z') TO ('2025-03-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m03w2 PARTITION OF event_log FOR VALUES FROM ('2025-03-08T00:00:00.000Z') TO ('2025-03-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m03w3 PARTITION OF event_log FOR VALUES FROM ('2025-03-15T00:00:00.000Z') TO ('2025-03-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m03w4 PARTITION OF event_log FOR VALUES FROM ('2025-03-22T00:00:00.000Z') TO ('2025-04-01T00:00:00.000Z'); --april 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m04w1 PARTITION OF event_log FOR VALUES FROM ('2025-04-01T00:00:00.000Z') TO ('2025-04-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m04w2 PARTITION OF event_log FOR VALUES FROM ('2025-04-08T00:00:00.000Z') TO ('2025-04-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m04w3 PARTITION OF event_log FOR VALUES FROM ('2025-04-15T00:00:00.000Z') TO ('2025-04-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m04w4 PARTITION OF event_log FOR VALUES FROM ('2025-04-22T00:00:00.000Z') TO ('2025-05-01T00:00:00.000Z'); --may 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m05w1 PARTITION OF event_log FOR VALUES FROM ('2025-05-01T00:00:00.000Z') TO ('2025-05-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m05w2 PARTITION OF event_log FOR VALUES FROM ('2025-05-08T00:00:00.000Z') TO ('2025-05-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m05w3 PARTITION OF event_log FOR VALUES FROM ('2025-05-15T00:00:00.000Z') TO ('2025-05-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m05w4 PARTITION OF event_log FOR VALUES FROM ('2025-05-22T00:00:00.000Z') TO ('2025-06-01T00:00:00.000Z'); --june 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m06w1 PARTITION OF event_log FOR VALUES FROM ('2025-06-01T00:00:00.000Z') TO ('2025-06-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m06w2 PARTITION OF event_log FOR VALUES FROM ('2025-06-08T00:00:00.000Z') TO ('2025-06-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m06w3 PARTITION OF event_log FOR VALUES FROM ('2025-06-15T00:00:00.000Z') TO ('2025-06-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m06w4 PARTITION OF event_log FOR VALUES FROM ('2025-06-22T00:00:00.000Z') TO ('2025-07-01T00:00:00.000Z'); --july 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m07w1 PARTITION OF event_log FOR VALUES FROM ('2025-07-01T00:00:00.000Z') TO ('2025-07-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m07w2 PARTITION OF event_log FOR VALUES FROM ('2025-07-08T00:00:00.000Z') TO ('2025-07-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m07w3 PARTITION OF event_log FOR VALUES FROM ('2025-07-15T00:00:00.000Z') TO ('2025-07-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m07w4 PARTITION OF event_log FOR VALUES FROM ('2025-07-22T00:00:00.000Z') TO ('2025-08-01T00:00:00.000Z'); --august 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m08w1 PARTITION OF event_log FOR VALUES FROM ('2025-08-01T00:00:00.000Z') TO ('2025-08-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m08w2 PARTITION OF event_log FOR VALUES FROM ('2025-08-08T00:00:00.000Z') TO ('2025-08-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m08w3 PARTITION OF event_log FOR VALUES FROM ('2025-08-15T00:00:00.000Z') TO ('2025-08-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m08w4 PARTITION OF event_log FOR VALUES FROM ('2025-08-22T00:00:00.000Z') TO ('2025-09-01T00:00:00.000Z'); --september 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m09w1 PARTITION OF event_log FOR VALUES FROM ('2025-09-01T00:00:00.000Z') TO ('2025-09-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m09w2 PARTITION OF event_log FOR VALUES FROM ('2025-09-08T00:00:00.000Z') TO ('2025-09-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m09w3 PARTITION OF event_log FOR VALUES FROM ('2025-09-15T00:00:00.000Z') TO ('2025-09-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m09w4 PARTITION OF event_log FOR VALUES FROM ('2025-09-22T00:00:00.000Z') TO ('2025-10-01T00:00:00.000Z'); --october 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m10w1 PARTITION OF event_log FOR VALUES FROM ('2025-10-01T00:00:00.000Z') TO ('2025-10-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m10w2 PARTITION OF event_log FOR VALUES FROM ('2025-10-08T00:00:00.000Z') TO ('2025-10-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m10w3 PARTITION OF event_log FOR VALUES FROM ('2025-10-15T00:00:00.000Z') TO ('2025-10-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m10w4 PARTITION OF event_log FOR VALUES FROM ('2025-10-22T00:00:00.000Z') TO ('2025-11-01T00:00:00.000Z'); --november 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m11w1 PARTITION OF event_log FOR VALUES FROM ('2025-11-01T00:00:00.000Z') TO ('2025-11-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m11w2 PARTITION OF event_log FOR VALUES FROM ('2025-11-08T00:00:00.000Z') TO ('2025-11-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m11w3 PARTITION OF event_log FOR VALUES FROM ('2025-11-15T00:00:00.000Z') TO ('2025-11-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m11w4 PARTITION OF event_log FOR VALUES FROM ('2025-11-22T00:00:00.000Z') TO ('2025-12-01T00:00:00.000Z'); --december 2025 CREATE TABLE IF NOT EXISTS event_logs_y2025m12w1 PARTITION OF event_log FOR VALUES FROM ('2025-12-01T00:00:00.000Z') TO ('2025-12-08T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m12w2 PARTITION OF event_log FOR VALUES FROM ('2025-12-08T00:00:00.000Z') TO ('2025-12-15T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m12w3 PARTITION OF event_log FOR VALUES FROM ('2025-12-15T00:00:00.000Z') TO ('2025-12-22T00:00:00.000Z'); CREATE TABLE IF NOT EXISTS event_logs_y2025m12w4 PARTITION OF event_log FOR VALUES FROM ('2025-12-22T00:00:00.000Z') TO ('2026-01-01T00:00:00.000Z');
- Only for Versions 4.8.8 and later:
watsonx
Assistant
tfmm pods get stuck in Init:0/1 state
Applies to: 4.8.x
- Problem
- During the installation, watsonx
Assistant
tfmmpods get stuck in theInit:0/1state, preventing the installation from completing. This issue occurs due to resource allocation problems for themodel-uploadinit container. - Solution
- To fix the issue, apply the following patch to adjust the resource limits and requests for the
model-uploadinit container.export PROJECT_CPD_INST_OPERANDS=<namespace where watsonx Assistant is installed> export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-tfmm-init-container-fix namespace: ${PROJECT_CPD_INST_OPERANDS} spe apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantCluRuntime name: ${INSTANCE} patchType: patchStrategicMerge patch: tfmm: deployment: spec: template: spec: initContainers: - name: model-upload resources: limits: cpu: 5 memory: 512Mi requests: cpu: 500m memory: 512Mi EOF
Upgrade from Version 4.8.0 causes one of the many ETCD pods to go to
crashloopback state
Applies to: 4.8.7
- Problem
- During the upgrade of watsonx
Assistant from Version 4.8.0, one of the many
ETCD podsmight go tocrashloopbackstate. You might findpanic: not implementedin theETCD podlogs that is incrashloopbackstate.panic: not implemented goroutine 148 [running]: go.etcd.io/etcd/etcdserver.(*applierV3backend).Apply(0xc0001ba160, 0xc001c2ea80, 0x0) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/apply.go:173 +0x1019 go.etcd.io/etcd/etcdserver.(*authApplierV3).Apply(0xc0001b8050, 0xc001c2ea80, 0x0) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/apply_auth.go:60 +0xd7 go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntryNormal(0xc0002de600, 0xc0007234d0) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:2224 +0x1f0 go.etcd.io/etcd/etcdserver.(*EtcdServer).apply(0xc0002de600, 0xc000a25628, 0x51b, 0x578, 0xc0000ec140, 0xc0018c8070, 0x0, 0x10f7228) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:2138 +0x596 go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntries(0xc0002de600, 0xc0000ec140, 0xc0002a8400) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:1390 +0xd4 go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc0002de600, 0xc0000ec140, 0xc0002a8400) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:1114 +0x88 go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x1299c80, 0xc000212ec0) /tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:1059 +0x3c go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc00089e0c0) /tmp/etcd-release-3.4.16/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xdb created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler /tmp/etcd-release-3.4.16/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13d - Solution
- To fix the issue, restart the
ETCD podthat is incrashloopbackstate.
watsonx
Assistant Upgrade gets stuck in Verify state
Applies to: Upgrades to 4.8.6 from versions prior to 4.8.0
- Problem
- During upgrade, watsonx
Assistant gets stuck in
Verifystate withUnprocessable Entityerror messages inetcdoperator logs which prevents theetcdpods from updating to the new 4.8.6 values. - Solution
- To fix the issue, apply the following patch:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: remove-icpdsupport-module-label namespace: ${PROJECT_CPD_INST_OPERANDS} spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistant name: ${INSTANCE} patchType: patchJson6902 patch: etcd: cr: - op: remove path: /spec/labels/icpdsupport~1module EOF
Workspace logs and DataGovernor Elasticsearch pods goes to
CrashLoopBackOff status after upgrade
Applies to: Upgrades up to 4.8.4
- Problem
-
The Workspace logs and Data Governor Elasticsearch pods goes to
CrashLoopBackOffstatus after the watsonx Assistant upgrade with theFunctional test failed at SCENARIO: V2 Health Check.error. - Solution
- To fix the issue, do the following steps:
- Apply the following patch to increase the number of CPU request to
500m.cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-data-governor spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistant name: wa patch: data-governor: datagovernoroverride: spec: dependencies: elasticsearch: esResources: requests: cpu: 500m patchType: patchStrategicMerge EOFNote: After the change, it takes approximately 15 minutes for the watsonx Assistant operator to update the number of CPU request. - Check the health of
wa-data-governor:oc get sts |grep wa-data-go wa-data-go-d7db-ib-6fb9-es-server-all 3/3 19h
- Apply the following patch to increase the number of CPU request to
An etcd operator script fails while upgrading watsonx Assistant 4.8.4
Applies to: 4.8.4
- Problem
-
During watsonx Assistant upgrade from version 4.7.x to version 4.8.4, the
Readystatus showsFalseandReadyReasonshowsIn Progressfor a long time.
Also, an error message similar to one of the following is displayed:# oc get wd -n zen NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 4.8.4 False InProgress True VerifyWait 11/23 10/23 NOT_QUIESCED NOT_QUIESCED 2d17h"msg": "An unhandled exception occurred while templating '{{ q('etcd_member', cluster_host= etcd_cluster_name + '-client.' + etcd_namespace + '.svc', cluster_port=etcd_client_port, ca_cert=tls_directory + '/etcd-ca.crt', cert_cert=tls_directory + '/etcd-client.crt', cert_key=tls_directory + '/etcd-client.key') }}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception occurred while running the lookup plugin 'etcd_member'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unable to fetch members. Error: 'Client' object has no attribute 'server_version_sem'. Unable to fetch members. Error: 'Client' object has no attribute 'server_version_sem'"Symptom: TASK [etcdcluster : Enable authentication when secure client] ****************** [1;30mtask path: /opt/ansible/roles/etcdcluster/tasks/reconcile_pods.yaml:246�[0m /usr/local/lib/python3.8/site-packages/etcd3/baseclient.py:97: Etcd3Warning: cannot detect etcd server version 1. maybe is a network problem, please check your network connection 2. maybe your etcd server version is too low, required: 3.2.2+ warnings.warn(Etcd3Warning("cannot detect etcd server version\n" [0;31mfatal: [localhost]: FAILED! => {[0m [0;31m "msg": "An unhandled exception occurred while running the lookup plugin 'etcd_auth'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Enabling authentication failed. Error: 'Client' object has no attribute 'server_version_sem'"[0m [0;31m}[0m
- Cause
-
A script in the etcd operator that sets authentication might fail. When it fails, the etcd operator does not deploy with
authentication:enabledin theetcdclusterCR. This failure stops other components in the service from being upgraded and verified. - Solution
- Attempt to re-execute the etcd operator by
restarting the
etcdclusterCR.- Get the name of the service
etcdcluster.oc get etcdcluster | grep etcd <or name of the etcd cluster in the deployment> - Delete the CR to allow the etcd operator
to re-execute
tasks.
oc delete etcdcluster <cluster> - Wait until the
etcdclusterandetcdpods are re-created. - Check the status of
Ready,Deployed, andVerifiedto make sure that the upgrade is successful.# oc get wd NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 4.8.4 True Stable False Stable 23/23 23/23 NOT_QUIESCED NOT_QUIESCED 3d6h
- Get the name of the service
DataGovernor Elasticsearch pods in CrashLoopBackOff status
Applies to: 4.8.x
- Problem
-
The Elasticsearch pods of DataGovernor go to
CrashLoopBackOffstatus after the watsonx Assistant upgrade. - Solution
- To fix the issue, do the following steps:
- Add pause annotation to
DataGovernor:
oc annotate dg wa-data-governor pause-reconciliation="true" --overwrite - Edit
wa-data-governor:oc edit elasticsearchcluster wa-data-governor -
Add the following lines in the spec section of
wa-data-governorand restart the Elasticsearch pods:spec: livenessProbe: failureThreshold: 5 initialDelaySeconds: 600 periodSeconds: 30 successThreshold: 1 timeoutSeconds: 300 readinessProbe: failureThreshold: 5 initialDelaySeconds: 600 periodSeconds: 30 successThreshold: 3 timeoutSeconds: 300 - Check the status of
wa-data-governor:oc get sts |grep wa-data-go wa-data-go-d7db-ib-6fb9-es-server-all 3/3 19h - Scale the Elasticsearch statefulset to
0when the first pod isTerminating, :oc scale sts wa-data-go-d7db-ib-6fb9-es-server-all --replicas=0 - Scale the Elasticsearch statefulset back to
3:oc scale sts wa-data-go-d7db-ib-6fb9-es-server-all --replicas=3 -
Remove the pause annotation in DataGovernor after all the three elastic nodes come live and are no longer in the
CrashLoopBackoofstatus:oc annotate dg wa-data-governor pause-reconciliation="false" --overwrite - Restart the Elasticsearch pods.Note: After restart, you can expect the Elasticsearch pods to crash a few times before they become live again. Therefore, you must monitor the Elasticsearch pods to make sure they are healthy and stable.
- Add pause annotation to
DataGovernor:
DataGovernor does not recover after shutdown and restart
Applies to: 4.8.0 or 4.8.2
- Problem
-
Data Governor does not recover after shutdown or restart because the Kafka CR shows the following error message:
oc get kafka wa-data-governor-kafka -o yamlconditions: - lastTransitionTime: "2024-01-16T19:53:33.655488407Z" message: 'Failure executing: PATCH at: [https://172.30.0.1:443/api/v1/namespaces/cpd/persistentvolumeclaims/data-wa-data-governor-kafka-zookeeper-0](https://172.30.0.1/api/v1/namespaces/cpd/persistentvolumeclaims/data-wa-data-governor-kafka-zookeeper-0). Message: PersistentVolumeClaim "data-wa-data-governor-kafka-zookeeper-0" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.resources.requests.storage, message=Forbidden: field can not be less than previous value, reason=FieldValueForbidden, additionalProperties={})], group=null, kind=PersistentVolumeClaim, name=data-wa-data-governor-kafka-zookeeper-0, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=PersistentVolumeClaim "data-wa-data-governor-kafka-zookeeper-0" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).' reason: KubernetesClientException status: "True" type: NotReady - Solution
- To fix the issue, use the following command:
cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-data-governor-override-custom-resource spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistant name: wa patch: data-governor: datagovernor: spec: overrideCustomResource: wa-data-exhaust-override patchType: patchStrategicMerge EOF
One or more watsonx
Assistant pods go to the ContainerStatusUnknown
state
Applies to: 4.8.x
- Problem
-
Some watsonx Assistant pods go to the
ContainerStatusUnknownstate. - Solution
- You can delete the pods in the
ContainerStatusUnknownstate by doing the following:- Get the instance name and set the
INSTANCEvariable to that name:export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Get the pods that are in the
ContainerStatusUnknownstate:oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown - Delete the pods that are in the
ContainerStatusUnknownstate individually:oc delete pod <unknown-state-pod> - Confirm that there are no more pods in the
ContainerStatusUnknownstate:oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown
- Get the instance name and set the
EDB Postgres cluster in bad state
Applies to: Any release
- Problem
-
The watsonx Assistant EDB Postgres cluster is unhealthy.
- Solution
- Complete the following to recover the watsonx
Assistant
EDB Postgres cluster.
Only for Version 4.8.8 and later, do the following steps:
-
Install the CloudNativePG (CNP) plugin for EnterpriseDB.
curl -sSfL \ https://github.com/EnterpriseDB/kubectl-cnp/raw/main/install.sh | \ sudo sh -s -- -b /usr/local/bin - Query the status of
wa-postgres-16to get the details of the EDB Postgres cluster such asName,Namespace,Primary instance, andStatus:oc cnp status wa-postgres-16 Cluster Summary Name: wa-postgres-16 Namespace: zen PostgreSQL Image: [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9) Primary instance: wa-postgres-16-3 Primary start time: 2023-12-13 12:24:06 +0000 UTC (uptime 5s) Status: Failing over Failing over from wa-postgres-16-2 to wa-postgres-16-3 Instances: 3 Ready instances: 0 - Check the pods that are failing:
oc get pods | grep wa-postgres-16 - If
wa-postgres-16-1starts,destroyit becausewa-postgres-16-3is thePrimary instance:oc cnp destroy wa-postgres-16 1 - If
wa-postgres-16-2starts,destroyit becausewa-postgres-16-3is thePrimary instance:oc cnp destroy wa-postgres-16 2 - After
wa-postgres-16-3starts, the EnterpriseDB operator controller recreates two standby pods. - Query the status of
wa-postgres-16:oc cnp status wa-postgres-16 Cluster Summary Name: wa-postgres-16 Namespace: cpd-instance System ID: 7311733145743429658 PostgreSQL Image: [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9) Primary instance: wa-postgres-16-3 Primary start time: 2023-12-12 16:59:47 +0000 UTC (uptime 20h48m52s) Status: Cluster in healthy state <---- back in a healthy state. Instances: 3 Ready instances: 3 0
For Versions lesser than 4.8.8:
-
Install the CloudNativePG (CNP) plugin for EnterpriseDB.
curl -sSfL \ https://github.com/EnterpriseDB/kubectl-cnp/raw/main/install.sh | \ sudo sh -s -- -b /usr/local/bin - Query the status of
wa-postgresto get the details of the EDB Postgres cluster such asName,Namespace,Primary instance, andStatus:oc cnp status wa-postgres Cluster Summary Name: wa-postgres Namespace: zen PostgreSQL Image: [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9) Primary instance: wa-postgres-3 Primary start time: 2023-12-13 12:24:06 +0000 UTC (uptime 5s) Status: Failing over Failing over from wa-postgres-2 to wa-postgres-3 Instances: 3 Ready instances: 0 - Check the pods that are failing:
oc get pods | grep wa-postgres - If
wa-postgres-1starts,destroyit becausewa-postgres-3is thePrimary instance:oc cnp destroy wa-postgres 1 - If
wa-postgres-2starts,destroyit becausewa-postgres-3is thePrimary instance:oc cnp destroy wa-postgres 2 - After
wa-postgres-3starts, the EnterpriseDB operator controller recreates two standby pods. - Query the status of
wa-postgres:oc cnp status wa-postgres Cluster Summary Name: wa-postgres Namespace: cpd-instance System ID: 7311733145743429658 PostgreSQL Image: [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9) Primary instance: wa-postgres-3 Primary start time: 2023-12-12 16:59:47 +0000 UTC (uptime 20h48m52s) Status: Cluster in healthy state <---- back in a healthy state. Instances: 3 Ready instances: 3 0
-
The ETCD pods of watsonx Assistant not re-created after restarting the cluster
Applies to: Upgrades to 4.8.1
- Problem
- When you do an unplanned restart of the Kubernetes cluster of your watsonx
Assistant, the
ETCD pods remain in
ContainerCreatingstatus:wa-etcd-0 0/1 ContainerCreating 0In addition, when you run the oc describe, you see the following message in the event logs:attachdetach-controller Multi-Attach error for volume "pvc-af57751f-245e-4e7e-8ec1-84aa6f7bac15" Volume is already exclusively attached to one node and can't be attached to another.The ETCD pods of watsonx Assistant remain in the
ContainerCreatingstatus because of theMulti-Attacherror for PersistentVolume (PV). When you restart the cluster, the PersistentVolumeClaim (PVC) tries to attach a new worker node in the place of the existing worker node in PV. The PV denies reattachment of the worker node because the PV has onlyReadWriteOncepermission, which causesMulti-Attach error for volume.... - Solution
-
-
Scale the
replicasof thewa-etcdstateful-set to0:oc scale sts wa-etcd --replicas=0 - Delete the PVC that remains in use to free it to attach to a new
pod:
oc delete pvc data-wa-etcd-0 - Scale the stateful-set back to three
replicas:
oc scale sts wa-etcd --replicas=3 - Check the WA-ETCD pods that return to running
state:
oc get pods | grep wa-etcd wa-etcd-0 1/1 Running 0 3m53s wa-etcd-1 1/1 Running 0 3m25s wa-etcd-2 1/1 Running 0 2m55s
-
Elasticsearch store not getting cleaned up on upgrade
Applies to: Upgrades to 4.8.0
- Problem
-
The Elasticsearch cluster that is used by the
storemicroservice is not cleaned up after upgrading watsonx Assistant to Version 4.8.0. - Solution
-
Clean up the Elasticsearch cluster store after upgrading to watsonx Assistant Version 4.8.0:
- Export the
PROJECT_CPD_INST_OPERANDSvariable as a project to the location where Cloud Pak for Data and watsonx Assistant are installed:export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed> - Export the name of your watsonx
Assistant instance as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Check the name of the Elasticsearch
cluster:
oc get elasticsearchcluster -n ${PROJECT_CPD_INST_OPERANDS} - If
$(INSTANCE}-es-storeis listed, delete the store:oc delete elasticsearchcluster $(INSTANCE}-es-store -n ${PROJECT_CPD_INST_OPERANDS} - Confirm the deletion of
$(INSTANCE}-es-store:oc get elasticsearchcluster -n ${PROJECT_CPD_INST_OPERANDS}
- Export the
Data Governor shutdown issue
Applies to: 4.8.0 and later
- Problem
-
When you shut down, the
Data Governordoes not stop the running pods in someDeployments. - Solution
-
Remove the
Deploymentswith the running pods:- Run the following command to return the
Deploymentsthat are safe to delete:
where theoc get deploy -l squad=data-exhaust,app=${INSTANCE_NAME:-wa}INSTANCE_NAMEis the name of the watsonx Assistant installation. - Manually delete each
Deploymentthat has the active pods:oc delete deploy -l squad=data-exhaust,app=${INSTANCE_NAME:-wa}
Data Governorreturns the deletedDeployments. - Run the following command to return the
EDB Postgres connection errors when max connections reached
Applies to: 4.7.0 and later
- Problem
-
When the watsonx Assistant installation has too many instances being used by parallel users (over 10,000), the maximum number of concurrent Postgres users is reached and can cause errors.
- Solution
-
Increase the maximum number of concurrent EDB Postgres users. You need to apply two temporary patches by modifying the
EDB Postgres max_connections.
- Before applying both patches, replace the PROJECT_CPD_INST_OPERANDS namespace with the namespace where watsonx Assistant is installed.
- Export the PROJECT_CPD_INST_OPERANDS variable as the project where Cloud Pak for Data and watsonx
Assistant are
installed.
export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed> - Run the following to apply both
patches:
cat <<EOF |oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-postgres-max-connections-1 namespace: ${PROJECT_CPD_INST_OPERANDS} labels: type: critical-configuration spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantStore name: wa patch: postgres: postgres: spec: postgresql: parameters: max_connections: "200" patchType: patchStrategicMerge EOFcat <<EOF |oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-postgres-max-connections-2 namespace: ${PROJECT_CPD_INST_OPERANDS} labels: type: critical-configuration spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistant name: wa patch: postgres: postgres: spec: postgresql: parameters: max_connections: "200" patchType: patchStrategicMerge EOF - Only for Versions 4.8.8 and
later:
After twenty minutes, you can run the following to verify the changes.
oc exec -it wa-postgres-16-1 sh sh-4.4$ psql -U postgres psql (12.14) Type "help" for help. postgres=# show max_connections; max_connections ----------------- 200 (1 row) - For Versions lesser than 4.8.8:
After twenty minutes, you can run the following to verify the changes.
oc exec -it wa-postgres-1 sh sh-4.4$ psql -U postgres psql (12.14) Type "help" for help. postgres=# show max_connections; max_connections ----------------- 200 (1 row) -
Check that both the patches have a
type: critical-configurationlabel so they won't be deleted when you upgrade your watsonx Assistantoc get temporarypatch --show-labels NAME READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED AGE LABELS wa-postgres-max-connections-1 16h type=critical-configuration wa-postgres-max-connections-2 14h type=critical-configuration
watsonx Assistant Redis pods not starting because quota is applied to the namespace
Applies to: 4.7.0 and later
- Problem
- Redis pods fail to start due to the following
error:
Warning FailedCreate 51s (x15 over 2m22s) statefulset-controller create Pod c-wa-redis-m-0 in StatefulSet c-wa-redis-m failed error: pods "c-wa-redis-m-0" is forbidden: failed quota: cpd-quota: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
- Cause
- Redis pods cannot start if quota is applied to namespace but
limitrangeis not set. Because Redis init containers do not have limits.cpu, limits.memory, requests.cpu, or requests.memory, an error occurs. - Solution
- Apply limit range with defaults for limits and requests. Modify the namespace in the following
yaml to the namespace where Cloud Pak for Data is
installed:
apiVersion: v1 kind: LimitRange metadata: name: cpu-resource-limits namespace: zen #Change it to the namespace where CPD is installed spec: limits: - default: cpu: 300m memory: 200Mi defaultRequest: cpu: 200m memory: 200Mi type: Container
watsonx Assistant Redis pods not running after cluster restart
Applies to: 4.7.0 and later
- Problem
- watsonx Assistant pods do not restart successfully after the cluster is restarted.
- Cause
- When the cluster is restarted, Redis is not restarting properly. This issue prevents watsonx Assistant from restarting successfully.
- Solution
-
- Get the instance name by running
oc get wa. Set theINSTANCEvariable to that name. - Get the unhealthy redis
pods:
oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running - For each unhealthy redis pod, restart the
pod:
oc delete pod <unhealthy-redis-pod> - Confirm that there are no more unhealthy redis
pods:
oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running
- Get the instance name by running
watsonx
Assistant upgrade gets stuck at apply-cr
Applies to: 4.7.0 and later
- Problem
-
The
clu-trainingpods are inCrashloopbackstate afterapply-olmcompletes, watsonx Assistant hangs during theibmcpd upgrade, or the upgrade hangs during theapply-crcommand with the message:pre-apply-cr release patching (if any) for watson_assistant]
- Cause
- After
apply-olm, model train pods might go into a bad state causing theapply-crcommand for watsonx Assistant version 4.6.0 to stall. - Solution
-
Run the following commands after running the
apply-olmcommand:- Export the name of your watsonx
Assistant instance as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Re-create the ModelTrain training
job:
oc delete modeltraindynamicworkflows.modeltrain.ibm.com ${INSTANCE}-dwf # This command may take some time to complete. Increase CPU allocation to the model train operator, if the command does not complete in few minutes oc delete pvc -l release=${INSTANCE}-dwf-ibm-mt-dwf-rabbitmq oc delete deploy ${INSTANCE}-clu-training-${INSTANCE}-dwf oc delete secret/${INSTANCE}-clu-training-secret job/${INSTANCE}-clu-training-create job/${INSTANCE}-clu-training-update secret/${INSTANCE}-dwf-ibm-mt-dwf-server-tls-secret secret/${INSTANCE}-dwf-ibm-mt-dwf-client-tls-secret oc delete secret registry-${INSTANCE}-clu-training-${INSTANCE}-dwf-training
Expect it to take at least 30 minutes for the new training job to take effect and the status to change to
Completed. - Export the name of your watsonx
Assistant instance as an environment
variable:
watsonx
Assistant upgrade gets stuck at apply-cr or training does not
work after the upgrade completes successfully
Applies to: 4.7.0 and later
- Problem
- The
etcdclusterscustom resourceswa-etcdshows the following patching error:"Failed to patch object: b''{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"StatefulSet.apps \\"wa-etcd\\" is invalid:" - Solution
- To check whether there is an error in
etcdcluster:- Export the name of your watsonx
Assistant instance as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Describe the
etcdclustercustom resource:oc describe etcdcluster ${INSTANCE}-etcd
Complete the following steps to fix the issue:
- Export the name of your watsonx
Assistant instance as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Delete the following components:
Expect it to take at least 20 minutes for theoc delete job ${INSTANCE}-create-slot-jobetcdclustercustom resource to come back again and become healthy.To confirm, enter
oc get etcdclustersand ensure you get output similar to:NAME AGE wa-etcd 120m - Re-create the
clusubsystem component:
Expect it to take at least 20 minutes for theoc delete clu ${INSTANCE}clusubsystem to come back again and become healthy.To confirm, enter
oc get waand ensure that you get output similar to:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED AGE wa 4.6.0 True Stable False Stable 18/18 18/18 12h
- Export the name of your watsonx
Assistant instance as an environment
variable:
Red Hat OpenShift upgrade hangs because some watsonx Assistant pods do not quiesce
Applies to: 4.7.0 and later
Fixed in: 4.8.6
- Problem
- Some watsonx Assistant pods do not quiesce and might not automatically drain, causing the Red Hat OpenShift upgrade to pause.
- Cause
- Quiesce capability is not entirely supported by watsonx Assistant, and will cause some of the pods to continue running.
- Solution
- watsonx Assistant quiesce is optional when upgrading the Red Hat OpenShift cluster. Monitor the node being upgraded for any pods that do not automatically drain (causing the upgrade to hang). To enable the Red Hat OpenShift upgrade to continue, delete the pod that is not draining automatically so it can proceed to another node.
Increasing backup storage for wa-store-cronjob pods that run out of
space
Applies to: 4.7.0 and later
- Problem
- The nightly scheduled
wa-store-cronjobpods eventually fail with No space left on device. - Cause
- The size of the cronjob backup PVC needs to be increased.
- Solution
- Edit the size of the backup storage from 1Gi to 2Gi:
- Edit the CR:
oc edit wa wa - In the
configOverridessection, add:store_db: backup: size: 2Gi
After 5-10 minutes, the PVC should be resized. If the problem persists, increase the size to a larger value, for example,
4Gi. - Edit the CR:
Inaccurate status message from command line after upgrade
- Diagnosing the problem
- If you run the
cpd-cli service-instance upgradecommand from the Cloud Pak for Data command-line interface, and then use theservice-instance listcommand to check the status of each service, the provision status for the service is listed asUPGRADE_FAILED. - Cause of the problem
- When you upgrade the service, only the
cpd-cli manage apply-crcommand is supported. You cannot use thecpd-cli service-instance upgradecommand to upgrade the service. And after you upgrade the service with theapply-crmethod, the change in version and status is not recognized by theservice-instancecommand. However, the correct version is displayed from the Cloud Pak for Data web client. - Resolving the problem
- No action is required. If you use the
cpd-cli manage apply-crmethod to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by thecpd-cli service-instance listcommand.
ModelTrain or clu-training pods may not get healthy when upgrading or after
installation
Applies to: 4.7.0 and later
- Problem
- During the watsonx
Assistant upgrade or installation, one or more pods that are listed by
oc get pods |grep wa-dwfare not getting healthy. - Solution
-
Note:
Perform the following procedures only with the help of the technical support team.