Known issues and limitations for watsonx Assistant

Important: IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.

The following known issues and limitations apply to watsonx Assistant.

For a complete list of known issues and troubleshooting information for all versions of watsonx Assistant, see Troubleshooting known issues. For a complete list of known issues for Cloud Pak for Data, see Limitations and known issues in Cloud Pak for Data.

Activity logs do not show up in the new watsonx Assistant installation state

Applies to: Any release

Problem
Activity logs don't show because the EDB Postgres database doesn't have partitions for the year 2025.
Solution
  1. Only for Versions 4.8.8 and later:

    Get the database password.

    oc get secret wa-postgres-16-admin-auth -o jsonpath='{.data.password}' | base64 -d && echo
    For Versions lesser than 4.8.8:
    Get the database password.
    oc get secret wa-postgres-admin-auth -o jsonpath='{.data.password}' | base64 -d && echo
  2. Exec into one of the EDB Postgres pods.

    Only for Versions 4.8.8 and later:

    oc rsh wa-postgres-16-1
    For Versions lesser than 4.8.8:
    oc rsh wa-postgres-1
  3. Connect to the database.

    Only for Versions 4.8.8 and later:

    sh-5.1$ psql -h wa-postgres-16-rw.cpd.svc -d conversation_pprd_wa -U postgres
    Password for user postgres:
    psql (12.18)
    SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
    Type "help" for help.
    For Versions lesser than 4.8.8:
    sh-5.1$ psql -h wa-postgres-rw.cpd.svc -d conversation_pprd_wa -U postgres
    Password for user postgres:
    psql (12.18)
    SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
    Type "help" for help.
  4. Verify event_log table schema.
    conversation_pprd_wa=# \d event_log
                                    Partitioned table "public.event_log"
           Column        |            Type             | Collation | Nullable |         Default
    ---------------------+-----------------------------+-----------+----------+--------------------------
     event_name          | text                        |           | not null |
     event_method        | activity_method             |           | not null |
     event_resource      | activity_resource           |           | not null |
     domain              | text                        |           |          |
     status              | integer                     |           |          |
     description         | text                        |           |          |
     modified            | timestamp without time zone |           | not null | now()
     crn_id              | text                        |           |          |
     request_id          | uuid                        |           |          |
     request_uri         | text                        |           |          |
     instance_id         | uuid                        |           | not null |

    If event_log shows a column request_id or instance_id as UUID, change those values to text by using the following command.

    conversation_pprd_wa=# ALTER TABLE event_log ALTER column request_id TYPE text;
    ALTER TABLE
    conversation_pprd_wa=# ALTER TABLE event_log ALTER column instance_id TYPE text;
    ALTER TABLE
  5. Create the following partitions for the year 2025.
    -- december 2024
    CREATE TABLE IF NOT EXISTS event_logs_y2024m12w1 PARTITION OF event_log 
      FOR VALUES FROM ('2024-12-01T00:00:00.000Z') TO ('2024-12-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2024m12w2 PARTITION OF event_log 
      FOR VALUES FROM ('2024-12-08T00:00:00.000Z') TO ('2024-12-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2024m12w3 PARTITION OF event_log
      FOR VALUES FROM ('2024-12-15T00:00:00.000Z') TO ('2024-12-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2024m12w4 PARTITION OF event_log
      FOR VALUES FROM ('2024-12-22T00:00:00.000Z') TO ('2025-01-01T00:00:00.000Z');
    
    
    -- january 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m01w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-01-01T00:00:00.000Z') TO ('2025-01-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m01w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-01-08T00:00:00.000Z') TO ('2025-01-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m01w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-01-15T00:00:00.000Z') TO ('2025-01-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m01w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-01-22T00:00:00.000Z') TO ('2025-02-01T00:00:00.000Z');
    
    -- february 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m02w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-02-01T00:00:00.000Z') TO ('2025-02-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m02w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-02-08T00:00:00.000Z') TO ('2025-02-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m02w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-02-15T00:00:00.000Z') TO ('2025-02-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m02w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-02-22T00:00:00.000Z') TO ('2025-03-01T00:00:00.000Z');
    
    --march 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m03w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-03-01T00:00:00.000Z') TO ('2025-03-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m03w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-03-08T00:00:00.000Z') TO ('2025-03-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m03w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-03-15T00:00:00.000Z') TO ('2025-03-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m03w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-03-22T00:00:00.000Z') TO ('2025-04-01T00:00:00.000Z');
    
    --april 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m04w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-04-01T00:00:00.000Z') TO ('2025-04-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m04w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-04-08T00:00:00.000Z') TO ('2025-04-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m04w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-04-15T00:00:00.000Z') TO ('2025-04-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m04w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-04-22T00:00:00.000Z') TO ('2025-05-01T00:00:00.000Z');
    
    --may 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m05w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-05-01T00:00:00.000Z') TO ('2025-05-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m05w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-05-08T00:00:00.000Z') TO ('2025-05-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m05w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-05-15T00:00:00.000Z') TO ('2025-05-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m05w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-05-22T00:00:00.000Z') TO ('2025-06-01T00:00:00.000Z');
    
    --june 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m06w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-06-01T00:00:00.000Z') TO ('2025-06-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m06w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-06-08T00:00:00.000Z') TO ('2025-06-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m06w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-06-15T00:00:00.000Z') TO ('2025-06-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m06w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-06-22T00:00:00.000Z') TO ('2025-07-01T00:00:00.000Z');
    
    --july 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m07w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-07-01T00:00:00.000Z') TO ('2025-07-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m07w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-07-08T00:00:00.000Z') TO ('2025-07-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m07w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-07-15T00:00:00.000Z') TO ('2025-07-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m07w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-07-22T00:00:00.000Z') TO ('2025-08-01T00:00:00.000Z');
    
    --august 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m08w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-08-01T00:00:00.000Z') TO ('2025-08-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m08w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-08-08T00:00:00.000Z') TO ('2025-08-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m08w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-08-15T00:00:00.000Z') TO ('2025-08-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m08w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-08-22T00:00:00.000Z') TO ('2025-09-01T00:00:00.000Z');
    
    --september 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m09w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-09-01T00:00:00.000Z') TO ('2025-09-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m09w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-09-08T00:00:00.000Z') TO ('2025-09-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m09w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-09-15T00:00:00.000Z') TO ('2025-09-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m09w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-09-22T00:00:00.000Z') TO ('2025-10-01T00:00:00.000Z');
    
    --october 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m10w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-10-01T00:00:00.000Z') TO ('2025-10-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m10w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-10-08T00:00:00.000Z') TO ('2025-10-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m10w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-10-15T00:00:00.000Z') TO ('2025-10-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m10w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-10-22T00:00:00.000Z') TO ('2025-11-01T00:00:00.000Z');
    
    --november 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m11w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-11-01T00:00:00.000Z') TO ('2025-11-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m11w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-11-08T00:00:00.000Z') TO ('2025-11-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m11w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-11-15T00:00:00.000Z') TO ('2025-11-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m11w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-11-22T00:00:00.000Z') TO ('2025-12-01T00:00:00.000Z');
    
    --december 2025
    CREATE TABLE IF NOT EXISTS event_logs_y2025m12w1 PARTITION OF event_log 
      FOR VALUES FROM ('2025-12-01T00:00:00.000Z') TO ('2025-12-08T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m12w2 PARTITION OF event_log 
      FOR VALUES FROM ('2025-12-08T00:00:00.000Z') TO ('2025-12-15T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m12w3 PARTITION OF event_log 
      FOR VALUES FROM ('2025-12-15T00:00:00.000Z') TO ('2025-12-22T00:00:00.000Z');
    
    CREATE TABLE IF NOT EXISTS event_logs_y2025m12w4 PARTITION OF event_log 
      FOR VALUES FROM ('2025-12-22T00:00:00.000Z') TO ('2026-01-01T00:00:00.000Z');
    

watsonx Assistant tfmm pods get stuck in Init:0/1 state

Applies to: 4.8.x

Problem
During the installation, watsonx Assistant tfmm pods get stuck in the Init:0/1 state, preventing the installation from completing. This issue occurs due to resource allocation problems for the model-upload init container.
Solution
To fix the issue, apply the following patch to adjust the resource limits and requests for the model-upload init container.
export PROJECT_CPD_INST_OPERANDS=<namespace where watsonx Assistant is installed>

export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`

cat <<EOF | oc apply -f -
apiVersion: assistant.watson.ibm.com/v1
kind: TemporaryPatch
metadata:
  name: wa-tfmm-init-container-fix
  namespace: ${PROJECT_CPD_INST_OPERANDS}
spe
  apiVersion: assistant.watson.ibm.com/v1
  kind: WatsonAssistantCluRuntime
  name: ${INSTANCE}
  patchType: patchStrategicMerge
  patch:
    tfmm:
      deployment:
        spec:
          template:
            spec:
              initContainers:
              - name: model-upload
                resources:
                  limits: 
                    cpu: 5
                    memory: 512Mi
                  requests: 
                    cpu: 500m
                    memory: 512Mi
EOF

Upgrade from Version 4.8.0 causes one of the many ETCD pods to go to crashloopback state

Applies to: 4.8.7

Problem
During the upgrade of watsonx Assistant from Version 4.8.0, one of the many ETCD pods might go to crashloopback state. You might find panic: not implemented in the ETCD pod logs that is in crashloopback state.
panic: not implemented

goroutine 148 [running]:
go.etcd.io/etcd/etcdserver.(*applierV3backend).Apply(0xc0001ba160, 0xc001c2ea80, 0x0)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/apply.go:173 +0x1019
go.etcd.io/etcd/etcdserver.(*authApplierV3).Apply(0xc0001b8050, 0xc001c2ea80, 0x0)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/apply_auth.go:60 +0xd7
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntryNormal(0xc0002de600, 0xc0007234d0)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:2224 +0x1f0
go.etcd.io/etcd/etcdserver.(*EtcdServer).apply(0xc0002de600, 0xc000a25628, 0x51b, 0x578, 0xc0000ec140, 0xc0018c8070, 0x0, 0x10f7228)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:2138 +0x596
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntries(0xc0002de600, 0xc0000ec140, 0xc0002a8400)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:1390 +0xd4
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc0002de600, 0xc0000ec140, 0xc0002a8400)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:1114 +0x88
go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x1299c80, 0xc000212ec0)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/etcdserver/server.go:1059 +0x3c
go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc00089e0c0)
	/tmp/etcd-release-3.4.16/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xdb
created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler
	/tmp/etcd-release-3.4.16/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13d
Solution
To fix the issue, restart the ETCD pod that is in crashloopback state.

watsonx Assistant Upgrade gets stuck in Verify state

Applies to: Upgrades to 4.8.6 from versions prior to 4.8.0

Problem
During upgrade, watsonx Assistant gets stuck in Verify state with Unprocessable Entity error messages in etcd operator logs which prevents the etcd pods from updating to the new 4.8.6 values.
Solution
To fix the issue, apply the following patch:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
cat <<EOF | oc apply -f -
apiVersion: assistant.watson.ibm.com/v1
kind: TemporaryPatch
metadata:
  name: remove-icpdsupport-module-label
  namespace: ${PROJECT_CPD_INST_OPERANDS}
spec:
  apiVersion: assistant.watson.ibm.com/v1
  kind: WatsonAssistant
  name: ${INSTANCE}
  patchType: patchJson6902
  patch:
      etcd:
        cr:
          - op: remove
            path: /spec/labels/icpdsupport~1module
EOF

Workspace logs and DataGovernor Elasticsearch pods goes to CrashLoopBackOff status after upgrade

Applies to: Upgrades up to 4.8.4

Problem

The Workspace logs and Data Governor Elasticsearch pods goes to CrashLoopBackOff status after the watsonx Assistant upgrade with the Functional test failed at SCENARIO: V2 Health Check. error.

Solution
To fix the issue, do the following steps:
  1. Apply the following patch to increase the number of CPU request to 500m.
    cat <<EOF | oc apply -f -
    apiVersion: assistant.watson.ibm.com/v1
    kind: TemporaryPatch
    metadata:
      name: wa-data-governor
    spec:
      apiVersion: assistant.watson.ibm.com/v1
      kind: WatsonAssistant
      name: wa
      patch:
        data-governor:
          datagovernoroverride:
            spec:
              dependencies:
                elasticsearch:
                  esResources:
                    requests:
                      cpu: 500m
      patchType: patchStrategicMerge
    EOF
    Note: After the change, it takes approximately 15 minutes for the watsonx Assistant operator to update the number of CPU request.
  2. Check the health of wa-data-governor:
    oc get sts |grep wa-data-go
    wa-data-go-d7db-ib-6fb9-es-server-all   3/3     19h

An etcd operator script fails while upgrading watsonx Assistant 4.8.4

Applies to: 4.8.4

Problem
During watsonx Assistant upgrade from version 4.7.x to version 4.8.4, the Ready status shows False and ReadyReason shows In Progress for a long time.

# oc get wd -n zen
NAME   VERSION   READY   READYREASON   UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE        DATASTOREQUIESCE   AGE
wd     4.8.4     False   InProgress    True       VerifyWait       11/23      10/23      NOT_QUIESCED   NOT_QUIESCED       2d17h
Also, an error message similar to one of the following is displayed:
"msg": "An unhandled exception occurred while templating '{{ q('etcd_member', cluster_host= etcd_cluster_name + '-client.'
 + etcd_namespace + '.svc', cluster_port=etcd_client_port, ca_cert=tls_directory + '/etcd-ca.crt', cert_cert=tls_directory + '/etcd-client.crt',
 cert_key=tls_directory + '/etcd-client.key') }}'. Error was a <class 'ansible.errors.AnsibleError'>, original message: An unhandled exception
 occurred while running the lookup plugin 'etcd_member'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unable to fetch
 members. Error: 'Client' object has no attribute 'server_version_sem'. Unable to fetch members. Error: 'Client' object has no attribute
 'server_version_sem'"
 Symptom:
TASK [etcdcluster : Enable authentication when secure client] ******************
[1;30mtask path: /opt/ansible/roles/etcdcluster/tasks/reconcile_pods.yaml:246�[0m
/usr/local/lib/python3.8/site-packages/etcd3/baseclient.py:97: Etcd3Warning: cannot detect etcd server version
1. maybe is a network problem, please check your network connection
2. maybe your etcd server version is too low, required: 3.2.2+
 warnings.warn(Etcd3Warning("cannot detect etcd server version\n"
[0;31mfatal: [localhost]: FAILED! => {[0m
[0;31m    "msg": "An unhandled exception occurred while running the lookup plugin 'etcd_auth'. Error was a <class 'ansible.errors.AnsibleError'>,
 original message: Enabling authentication failed. Error: 'Client' object has no attribute 'server_version_sem'"[0m
[0;31m}[0m
Cause

A script in the etcd operator that sets authentication might fail. When it fails, the etcd operator does not deploy with authentication:enabled in the etcdcluster CR. This failure stops other components in the service from being upgraded and verified.

Solution
Attempt to re-execute the etcd operator by restarting the etcdcluster CR.
  1. Get the name of the service etcdcluster.
    oc get etcdcluster | grep etcd <or name of the etcd cluster in the deployment>
  2. Delete the CR to allow the etcd operator to re-execute tasks.
    
oc delete etcdcluster <cluster>
  3. Wait until the etcdcluster and etcd pods are re-created.
  4. Check the status of Ready, Deployed, and Verified to make sure that the upgrade is successful.
    



# oc get wd 
    NAME   VERSION   READY   READYREASON   UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE        DATASTOREQUIESCE   AGE
    wd     4.8.4     True    Stable        False      Stable           23/23      23/23      NOT_QUIESCED   NOT_QUIESCED       3d6h

DataGovernor Elasticsearch pods in CrashLoopBackOff status

Applies to: 4.8.x

Problem

The Elasticsearch pods of DataGovernor go to CrashLoopBackOff status after the watsonx Assistant upgrade.

Solution
To fix the issue, do the following steps:
  1. Add pause annotation to DataGovernor:
    oc annotate dg wa-data-governor pause-reconciliation="true" --overwrite
  2. Edit wa-data-governor:
    oc edit elasticsearchcluster wa-data-governor
  3. Add the following lines in the spec section of wa-data-governor and restart the Elasticsearch pods:

      spec:
        livenessProbe:
          failureThreshold: 5
          initialDelaySeconds: 600
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 300
        readinessProbe:
          failureThreshold: 5
          initialDelaySeconds: 600
          periodSeconds: 30
          successThreshold: 3
          timeoutSeconds: 300
  4. Check the status of wa-data-governor:
    oc get sts |grep wa-data-go
    wa-data-go-d7db-ib-6fb9-es-server-all   3/3     19h
  5. Scale the Elasticsearch statefulset to 0 when the first pod is Terminating, :
    oc scale sts wa-data-go-d7db-ib-6fb9-es-server-all  --replicas=0
  6. Scale the Elasticsearch statefulset back to 3:
    oc scale sts wa-data-go-d7db-ib-6fb9-es-server-all  --replicas=3
  7. Remove the pause annotation in DataGovernor after all the three elastic nodes come live and are no longer in the CrashLoopBackoof status:

    oc annotate dg wa-data-governor pause-reconciliation="false" --overwrite
  8. Restart the Elasticsearch pods.
    Note: After restart, you can expect the Elasticsearch pods to crash a few times before they become live again. Therefore, you must monitor the Elasticsearch pods to make sure they are healthy and stable.

DataGovernor does not recover after shutdown and restart

Applies to: 4.8.0 or 4.8.2

Problem

Data Governor does not recover after shutdown or restart because the Kafka CR shows the following error message:

oc get kafka wa-data-governor-kafka -o yaml
conditions:
  - lastTransitionTime: "2024-01-16T19:53:33.655488407Z"
    message: 'Failure executing: PATCH at: [https://172.30.0.1:443/api/v1/namespaces/cpd/persistentvolumeclaims/data-wa-data-governor-kafka-zookeeper-0](https://172.30.0.1/api/v1/namespaces/cpd/persistentvolumeclaims/data-wa-data-governor-kafka-zookeeper-0).
      Message: PersistentVolumeClaim "data-wa-data-governor-kafka-zookeeper-0" is
      invalid: spec.resources.requests.storage: Forbidden: field can not be less than
      previous value. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.resources.requests.storage,
      message=Forbidden: field can not be less than previous value, reason=FieldValueForbidden,
      additionalProperties={})], group=null, kind=PersistentVolumeClaim, name=data-wa-data-governor-kafka-zookeeper-0,
      retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=PersistentVolumeClaim
      "data-wa-data-governor-kafka-zookeeper-0" is invalid: spec.resources.requests.storage:
      Forbidden: field can not be less than previous value, metadata=ListMeta(_continue=null,
      remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}),
      reason=Invalid, status=Failure, additionalProperties={}).'
    reason: KubernetesClientException
    status: "True"
    type: NotReady
Solution
To fix the issue, use the following command:
cat <<EOF | oc apply -f -
apiVersion: assistant.watson.ibm.com/v1
kind: TemporaryPatch
metadata:
  name: wa-data-governor-override-custom-resource
spec:
  apiVersion: assistant.watson.ibm.com/v1
  kind: WatsonAssistant
  name: wa
  patch:
    data-governor:
      datagovernor:
        spec:
          overrideCustomResource: wa-data-exhaust-override
  patchType: patchStrategicMerge
EOF

One or more watsonx Assistant pods go to the ContainerStatusUnknown state

Applies to: 4.8.x

Problem

Some watsonx Assistant pods go to the ContainerStatusUnknown state.

Solution
You can delete the pods in the ContainerStatusUnknown state by doing the following:
  1. Get the instance name and set the INSTANCE variable to that name:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  2. Get the pods that are in the ContainerStatusUnknown state:
    oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown
  3. Delete the pods that are in the ContainerStatusUnknown state individually:
    oc delete pod <unknown-state-pod>
  4. Confirm that there are no more pods in the ContainerStatusUnknown state:
    oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown

EDB Postgres cluster in bad state

Applies to: Any release

Problem

The watsonx Assistant EDB Postgres cluster is unhealthy.

Solution
Complete the following to recover the watsonx Assistant EDB Postgres cluster.

Only for Version 4.8.8 and later, do the following steps:

  1. Install the CloudNativePG (CNP) plugin for EnterpriseDB.

    curl -sSfL \
    https://github.com/EnterpriseDB/kubectl-cnp/raw/main/install.sh | \
    sudo sh -s -- -b /usr/local/bin
  2. Query the status of wa-postgres-16 to get the details of the EDB Postgres cluster such as Name, Namespace, Primary instance, and Status:
    oc cnp status wa-postgres-16
    
    Cluster Summary
    Name:                wa-postgres-16
    Namespace:           zen
    PostgreSQL Image:    [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9)
    Primary instance:    wa-postgres-16-3
    Primary start time:  2023-12-13 12:24:06 +0000 UTC (uptime 5s)
    Status:              Failing over Failing over from wa-postgres-16-2 to wa-postgres-16-3
    Instances:           3
    Ready instances:     0
  3. Check the pods that are failing:
    oc get pods | grep wa-postgres-16
  4. If wa-postgres-16-1 starts, destroy it because wa-postgres-16-3 is the Primary instance:
    oc cnp destroy wa-postgres-16 1
  5. If wa-postgres-16-2 starts, destroy it because wa-postgres-16-3 is the Primary instance:
    oc cnp destroy wa-postgres-16 2
  6. After wa-postgres-16-3 starts, the EnterpriseDB operator controller recreates two standby pods.
  7. Query the status of wa-postgres-16:
    oc cnp status wa-postgres-16
    
    Cluster Summary
    Name:                wa-postgres-16
    Namespace:           cpd-instance
    System ID:           7311733145743429658
    PostgreSQL Image:    [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9)
    Primary instance:    wa-postgres-16-3
    Primary start time:  2023-12-12 16:59:47 +0000 UTC (uptime 20h48m52s)
    Status:              Cluster in healthy state                         <---- back in a healthy state.
    Instances:           3
    Ready instances:     3   0

For Versions lesser than 4.8.8:

  1. Install the CloudNativePG (CNP) plugin for EnterpriseDB.

    curl -sSfL \
    https://github.com/EnterpriseDB/kubectl-cnp/raw/main/install.sh | \
    sudo sh -s -- -b /usr/local/bin
  2. Query the status of wa-postgres to get the details of the EDB Postgres cluster such as Name, Namespace, Primary instance, and Status:
    oc cnp status wa-postgres
    
    Cluster Summary
    Name:                wa-postgres
    Namespace:           zen
    PostgreSQL Image:    [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9)
    Primary instance:    wa-postgres-3
    Primary start time:  2023-12-13 12:24:06 +0000 UTC (uptime 5s)
    Status:              Failing over Failing over from wa-postgres-2 to wa-postgres-3
    Instances:           3
    Ready instances:     0
  3. Check the pods that are failing:
    oc get pods | grep wa-postgres
  4. If wa-postgres-1 starts, destroy it because wa-postgres-3 is the Primary instance:
    oc cnp destroy wa-postgres 1
  5. If wa-postgres-2 starts, destroy it because wa-postgres-3 is the Primary instance:
    oc cnp destroy wa-postgres 2
  6. After wa-postgres-3 starts, the EnterpriseDB operator controller recreates two standby pods.
  7. Query the status of wa-postgres:
    oc cnp status wa-postgres
    
    Cluster Summary
    Name:                wa-postgres
    Namespace:           cpd-instance
    System ID:           7311733145743429658
    PostgreSQL Image:    [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9)
    Primary instance:    wa-postgres-3
    Primary start time:  2023-12-12 16:59:47 +0000 UTC (uptime 20h48m52s)
    Status:              Cluster in healthy state                         <---- back in a healthy state.
    Instances:           3
    Ready instances:     3   0

The ETCD pods of watsonx Assistant not re-created after restarting the cluster

Applies to: Upgrades to 4.8.1

Problem
When you do an unplanned restart of the Kubernetes cluster of your watsonx Assistant, the ETCD pods remain in ContainerCreating status:
wa-etcd-0 0/1 ContainerCreating 0
In addition, when you run the oc describe, you see the following message in the event logs:
attachdetach-controller 
         Multi-Attach error for volume "pvc-af57751f-245e-4e7e-8ec1-84aa6f7bac15" Volume is already exclusively attached to one node and can't be attached to another.

The ETCD pods of watsonx Assistant remain in the ContainerCreating status because of the Multi-Attach error for PersistentVolume (PV). When you restart the cluster, the PersistentVolumeClaim (PVC) tries to attach a new worker node in the place of the existing worker node in PV. The PV denies reattachment of the worker node because the PV has only ReadWriteOnce permission, which causes Multi-Attach error for volume....

Solution
  1. Scale the replicas of the wa-etcd stateful-set to 0:

    oc scale sts wa-etcd --replicas=0
  2. Delete the PVC that remains in use to free it to attach to a new pod:
    oc delete pvc data-wa-etcd-0 
  3. Scale the stateful-set back to three replicas:
    oc scale sts wa-etcd --replicas=3
  4. Check the WA-ETCD pods that return to running state:
    oc get pods | grep wa-etcd wa-etcd-0
            1/1     Running             0      3m53s wa-etcd-1
            1/1     Running             0      3m25s wa-etcd-2
            1/1     Running             0      2m55s

Elasticsearch store not getting cleaned up on upgrade

Applies to: Upgrades to 4.8.0

Problem

The Elasticsearch cluster that is used by the store microservice is not cleaned up after upgrading watsonx Assistant to Version 4.8.0.

Solution
Clean up the Elasticsearch cluster store after upgrading to watsonx Assistant Version 4.8.0:
  1. Export the PROJECT_CPD_INST_OPERANDS variable as a project to the location where Cloud Pak for Data and watsonx Assistant are installed:
    export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed>
  2. Export the name of your watsonx Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  3. Check the name of the Elasticsearch cluster:
    oc get elasticsearchcluster -n ${PROJECT_CPD_INST_OPERANDS}
  4. If $(INSTANCE}-es-store is listed, delete the store:
    oc delete elasticsearchcluster $(INSTANCE}-es-store -n ${PROJECT_CPD_INST_OPERANDS}
  5. Confirm the deletion of $(INSTANCE}-es-store:
    oc get elasticsearchcluster -n ${PROJECT_CPD_INST_OPERANDS}

Data Governor shutdown issue

Applies to: 4.8.0 and later

Problem

When you shut down, the Data Governor does not stop the running pods in some Deployments.

Solution
Remove the Deployments with the running pods:
  1. Run the following command to return the Deployments that are safe to delete:
    oc get deploy -l squad=data-exhaust,app=${INSTANCE_NAME:-wa}
    where the INSTANCE_NAME is the name of the watsonx Assistant installation.
  2. Manually delete each Deployment that has the active pods:
    oc delete deploy -l squad=data-exhaust,app=${INSTANCE_NAME:-wa}
When you restart, the Data Governor returns the deleted Deployments.

EDB Postgres connection errors when max connections reached

Applies to: 4.7.0 and later

Problem

When the watsonx Assistant installation has too many instances being used by parallel users (over 10,000), the maximum number of concurrent Postgres users is reached and can cause errors.

Solution

Increase the maximum number of concurrent EDB Postgres users. You need to apply two temporary patches by modifying the EDB Postgres max_connections.

  1. Before applying both patches, replace the PROJECT_CPD_INST_OPERANDS namespace with the namespace where watsonx Assistant is installed.
  2. Export the PROJECT_CPD_INST_OPERANDS variable as the project where Cloud Pak for Data and watsonx Assistant are installed.
    export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed>
    
  3. Run the following to apply both patches:
    cat <<EOF |oc apply -f -
    apiVersion: assistant.watson.ibm.com/v1
    kind: TemporaryPatch
    metadata:
      name: wa-postgres-max-connections-1
      namespace: ${PROJECT_CPD_INST_OPERANDS}
      labels:
        type: critical-configuration
    spec:
      apiVersion: assistant.watson.ibm.com/v1
      kind: WatsonAssistantStore
      name: wa
      patch:
        postgres:
          postgres:
            spec:
              postgresql:
                parameters:
                  max_connections: "200"
      patchType: patchStrategicMerge
    EOF
    
    cat <<EOF |oc apply -f -
    apiVersion: assistant.watson.ibm.com/v1
    kind: TemporaryPatch
    metadata:
      name: wa-postgres-max-connections-2
      namespace: ${PROJECT_CPD_INST_OPERANDS}
      labels:
        type: critical-configuration
    spec:
      apiVersion: assistant.watson.ibm.com/v1
      kind: WatsonAssistant
      name: wa
      patch:
        postgres:
          postgres:
            spec:
              postgresql:
                parameters:
                  max_connections: "200"
      patchType: patchStrategicMerge
    EOF
    
  4. Only for Versions 4.8.8 and later:

    After twenty minutes, you can run the following to verify the changes.

    oc exec -it wa-postgres-16-1  sh
    sh-4.4$ psql -U postgres
    psql (12.14)
    Type "help" for help.
    
    postgres=# show max_connections;
     max_connections 
    -----------------
     200
    (1 row)
    
  5. For Versions lesser than 4.8.8:

    After twenty minutes, you can run the following to verify the changes.

    oc exec -it wa-postgres-1  sh
    sh-4.4$ psql -U postgres
    psql (12.14)
    Type "help" for help.
    
    postgres=# show max_connections;
     max_connections 
    -----------------
     200
    (1 row)
    
  6. Check that both the patches have a type: critical-configuration label so they won't be deleted when you upgrade your watsonx Assistant

    oc get temporarypatch  --show-labels
    NAME                            READY   READYREASON   UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   AGE   LABELS
    wa-postgres-max-connections-1                                                                           16h   type=critical-configuration
    wa-postgres-max-connections-2                                                                           14h   type=critical-configuration
    

watsonx Assistant Redis pods not starting because quota is applied to the namespace

Applies to: 4.7.0 and later

Problem
Redis pods fail to start due to the following error:
Warning  FailedCreate      51s (x15 over 2m22s)  statefulset-controller  create Pod c-wa-redis-m-0 in StatefulSet c-wa-redis-m failed error: pods "c-wa-redis-m-0" is forbidden: failed quota: cpd-quota: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
Cause
Redis pods cannot start if quota is applied to namespace but limitrange is not set. Because Redis init containers do not have limits.cpu, limits.memory, requests.cpu, or requests.memory, an error occurs.
Solution
Apply limit range with defaults for limits and requests. Modify the namespace in the following yaml to the namespace where Cloud Pak for Data is installed:
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-resource-limits
  namespace:  zen  #Change it to the namespace where CPD is installed
spec:
  limits:
  - default:
      cpu: 300m
      memory: 200Mi
    defaultRequest:
      cpu: 200m
      memory: 200Mi
    type: Container

watsonx Assistant Redis pods not running after cluster restart

Applies to: 4.7.0 and later

Problem
watsonx Assistant pods do not restart successfully after the cluster is restarted.
Cause
When the cluster is restarted, Redis is not restarting properly. This issue prevents watsonx Assistant from restarting successfully.
Solution
  1. Get the instance name by running oc get wa. Set the INSTANCE variable to that name.
  2. Get the unhealthy redis pods:
    oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running
  3. For each unhealthy redis pod, restart the pod:
    oc delete pod <unhealthy-redis-pod>
  4. Confirm that there are no more unhealthy redis pods:
    oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running

watsonx Assistant upgrade gets stuck at apply-cr

Applies to: 4.7.0 and later

Problem

The clu-training pods are in Crashloopback state after apply-olm completes, watsonx Assistant hangs during the ibmcpd upgrade, or the upgrade hangs during the apply-cr command with the message:

pre-apply-cr release patching (if any) for watson_assistant]
Cause
After apply-olm, model train pods might go into a bad state causing the apply-cr command for watsonx Assistant version 4.6.0 to stall.
Solution

Run the following commands after running the apply-olm command:

  1. Export the name of your watsonx Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  2. Re-create the ModelTrain training job:
    oc delete modeltraindynamicworkflows.modeltrain.ibm.com ${INSTANCE}-dwf # This command may take some time to complete. Increase CPU allocation to the model train operator, if the command does not complete in few minutes
    oc delete pvc -l release=${INSTANCE}-dwf-ibm-mt-dwf-rabbitmq
    oc delete deploy ${INSTANCE}-clu-training-${INSTANCE}-dwf
    oc delete secret/${INSTANCE}-clu-training-secret job/${INSTANCE}-clu-training-create job/${INSTANCE}-clu-training-update secret/${INSTANCE}-dwf-ibm-mt-dwf-server-tls-secret secret/${INSTANCE}-dwf-ibm-mt-dwf-client-tls-secret
    oc delete secret registry-${INSTANCE}-clu-training-${INSTANCE}-dwf-training

Expect it to take at least 30 minutes for the new training job to take effect and the status to change to Completed.

watsonx Assistant upgrade gets stuck at apply-cr or training does not work after the upgrade completes successfully

Applies to: 4.7.0 and later

Problem
The etcdclusters custom resources wa-etcd shows the following patching error:
"Failed to patch object: b''{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"StatefulSet.apps
      \\"wa-etcd\\" is invalid:"
Solution
To check whether there is an error in etcdcluster:
  1. Export the name of your watsonx Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  2. Describe the etcdcluster custom resource:
    oc describe etcdcluster ${INSTANCE}-etcd

Complete the following steps to fix the issue:

  1. Export the name of your watsonx Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  2. Delete the following components:
    oc delete job ${INSTANCE}-create-slot-job
    Expect it to take at least 20 minutes for the etcdcluster custom resource to come back again and become healthy.

    To confirm, enter oc get etcdclusters and ensure you get output similar to:

    NAME AGE
    wa-etcd 120m
  3. Re-create the clu subsystem component:
    oc delete clu ${INSTANCE}
    Expect it to take at least 20 minutes for the clu subsystem to come back again and become healthy.

    To confirm, enter oc get wa and ensure that you get output similar to:

    NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED AGE
    wa 4.6.0 True Stable False Stable 18/18 18/18 12h

Red Hat OpenShift upgrade hangs because some watsonx Assistant pods do not quiesce

Applies to: 4.7.0 and later

Fixed in: 4.8.6

Problem
Some watsonx Assistant pods do not quiesce and might not automatically drain, causing the Red Hat OpenShift upgrade to pause.
Cause
Quiesce capability is not entirely supported by watsonx Assistant, and will cause some of the pods to continue running.
Solution
watsonx Assistant quiesce is optional when upgrading the Red Hat OpenShift cluster. Monitor the node being upgraded for any pods that do not automatically drain (causing the upgrade to hang). To enable the Red Hat OpenShift upgrade to continue, delete the pod that is not draining automatically so it can proceed to another node.

Increasing backup storage for wa-store-cronjob pods that run out of space

Applies to: 4.7.0 and later

Problem
The nightly scheduled wa-store-cronjob pods eventually fail with No space left on device.
Cause
The size of the cronjob backup PVC needs to be increased.
Solution
Edit the size of the backup storage from 1Gi to 2Gi:
  1. Edit the CR:
    oc edit wa wa
  2. In the configOverrides section, add:
    store_db:
      backup:
        size: 2Gi

After 5-10 minutes, the PVC should be resized. If the problem persists, increase the size to a larger value, for example, 4Gi.

Inaccurate status message from command line after upgrade

Diagnosing the problem
If you run the cpd-cli service-instance upgrade command from the Cloud Pak for Data command-line interface, and then use the service-instance list command to check the status of each service, the provision status for the service is listed as UPGRADE_FAILED.
Cause of the problem
When you upgrade the service, only the cpd-cli manage apply-cr command is supported. You cannot use the cpd-cli service-instance upgrade command to upgrade the service. And after you upgrade the service with the apply-cr method, the change in version and status is not recognized by the service-instance command. However, the correct version is displayed from the Cloud Pak for Data web client.
Resolving the problem
No action is required. If you use the cpd-cli manage apply-cr method to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by the cpd-cli service-instance list command.

ModelTrain or clu-training pods may not get healthy when upgrading or after installation

Applies to: 4.7.0 and later

Problem
During the watsonx Assistant upgrade or installation, one or more pods that are listed by oc get pods |grep wa-dwf are not getting healthy.
Solution
Note:

Perform the following procedures only with the help of the technical support team.

Run the following script to re-create Model Train and CLU training pods:
#!/bin/bash
# Set the default values for PROJECT_CPD_INST_OPERANDS and OPERATOR_NS
PROJECT_CPD_INST_OPERANDS=$(oc get wa -A -o=jsonpath='{.items[0].metadata.namespace}' | awk '{print $1}')
if [ -z "$PROJECT_CPD_INST_OPERANDS" ]; then
  echo -e "\n##  Unable to retrieve PROJECT_CPD_INST_OPERANDS. Exiting..."
  exit 1
fi

OPERATOR_NS=""
for ns in $(oc get namespaces -o=jsonpath='{.items[*].metadata.name}'); do
  if oc get deploy -n "$ns" ibm-watson-assistant-operator >/dev/null 2>&1; then
    OPERATOR_NS=$ns
    break
  fi
done

if [ -z "$OPERATOR_NS" ]; then
  echo -e "\n##  ibm-watson-assistant-operator deployment not found in any namespace."
  exit 1
fi

# Prompt the user for input
echo -e "\n"
read -p "Please enter the namespace where Assistant is installed. [default: $PROJECT_CPD_INST_OPERANDS]: " PROJECT_CPD_INST_OPERANDS_OVERRIDE
PROJECT_CPD_INST_OPERANDS=${PROJECT_CPD_INST_OPERANDS_OVERRIDE:-$PROJECT_CPD_INST_OPERANDS}
echo -e "\n"
read -p "Please enter the namespace where assistant operator is installed. [default: $OPERATOR_NS]: " OPERATOR_NS_OVERRIDE
OPERATOR_NS=${OPERATOR_NS_OVERRIDE:-$OPERATOR_NS}

# Export the instance
export PROJECT_CPD_INST_OPERANDS
export OPERATOR_NS


export INSTANCE=$(oc get wa -n "${PROJECT_CPD_INST_OPERANDS}" | grep -v NAME | awk '{print $1}')
echo -e "\n##  Found watsonx
Assistant Instance $INSTANCE"

# Scale down assistant-operator
echo -e "\n##  Scaling down ibm-watson-assistant-operator to 0 replica"
oc scale deploy ibm-watson-assistant-operator --replicas=0 -n "$OPERATOR_NS"

# Get MT cr name
export MT_CR_NAME="wa-dwf"

# Delete Modeltrain
echo -e "\n##  Deleting Assistant modeltraindynamicworkflows CR if it exists"

if oc get modeltraindynamicworkflows.modeltrain.ibm.com $MT_CR_NAME -n $PROJECT_CPD_INST_OPERANDS >/dev/null 2>&1; then
  echo -e "\nmodeltraindynamicworkflows.modeltrain.ibm.com assistant CR found. Deleting ..."
  oc delete modeltraindynamicworkflows.modeltrain.ibm.com $MT_CR_NAME -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true &
else
  echo -e "\nmodeltraindynamicworkflows.modeltrain.ibm.com assistant CR not found."
fi

echo -e "\n##  Waiting for deletion..."
start_time=$(date +%s)
while true; do
    if ! oc get modeltraindynamicworkflows.modeltrain.ibm.com $MT_CR_NAME -n $PROJECT_CPD_INST_OPERANDS &>/dev/null; then
        break  # Exit the loop if the resource is deleted
    fi

    current_time=$(date +%s)
    elapsed_time=$((current_time - start_time))
    if [[ $elapsed_time -gt 300 ]]; then
        echo -e "\n##  Timeout: Deletion took longer than 5 minutes. Removing finalizer forcefully..."
        oc patch modeltraindynamicworkflows.modeltrain.ibm.com $MT_CR_NAME -n $PROJECT_CPD_INST_OPERANDS --type json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
        break  # Exit the loop even if the resource is not deleted
    fi

    sleep 10  # Wait for 1 minute before checking again
done

#Delete deploy,sts and dwf rabbitmqcluster started by Model Train if not deleted
#Retrieve the names of deployments, statefulsets, and RabbitMQ clusters
resources=$(oc get deploy,sts,rabbitmqcluster -l icpdsupport/app=wa-dwf --no-headers | awk '{print $1}')
# Check if any resources exist
if [ -z "$resources" ]; then
  echo -e "\n##  No deployments, statefulsets, or RabbitMQ clusters with label 'icpdsupport/app=wa-dwf' found."
else
  echo -e "\n##  Deleting existing deployments, statefulsets, and RabbitMQ clusters..."
  # Loop through each name and delete the corresponding resource
  while IFS= read -r name; do
    oc delete "$name"
  done <<< "$resources"
  echo -e "\n##  Deletion completed."
fi

# Delete PVC
echo -e "\n## Deleting any leftover PVC"
# Delete PVC if one is found
PVC_NAMES=$(oc get pvc -l release=$MT_CR_NAME-ibm-mt-dwf-rabbitmq -n $PROJECT_CPD_INST_OPERANDS -o jsonpath='{.items[*].metadata.name}')

if [ -n "$PVC_NAMES" ]; then
  oc delete pvc $PVC_NAMES -n $PROJECT_CPD_INST_OPERANDS
fi

# Delete deployments and secrets
echo -e "\n##  Cleaning up DWF resoruces"
oc delete deploy $INSTANCE-clu-training-$INSTANCE-dwf -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true
oc delete secret -l release=wa-dwf -n $PROJECT_CPD_INST_OPERANDS  --ignore-not-found=true
oc delete secret/${INSTANCE}-dwf-ibm-mt-dwf-server-tls-secret secret/${INSTANCE}-dwf-ibm-mt-dwf-client-tls-secret -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true
oc delete secret/${INSTANCE}-clu-training-secret job/${INSTANCE}-clu-training-create -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true
oc delete secret/${INSTANCE}-clu-training-secret job/${INSTANCE}-clu-training-update -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true
oc delete secret registry-${INSTANCE}-clu-training-${INSTANCE}-dwf-training -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true
oc delete hpa  $INSTANCE-clu-training-$INSTANCE-dwf -n $PROJECT_CPD_INST_OPERANDS --ignore-not-found=true

# Scale up watson-assistant-operator
echo -e "\n##  Scaling up ibm-watson-assistant-operator to 1 replica"
oc scale deploy ibm-watson-assistant-operator --replicas=1 -n $OPERATOR_NS

# Wait for the modeltraindynamicworkflow to show up
echo -e "\n##  Waiting for wa-dwf modeltraindynamicworkflow..."
while true; do
    OUTPUT=$(oc get modeltraindynamicworkflow -A -o=name | grep "wa-dwf")
    if [ -n "$OUTPUT" ]; then
        echo -e "\n##  wa-dwf modeltraindynamicworkflow found!"
        break
    fi
    echo -e "\nWaiting for modeltraindynamicworkflow CR to appear."
    sleep 60
done

# Check all dwf pods
declare -a deployments=("wa-dwf-ibm-mt-dwf-lcm" "wa-dwf-ibm-mt-dwf-trainer" "wa-clu-training-wa-dwf" )

# Function to check if a deployment is ready
check_deployments_ready() {
  local deployment=$1
  local deployment_status=$(oc get deployment "$deployment" -o=jsonpath='{.status.conditions[?(@.type=="Available")].status}')

  if [ "$deployment_status" = "True" ]; then
    echo -e "\nDeployment $deployment is ready."
    return 0
  else
    echo -e "\nDeployment $deployment is not ready. Waiting .."
    return 1
  fi
}

# Check the deployments and their pods
echo -e "\n##  Checking deployments and pods..."
while true; do
    all_pods_running=true

    for deployment in "${deployments[@]}"; do
        deployment_output=$(oc get deployment "$deployment" -n $PROJECT_CPD_INST_OPERANDS  2>/dev/null)
        if [ -z "$deployment_output" ]; then
            echo -e "\nDeployment $deployment does not exist. Waiting .."
            all_pods_running=false
            sleep 30
            break
        else
            if ! check_deployments_ready "$deployment"; then
                all_pods_running=false
                sleep 60
                break
            fi
        fi
    done

    if $all_pods_running; then
        echo -e "\n##  All deployments are now ready."
        break
    fi

    sleep 1
done

# Restart master pod
echo -e "\n##  Restarting  master pod"
oc rollout restart deploy wa-master -n $PROJECT_CPD_INST_OPERANDS
oc rollout status deploy wa-master -n $PROJECT_CPD_INST_OPERANDS