Troubleshooting issues in watsonx Assistant

If you encounter an installation issue with watsonx Assistant, such as a cluster node not starting as expected, access log files from the cluster to get more detail about the issue.

Permissions you need for this task:
You must be an administrator of the Red Hat® OpenShift® project.

One or more watsonx Assistant pods go to the ContainerStatusUnknown state

You can delete the pods in the ContainerStatusUnknown state by doing the following:
  1. Get the instance name and set the INSTANCE variable to that name:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  2. Get the pods that are in the ContainerStatusUnknown state:
    oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown
  3. Delete the pods that are in the ContainerStatusUnknown state individually:
    oc delete pod <unknown-state-pod>
  4. Confirm that there are no more pods in the ContainerStatusUnknown state:
    oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown

Postgres cluster in bad state

Complete the following to recover the watsonx Assistant Postgres cluster:
  1. Install the CloudNativePG (CNP) plugin for EnterpriseDB:

    curl -sSfL \
    https://github.com/EnterpriseDB/kubectl-cnp/raw/main/install.sh | \
    sudo sh -s -- -b /usr/local/bin
  2. Query the status of wa-postgres to get the details of the Postgres cluster such as Name, Namespace, Primary instance, and Status:
    oc cnp status wa-postgres
    
    Cluster Summary
    Name:                wa-postgres
    Namespace:           zen
    PostgreSQL Image:    [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9)
    Primary instance:    wa-postgres-3
    Primary start time:  2023-12-13 12:24:06 +0000 UTC (uptime 5s)
    Status:              Failing over Failing over from wa-postgres-2 to wa-postgres-3
    Instances:           3
    Ready instances:     0
  3. Check the pods that are failing:
    oc get pods | grep wa-postgres
  4. If wa-postgres-1 starts, destroy it because wa-postgres-3 is the Primary instance:
    oc cnp destroy wa-postgres 1
  5. If wa-postgres-2 starts, destroy it because wa-postgres-3 is the Primary instance:
    oc cnp destroy wa-postgres 2
  6. After wa-postgres-3 starts, the EnterpriseDB operator controller recreates two standby pods.
  7. Query the status of wa-postgres:
    oc cnp status wa-postgres
    
    Cluster Summary
    Name:                wa-postgres
    Namespace:           cpd-instance
    System ID:           7311733145743429658
    PostgreSQL Image:    [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9)
    Primary instance:    wa-postgres-3
    Primary start time:  2023-12-12 16:59:47 +0000 UTC (uptime 20h48m52s)
    Status:              Cluster in healthy state                         <---- back in a healthy state.
    Instances:           3
    Ready instances:     3   0

Restarting watsonx Assistant deployments in the correct sequence

Run the following script to restart the watsonx Assistant pods. First, the script restarts watsonx Assistant microservices that must start sequentially. Then, it rolling restarts the watsonx Assistant deployments that do not need to be started sequentially. The script will skip deployments owned by watsonx Assistant dependencies. Contact IBM Support if some deployments are not starting after the rolling restart, as that will require more troubleshooting.
INSTANCE="wa" # Replace watsonx Assistant instance name if different
for DEPLOYMENT in ed dragonfly-clu-mm tfmm clu-triton-serving clu-serving nlu dialog store
do
echo "#Starting rolling restart of $INSTANCE-$DEPLOYMENT."
oc rollout restart deployment $INSTANCE-$DEPLOYMENT
oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true
echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
done

for DEPLOYMENT in analytics clu-embedding incoming-webhooks integrations recommends spellchecker-mm store-sync system-entities ui webhooks-connector gw-instance store-admin
do
echo "#Starting rolling restart of $INSTANCE-$DEPLOYMENT"
oc rollout restart deployment $INSTANCE-$DEPLOYMENT
done

for DEPLOYMENT in analytics clu-embedding incoming-webhooks integrations recommends spellchecker-mm store-sync system-entities ui webhooks-connector gw-instance store-admin
do
oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true
echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
done 
echo "# All watsonx Assistant deployments restarted successfully."

Increasing watsonx Assistant etcd resources

When an out-of-memory error is received, complete the following steps to increase the etcd resources:
  1. Export the watsonx Assistant instance:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
  2. Check the PVC status:
    oc get pvc | grep $INSTANCE-etcd
  3. Scale down the operator
    oc scale deploy ibm-etcd-operator --replicas=0
  4. Scale down STS for etcd:
    oc get sts | grep $INSTANCE-etcd
    oc scale sts/$INSTANCE-etcd --replicas=0
  5. Update PVC size in YAML for all the instances of PVC:
    1. Edit PVC:
      oc patch pvc data-$INSTANCE-etcd-0 -p '{"spec": {"resources": {"requests": {"storage": "4Gi"}}}}'
      oc patch pvc data-$INSTANCE-etcd-1 -p '{"spec": {"resources": {"requests": {"storage": "4Gi"}}}}'
      oc patch pvc data-$INSTANCE-etcd-2 -p '{"spec": {"resources": {"requests": {"storage": "4Gi"}}}}'
      
    2. Check if resources.requests.storage got updated.
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 4Gi
  6. Scale up STS for etcd:
    oc scale sts/$INSTANCE-etcd --replicas=3
    oc get sts | grep $INSTANCE-etcd
  7. Verify the storageSize in the etcd CR is the same value as the PVC size. It must be the same size or the PVC size increase is ignored.
    oc get etcdcluster $INSTANCE-etcd -o jsonpath="{.spec.storageSize}"
    4Gi
  8. If the storageSize of the current etcdCluster does not match the PVC size, then it needs to be changed. Start by modifying the value of the Assistant Operator.
    1. Patch watsonx Assistant CR to increase storageSize
      oc patch wa wa --type json -p '[{ "op": "replace", "path": "/spec/datastores/etcd/storageSize", "value": "4Gi" }]'
    2. Delete the Existing etcdCluster CR
      oc delete etcdcluster $INSTANCE-etcd
    3. Scale up the ETCD Operator.
      oc scale deploy ibm-etcd-operator --replicas=1
    4. Scale up the ETCD STS
      oc scale sts/$INSTANCE-etcd --replicas=3
      oc get sts | grep $INSTANCE-etcd
    5. Wait for the etcdCluster CR to be re-created and then reverify storageSize
      oc get etcdcluster $INSTANCE-etcd -o jsonpath="{.spec.storageSize}"
      4Gi
    6. After new storageSize is verified, delete the etcd statefulSet
      oc delete sts $INSTANCE-etcd
    7. Wait for the statefulSet to be re-created. New etcd-instances should now have the appropriate storage size.

If this does not fix the issue, then restart the store by restarting these CLU components order:

  1. Find and delete the ed pod(s):
    oc get pod |grep $INSTANCE-ed
    
    wa-ed-84df869b74-dp62m 3/3     Running     0          16h
    oc delete pod wa-ed-84df869b74-dp62m
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-ed"
  2. Find and delete the tf-mm pod(s):
    oc get pod |grep $INSTANCE-tfmm
    wa-tfmm-659857db48-mgnnt  3/3     Running     0          16h
    oc delete pod wa-tfmm-659857db48-mgnnt
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-tfmm"
  3. Find and delete the clu-triton-serving pod:
    oc get pod |grep $INSTANCE-clu-triton-serving
    wa-clu-triton-serving-7df477994c-m27pz 1/1     Running     0          3h52m
    oc delete pod wa-clu-triton-serving-7df477994c-m27pz
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-clu-triton-serving"
  4. Find and delete the clu-serving pod(s):
    oc get pod |grep $INSTANCE-clu-serving
    wa-clu-serving-695784897c-l84c2  2/2     Running     0          16h
    oc delete pod wa-clu-serving-695784897c-l84c2
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-clu-serving"
  5. Find and delete the nlu pod(s):
    oc get pod |grep $INSTANCE-nlu
    wa-nlu-5cf78d749d-9g9rs 1/1     Running     5          17h
    oc delete pod wa-nlu-5cf78d749d-9g9rs
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-nlu"
  6. Find and delete the dialog pod(s):
    oc get pod |grep $INSTANCE-dialog
    wa-dialog-6cc9979774-g97v6 1/1     Running     0          16h
    oc delete pod wa-dialog-6cc9979774-g97v6
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-dialog"
  7. Find and delete the store pod(s):
    oc get pod |grep $INSTANCE-store
    wa-store-6cc9979774-g97v6 1/1     Running     0          16h
    oc delete pod wa-store-6cc9979774-g97v6
    Make sure it comes back up fully:
    watch "oc get pod |grep $INSTANCE-store"
  8. Check the status of pods. Ensure that they are all in running state.
  9. Retrain your old dialog skills if required.

Checking the status of EDB pods

Run following command to check the status of EDB pods:

oc get po | grep postgres

They should be in Running state. If they are not in the Running state, then complete the following steps:

  1. Retrieve the EDB cluster name:
    oc get cluster | grep postgres
  2. Retrieve the watsonx Assistant operator deployment:
    oc get deployments -n cpd-operators | grep asssitant
  3. Scale the watsonx Assistant operator deployment in the proper namespace:
    oc scale deployment <deployment> --replicas=0 -n cpd-operators
  4. Edit the postgres cluster:
    oc edit cluster <WA cluster>
    1. Set instances to 1.
    2. Set maxSyncReplicas to 0.
    3. Set minSyncReplicas to 0 d.
    4. Save and quit the editor.
  5. Check for any remaining postgres pods:
    oc get pods |grep postgres
    There should be only one.
  6. Delete any pvc belonging to a standby postgres pod:
    Warning: Do not delete the remaining BOUND postgres pod, or you will lose your data. Please ensure you have taken backups (and they are in a safe location) before you make any changes to postgres.
    oc delete pvc <pvc-name>
  7. Scale up the EDB cluster:
    oc edit cluster <WA Cluster>
    1. Set instances to 3.
    2. Set maxReplicas to original value.
    3. Set minReplicas to original value.
    4. Save and quit the editor.
  8. Monitor for the creation of new postgres pods and their status. This may take an extended amount of time depending on how much data is in postgres:
    oc get pods | grep postgres
  9. Scale the assistant operator deployment back to 1:
    oc scale deployment <deployment name> --replicas=1 -n cpd-operators