Troubleshooting issues in watsonx Assistant
If you encounter an installation issue with watsonx Assistant, such as a cluster node not starting as expected, access log files from the cluster to get more detail about the issue.
- Permissions you need for this task:
- You must be an administrator of the Red Hat® OpenShift® project.
Troubleshooting issues
One or more watsonx Assistant pods go to the ContainerStatusUnknown state
You can delete the pods in the
ContainerStatusUnknown state by doing the following:- Get the instance name and set the
INSTANCEvariable to that name:export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Get the pods that are in the
ContainerStatusUnknownstate:oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown - Delete the pods that are in the
ContainerStatusUnknownstate individually:oc delete pod <unknown-state-pod> - Confirm that there are no more pods in the
ContainerStatusUnknownstate:oc get pods | grep ${INSTANCE}- | grep ContainerStatusUnknown
Postgres cluster in bad state
Complete the following to recover the watsonx Assistant Postgres cluster:
-
Install the CloudNativePG (CNP) plugin for EnterpriseDB:
curl -sSfL \ https://github.com/EnterpriseDB/kubectl-cnp/raw/main/install.sh | \ sudo sh -s -- -b /usr/local/bin - Query the status of
wa-postgresto get the details of the Postgres cluster such asName,Namespace,Primary instance, andStatus:oc cnp status wa-postgres Cluster Summary Name: wa-postgres Namespace: zen PostgreSQL Image: [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9) Primary instance: wa-postgres-3 Primary start time: 2023-12-13 12:24:06 +0000 UTC (uptime 5s) Status: Failing over Failing over from wa-postgres-2 to wa-postgres-3 Instances: 3 Ready instances: 0 - Check the pods that are failing:
oc get pods | grep wa-postgres - If
wa-postgres-1starts,destroyit becausewa-postgres-3is thePrimary instance:oc cnp destroy wa-postgres 1 - If
wa-postgres-2starts,destroyit becausewa-postgres-3is thePrimary instance:oc cnp destroy wa-postgres 2 - After
wa-postgres-3starts, the EnterpriseDB operator controller recreates two standby pods. - Query the status of
wa-postgres:oc cnp status wa-postgres Cluster Summary Name: wa-postgres Namespace: cpd-instance System ID: 7311733145743429658 PostgreSQL Image: [icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9](icr.io/cpopen/edb/postgresql:12.16-4.18.0-amd64@sha256:93801fc5f515ede1b243c15ca79956a100143ce21d39d92bf85ebaff99a9dbd9) Primary instance: wa-postgres-3 Primary start time: 2023-12-12 16:59:47 +0000 UTC (uptime 20h48m52s) Status: Cluster in healthy state <---- back in a healthy state. Instances: 3 Ready instances: 3 0
Restarting watsonx Assistant deployments in the correct sequence
Run the following script to restart the watsonx Assistant pods. First, the script restarts watsonx Assistant microservices that must start sequentially. Then, it rolling restarts the watsonx Assistant deployments that do not need to be started sequentially. The script will skip
deployments owned by watsonx Assistant dependencies. Contact IBM Support if some deployments are not starting after the rolling
restart, as that will require more
troubleshooting.
INSTANCE="wa" # Replace watsonx Assistant instance name if different
for DEPLOYMENT in ed dragonfly-clu-mm tfmm clu-triton-serving clu-serving nlu dialog store
do
echo "#Starting rolling restart of $INSTANCE-$DEPLOYMENT."
oc rollout restart deployment $INSTANCE-$DEPLOYMENT
oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true
echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
done
for DEPLOYMENT in analytics clu-embedding incoming-webhooks integrations recommends spellchecker-mm store-sync system-entities ui webhooks-connector gw-instance store-admin
do
echo "#Starting rolling restart of $INSTANCE-$DEPLOYMENT"
oc rollout restart deployment $INSTANCE-$DEPLOYMENT
done
for DEPLOYMENT in analytics clu-embedding incoming-webhooks integrations recommends spellchecker-mm store-sync system-entities ui webhooks-connector gw-instance store-admin
do
oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true
echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
done
echo "# All watsonx Assistant deployments restarted successfully."Increasing watsonx Assistant etcd resources
When an out-of-memory error is received, complete the following steps to increase the etcd resources:
- Export the watsonx Assistant
instance:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` - Check the PVC
status:
oc get pvc | grep $INSTANCE-etcd - Scale down the operator
oc scale deploy ibm-etcd-operator --replicas=0 - Scale down STS for etcd:
oc get sts | grep $INSTANCE-etcd oc scale sts/$INSTANCE-etcd --replicas=0 - Update PVC size in YAML for all the instances of PVC:
- Edit
PVC:
oc patch pvc data-$INSTANCE-etcd-0 -p '{"spec": {"resources": {"requests": {"storage": "4Gi"}}}}' oc patch pvc data-$INSTANCE-etcd-1 -p '{"spec": {"resources": {"requests": {"storage": "4Gi"}}}}' oc patch pvc data-$INSTANCE-etcd-2 -p '{"spec": {"resources": {"requests": {"storage": "4Gi"}}}}' - Check if
resources.requests.storagegot updated.spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Gi
- Edit
PVC:
- Scale up STS for etcd:
oc scale sts/$INSTANCE-etcd --replicas=3 oc get sts | grep $INSTANCE-etcd - Verify the storageSize in the etcd CR is the same value as the PVC size. It must be the same
size or the PVC size increase is
ignored.
oc get etcdcluster $INSTANCE-etcd -o jsonpath="{.spec.storageSize}" 4Gi - If the storageSize of the current etcdCluster does not match the PVC size, then it needs to be
changed. Start by modifying the value of the Assistant Operator.
- Patch watsonx Assistant CR to increase storageSize
oc patch wa wa --type json -p '[{ "op": "replace", "path": "/spec/datastores/etcd/storageSize", "value": "4Gi" }]' - Delete the Existing etcdCluster
CR
oc delete etcdcluster $INSTANCE-etcd - Scale up the ETCD
Operator.
oc scale deploy ibm-etcd-operator --replicas=1 - Scale up the ETCD STS
oc scale sts/$INSTANCE-etcd --replicas=3 oc get sts | grep $INSTANCE-etcd - Wait for the etcdCluster CR to be re-created and then reverify
storageSize
oc get etcdcluster $INSTANCE-etcd -o jsonpath="{.spec.storageSize}" 4Gi - After new storageSize is verified, delete the etcd
statefulSet
oc delete sts $INSTANCE-etcd - Wait for the statefulSet to be re-created. New etcd-instances should now have the appropriate storage size.
- Patch watsonx Assistant CR to increase storageSize
If this does not fix the issue, then restart the store by restarting these CLU components order:
- Find and delete the ed
pod(s):
oc get pod |grep $INSTANCE-edwa-ed-84df869b74-dp62m 3/3 Running 0 16h
Make sure it comes back up fully:oc delete pod wa-ed-84df869b74-dp62mwatch "oc get pod |grep $INSTANCE-ed" - Find and delete the tf-mm
pod(s):
oc get pod |grep $INSTANCE-tfmmwa-tfmm-659857db48-mgnnt 3/3 Running 0 16h
Make sure it comes back up fully:oc delete pod wa-tfmm-659857db48-mgnntwatch "oc get pod |grep $INSTANCE-tfmm" - Find and delete the clu-triton-serving
pod:
oc get pod |grep $INSTANCE-clu-triton-servingwa-clu-triton-serving-7df477994c-m27pz 1/1 Running 0 3h52m
Make sure it comes back up fully:oc delete pod wa-clu-triton-serving-7df477994c-m27pzwatch "oc get pod |grep $INSTANCE-clu-triton-serving" - Find and delete the clu-serving
pod(s):
oc get pod |grep $INSTANCE-clu-servingwa-clu-serving-695784897c-l84c2 2/2 Running 0 16h
Make sure it comes back up fully:oc delete pod wa-clu-serving-695784897c-l84c2watch "oc get pod |grep $INSTANCE-clu-serving" - Find and delete the nlu
pod(s):
oc get pod |grep $INSTANCE-nluwa-nlu-5cf78d749d-9g9rs 1/1 Running 5 17h
Make sure it comes back up fully:oc delete pod wa-nlu-5cf78d749d-9g9rswatch "oc get pod |grep $INSTANCE-nlu" - Find and delete the dialog
pod(s):
oc get pod |grep $INSTANCE-dialogwa-dialog-6cc9979774-g97v6 1/1 Running 0 16h
Make sure it comes back up fully:oc delete pod wa-dialog-6cc9979774-g97v6watch "oc get pod |grep $INSTANCE-dialog" - Find and delete the store
pod(s):
oc get pod |grep $INSTANCE-storewa-store-6cc9979774-g97v6 1/1 Running 0 16h
Make sure it comes back up fully:oc delete pod wa-store-6cc9979774-g97v6watch "oc get pod |grep $INSTANCE-store" - Check the status of pods. Ensure that they are all in running state.
- Retrain your old dialog skills if required.
Checking the status of EDB pods
Run following command to check the status of EDB pods:
oc get po | grep postgres
They should be in Running state. If they are not in the Running state, then complete the following steps:
- Retrieve the EDB cluster
name:
oc get cluster | grep postgres - Retrieve the watsonx Assistant operator
deployment:
oc get deployments -n cpd-operators | grep asssitant - Scale the watsonx Assistant operator deployment in the proper
namespace:
oc scale deployment <deployment> --replicas=0 -n cpd-operators - Edit the postgres cluster:
oc edit cluster <WA cluster>- Set instances to 1.
- Set maxSyncReplicas to 0.
- Set minSyncReplicas to 0 d.
- Save and quit the editor.
- Check for any remaining postgres
pods:
There should be only one.oc get pods |grep postgres - Delete any pvc belonging to a standby postgres pod:Warning: Do not delete the remaining BOUND postgres pod, or you will lose your data. Please ensure you have taken backups (and they are in a safe location) before you make any changes to postgres.
oc delete pvc <pvc-name> - Scale up the EDB cluster:
oc edit cluster <WA Cluster>- Set instances to 3.
- Set maxReplicas to original value.
- Set minReplicas to original value.
- Save and quit the editor.
- Monitor for the creation of new postgres pods and their status. This may take an extended amount
of time depending on how much data is in
postgres:
oc get pods | grep postgres - Scale the assistant operator deployment back to
1:
oc scale deployment <deployment name> --replicas=1 -n cpd-operators