Limitations and known issues in Watson Assistant
The following limitations and known issues apply to Watson Assistant.
- Watson Assistant Redis pods not starting because quota is applied to the namespace
- Watson Assistant Redis pods not running after cluster restart
- Some Watson Assistant pods do not have annotations that are used for scheduling
- The etcd CSV status changes from Succeeded to Installing
- Watson Assistant upgrade gets stuck at apply-cr
- Watson Assistant upgrade gets stuck at apply-cr or training does not work after the upgrade completes successfully
- A few Watson Assistant pods in CrashLoopBackOff and RabbitMQ pods are missing after running apply-olm
- OpenShift upgrade hangs because some Watson Assistant pods do not quiesce
- Increasing backup storage for wa-store-cronjob pods that run out of space
- Watson Assistant air gap installation did not complete because the wa-incoming-webhooks pod did not start
- Insufficient Scope error when mirroring images
- Analytics service is not working for a large size Watson Assistant deployment
- Watson Assistant upgrade or installation fails as the Watson Assistant UI and Watson Gateway operator pods continuously restart or get evicted
- ETCD pods are on "CrashLoopBackOff" while changing Watson Assistant size from "medium" to "large"
- Inaccurate status message from command line after upgrade
- UnsupportedKafkaVersionException in Kafka CR
For a complete list of known issues and troubleshooting information for all versions of Watson Assistant, see Troubleshooting known issues. For a complete list of known issues for Cloud Pak for Data, see Limitations and known issues in Cloud Pak for Data.
Watson Assistant Redis pods not starting because quota is applied to the namespace
Applies to: 4.0 and later
- Problem
- Redis pods fail to start due to the following
error:
Warning FailedCreate 51s (x15 over 2m22s) statefulset-controller create Pod c-wa-redis-m-0 in StatefulSet c-wa-redis-m failed error: pods "c-wa-redis-m-0" is forbidden: failed quota: cpd-quota: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
- Cause
- Redis pods cannot start if quota is applied to namespace but
limitrange
is not set. Because Redis init containers do not have limits.cpu, limits.memory, requests.cpu, or requests.memory, an error occurs. - Solution
- Apply limit range with defaults for limits and requests. Modify the namespace in the following
yaml to the namespace where Cloud Pak for Data is
installed:
apiVersion: v1 kind: LimitRange metadata: name: cpu-resource-limits namespace: zen #Change it to the namespace where CPD is installed spec: limits: - default: cpu: 300m memory: 200Mi defaultRequest: cpu: 200m memory: 200Mi type: Container
Watson Assistant Redis pods not running after cluster restart
Applies to: 4.5.0 and later
- Problem
- Watson Assistant pods do not restart successfully after the cluster is restarted.
- Cause
- When the cluster is restarted, Redis is not restarting properly. This issue prevents Watson Assistant from restarting successfully.
- Solution
-
- Get the instance name by running
oc get wa
. Set theINSTANCE
variable to that name. - Get the unhealthy redis
pods:
oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running
- For each unhealthy redis pod, restart the
pod:
oc delete pod <unhealthy-redis-pod>
- Confirm that there are no more unhealthy redis
pods:
oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running
- Get the instance name by running
Some Watson Assistant pods do not have annotations that are used for scheduling
Applies to: 4.6.0
- Problem
- Some Watson
Assistant pods are missing the
cloudpakInstanceId
annotation. - Impact
- If you use the IBM
Cloud Pak for Data scheduling service, any
Watson
Assistant pods without the
cloudpakInstanceId
annotation are:- Scheduled by the default Kubernetes scheduler rather than the scheduling service
- Not included in the quota enforcement
The etcd CSV status changes from
Succeeded
to Installing
Applies to: Version 4.6.2 and later
When you run the cpd-cli
manage
apply-olm
command, the status of the etcd CSV is Succeeded
. Later, the
status changes to Installing
.
- Diagnosing the problem
-
- Check the status of the etcd operator
pods:
oc get pods -n ${PROJECT_CPD_OPS} | grep ibm-etcd
If the status of the pod is not
Running
, review the pod logs. - Review the etcd operator pod
logs:
oc logs -n ${PROJECT_CPD_OPS} | grep ibm-etcd
Look for the following error message:
ERROR! A worker was found in a dead state\u001b[0m\n","job":"job-ID","name":"wa-data-governor-etcd","namespace":"project-naem", "error":"exit status 1","stacktrace":"github.com/operator-framework/operator-sdk/internal/ansible/runner.(*runner).Run.func1\n\t/workspace/internal/ansible/runner/runner.go:269"}
- Check the status of the etcd operator
pods:
- Resolving the problem
-
Contact IBM® Support for assistance.
Watson
Assistant upgrade gets stuck at
apply-cr
Applies to: 4.0.x to 4.6.0 upgrade
- Problem
-
The
clu-training
pods are inCrashloopback
state afterapply-olm
completes, Watson Assistant hangs during theibmcpd upgrade
, or the upgrade hangs during theapply-cr
command with the message:pre-apply-cr release patching (if any) for watson_assistant]
- Cause
- After
apply-olm
, model train pods might go into a bad state causing theapply-cr
command for Watson Assistant version 4.6.0 to stall. - Solution
-
Run the following commands after running the
apply-olm
command:- Export the name of your Watson
Assistant instance
as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
- Recreate the ModelTrain training
job:
oc delete modeltraindynamicworkflows.modeltrain.ibm.com ${INSTANCE}-dwf # This command may take some time to complete. Delete the modeltraindynamicworkflows.modeltrain.ibm.com CR finalizer, if the command does not complete in few minutes oc delete pvc -l release=${INSTANCE}-dwf-ibm-mt-dwf-rabbitmq oc delete deploy ${INSTANCE}-clu-training-${INSTANCE}-dwf oc delete secret/${INSTANCE}-clu-training-secret job/${INSTANCE}-clu-training-create job/${INSTANCE}-clu-training-update secret/${INSTANCE}-dwf-ibm-mt-dwf-server-tls-secret secret/${INSTANCE}-dwf-ibm-mt-dwf-client-tls-secret oc delete secret registry-${INSTANCE}-clu-training-${INSTANCE}-dwf-training
Expect it to take at least 30 minutes for the new training job to take effect and the status to change to
Completed
. - Export the name of your Watson
Assistant instance
as an environment
variable:
Watson
Assistant upgrade gets stuck at
apply-cr
or training does not work after the upgrade completes successfully
Applies to: 4.0.x to 4.6.0 upgrade
- Problem
- The
etcdclusters
custom resources,wa-data-governor-etcd
andwa-etcd
shows the following patching error:"Failed to patch object: b''{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"StatefulSet.apps \\"wa-etcd\\" is invalid:"
- Solution
- To check if there is an error in
etcdcluster
:- Export the name of your Watson
Assistant instance
as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
- Describe the
etcdcluster
custom resource:oc describe etcdcluster ${INSTANCE}-data-governor-etcd oc describe etcdcluster ${INSTANCE}-etcd
Complete the following steps to fix the issue:
- Export the name of your Watson
Assistant instance
as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
- Delete the following components:
Expect it to take at least 20 minutes for theoc delete job ${INSTANCE}-create-slot-job oc delete etcdcontent ${INSTANCE}-data-exhaust-tenant-analytics ${INSTANCE}-data-governor ${INSTANCE}-data-governor-ibm-data-governor-data-exhaust-internal oc delete etcdcluster ${INSTANCE}-data-governor-etcd ${INSTANCE}-etcd
etcdcluster
custom resource to come back again and become healthy.To confirm, enter
oc get etcdclusters
and ensure you get output similar to:NAME AGE wa-data-governor-etcd 116m wa-etcd 120m
- Recreate the
clu
subsystem component:
Expect it to take at least 20 minutes for theoc delete clu ${INSTANCE}
clu
subsystem to come back again and become healthy.To confirm, enter
oc get wa
and ensure you get output similar to:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED AGE wa 4.6.0 True Stable False Stable 18/18 18/18 12h
- Export the name of your Watson
Assistant instance
as an environment
variable:
A few Watson
Assistant pods in
CrashLoopBackOff
and RabbitMQ pods are missing after running
apply-olm
Applies to: 4.0.8 to 4.6.0 upgrade
- Problem
-
After
apply-olm,
the RabbitMQ cluster has no pods running and thewa-dwf-ibm-mt-dwf-rabbitmq-ibm-rabbitmq-backup-label
job failed to complete. The pod started by this job has the following error:***** RabbitMQ PVCs did not become ready in time ***** /bin/bash: line 8: kubectl: command not found Retries=0 waiting for all PVCs to be ready
The
wa-dwf-ibm-mt-dwf-trainer
andwa-clu-training-wa-dwf
pods are inCrashLoopBackOff
status and the RabbitMQ cluster is not stable. - Solution
-
Delete the
wa-dwf-ibm-mt-dwf-rabbitmq-ibm-rabbitmq-backup-label
job:oc delete job wa-dwf-ibm-mt-dwf-rabbitmq-ibm-rabbitmq-backup-label
As a result, the RabbitMQ pods,
wa-dwf-ibm-mt-dwf-trainer
andwa-clu-training-wa-dwf pods
should recover. Continue with theapply-cr
command to upgrade to Watson Assistant version 4.6.0.
OpenShift upgrade hangs because some Watson Assistant pods do not quiesce
Applies to: 4.0.x and later
- Problem
- Some Watson Assistant pods do not quiesce and might not automatically drain, causing the OpenShift upgrade to pause.
- Cause
- Quiesce capability is not entirely supported by Watson Assistant, and will cause some of the pods to continue running.
- Solution
- Watson Assistant quiesce is optional when upgrading the OpenShift cluster. Monitor the node being upgraded for any pods that do not automatically drain (causing the upgrade to hang). To enable the OpenShift upgrade to continue, delete the pod that is not draining automatically so it can proceed to another node.
Increasing backup storage for wa-store-cronjob
pods that run out of
space
Applies to: 4.5.3 and later
- Problem
- The nightly scheduled
wa-store-cronjob
pods eventually fail with No space left on device. - Cause
- The size of the cronjob backup PVC needs to be increased.
- Solution
- Edit the size of the backup storage from 1Gi to 2Gi:
- Edit the CR:
oc edit wa wa
- In the
configOverrides
section, add:store_db: backup: size: 2Gi
After 5-10 minutes, the PVC should be re-sized. If the problem persists, increase the size to a larger value, for example,
4Gi
. - Edit the CR:
Watson
Assistant air gap installation did not
complete because the wa-incoming-webhooks
pod did not start
Applies to: 4.6.0
- Problem
- The Watson
Assistant air gap installation did not
complete because the
wa-incoming-webhooks
pod did not start. - Solution
-
-
Copy the docker image
cp.icr.io/cp/watson-assistant/incoming-webhooks:20230118-191010-9e76035-cp4d_startup_issue@sha256:24a5d8db910301e6bcdfeeb5f196adcd28d3ad82fa3f0ead0e21785da7232fd9
to your private registry.If you have a system that can connect to the entitled and private registry, the
skopeo
utility can download and copy the docker image to the private registry.skopeo copy --src-creds cp:${ENTITLED_REGISTRY_APIKEY} --dest-creds ${PRIVATE_REGISTRY_USERNAME}:${PRIVATE_REGISTRY_APIKEY} --dest-tls-verify=false --src-tls-verify=false docker://cp.icr.io/cp/watson-assistant/incoming-webhooks:20230118-191010-9e76035-cp4d_startup_issue docker://${PRIVATE_REGISTRY_NAME}/cp/watson-assistant/watson-assistant/incoming-webhooks:20230118-191010-9e76035-cp4d_startup_issue
- Modify the Watson
Assistant CR to specify the
custom incoming_webhooks image, built to work in air gap
environments:
configOverrides: container_images: incoming_webhooks: tag: 20230118-191010-9e76035-cp4d_startup_issue@sha256:24a5d8db910301e6bcdfeeb5f196adcd28d3ad82fa3f0ead0e21785da7232fd9
-
Insufficient Scope error when mirroring images
- Error
- When you mirror the containers for Watson services, an
InsufficientScope
error is displayed. - Cause
- The Watson services do not specify the EDB PostgreSQL operator details. They rely on the EDB bundle to specify them. However, the EDB bundle includes both standard and enterprise images and applies the enterprise image by default. Watson services are entitled to use the standard image only.
- Solution
- Replace the EDB PostgreSQL enterprise image with the standard image. To do so, complete the
following steps:
- Get a list of the images by using the following
command:
Specify the components that you want to check. For example,./cpd-cli manage list-images --release=$VERSION --components=
watson_assistant
orwatson_discovery
. To specify both, add a comma between the two component names. - Replace the advanced image.
For example, the following command replaces the advanced image for Watson Assistant:
sed -i -e '/edb-postgres-advanced/d' \ ./cpd-cli-workspace/olm-utils-workspace/work/offline/$VERSION/watson_assistant/ibm-cloud-native-postgresql-*-images.csv
- Repeat the
list-images
command from Step 1 to confirm thatedb-postgres-advanced
is not listed for the component.
- Get a list of the images by using the following
command:
Analytics service is not working for a large size Watson Assistant deployment
Applies to: 4.6.3
- Problem
- The Analytics service does not show the User conversations and overview data.
- Cause
- Store Kafka topic creation is failing because of the replicas mismatch of Kafka topics.
- Solution
- Create a temporary patch to fix the topic size:
- Export the name of your Watson
Assistant instance
as an environment
variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
- Create the following temporary patch:
cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: ${INSTANCE}-kafka-replicas namespace: ${PROJECT_CPD_INSTANCE} spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistant name: ${INSTANCE} patchType: patchStrategicMerge patch: data-governor: dataexhaustoverride: spec: dependencies: kafka: offsetsTopicReplicationFactor: override: 3 replicas: override: 3 EOF
- Export the name of your Watson
Assistant instance
as an environment
variable:
Watson Assistant upgrade or installation fails as the Watson Assistant UI and Watson Gateway operator pods continuously restart or get evicted
Applies to: 4.6.0 to 4.6.3 (fixed in 4.6.5)
- Problem
- The Watson Assistant UI pods continuously restart because the gateway deployment is missing. The
gateway-operator
pod is evicted, restarting, or in CrashLoopBackOff because of an out-of-memory error, and the Watson Assistant upgrade or installation fails. - Solution
- To diagnose the problem, determine whether the
gateway-operator
pods were evicted:- Check the status of the operator
pods:
oc get pods -n ${PROJECT_CPD_OPS} | grep gateway-operator
- If any of the pods are in the Evicted state, get the name of the
operator:
oc get csv -n ${PROJECT_CPD_OPS} \ -loperators.coreos.com/gateway-operator.${PROJECT_CPD_OPS}
You will need the name to resolve the problem. The name has the following format:
gateway-operator.vX.X.X
. Set the OP_NAME environment variable to the name of the operator:
Edit the operator CSV:export OP_NAME=<operator-name>
oc edit csv $OP_NAME
- If any of the pods are in the CrashLoop state and the CSV is not installed yet,
Edit the operator
deployment:
oc edit csv -n ${PROJECT_CPD_OPS} gateway-operator-vX.X.X
- Increase the memory limits and requests in the operator CSV to 1Gi. Increase the memory
request and limit in the operator CSV:
resources: limits: cpu: 500m ephemeral-storage: 1Gi memory: 1Gi requests: cpu: 100m ephemeral-storage: 300Mi memory: 1Gi
- Check the status of the operator
pods:
ETCD pods are on "CrashLoopBackOff" while changing Watson Assistant size from "medium" to "large"
Applies to: 4.6.3
- Problem
-
While trying to increase Watson Assistant from size "medium" to size "large", the
etcd
pods go into crashloopback. - Cause
- The problem occurs when an additional
etcd
pod is deployed, but it doesn't get the root user and password applied from theibm-etcd-operator
. - Solution
-
Add the
ETCDCTL_USER
environment property back to theetcd
statefulset
:- Determine the
authSecretName
used in theetcd
deployment:oc get etcdcluster wa-etcd \ --namespace=${PROJECT_CPD_INSTANCE} \ -o yaml | grep authSecretName
Output of that command will be:
authSecretName: <auth_secret_name>
. - Edit the
statefulset
:oc edit sts wa-etcd \ --namespace=${PROJECT_CPD_INSTANCE}
- Look for the
env
section in thestatefulset
.env: - name: INITIAL_CLUSTER_SIZE value: "5" - name: CLUSTER_NAME value: ibm-etcd-instance - name: ETCDCTL_API value: "3" - name: ENABLE_CLIENT_CERT_AUTH value: "false"
- Add the following snippet after the
ENABLE_CLIENT_CERT_AUTH
key-value pair.Before you add this snippet, replace
<auth_secret_name>
with the value that was returned by theoc get etcdcluster
command.- name: USERNAME value: root - name: PASSWORD valueFrom: secretKeyRef: key: password name: <auth_secret_name> - name: ETCDCTL_USER value: $(USERNAME):$(PASSWORD)
This will allow the
etcd
root user and password to be applied to any additionaletcd
pods that may be added to the deployment. - Once all the pods restart, you can edit the
etcdcluster
to change the size.
If you apply this workaround, you must manually rotate the secret reference in the
statesfulset
. - Determine the
Inaccurate status message from command line after upgrade
- Problem
- If you run the
cpd-cli service-instance upgrade
command from the Cloud Pak for Data command-line interface, and then use theservice-instance list
command to check the status of each service, the provision status for the service is listed asUPGRADE_FAILED
. - Cause of the problem
- When you upgrade the service, only the
cpd-cli manage apply-cr
command is supported. You cannot use thecpd-cli service-instance upgrade
command to upgrade the service. And after you upgrade the service with theapply-cr
method, the change in version and status is not recognized by theservice-instance
command. However, the correct version is displayed from the Cloud Pak for Data web client. - Resolving the problem
- No action is required. As long as you use the
cpd-cli manage apply-cr
method to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by thecpd-cli service-instance list
command.
UnsupportedKafkaVersionException in Kafka CR
Applies to: 4.6.x
- Problem
- During a Watson
Assistant upgrade (depending on
the version of Data Governor you are upgrading from and to), the reconciliation of Kafka Custom
Resource (CR) is halted with the following error in the Kafka CR's status:
Status: Conditions: Last Transition Time: 2022-11-08T02:12:23.574Z Message: Unsupported Kafka.spec.kafka.version: 2.6.0. Supported versions are: [3.2.3] Reason: UnsupportedKafkaVersionException
- Solution
- Edit the Kafka CR named <wa instance name>-data-governor-kafka and
remove the version
field:
oc patch kafka wa-data-governor-kafka --type=json -p="[{'op': 'remove', 'path': '/spec/kafka/version'}]"