Limitations and known issues in Watson Assistant

The following limitations and known issues apply to Watson Assistant.

For a complete list of known issues and troubleshooting information for all versions of Watson Assistant, see Troubleshooting known issues. For a complete list of known issues for Cloud Pak for Data, see Limitations and known issues in Cloud Pak for Data.

Watson Assistant Redis pods not starting because quota is applied to the namespace

Applies to: 4.0 and later

Problem
Redis pods fail to start due to the following error:
Warning  FailedCreate      51s (x15 over 2m22s)  statefulset-controller  create Pod c-wa-redis-m-0 in StatefulSet c-wa-redis-m failed error: pods "c-wa-redis-m-0" is forbidden: failed quota: cpd-quota: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
Cause
Redis pods cannot start if quota is applied to namespace but limitrange is not set. Because Redis init containers do not have limits.cpu, limits.memory, requests.cpu, or requests.memory, an error occurs.
Solution
Apply limit range with defaults for limits and requests. Modify the namespace in the following yaml to the namespace where Cloud Pak for Data is installed:
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-resource-limits
  namespace:  zen  #Change it to the namespace where CPD is installed
spec:
  limits:
  - default:
      cpu: 300m
      memory: 200Mi
    defaultRequest:
      cpu: 200m
      memory: 200Mi
    type: Container

Watson Assistant Redis pods not running after cluster restart

Applies to: 4.5.0 and later

Problem
Watson Assistant pods do not restart successfully after the cluster is restarted.
Cause
When the cluster is restarted, Redis is not restarting properly. This issue prevents Watson Assistant from restarting successfully.
Solution
  1. Get the instance name by running oc get wa. Set the INSTANCE variable to that name.
  2. Get the unhealthy redis pods:
    oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running
  3. For each unhealthy redis pod, restart the pod:
    oc delete pod <unhealthy-redis-pod>
  4. Confirm that there are no more unhealthy redis pods:
    oc get pods | grep ${INSTANCE}- | grep redis | grep -v Running

Some Watson Assistant pods do not have annotations that are used for scheduling

Applies to: 4.6.0

Problem
Some Watson Assistant pods are missing the cloudpakInstanceId annotation.
Impact
If you use the IBM Cloud Pak for Data scheduling service, any Watson Assistant pods without the cloudpakInstanceId annotation are:
  • Scheduled by the default Kubernetes scheduler rather than the scheduling service
  • Not included in the quota enforcement

The etcd CSV status changes from Succeeded to Installing

Applies to: Version 4.6.2 and later

When you run the cpd-cli manage apply-olm command, the status of the etcd CSV is Succeeded. Later, the status changes to Installing.

Diagnosing the problem
  1. Check the status of the etcd operator pods:
    oc get pods -n ${PROJECT_CPD_OPS} | grep ibm-etcd

    If the status of the pod is not Running, review the pod logs.

  2. Review the etcd operator pod logs:
    oc logs -n ${PROJECT_CPD_OPS} | grep ibm-etcd

    Look for the following error message:

    ERROR! A worker was found in a dead state\u001b[0m\n","job":"job-ID","name":"wa-data-governor-etcd","namespace":"project-naem",
    "error":"exit status 1","stacktrace":"github.com/operator-framework/operator-sdk/internal/ansible/runner.(*runner).Run.func1\n\t/workspace/internal/ansible/runner/runner.go:269"}
    
Resolving the problem

Contact IBM® Support for assistance.

Watson Assistant upgrade gets stuck at apply-cr

Applies to: 4.0.x to 4.6.0 upgrade

Problem

The clu-training pods are in Crashloopback state after apply-olm completes, Watson Assistant hangs during the ibmcpd upgrade, or the upgrade hangs during the apply-cr command with the message:

pre-apply-cr release patching (if any) for watson_assistant]
Cause
After apply-olm, model train pods might go into a bad state causing the apply-cr command for Watson Assistant version 4.6.0 to stall.
Solution

Run the following commands after running the apply-olm command:

  1. Export the name of your Watson Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
  2. Recreate the ModelTrain training job:
    oc delete modeltraindynamicworkflows.modeltrain.ibm.com ${INSTANCE}-dwf # This command may take some time to complete. Delete the modeltraindynamicworkflows.modeltrain.ibm.com CR finalizer, if the command does not complete in few minutes
    oc delete pvc -l release=${INSTANCE}-dwf-ibm-mt-dwf-rabbitmq
    oc delete deploy ${INSTANCE}-clu-training-${INSTANCE}-dwf
    oc delete secret/${INSTANCE}-clu-training-secret job/${INSTANCE}-clu-training-create job/${INSTANCE}-clu-training-update secret/${INSTANCE}-dwf-ibm-mt-dwf-server-tls-secret secret/${INSTANCE}-dwf-ibm-mt-dwf-client-tls-secret
    oc delete secret registry-${INSTANCE}-clu-training-${INSTANCE}-dwf-training

Expect it to take at least 30 minutes for the new training job to take effect and the status to change to Completed.

Watson Assistant upgrade gets stuck at apply-cr or training does not work after the upgrade completes successfully

Applies to: 4.0.x to 4.6.0 upgrade

Problem
The etcdclusters custom resources, wa-data-governor-etcd and wa-etcd shows the following patching error:
"Failed to patch object: b''{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"StatefulSet.apps
      \\"wa-etcd\\" is invalid:"
Solution
To check if there is an error in etcdcluster:
  1. Export the name of your Watson Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
  2. Describe the etcdcluster custom resource:
    oc describe etcdcluster ${INSTANCE}-data-governor-etcd
    oc describe etcdcluster ${INSTANCE}-etcd

Complete the following steps to fix the issue:

  1. Export the name of your Watson Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
  2. Delete the following components:
    oc delete job ${INSTANCE}-create-slot-job
    oc delete etcdcontent ${INSTANCE}-data-exhaust-tenant-analytics ${INSTANCE}-data-governor ${INSTANCE}-data-governor-ibm-data-governor-data-exhaust-internal
    oc delete etcdcluster ${INSTANCE}-data-governor-etcd ${INSTANCE}-etcd
    Expect it to take at least 20 minutes for the etcdcluster custom resource to come back again and become healthy.

    To confirm, enter oc get etcdclusters and ensure you get output similar to:

    NAME AGE
    wa-data-governor-etcd 116m
    wa-etcd 120m
  3. Recreate the clu subsystem component:
    oc delete clu ${INSTANCE}
    Expect it to take at least 20 minutes for the clu subsystem to come back again and become healthy.

    To confirm, enter oc get wa and ensure you get output similar to:

    NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED AGE
    wa 4.6.0 True Stable False Stable 18/18 18/18 12h

A few Watson Assistant pods in CrashLoopBackOff and RabbitMQ pods are missing after running apply-olm

Applies to: 4.0.8 to 4.6.0 upgrade

Problem

After apply-olm, the RabbitMQ cluster has no pods running and the wa-dwf-ibm-mt-dwf-rabbitmq-ibm-rabbitmq-backup-label job failed to complete. The pod started by this job has the following error:

***** RabbitMQ PVCs did not become ready in time *****
/bin/bash: line 8: kubectl: command not found
Retries=0
waiting for all PVCs to be ready

The wa-dwf-ibm-mt-dwf-trainer and wa-clu-training-wa-dwf pods are in CrashLoopBackOff status and the RabbitMQ cluster is not stable.

Solution

Delete the wa-dwf-ibm-mt-dwf-rabbitmq-ibm-rabbitmq-backup-label job:

oc delete job wa-dwf-ibm-mt-dwf-rabbitmq-ibm-rabbitmq-backup-label

As a result, the RabbitMQ pods, wa-dwf-ibm-mt-dwf-trainer and wa-clu-training-wa-dwf pods should recover. Continue with the apply-cr command to upgrade to Watson Assistant version 4.6.0.

OpenShift upgrade hangs because some Watson Assistant pods do not quiesce

Applies to: 4.0.x and later

Problem
Some Watson Assistant pods do not quiesce and might not automatically drain, causing the OpenShift upgrade to pause.
Cause
Quiesce capability is not entirely supported by Watson Assistant, and will cause some of the pods to continue running.
Solution
Watson Assistant quiesce is optional when upgrading the OpenShift cluster. Monitor the node being upgraded for any pods that do not automatically drain (causing the upgrade to hang). To enable the OpenShift upgrade to continue, delete the pod that is not draining automatically so it can proceed to another node.

Increasing backup storage for wa-store-cronjob pods that run out of space

Applies to: 4.5.3 and later

Problem
The nightly scheduled wa-store-cronjob pods eventually fail with No space left on device.
Cause
The size of the cronjob backup PVC needs to be increased.
Solution
Edit the size of the backup storage from 1Gi to 2Gi:
  1. Edit the CR:
    oc edit wa wa
  2. In the configOverrides section, add:
    store_db:
      backup:
        size: 2Gi

After 5-10 minutes, the PVC should be re-sized. If the problem persists, increase the size to a larger value, for example, 4Gi.

Watson Assistant air gap installation did not complete because the wa-incoming-webhooks pod did not start

Applies to: 4.6.0

Problem
The Watson Assistant air gap installation did not complete because the wa-incoming-webhooks pod did not start.
Solution
  1. Copy the docker image cp.icr.io/cp/watson-assistant/incoming-webhooks:20230118-191010-9e76035-cp4d_startup_issue@sha256:24a5d8db910301e6bcdfeeb5f196adcd28d3ad82fa3f0ead0e21785da7232fd9 to your private registry.

    If you have a system that can connect to the entitled and private registry, the skopeo utility can download and copy the docker image to the private registry.

    skopeo copy --src-creds cp:${ENTITLED_REGISTRY_APIKEY} --dest-creds ${PRIVATE_REGISTRY_USERNAME}:${PRIVATE_REGISTRY_APIKEY} --dest-tls-verify=false --src-tls-verify=false docker://cp.icr.io/cp/watson-assistant/incoming-webhooks:20230118-191010-9e76035-cp4d_startup_issue docker://${PRIVATE_REGISTRY_NAME}/cp/watson-assistant/watson-assistant/incoming-webhooks:20230118-191010-9e76035-cp4d_startup_issue
  2. Modify the Watson Assistant CR to specify the custom incoming_webhooks image, built to work in air gap environments:
    configOverrides:
      container_images:
        incoming_webhooks:
          tag: 20230118-191010-9e76035-cp4d_startup_issue@sha256:24a5d8db910301e6bcdfeeb5f196adcd28d3ad82fa3f0ead0e21785da7232fd9

Insufficient Scope error when mirroring images

Error
When you mirror the containers for Watson services, an InsufficientScope error is displayed.
Cause
The Watson services do not specify the EDB PostgreSQL operator details. They rely on the EDB bundle to specify them. However, the EDB bundle includes both standard and enterprise images and applies the enterprise image by default. Watson services are entitled to use the standard image only.
Solution
Replace the EDB PostgreSQL enterprise image with the standard image. To do so, complete the following steps:
  1. Get a list of the images by using the following command:
    ./cpd-cli manage list-images --release=$VERSION --components=
    Specify the components that you want to check. For example, watson_assistant or watson_discovery. To specify both, add a comma between the two component names.
  2. Replace the advanced image.

    For example, the following command replaces the advanced image for Watson Assistant:

    sed -i -e '/edb-postgres-advanced/d' \
    ./cpd-cli-workspace/olm-utils-workspace/work/offline/$VERSION/watson_assistant/ibm-cloud-native-postgresql-*-images.csv
  3. Repeat the list-images command from Step 1 to confirm that edb-postgres-advanced is not listed for the component.

Analytics service is not working for a large size Watson Assistant deployment

Applies to: 4.6.3

Problem
The Analytics service does not show the User conversations and overview data.
Cause
Store Kafka topic creation is failing because of the replicas mismatch of Kafka topics.
Solution
Create a temporary patch to fix the topic size:
  1. Export the name of your Watson Assistant instance as an environment variable:
    export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
  2. Create the following temporary patch:
    cat <<EOF | oc apply -f -
    apiVersion: assistant.watson.ibm.com/v1
    kind: TemporaryPatch
    metadata:
      name: ${INSTANCE}-kafka-replicas
      namespace: ${PROJECT_CPD_INSTANCE}
    spec:
      apiVersion: assistant.watson.ibm.com/v1
      kind: WatsonAssistant
      name: ${INSTANCE}
      patchType: patchStrategicMerge
      patch:
        data-governor:
          dataexhaustoverride:
            spec:
              dependencies:
                kafka:
                  offsetsTopicReplicationFactor:
                    override: 3
                  replicas:
                    override: 3
    EOF

Watson Assistant upgrade or installation fails as the Watson Assistant UI and Watson Gateway operator pods continuously restart or get evicted

Applies to: 4.6.0 to 4.6.3 (fixed in 4.6.5)

Problem
The Watson Assistant UI pods continuously restart because the gateway deployment is missing. The gateway-operator pod is evicted, restarting, or in CrashLoopBackOff because of an out-of-memory error, and the Watson Assistant upgrade or installation fails.
Solution
To diagnose the problem, determine whether the gateway-operator pods were evicted:
  1. Check the status of the operator pods:
    oc get pods -n ${PROJECT_CPD_OPS} | grep gateway-operator
  2. If any of the pods are in the Evicted state, get the name of the operator:
    oc get csv -n ${PROJECT_CPD_OPS} \
    -loperators.coreos.com/gateway-operator.${PROJECT_CPD_OPS}

    You will need the name to resolve the problem. The name has the following format: gateway-operator.vX.X.X. Set the OP_NAME environment variable to the name of the operator:

    export OP_NAME=<operator-name>
    Edit the operator CSV:
    oc edit csv $OP_NAME
  3. If any of the pods are in the CrashLoop state and the CSV is not installed yet, Edit the operator deployment:
    oc edit csv -n ${PROJECT_CPD_OPS} gateway-operator-vX.X.X
  4. Increase the memory limits and requests in the operator CSV to 1Gi. Increase the memory request and limit in the operator CSV:
    resources:
       limits:
         cpu: 500m
         ephemeral-storage: 1Gi
         memory:  1Gi 
       requests:
         cpu: 100m
         ephemeral-storage: 300Mi
         memory:  1Gi 

ETCD pods are on "CrashLoopBackOff" while changing Watson Assistant size from "medium" to "large"

Applies to: 4.6.3

Problem

While trying to increase Watson Assistant from size "medium" to size "large", the etcd pods go into crashloopback.

Cause
The problem occurs when an additional etcd pod is deployed, but it doesn't get the root user and password applied from the ibm-etcd-operator.
Solution

Add the ETCDCTL_USER environment property back to the etcd statefulset:

  1. Determine the authSecretName used in the etcd deployment:
    oc get etcdcluster wa-etcd \
    --namespace=${PROJECT_CPD_INSTANCE} \
    -o yaml | grep authSecretName

    Output of that command will be: authSecretName: <auth_secret_name>.

  2. Edit the statefulset:
    oc edit sts wa-etcd \
    --namespace=${PROJECT_CPD_INSTANCE}
  3. Look for the env section in the statefulset.
            env:
            - name: INITIAL_CLUSTER_SIZE
              value: "5"
            - name: CLUSTER_NAME
              value: ibm-etcd-instance
            - name: ETCDCTL_API
              value: "3"
            - name: ENABLE_CLIENT_CERT_AUTH
              value: "false"
    
  4. Add the following snippet after the ENABLE_CLIENT_CERT_AUTH key-value pair.

    Before you add this snippet, replace <auth_secret_name> with the value that was returned by the oc get etcdcluster command.

            - name: USERNAME
              value: root
            - name: PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: <auth_secret_name>
            - name: ETCDCTL_USER
              value: $(USERNAME):$(PASSWORD)

    This will allow the etcd root user and password to be applied to any additional etcd pods that may be added to the deployment.

  5. Once all the pods restart, you can edit the etcdcluster to change the size.

If you apply this workaround, you must manually rotate the secret reference in the statesfulset.

Inaccurate status message from command line after upgrade

Problem
If you run the cpd-cli service-instance upgrade command from the Cloud Pak for Data command-line interface, and then use the service-instance list command to check the status of each service, the provision status for the service is listed as UPGRADE_FAILED.
Cause of the problem
When you upgrade the service, only the cpd-cli manage apply-cr command is supported. You cannot use the cpd-cli service-instance upgrade command to upgrade the service. And after you upgrade the service with the apply-cr method, the change in version and status is not recognized by the service-instance command. However, the correct version is displayed from the Cloud Pak for Data web client.
Resolving the problem
No action is required. As long as you use the cpd-cli manage apply-cr method to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by the cpd-cli service-instance list command.

UnsupportedKafkaVersionException in Kafka CR

Applies to: 4.6.x

Problem
During a Watson Assistant upgrade (depending on the version of Data Governor you are upgrading from and to), the reconciliation of Kafka Custom Resource (CR) is halted with the following error in the Kafka CR's status:
Status:
  Conditions:
    Last Transition Time:  2022-11-08T02:12:23.574Z
    Message:               Unsupported Kafka.spec.kafka.version: 2.6.0. Supported versions are: [3.2.3]
    Reason:                UnsupportedKafkaVersionException
Solution
Edit the Kafka CR named <wa instance name>-data-governor-kafka and remove the version field:
oc patch kafka wa-data-governor-kafka --type=json -p="[{'op': 'remove', 'path': '/spec/kafka/version'}]"