Limitations and known issues in Watson Discovery

The following limitations and known issues apply to the Watson Discovery service.

Watson Discovery has the following known issues:

Unable to apply a user-trained SDU model to a collection with documents from external data sources
Mirroring Watson service images fails with an Insufficient Scope error
Error installing Watson Discovery when pulling images from a Version 4.6.1 private container registry
Inaccurate status message from command line after upgrade
UpgradeError is shown after resizing PVC
Errored state is shown after upgrade
Disruption of service after upgrade or restart
RabbitMQ gets stuck in a loop after several installation attempts
MinIO gets stuck in a loop after several installation attempts
Attempted upgrade from early 4.0.x versions without quiescing
Unable to upgrade from 4.0.x to 4.6 successfully
Cannot update operators with a dependency on etcd, MinIO, or RabbitMQ
Unable to modify the resources of the Postgres pods associated with Watson Discovery
Watson Discovery installation receives a wd-discovery-haywire error due to an NSX plugin being installed as a CNI plugin on OpenShift
Watson Discovery MinIO pods not starting because quota is applied to the namespace
ETCD error when upgrading Watson Discovery from 4.5 to 4.6
Retrieving Watson Discovery ElasticSearch PVCs during uninstallation

Note: Known issues are cumulative. Issues from previous releases continue to exist in later releases unless otherwise noted. For more information about known issues in earlier releases, see Known issues.

Limitations

The following limitations apply to the Watson Discovery service:

You cannot use the Cloud Pak for Data OpenShift APIs for Data Protection (OADP) backup and restore utility to back up and restore the Watson Discovery service. Instead, use the backup and restore process that is described in the product documentation on the IBM Cloud Docs site.
The service supports single-zone deployments; it does not support multi-zone deployments.
Watson Discovery cannot always reconcile temporary software patches. If you are asked by IBM Support to apply a patch, you must complete an additional step to make sure that the patch gets applied properly. For more information, see Applying a temporary patch.
You cannot upgrade the Watson Discovery service by using the service-instance upgrade command from the Cloud Pak for Data command-line interface.

Unable to apply a user-trained SDU model to a collection with documents from external data sources

Applies to 4.6.0 only

Problem

When you create a collection that crawls an external data source, and then choose to create a user-trained Smart Document Understanding (SDU) model from the Identifying fields page, the SDU tool is not displayed. Instead, a message is displayed that says, “Come back later”.

Resolving the problem

For a 4.6.0 deployment only, you can apply a patch that adds an updated version of the wd-ingestion operator to your deployment. To do so, run the following command:

oc patch wd/wd --type=merge \
--patch='{"spec":{"ingestion":{"image":{"digest":"sha256:eef24fa7d8a43a23adb2db64121d397f985c6994629f0c0a853643f04cf0420a",\
"name":"wd-ingestion","tag":"14.6.0-11038"}}}}'

Mirroring Watson service images fails with an `Insufficient Scope` error

Applies to: 4.6.0 - 4.6.2

Fixed in: 4.6.3

Problem

When you run the

cpd-cli
manage
mirror-images

command, the command fails with an Insufficient Scope error. This problem occurs for the following services:

Watson Assistant
Watson Discovery
Watson Knowledge Studio
Watson Speech services

This problem occurs because the command is trying to mirror the images for EDB Postgres Enterprise but you do not have a license for EDB Postgres Enterprise.

Resolving the problem

To mirror the service images to the private container registry:

Run the

cpd-cli
manage
list-images

command to download the EDB Postgres CASE package:

Mirroring images directly to the private container registry

cpd-cli manage list-images \
--components=edb_cp4d \
--target_registry=${PRIVATE_REGISTRY_LOCATION}

Mirroring images using an intermediary container registry

cpd-cli manage list-images \
--components=edb_cp4d \
--target_registry=127.0.0.1:12443

Replace the EDB Postgres Enterprise images with the EDB Postgres images:
The cpd-cli uses the default location for the work directory.
```
sed -i -e '/edb-postgres-advanced/d' \
./cpd-cli-workspace/olm-utils-workspace/work/offline/$VERSION/{component_name}/ibm-cloud-native-postgresql-*-images.csv
```
Change component_name to the appropriate component name from the following options:
- watson_assistant
- watson_discovery
- watson_ks
- watson_speech
To specify more than one component, separate the component names with commas. For example, the following command replaces the Enterprise version for Watson Assistant and Watson Discovery:
```
sed -i -e '/edb-postgres-advanced/d' \
./cpd-cli-workspace/olm-utils-workspace/work/offline/$VERSION/{watson_assistant,watson_discovery}/ibm-cloud-native-postgresql-*-images.csv
```
The cpd-cli uses the CPD_CLI_MANAGE_WORKSPACE environment variable to determine the location of the work directory
```
sed -i -e '/edb-postgres-advanced/d' \
CPD_CLI_MANAGE_WORKSPACE/work/offline/$VERSION/{component_name}/ibm-cloud-native-postgresql-*-images.csv
```
Change component_name to the appropriate component name from the following options:
- watson_assistant
- watson_discovery
- watson_ks
- watson_speech
To specify more than one component, separate the component names with commas.

Error installing Watson Discovery when pulling images from a Version 4.6.1 private container registry

Error

When you install Watson Discovery by pulling images from a private container registry on Cloud Pak for Data Version 4.6.1, the installation does not complete successfully. When you check the pods and get the etcd pod with a command such as

oc -n ${PROJECT_CPD_INSTANCE} get pod | grep
wd-discovery-etcd

, an ImagePullBackOff error is returned.

Cause

Watson Discovery did not release a 4.6.1 version of the software. Therefore, when you install Watson Discovery on Cloud Pak for Data Version 4.6.1, a 4.6.0 version of the Watson Discovery software is installed. The service defines the etcd operator to use in its custom resource (because the etcd operator doesn't always provide a default image). With the 4.6.1 release, a newer version of etcd is specified. As a result, the older version of the etcd image that is specified for Discovery is not mirrored to the private registry. This missing etcd image results in an error.

Solution

Mirror the etcd image to the private registry by completing the following steps:

Set the following variables:

export VERSION=4.6.0
export COMPONENTS=opencontent_etcd

Mirror etcd images to the registry by following the steps in the Mirroring images to a private container registry procedure.
Delete the failed wd-discovery-etcd pods.
One or more new etcd pods, depending on your deployment type, are started.
Confirm that the etcd pods are running successfully.

Inaccurate status message from command line after upgrade

Problem: If you run the cpd-cli service-instance upgrade command from the Cloud Pak for Data command-line interface, and then use the service-instance list command to check the status of each service, the provision status for the service is listed as UPGRADE_FAILED.
Cause of the problem: When you upgrade the service, only the cpd-cli manage apply-cr command is supported. You cannot use the cpd-cli service-instance upgrade command to upgrade the service. And after you upgrade the service with the apply-cr method, the change in version and status is not recognized by the service-instance command. However, the correct version is displayed from the Cloud Pak for Data web client.
Resolving the problem: No action is required. As long as you use the cpd-cli manage apply-cr method to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by the cpd-cli service-instance list command.

UpgradeError is shown after resizing PVC

Error: After you edit the custom resource to change the size of a persistent volume claim for a data store, an error is shown.
Cause: You cannot change the persistent volume claim size of a component by updating the custom resource. Instead, you must change the size of the PVC on the persistent volume claim node after it is created.
Solution: To prevent the error, undo the changes that were made to the YAML file. For more information about the steps to follow to change the persistent volume claim size successfully, see Scaling an existing persistent volume claim size.

Errored state is shown after upgrade

This issue applies only when you upgrade to versions 4.6.0 and 4.6.2.

Error: After you run the cpd-cli manage apply-olm --upgrade=true command to upgrade the service to version 4.6, the ready state shows as Failed with the reason Errored.
Cause: Changes that are specific to the operator between minor versions cause errors during future reconciliation loops. The instance is operational but the operator is unable to complete successfully.
Solution: Complete the upgrade by using the command that updates the operator and operand at the same time, which is the cpd-cli manage apply-cr command.

Disruption of service after upgrade or restart

Error

After an upgrade or restart, one or more pods in the cluster are in an Init state, or are intermittently in a CrashLoopBackOff or Running state.

Cause

The Elasticsearch component uses a quorum of pods to ensure availability when it completes search operations. However, each pod in the quorum must recognize the same pod as the leader of the quorum. The system can run into issues when more than one leader pod is identified.

Solution

To determine if confusion about the quorum leader pod is the cause of the issue, complete the following steps:

Check each of the Elasticsearch pod with the role of master to see which pod it identifies as the quorum leader.

oc get pod -l icpdsupport/addOnId=discovery,app=elastic,role=master,tenant=wd \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'  | while read i; do echo $i; oc exec $i \
-c elasticsearch -- bash -c 'curl -ksS "localhost:19200/_cat/master?v"'; echo; done

Each pod must list the same pod as the leader.

For example, in the following result, two different leaders are identified. Pods 1 and 2 identify pod 2 as the leader. However, pod 0 identifies itself as the leader.

wd-ibm-elasticsearch-es-server-master-0
id                     host      ip        node
7q0kyXJkSJirUMTDPIuOHA 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-0

wd-ibm-elasticsearch-es-server-master-1
id                     host      ip        node
L0mqDts7Rh6HiB0aQ4LLtg 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-2

wd-ibm-elasticsearch-es-server-master-2
id                     host      ip        node
L0mqDts7Rh6HiB0aQ4LLtg 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-2

If you find that more than one pod is identified as the leader, complete the following steps to fix the problem:

Delete all but one of the master Elasticsearch pods, and then wait until new pods are started and become available.
Repeat the check described earlier to find out whether all Elasticsearch pods with the master role identify the same pod as the leader.

RabbitMQ gets stuck in a loop after several installation attempts

Error

After an initial installation or upgrade failure and repeated attempts to retry, the common services RabbitMQ operator pod can get into a CrashLoopBackOff state. For example, the log might include the following types of messages:

"error":"failed to upgrade release: post-upgrade hooks failed: warning: 
Hook post-upgrade ibm-rabbitmq/templates/rabbitmq-backup-labeling-job.yaml 
failed: jobs.batch "{%name}-ibm-rabbitmq-backup-label" already exists"

Cause

Resources for the RabbitMQ operator component must be fully removed before a new installation or upgrade is started. If too many attempts occur in succession, remaining resources can cause new installations to fail.

Solution

Delete the RabbitMQ backup label job from the previous installation or upgrade attempt. Look for the name of the job in the logs. The name ends in -ibm-rabbitmq-backup-label.
```
oc delete job {%name}-ibm-rabbitmq-backup-label -n ${PROJECT_CPD_INSTANCE}
```

Check that the pod returns a Ready state.

oc get pods -n PROJECT_CPFS_OPS | grep ibm-rabbitmq

MinIO gets stuck in a loop after several installation attempts

Error

The message, Cannot find volume "export" to mount into container "ibm-minio", is displayed during an installation or upgrade of Discovery. When you check the status of the MinIO pods by using the command, oc get pods -l release=wd-minio -o wide, and then check the MinIO operator logs by using the commands,

oc get pods -A | grep
ibm-minio-operator

, and then

oc logs -n <namespace>
ibm-minio-operator-XXXXX

, you see an error that is similar to the following message in the logs:

ibm-minio/templates/minio-create-bucket-job.yaml failed: jobs.batch "wd-minio-discovery-create-bucket" 
already exists) and failed rollback: failed to replace object"

Cause

A job that creates a storage bucket for MinIO and then is deleted after it completes, is not being deleted properly.

Solution

Complete the following steps to check whether an incomplete create-bucket job for MinIO exists. If so, delete the incomplete job so that the job can be recreated and can then run successfully.

Check for the MinIO job by using the following command:
```
oc get jobs | grep 'wd-minio-discovery-create-bucket'
```
If an existing job is listed in the response, delete the job by using the following command:
```
oc delete job $(oc get jobs -oname | grep 'wd-minio-discovery-create-bucket')
```
Verify that all of the MinIO pods start successfully by using the following command:
```
oc get pods -l release=wd-minio -o wide
```

Attempted upgrade from early 4.0.x versions without quiescing

Error

When you check the status of the upgrade, errors are shown and only 8 or so of the 24 components are ready.

Cause

If you upgraded Watson Discovery from version numbers 4.0.2 through 4.0.5 without first quiescing the service, you can run into issues with the upgrade process.

Solution

Complete the following steps to redo the upgrade:

Revert the version of the service to the old version by using the following command:
```
oc patch wd wd --type='merge' --patch '{"spec":{"version": "<old_version>"}}'
```
Apply a temporary patch to modify the application configuration.
- Download the patch file named app-config-override-patch.yaml from the Watson Developer Cloud repository on GitHub.
- Use the following command to apply the patch:
```
oc apply -f app-config-override-patch.yaml
```
Upgrade the service by completing the Upgrading Watson Discovery from Version 4.0.x to Version 4.6 procedure.

Attention: Be sure to complete the step to quiesce the service, and then check the status of the service before you run the upgrade command. Wait until the QUIESCE status shows QUIESCED.
After the upgrade, be sure to complete the step to stop the quiesce of the service.

Run the following commands to remove the files that were created by the temporary patch.

oc patch temporarypatch wd-app-config-override-patch \
--type json --patch '[{ "op": "remove", "path": "/metadata/finalizers" }]'
oc delete -f app-config-override-patch.yaml
oc get crd | grep watsondiscovery | cut -d' ' -f1 | xargs -I{} oc annotate \
--overwrite {} wd oppy.ibm.com/temporary-patches-

Unable to upgrade from 4.0.x to 4.6 successfully

Error

An upgrade from a 4.0.x installation to 4.6 does not complete. The READY column shows False and READYREASON shows Errored and does not resolve.

Cause

Discovery defaults to creating a Development deployment type. However, you can override that default configuration by specifying a deployment type with the spec.shared.deploymentType setting.

In 4.0.x releases, the spec.shared.deploymentType field with a Starter value (which is equivalent to Development) was applied if you did not change it to Production. In a 4.6 installation, when using cpd-cli manage, Discovery sets the spec.shared.deploymentType field to Production to create a Production-ready installation by default.

Deployment types cannot be changed after an initial deployment. They cannot be changed during an upgrade either. If you had a Starter or Development deployment previously, you might have inadvertently created a Production deployment during the upgrade. This configuration mismatch will not work.

If you aren't sure which deployment type was used for your 4.6 upgrade, you can check by completing the following steps:

Run the following command to check the current deployment type:
```
oc get WatsonDiscovery wd -ojsonpath='{.spec.shared.deploymentType}'
```
If Production is returned, then you need to apply the workaround. If no value is returned, then you might not have specified a value and a Development deployment might be applied (because that is the internal default configuration).

Run the following command to check the number of persistent volume claims (PVCs) that were created by the EDB PostgreSQL instance for your installation:

oc get pvc -lk8s.enterprisedb.io/cluster=wd-discovery-cn-postgres

If there is more than one, and the AGE of most recent one aligns with the time of the upgrade, it means that you used a Production deployment type during the upgrade by mistake.

For example, the following response means the upgrade was configured to create a production installation because there are two PostgreSQL pods:

NAME                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
wd-discovery-cn-postgres-1  Bound    pvc-169dd8cf-2c02-452d-8f2e-85ecf3ce31aa   30Gi       RWO            ocs-storagecluster-ceph-rbd   7d1h
wd-discovery-cn-postgres-2  Bound    pvc-e3056a68-2603-436a-bf01-30057c34ad1a   30Gi       RWO            ocs-storagecluster-ceph-rbd   41h

Solution

To resolve the problem and continue the upgrade, complete the following steps:

Do one of the following things:
- If Production was returned in the previous procedure, modify the spec.shared.deploymentType field in the custom resource to match the value that was used during the original installation.
```
oc patch WatsonDiscovery wd --type=merge \
--patch='{"spec":{"shared":"deploymentType": "Starter"}}'
```
  Confirm that the Starter deployment type is returned now by using the following command:
```
oc get WatsonDiscovery wd -ojsonpath='{.spec.shared.deploymentType}'
```
  The Development and Starter types are functionally the same, and both values are accepted by the service.
- If an empty value was returned in the previous procedure, then the field was not specified in the initial installation. In this case, you must remove the field in the current custom resource. When you do so, the internal default setting of Development will be used, which is what you want in this case.
  To remove the field, enter the following command:
```
oc patch WatsonDiscovery wd --type=json \
--patch='[{"op": "remove", "path": "/spec/shared/deploymentType"}]'
```
After the patch is applied, verify that EDB PostgreSQL is running with one instance only by using the following command:
```
oc get Cluster wd-discovery-cn-postgres
```
If more than one instance is reported, set PostgreSQL in maintenance mode by using the following command:
```
oc patch WatsonDiscovery wd --type=merge \
--patch='{"spec":{"postgres":{"quiesce":{"enabled": true}}}}'
```
PostgreSQL is in maintenance mode when the following command returns true. You might need to wait a few minutes.
```
oc get Cluster wd-discovery-cn-postgres \
-o jsonpath='{.spec.nodeMaintenanceWindow.inProgress}{"\n"}
```
Remove the additional pods and persistent volume claims (PVCs) that are associated with the instance. You got a list of these PVCs in an earlier step.
```
oc delete pod/wd-discovery-cn-postgres-2 pvc/wd-discovery-cn-postgres-2 
```

After the PVCs are removed, return PostgreSQL to normal operation by using the following command:

oc patch WatsonDiscovery wd --type=merge \
--patch='{"spec":{"postgres":{"quiesce":{"enabled": false}}}}'

PostgreSQL is out of maintenance mode when the following command returns false.

oc get Cluster wd-discovery-cn-postgres \
-o jsonpath='{.spec.nodeMaintenanceWindow.inProgress}{"\n"}

Confirm that the state of the cluster is now healthy.
```
oc get Cluster wd-discovery-cn-postgres
```

The upgrade resumes.

Cannot update operators with a dependency on etcd, MinIO, or RabbitMQ

Applies to: Upgrades from Version 4.0.x or 4.5.x to Version 4.6

When you run the cpd-cli manage apply-olm command, the operator for one or more of the following services might get stuck in the Installing phase:

Service	MinIO	RabbitMQ	etcd
IBM® Match 360		✓
OpenPages®		✓
Watson Assistant	✓	✓	✓
Watson Discovery	✓	✓	✓
Watson Knowledge Studio	✓		✓
Watson Speech services	✓	✓

This issue can occur when the upgrade of the etcd operator, MinIO operator, or RabbitMQ operator fails. These dependencies use Helm-based operators. When the upgrade of a Helm-based operator fails, the failed version is not automatically deleted. If there is insufficient memory to retry the upgrade, the operators encounter an out-of-memory error and the upgrade fails.

Diagnosing the problem

To determine which operator is causing the problem:

If you are upgrading a service that has a dependency on RabbitMQ, check the status of the RabbitMQ operator:
1. Check the last state of the operator:
```
oc get pods -n $PROJECT_CPD_OPS \
  -lapp.kubernetes.io/instance=ibm-rabbitmq-operator \
  -ojsonpath='{.items[*].status.containerStatuses[*].lastState.terminated}'
```
2. If the state includes "exitCode":137,..."reason":"OOMKilled", then get the name of the operator:
```
oc get csv -n "${PROJECT_CPD_OPS}" \
-loperators.coreos.com/ibm-rabbitmq-operator.${PROJECT_CPD_OPS}
```
  You will need the name to resolve the problem. The name has the following format: ibm-rabbitmq-operator.vX.X.X.
If you are upgrading a service that has a dependency on MinIO, check the status of the MinIO operator:
1. Check the last state of the operator:
```
oc get pods -n $PROJECT_CPD_OPS \
  -lapp.kubernetes.io/instance=ibm-minio-operator \
  -ojsonpath='{.items[*].status.containerStatuses[*].lastState.terminated}'
```
2. If the state includes "exitCode":137,..."reason":"OOMKilled", then get the name of the operator:
```
oc get csv -n "${PROJECT_CPD_OPS}" \
-loperators.coreos.com/ibm-minio-operator.${PROJECT_CPD_OPS}
```
  You will need the name to resolve the problem. The name has the following format: ibm-minio-operator.vX.X.X.
If you are upgrading a service that has a dependency on etcd, check the status of the etcd operator:
1. Check the last state of the operator:
```
oc get pods -n $PROJECT_CPD_OPS \
  -lapp.kubernetes.io/instance=ibm-etcd-operator \
  -ojsonpath='{.items[*].status.containerStatuses[*].lastState.terminated}'
```
2. If the state includes "exitCode":137,..."reason":"OOMKilled", then get the name of the operator:
```
oc get csv -n "${PROJECT_CPD_OPS}" \
-loperators.coreos.com/ibm-etcd-operator.${PROJECT_CPD_OPS}
```
  You will need the name to resolve the problem. The name has the following format: ibm-etcd-operator.vX.X.X.

Resolving the problem

If one or more pods are in the CrashLoopBackOff state, complete the following steps to resolve the problem:

Check the current limits and requests for the operator with pods that are in a poor state.
If all of the operators were stuck, repeat this process for each operator.
1. Set the OP_NAME environment variable to the name of the operator:
```
export OP_NAME=<operator-name>
```
2. Check the current limits for the operator:
```
oc get csv -n ${PROJECT_CPD_OPS} ${OP_NAME} \
-ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.limits.memory}'
```
3. Check the current requests for the operator:
```
oc get csv -n ${PROJECT_CPD_OPS} ${OP_NAME} \
-ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.requests.memory}'
```
4. Choose the appropriate action based on the values returned by the preceding commands:
  - If either the limits or requests are below 1Gi, continue to the next step.
  - If both values are above 1Gi, then the cause of the problem was misdiagnosed. This solution will not resolve the issues you are seeing.

Increase the memory limits and requests for the affected operator.

If all of the operators are stuck, repeat this process for each operator.

Create a JSON file named patch.json with the following content:

[
  {
    "op": "replace",
    "path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/memory",
    "value": "1Gi"
  },
  {
    "op": "replace",
    "path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory",
    "value": "1Gi"
  }
]

Ensure that the OP_NAME environment variable is set to the correct operator name:
```
echo ${OP_NAME}
```

Patch the operator:

oc patch csv -n ${PROJECT_CPD_OPS} ${OP_NAME} \
--type=json --patch="$(cat patch.json)"

Confirm that the patch was successfully applied:

oc get csv -n ${PROJECT_CPD_OPS} ${OP_NAME} \
-ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.limits.memory}

The command should return 1Gi.

Important: The patch is temporary. The memory settings apply only to the current deployment. The next time you update the operator, the settings are replaced by the default settings.

Unable to modify the resources of the Postgres pods associated with Watson Discovery

This issue was fixed in the 4.6.5 release.

Error

In Watson Discovery version 4.6, the Postgres pod's resource requests or limits can no longer be edited. Previously, a Watson Discovery operator was able to check spec.postgres.database.resources.

Cause

Because the WatsonDiscovery CustomResourceDefinition does not have .spec.postgres.resources defined, modifications to the WatsonDiscovery CR are automatically reverted. This prevents changes from rolling out to the pods.

Solution

Patch the WatsonDiscovery CustomResourceDefinition to include the new field prior to submitting a patch to modify the Postgres pod resources. To verify whether the patch is already applied, run the following command:

oc get CustomResourceDefinition watsondiscoveries.discovery.watson.ibm.com \
  --output jsonpath='{.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.postgres.properties.resources}'

If the patch is already applied, the command returns {"x-kubernetes-preserve-unknown-fields":true}. If nothing returns, run the following command to patch the CustomResourceDefinition:

oc patch CustomResourceDefinition watsondiscoveries.discovery.watson.ibm.com  \
  --type json \
  --patch '[{"op":"add","path":"/spec/versions/0/schema/openAPIV3Schema/properties/spec/properties/postgres/properties/resources","value":{"x-kubernetes-preserve-unknown-fields":true}}]'

Once the patch is applied, rerun the previous oc get command to verify that {"x-kubernetes-preserve-unknown-fields":true} returns. Once the patch has been successfully applied, you can resume modifying the Postgres pod resource configuration.

Watson Discovery installation receives a `wd-discovery-haywire` error due to an NSX plugin being installed as a CNI plugin on OpenShift

Error

Installation of the Watson Discovery service is pending with the following error on wd-discovery-haywire:

wd-discovery-haywire-56bc476b76-plv84.log
-----
Name: wd-discovery-haywire-56bc476b76-plv84
Container: wd-discovery-haywire
Namespace: cpd4-main-qi1001
Logs: 
		{"@timestamp":"2023-02-28T05:02:24.692Z","message":"Listening on 50,051","logger_name":"com.ibm.watson.wire.notices.Server","thread_name":"main","level":"INFO"}
{"@timestamp":"2023-02-28T05:02:57.307Z","message":"*** shutting down gRPC server since JVM is shutting down","logger_name":"com.ibm.watson.wire.notices.Server$1","thread_name":"Thread-6","level":"INFO"}
{"@timestamp":"2023-02-28T05:02:57.312Z","message":"*** server shut down","logger_name":"com.ibm.watson.wire.notices.Server$1","thread_name":"Thread-6","level":"INFO"}
[INFO  tini (1)] Spawned child process 'java' with pid '7'
[INFO  tini (1)] Main child exited normally (with status '143')
-----

Cause

Installing the NSX plugin as a OpenShift CNI Plug-in can cause the ibm-ngix pod not to access itself through its Service IP/DNS name. As a result, Watson Discovery gateway cannot receive incoming requests.

Solution

Apply a temporary patch to allow nginx in the gateway pod connect the other container without the kubernetes service:

Download the temporary patch wd-gateway-service-patch.zip.
Extract wd-gateway-service-patch.yml from the zip file.
Apply the temporary patch:
```
oc apply -f wd-gateway-service-patch.yml
```
Wait for the wd-discovery-gateway pod to restart.
Create a Watson Discovery instance in Cloud Pak for Data.

If you would like to remove temporary patch, enter the command:

oc delete temporarypatch.oppy.ibm.com wd-gateway-service-patch

If the command didn't return anything, open another terminal and enter the following commands:

oc patch temporarypatch.oppy.ibm.com wd-gateway-service-patch --type json --patch '[{
              "op": "remove", "path": "/metadata/finalizers" }]'
oc get crd | grep watsondiscovery | cut -d' ' -f 1 | xargs -I{} -t oc annotate {} wd
              --overwrite
    oppy.ibm.com/temporary-patches-

Watson Discovery MinIO pods not starting because quota is applied to the namespace

Applies to: 4.5.3 and later

Problem

The job wd-minio-discovery-create-pvc failed to complete when ResourceQuotas are applied to the namespace. When the job is described with

oc describe job
wd-minio-discovery-create-pvc

, there is a FailedCreate event mentioned failed quota. Example:

Warning  FailedCreate  31m job-controller  Error creating: pods "wd-minio-discovery-create-pvc-6shj8" is forbidden: failed quota: cpd-quota: must specify limits.cpu,limits.memory,requests.cpu,requests.memory

Cause

The MinIO Job cannot start if a ResourceQuota is applied to namespace but a LimitRange is not set due to the Job pod not having resources.requests or resources.limits configured.

Solution

Apply limit range with defaults for limits and requests. Modify the namespace in the following yaml to the namespace where Cloud Pak for Data is installed:

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-resource-limits
  namespace:   zen   #Change it to the namespace where CPD is installed
spec:
  limits:
  - default:
      cpu: 300m
      memory: 200Mi
    defaultRequest:
      cpu: 200m
      memory: 200Mi
    type: Container

ETCD error when upgrading Watson Discovery from 4.5 to 4.6

Applies to: 4.6.0 and later

Problem

After applying Watson Discovery CR (cpd-cli manage apply-cr .. --version=4.6.x ..), the etcdcluster resource wd-discovery-etcd gets into a failed state due to invalid labels. You can verify this by checking the etcdcluster conditions:

oc -n ${PROJECT_CPD_INSTANCE} get etcdcluster wd-discovery-etcd -o jsonpath="{.status.conditions}" 
---
[{"lastTransitionTime":"2023-03-30T12:45:58Z","message":"Failedtopatchobject:b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"StatefulSet.apps\\\\\"wd-discovery-etcd\\\\\"isinvalid:spec:Forbidden:updatestostatefulsetspecforfieldsotherthan\\'replicas\\',\\'template\\',\\'updateStrategy\\',\\'persistentVolumeClaimRetentionPolicy\\'and\\'minReadySeconds\\'areforbidden\",\"reason\":\"Invalid\",\"details\":{\"name\":\"wd-discovery-etcd\",\"group\":\"apps\",\"kind\":\"StatefulSet\",\"causes\":[{\"reason\":\"FieldValueForbidden\",\"message\":\"Forbidden:updatestostatefulsetspecforfieldsotherthan\\'replicas\\',\\'template\\',\\'updateStrategy\\',\\'persistentVolumeClaimRetentionPolicy\\'and\\'minReadySeconds\\'areforbidden\",\"field\":\"spec\"}]},\"code\":422}\\n'","reason":"Failed","status":"False","type":"Failure"}]

Cause

The ETCD operator does not recreate the statefulset on an immutable field change.

Solution

Manually delete the etcd statefulset to allow the operator to recreate it:

Delete the etcd statefulset:

oc delete sts wd-discovery-etcd
statefulset.apps "wd-discovery-etcd" deleted

Wait for the sts to be recreated:

oc get sts wd-discovery-etcd
NAME                READY   AGE
wd-discovery-etcd   0/3     24s
---
oc get etcdcluster wd-discovery-etcd -o jsonpath="{.status.conditions}"
[{"ansibleResult":{"changed":3,"completion":"2023-05-16T18:57:17.965897","failures":0,"ok":39,"skipped":36},"lastTransitionTime":"2023-05-16T18:56:24Z","message":"Awaiting next reconciliation","reason":"Successful","status":"True","type":"Running"},{"lastTransitionTime":"2023-05-16T18:57:18Z","message":"Last reconciliation succeeded","reason":"Successful","status":"True","type":"Successful"},{"lastTransitionTime":"2023-05-16T18:56:24Z","message":"","reason":"","status":"False","type":"Failure"}]

Retrieving Watson Discovery ElasticSearch PVCs during uninstallation

Applies to: 4.0.x and later

Problem

During uninstallation, the following command fails to retrieve the Watson Discovery ElasticSearch PVCs:

oc get pvc -l 'app.kubernetes.io/name in (wd,discovery)'

Cause

The command fails if the labels changed when other ElasticSearch PVCs exist. The ElasticSearch operator only updates the labels of the ElasticSearch PVCs if other ElasticSearch PVCs exist with a different set of labels compared to the Watson Discovery ones. Those PVCs also include the ibm-es-data label (either set to True or False).

Solution

You can retrieve the Watson Discovery ElasticSearch PVCs by entering the command:

oc get pvc | grep wd-ibm-elasticsearch

Delete any PVCs that are listed.

Limitations and known issues in Watson Discovery

Limitations

Unable to apply a user-trained SDU model to a collection with documents from external data sources

Mirroring Watson service images fails with an Insufficient Scope error

Error installing Watson Discovery when pulling images from a Version 4.6.1 private container registry

Inaccurate status message from command line after upgrade

UpgradeError is shown after resizing PVC

Errored state is shown after upgrade

Disruption of service after upgrade or restart

RabbitMQ gets stuck in a loop after several installation attempts

MinIO gets stuck in a loop after several installation attempts

Attempted upgrade from early 4.0.x versions without quiescing

Unable to upgrade from 4.0.x to 4.6 successfully

Cannot update operators with a dependency on etcd, MinIO, or RabbitMQ

Unable to modify the resources of the Postgres pods associated with Watson Discovery

Watson Discovery installation receives a wd-discovery-haywire error due to an NSX plugin being installed as a CNI plugin on OpenShift

Watson Discovery MinIO pods not starting because quota is applied to the namespace

ETCD error when upgrading Watson Discovery from 4.5 to 4.6

Retrieving Watson Discovery ElasticSearch PVCs during uninstallation

Mirroring Watson service images fails with an `Insufficient Scope` error

Watson Discovery installation receives a `wd-discovery-haywire` error due to an NSX plugin being installed as a CNI plugin on OpenShift