Known issues and limitations for Watson Discovery
The following known issues and limitations apply to the Watson Discovery service.
- After Watson Discovery installation, one or more ranker pods are in CrashLoopBackOff
- Watson Discovery is not accessible after upgrading
- Upgrading from Version 4.8.6 or earlier to Version 5.1.0 does not complete
- Failed to restore elasticsearch data in Watson Discovery
- Custom resources are not accessible from the Teach domain concepts section after upgrading
- During shutdown the DATASTOREQUIESCE field does not update
- Upgrade fails due to existing Elasticsearch 6.x indices
- UpgradeError is shown after resizing PVC
- Disruption of service after upgrading, restarting, or scaling by updating scaleConfig
After Watson Discovery installation, one or
more ranker pods are in CrashLoopBackOff
Applies to: 5.1.0 and later
- Error
-
Ranker master or ranker serve pods, or both of them are in
CrashLoopBackOff.
Reviewing the pod shows probe timeouts.oc get pod|grep ranker wd-discovery-ranker-master-6dbfb6bcc5-qv6nj 0/1 CrashLoopBackOff 525 (15s ago) 2d18h wd-discovery-ranker-master-6dbfb6bcc5-t2xzw 0/1 Running 515 (3m29s ago) 2d18h wd-discovery-ranker-monitor-agent-85bcc9bc9-b8kc9 1/1 Running 0 4d9h wd-discovery-ranker-monitor-agent-85bcc9bc9-g4gzq 1/1 Running 0 4d9h wd-discovery-ranker-rest-85bc97fd7d-8qr2w 1/1 Running 0 2d18h wd-discovery-ranker-rest-85bc97fd7d-fzq7h 1/1 Running 0 2d18h wd-discovery-serve-ranker-78bf7696bf-f596r 2/2 Running 1 (2d18h ago) 2d18h wd-discovery-serve-ranker-78bf7696bf-mqx8t 1/2 CrashLoopBackOff 582 (100s ago) 2d18hoc describe pod <wd-discovery-ranker-masterORserve-pod> -n ${PROJECT_CPD_INST_OPERANDS}Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 79m (x2279 over 6d12h) kubelet Readiness probe failed: command timed out Warning Unhealthy 75m (x915 over 6d11h) kubelet (combined from similar events): Liveness probe failed: command timed out Warning BackOff 60m (x5356 over 6d12h) kubelet Back-off restarting failed container wd-discovery-ranker-master in pod wd-discovery-ranker-master-5c575b659b-5wfqb_cpd-instance-upgrade(c5f5684d-fe4e-4585-aeb6-e6b1e9bf3583) Warning Unhealthy 15m (x1439 over 6d12h) kubelet Startup probe failed: command timed out Warning Unhealthy 5m28s (x2451 over 6d12h) kubelet Liveness probe failed: command timed out
- Cause
-
Ranker master or ranker serve pods, or both of them did not come up properly after installation.
- Solution
-
- Check if ranker master is in
CrashLoopBackOff:oc describe pod <wd-discovery-ranker-master-pod> -n ${PROJECT_CPD_INST_OPERANDS} -
Apply the following patch command to increase the resources on ranker master pod.
oc patch wd wd --type=merge --patch='{"spec": {"wire": {"rankerMaster": {"resources":{"requests":{"cpu":"1","memory":"3000Mi"},"limits":{"cpu":"1","memory":"3000Mi"}}}}}}' - Check if ranker serve pod is in
CrashLoopBackOff.oc describe pod <wd-discovery-ranker-serve-pod> -n ${PROJECT_CPD_INST_OPERANDS} - Apply the following patch command to increase the resources on ranker serve
pod.
oc patch wd wd --type=merge --patch='{"spec": {"mm": {"mmSideCar": {"resources":{"requests":{"cpu":"2","memory":"2000Mi"},"limits":{"cpu":"2","memory":"2000Mi"}}}}}}' - Perform the following steps to increase the pods probe timeout on ranker serve pod.
- Save the following patch as
serve-ranker-patch.yaml.apiVersion: oppy.ibm.com/v1 kind: TemporaryPatch metadata: name: serve-ranker-patch spec: apiVersion: discovery.watson.ibm.com/v1 kind: WatsonDiscovery name: wd patchType: patchStrategicMerge patch: wire: cr: spec: wire: serveRanker: mmRuntime: livenessProbe: initialDelaySeconds: 180 periodSeconds: 60 timeoutSeconds: 45 readinessProbe: initialDelaySeconds: 180 periodSeconds: 60 timeoutSeconds: 45 startupProbe: initialDelaySeconds: 120 periodSeconds: 60 timeoutSeconds: 45 - Apply the patch using the following
command:
oc apply -f serve-ranker-patch.yaml
- Save the following patch as
- Check if ranker master is in
Watson Discovery is not accessible after upgrading
Applies to: 5.1.0 and later
- Error
-
After upgrading to 5.1.0 or later, the Watson Discovery service is healthy, but not accessible from IBM Cloud Pak for Data. This error is likely to occur if the service was originally installed as Version 4.0.x.
- Cause
-
This issue is caused due to a version parameter in the spec.
- Solution
-
- Check the zen extensions to ensure the following are
present:
If these zen extensions are missing, there might be an override in the Watson Discovery CR.oc get zenextensions/discovery-gw-addon zenextensions/wd-discovery-watson-gateway-gw-instance -
Verify the version of Watson Gateway currently in use.
oc get watsongatewaysNAME VERSION READY READYREASON UPDATING UPDATINGREASON AGE wd-discovery-watson-gateway main True Stable False Stable 117d - If the result shows the version as
main, perform the following steps:- Edit the Watson Discovery
CR.
oc edit wd wd - Locate the relevant
versionline in thespec.Gateway: size: small version: main - Remove
version: main, then save and exit. - Monitor the gateway object for changes. This can take several minutes. After it shows as using
the
defaultversion, attempt to access Watson Discovery through IBM Cloud Pak for Data.
The response should appear similar to:oc get watsongatewaysNAME VERSION READY READYREASON UPDATING UPDATINGREASON AGE wd-discovery-watson-gateway default True Stable False Stable 19d
- Edit the Watson Discovery
CR.
- Check the zen extensions to ensure the following are
present:
Upgrading from Version 4.8.6 or earlier to Version 5.1.0 does not complete
Applies to: 5.1.0
Fixed in: 5.1.1
- Error
-
When upgrading Watson Discovery from Version 4.8.6 or earlier to Version 5.1.0, the upgrade does not complete and the status remains in the
InProgressstate.
- Cause
-
This error is caused by a parameter setting in PostgreSQL.
- Solution
-
- Enable superuser access on the source
cluster:
oc patch cluster.postgresql wd-discovery-cn-postgres --type merge --patch '{"spec": {"enableSuperuserAccess": true}}' - Delete the new PostgreSQL CR so that the
Watson Discovery operator can recreate
it:
oc delete cluster.postgresql wd-discovery-cn-postgres16
- Enable superuser access on the source
cluster:
Failed to restore elasticsearch data in Watson Discovery
Applies to: 4.8.x and later
- Error
-
After running OADP restore, sometimes Elasticsearch data is not restored in Watson Discovery. If this problem occurs, you can see an error message in the
CPD-CLI*.loglog file undercpd-cli-workspace/logsdirectory, for example:"[cloudpak:cloudpak_snapshot_2024-09-01-15-07-58/COvZbNZfTgGYBZ7OfSfOfA] cannot restore index [.ltrstore] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
- Cause
-
This error is caused when an index
.ltrstoreis created bydeployment/wd-discovery-training-crudbefore restoring the back up data. - Solution
-
- Go to the
PROJECT_CPD_INST_OPERANDSnamespace:oc project ${PROJECT_CPD_INST_OPERANDS} - Get an
Elasticsearchpod name:pod=$(oc get pod -l icpdsupport/addOnId=discovery,app=elastic,role=master,tenant=wd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}') - Note the number of replicas of
deployment/wd-discovery-training-crud:oc get deployment wd-discovery-training-crud -o jsonpath='{.spec.replicas}' - Scale down
deployment/wd-discovery-training-crud:oc patch wd wd --type merge --patch '{"spec": {"wire": {"trainingCrud": {"trainingCrudReplicas": 0}}}}' - Delete
.ltstoreindex:oc exec $pod -c elasticsearch -- bash -c 'curl -XDELETE -s -k -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/.ltrstore"' - Get the snapshot name that includes the data of Watson Discovery:
The command output indicates the latest snapshot name, for example:oc exec $pod -c elasticsearch -- bash -c 'curl -XGET -s -k -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/_cat/snapshots/cloudpak?h=id&s=end_epoch"'cloudpak_snapshot_2024-09-01-15-07-58 - Restore using the snapshot (replace
<snapshot-name>with your snapshot name):oc exec $pod -c elasticsearch -- bash -c 'curl -XPOST -s -k -u ${ELASTIC_USER}:${ELASTIC_PASSWORD} "${ELASTIC_ENDPOINT}/_snapshot/cloudpak/<snapshot-name>/_restore"' - Scale
deployment/wd-discovery-training-crudup to its original state:oc patch wd wd --type merge --patch '{"spec": {"wire": {"trainingCrud": {"trainingCrudReplicas": <number-of-original-replicas>}}}}'
- Go to the
Custom resources are not accessible from the Teach domain concepts section after upgrading
Applies to: Upgrading from 4.7.1 and 4.7.2 to any later version
- Error
-
In rare cases, a resource clean-up job might invalidate resources in certain projects when upgrading Watson Discovery. Invalidated resources lead to issues such as dictionaries and entity extractors not being accessible from the Teach domain concepts section of the Improvement tools panel on the Improve and customize page.
- Cause
-
An issue with the resource clean-up job in 4.7.1 and 4.7.2 invalidates the project resources, resulting in this issue.
- Solution
- Scale down the
wd-cnm-api podbefore upgrading Watson Discovery from 4.7.1 and 4.7.2.
After completing the upgrade process, either scale up the pod to its default value or scale the pod to a specific number of replicas instead of the default value.oc -n ${namespace} patch wd wd --type=merge --patch '{"spec": {"cnm": {"apiServer": {"replicas": 0}}}}'To scale up the pod to its default value, run the following command:oc -n ${namespace} patch wd wd --type=merge --patch '{"spec": {"cnm": {"apiServer": {"replicas": 1}}}}' oc -n ${namespace} patch wd wd --type=json --patch '[{"op":"remove","path":"/spec/cnm"}]'To scale the pod to a specific number of replicas, run the following command:oc -n ${namespace} patch wd wd --type=merge --patch '{"spec": {"cnm": {"apiServer": {"replicas": 1}}}}' oc -n ${namespace} patch wd wd --type=merge --patch "{\"spec\": {\"cnm\": {\"apiServer\": {\"replicas\": ${num_of_replicas}}}}}"
During shutdown the DATASTOREQUIESCE field does not update
Applies to: 5.1.0 and later
- Error
-
After successfully executing the cpd-cli manage shutdown command, the
DATASTOREQUIESCEstate in the Watson Discovery resource is stuck inQUIESCING:
# oc get WatsonDiscovery wd -n "${PROJECT_CPD_INST_OPERANDS}"
NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE
wd 4.7.3 True Stable False Stable 24/24 24/24 QUIESCED QUIESCING 16h
- Cause
-
Due to the way quiescing Postgres works, the Postgres pods are still running in background. This results in the metadata not updating in the Watson Discovery resource.
- Solution
- There is no fix for this. However, the state being stuck in
QUIESCINGdoes not affect the Watson Discovery operator.
Upgrade fails due to existing Elasticsearch 6.x indices
Applies to: 5.1.0 and later
- Error
- If the existing Elasticsearch cluster has indices created with Elasticsearch 6.x, then upgrading
Watson
Discovery to Version 5.0.0 and later
fails.
> oc get wd wd NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 4.8.0 False InProgress True VerifyWait 2/24 1/24 NOT_QUIESCED NOT_QUIESCED 63m - Cause
- Watson Discovery checks for existence of deprecated version of indices in the Elasticsearch cluster when upgrading to Version 5.0.0 and later.
- Solution
- To determine whether existing Elasticsearch 6.x indices are the cause of the upgrade failure,
verify the log of the
wd-discovery-es-detect-indexpod using the following command:
UpgradeError is shown after resizing PVC
Applies to: 5.1.0 and later
- Error
- After you edit the custom resource to change the size of a persistent volume claim for a data store, an error is shown.
- Cause
- You cannot change the persistent volume claim size of a component by updating the custom resource. Instead, you must change the size of the PVC on the persistent volume claim node after it is created.
- Solution
- To prevent the error, undo the changes that were made to the YAML file. For more information about the steps to follow to change the persistent volume claim size successfully, see Scaling an existing persistent volume claim size.
Disruption of service after upgrading, restarting, or scaling by
updating scaleConfig
Applies to: 5.1.0 and later
- Error
- After upgrading, restarting, or scaling Watson
Discovery by updating
the
scaleConfigparameter, the Elasticsearch component might become non-functional, resulting in disruption of service and data loss. - Cause
- The Elasticsearch component uses a quorum of pods to ensure availability when it completes search operations. However, each pod in the quorum must recognize the same pod as the leader of the quorum. The system can run into issues when more than one leader pod is identified.
- Solution
- To determine if confusion about the quorum leader pod is the cause of the issue, complete the
following steps:
- Log in to the cluster, and then set the namespace to the project where the Discovery resources are installed.
- Check each of the Elasticsearch pod with the role of
masterto see which pod it identifies as the quorum leader.
Each pod must list the same pod as the leader.oc get pod -l icpdsupport/addOnId=discovery,app=elastic,role=master,tenant=wd \ -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | while read i; do echo $i; oc exec $i \ -c elasticsearch -- bash -c 'curl -ksS "localhost:19200/_cat/master?v"'; echo; doneFor example, in the following result, two different leaders are identified. Pods1and2identify pod2as the leader. However, pod0identifies itself as the leader.wd-ibm-elasticsearch-es-server-master-0 id host ip node 7q0kyXJkSJirUMTDPIuOHA 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-0 wd-ibm-elasticsearch-es-server-master-1 id host ip node L0mqDts7Rh6HiB0aQ4LLtg 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-2 wd-ibm-elasticsearch-es-server-master-2 id host ip node L0mqDts7Rh6HiB0aQ4LLtg 127.0.0.1 127.0.0.1 wd-ibm-elasticsearch-es-server-master-2
If you find that more than one pod is identified as the leader, contact IBM Support.
Limitations
- Formulas that are embedded as images, especially those containing division bars (horizontal fractions) or other complex notations, are not reliably recognized or extracted by Watson Discovery. As a result, these formulas might be omitted, misinterpreted, or rendered incorrectly in the extracted output. This limitation stems from how the SDU pipeline handles embedded images, and currently affects all versions of Watson Discovery that use SDU.
- The service supports single-zone deployments; it does not support multi-zone deployments.
- You cannot upgrade the Watson
Discovery service by using the
service-instance upgradecommand from the Cloud Pak for Data command-line interface. - You cannot use the Cloud Pak for Data OpenShift® APIs for Data Protection (OADP) backup and restore utility to do an offline backup and restore the Watson Discovery service. Online backup and restore with OADP is available.