Known issues and limitations for IBM Cloud Pak for Data
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.
The following issues apply to IBM Cloud Pak for Data.
Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.
- Customer-reported issues
- General issues
- Installation and upgrade issues
- Security issues
- Backup and restore issues (all methods)
- Online backup and restore with IBM Storage Fusion issues
- Online backup and restore with the OADP backup and restore utility issues
- Offline backup and restore with the OADP backup and restore utility issues
- Cloud Pak for Data API issues
- Service issues
Customer-reported issues
Issues that are found after the release are posted on the IBM Support site.
General issues
- Services with a dependency on Db2 as a service crash due to an npm EACCES error
- Intermittent login issues when Cloud Pak for Data is integrated with the Identity Management Service
- The create-rsi-patch command fails
- Common core services is not aligned on the New diagnostics job page
- Critical alerts might appear on the home page after installation
- Elasticsearch pods shut down when they reach their ephemeral storage
- Storage volume pods cannot start when the persistent volume has a lot of files
Services with a dependency on Db2 as a service crash due to an npm EACCES error
If you have one of the following services installed on Cloud Pak for Data, you might see deployments returning
CrashLoopBackOff status:
- Db2®
- Db2 Warehouse
- OpenPages®
- Watson Knowledge Catalog
- Symptoms
-
- The
zen-databasesdeployment returns theCrashLoopBackOffstatus when you try to complete one of the following procedures:- Install the service
- Upgrade the service
- Restore from an offline backup
- Restart a
zen-databasedeployment
To check the status of thezen-databasedeployments, run the following command:oc get pods -n=${PROJECT_CPD_INST_OPERATORS} -l component=zen-databases - For any pods that are in the
CrashLoopBackOffstate, check the pod logs:oc logs <pod-name> -n=${PROJECT_CPD_INST_OPERATORS} | grep "npm ERR!"Look for the following error:
npm ERR! code EACCES npm ERR! syscall mkdir npm ERR! path /.npm npm ERR! errno -13 npm ERR! npm ERR! Your cache folder contains root-owned files, due to a bug in npm ERR! previous versions of npm which has since been addressed. npm ERR! npm ERR! To permanently fix this problem, please run: npm ERR! sudo chown -R 1000700000:0 "/.npm"
- The
- Resolving the problem
- To resolve the problem, run the following
command:
oc patch deployment zen-databases \ -n=${PROJECT_CPD_INST_OPERATORS} \ --type='json' \ --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/env/-", "value": {"name": "npm_config_cache", "value": "/tmp"}}]'
Intermittent login issues when Cloud Pak for Data is integrated with the Identity Management Service
Applies to: 4.7.0 and later
- Error 504 - Gateway Timeout
- Internal Server Error
Some users might be directed to the Identity providers page rather than the Cloud Pak for Data home page.
- Diagnosing the problem
- If users experience one or more of the issues described in the preceding text, check the
platform-identity-providerpods to determine whether the pods have been restarted multiple times:oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-identity-providerIf the output indicates multiple restarts, proceed to Resolving the problem
- Resolving the problem
-
- Restart the
icp-mongodb-0pod:oc delete podsicp-mongodb-0-n ${PROJECT_CPD_INST_OPERANDS} - Restart the
icp-mongodb-1pod:oc delete podsicp-mongodb-1-n ${PROJECT_CPD_INST_OPERANDS} - Restart the
icp-mongodb-2pod:oc delete podsicp-mongodb-2-n ${PROJECT_CPD_INST_OPERANDS} - Restart the
platform-auth-servicepod:oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-auth-service - Restart the
platform-identity-managementpod:oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-identity-management
- Restart the
The create-rsi-patch command fails
Applies to: 4.7.3 and later
When you run the cpd-cli
manage
create-rsi-patch command fails when you try to create or update
an existing resource specification injection (RSI) patch.
- Diagnosing the problem
-
- Confirm that the command fails during the following task:
TASK [utils : Wait until zen extension Cr is in Completed state]
- Review the errors in the
ibm-zen-operatorpod log:- Set the
ZEN_OPERATOR_PODenvironment variable to the name of thezen-operatorpod in the operators project for the instance:export ZEN_OPERATOR_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERATORS} | grep zen-operator | awk '{print $1}') - Change to the operators
project:
oc project ${PROJECT_CPD_INST_OPERATORS} - Open a remote shell on the
pod:
oc rsh ${ZEN_OPERATOR_POD} - In the remote shell, set the following environment variables:
- Set the
PROJECT_CPD_INST_OPERANDSenvironment variable to the operands project for the instance:export PROJECT_CPD_INST_OPERANDS=<project-name> - Set the
RSI-EXTENSIONenvironment variable to the name of theZenExtensionthat was created to manage the RSI patch:export RSI_EXTENSION=rsi-<patch-name>Replace <patch-name> with the name of the patch that you attempted to create or update.
- Set the
- Run the following command to look for the message We were unable to read either as JSON
nor YAML in the latest log
file:
cat /tmp/ansible-operator/runner/zen.cpd.ibm.com/v1/ZenExtension/${PROJECT_CPD_INST_OPERANDS}/${RSI_EXTENSION}/artifacts/latest/stdout | grep "We were unable to read either as JSON nor YAML"If the command returns a non-empty response, continue to Resolving the problem.
- Set the
- Confirm that the command fails during the following task:
- Resolving the problem
- If you need to use the RSI feature, contact IBM Support.
Common core services is not aligned on the New diagnostics job page
Applies to: 4.7.0 and later
- Diagnosing the problem
When you create a new diagnostics job, the Common Core Services option is not aligned with the other services on the page.

The Common Core Services row functions normally. This issue does not prevent you from creating a diagnostics job. After you create the job, you must refresh the Gather diagnostics page to see the job.
Critical alerts might appear on the home page after installation
Applies to: 4.7.0
- Diagnosing the problem
-
After installation, you might see critical alerts on the home page. However, the events that generated the alerts have cleared. The alerts will continue to display on the Alerts card for up to 3 days unless you delete the pods that generated the alerts.
The alerts are visible to users with one of the following permissions:- Administer platform
- Manage platform health
- View platform health
- Log in to the web client as a user with the appropriate permissions to view alerts.
- On the home page, click View all on the Alerts card.
- On the Alerts and events page, confirm that the alerts were generated by
one of the following services:
- Common core services
- Watson Knowledge Catalog
This issue can occur because
wkc-db2-initpods orjdbc-driver-sync-jobpods are in anErrorstate. - Resolving the problem
- An instance administrator or cluster administrator must resolve the problem.
-
Log in to Red Hat® OpenShift® Container Platform as a user with sufficient permissions to complete the task.
oc login ${OCP_URL} - Check the status of the
wkc-db2-initpods and thejdbc-driver-sync-jobpods.oc get pods --sort-by=.status.startTime -n ${PROJECT_CPD_INST_OPERANDS} | grep -E 'wkc-db2u-init|jdbc-driver' - Delete any pods that are in the
Errorstate.Replace
<pod-name>with the name of the pod in the error state.oc delete pod <pod-name> -n ${PROJECT_CPD_INST_OPERANDS}
-
Elasticsearch pods shut down when they reach their ephemeral storage
Applies to: 4.7.2 and 4.7.3
Fixed in: 4.7.4
- Diagnosing the problem
-
When the
ibm-elasticsearch-operator-ibm-es-controller-managerpod reaches its ephemeral storage limit of 1Gi, the pod fails with a ContainerStatusUnknown error. Each time the pod fails, a new pod is created.This problem occurs because of debug logs that are created.
- Get the status of the
ibm-elasticsearch-operator-ibm-es-controller-managerpods:oc get pods -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-elasticsearch-operator-ibm-es-controller-manager - If the command returns one or more pods in the ContainerStatusUnknown state,
check the ephemeral storage
setting:
oc get csv ibm-elasticsearch-operator.v1.1.1654 \ --n=${PROJECT_CPD_INST_OPERATORS} \ --ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.limits.ephemeral-storage}'If the command returns a value of 1Gi, continue to Resolving the problem.
- Get the status of the
- Resolving the problem
-
- Delete the
ibm-elasticsearch-operator-ibm-es-controller-managerpods that are in the ContainerStatusUnknown state:oc delete pods -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-elasticsearch-operator-ibm-es-controller-manager | grep ContainerStatusUnknown - Update the ephemeral storage
setting:
oc patch csv ibm-elasticsearch-operator.v1.1.1654 \ -n=${PROJECT_CPD_INST_OPERATORS} \ --type=json \ --patch="[{'op': 'replace', 'path': '/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/ephemeral-storage', 'value': '2Gi'}]" - Confirm that the patch was successfully
applied:
oc get csv ibm-elasticsearch-operator.v1.1.1654 \ --n=${PROJECT_CPD_INST_OPERATORS} \ --ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.limits.ephemeral-storage}'The command should return a value of 2Gi.
- Wait several minutes, then confirm that no pods are in the ContainerStatusUnknown
state:
oc get pods -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-elasticsearch-operator-ibm-es-controller-manager
- Delete the
Storage volume pods cannot start when the persistent volume has a lot of files
Applies to: 4.7.3 and 4.7.4
If you have a storage volume that points to a persistent volume with lots of files, the storage volume pods cannot start. This issue occurs on Red Hat OpenShift Data Foundation storage.
- Diagnosing the problem
- If you cannot access the storage volume:
- From the web client, ensure that file server is not Stopped.
- From the navigation menu, select .
- Click the storage volume name and check the File server status.
- If the file server is Stopped, restart it.
- If the file server is Running, proceed to the next step.
- Set the
VOLUMES_PROJECTenvironment variable to the name of the project where the storage volume exists:- The volume is in the operands project
-
export VOLUMES_PROJECT=${PROJECT_CPD_INST_OPERANDS} - The volume is in a tethered project
-
export VOLUMES_PROJECT=${PROJECT_CPD_INSTANCE_TETHERED}
- Get the name of the
pod:
oc get pods -n ${VOLUMES_PROJECT} | grep volumes-volume - Set the
VOLUMES_PODenvironment variable to the name of the pod:export VOLUMES_POD=<pod-name> - Get the pod
logs:
oc logs ${VOLUMES_POD} -n ${VOLUMES_PROJECT}Look for an error with the following format:
Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container volume configuration: context deadline exceeded: error reserving ctr name k8s_volumes-... for id <ID>: name is reserved
If the logs include the preceding error, proceed to Resolving the problem.
- From the web client, ensure that file server is not Stopped.
- Resolving the problem
- To enable the storage volume pod to start:
- Get the name of the deployment associated with the
pod:
oc get deployments -n ${VOLUMES_{PROJECT} | grep volumes-volume - Set the
VOLUMES_DEPLOYMENTenvironment variable to the name of the deployment:export VOLUMES_DEPLOYMENT=<deployment-name> - Patch the deployment to add the
io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: truelabel:oc patch deploy ${VOLUMES_DEPLOYMENT}\ --namespace=${VOLUMES_PROJECT} \ --type='json' \ --patch='[{"op": "add", "path": "/spec/template/metadata/annotations/io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel", "value": "true" }]'
- Get the name of the deployment associated with the
pod:
Installation and upgrade issues
- General installation and upgrade issues
- The apply-cluster-components command fails when another IBM Cloud Pak is installed on the cluster
- Installs and upgrades fail when you use a proxy server
- The apply-cr command fails when installing the zen component
- The apply-cr command fails when installing services with a dependency on Db2U
- The apply-cr command fails if the --components list includes the scheduler component
- Upgrades from 4.5
- Upgrades from 4.7
- General upgrade issues
- Service upgrades fail when the ibm-zen-operator pod runs out of memory
- The apply-cr command fails when services with a dependency on the common core services get stuck in the InProgress state during upgrade
- After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
- Inaccurate status message from the command line after you upgrade
- Secrets are not visible in connections after upgrade
The apply-cluster-components command fails when
another IBM Cloud Pak is installed on
the cluster
Applies to: 4.7.0, 4.7.1, 4.7.2, and 4.7.3
Fixed in: 4.7.4
If you try to install IBM Cloud Pak for Data Version
4.7 and another IBM Cloud Pak is already installed on the
cluster, the apply-cluster-components command fails with
the following message:
[✘] [ERROR] The max version of ibm-common-services-operator installed is X.XX.X.
Version 4.0.0 is the minimum required version.
Re-run the command with the '--migrate_from_cs_ns' option.
This problem occurs when the existing IBM Cloud Pak is running IBM Cloud Pak foundational services Version 2.x.
- Resolving the problem
- To resolve the problem, re-run the
cpd-cli manage apply-cluster-componentscommand without the--migrate_from_cs_nsoption.
Installs and upgrades fail when you use a proxy server
Applies to: 4.7.0 and 4.7.1
Fixed in: 4.7.2
If you use a cluster-wide proxy for Red Hat OpenShift Container Platform, the cpd-cli
manage
apply-cr command fails during the zen installation or upgrade.
- Diagnosing the problem
-
- The
cpd-cli manage apply-crcommand times out while installing or upgrading thezencomponent. - Examine the
ZenServicecustom resource:oc describe ZenService \ --namespace=${PROJECT_CPD_INST_OPERANDS}In the
Statussection, confirm that the following information is true:- The
Progressis 66%. - The
messagespecifies anunknown playbook failure
- The
- Examine the
zen-watchdog-frontdoor-extension ZenExtensionsoc describe ZenExtensions zen-watchdog-frontdoor-extension \ --namespace=${PROJECT_CPD_INST_OPERANDS}In theStatussection, confirm that one of the following messages is displayed:- 403
error:
Status code was -1 and not [200, 404]: Request failed: <urlopen error Tunnel connection failed: 403 Forbidden>
- 502
error:
Status code was -1 and not [200, 404]: Request failed: <urlopen error Tunnel connection failed: 502 Bad Gateway>
- 503
error:
Status code was -1 and not [200, 404]: Request failed: <urlopen error Tunnel connection failed: 503 Service Unavailable>
- 504
error:
Status code was -1 and not [200, 404]: Request failed: <urlopen error Tunnel connection failed: 504 Gateway Timeout>
- 403
error:
- Confirm that the issue is caused by your proxy settings:
- Set the
ZEN_OPERATOR_PODenvironment variable to the name of thezen-operatorpod in the operators project for the instance:export ZEN_OPERATOR_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERATORS} | grep zen-operator | awk '{print $1}') - Change to the operators
project:
oc project ${PROJECT_CPD_INST_OPERATORS} - Open a remote shell on the
pod:
oc rsh ${ZEN_OPERATOR_POD} - In the remote shell, set
PROJECT_CPD_INST_OPERANDSto the operands project for the instance:export PROJECT_CPD_INST_OPERANDS=<project-name> - Run the following command to determine whether you can access the
zen-core-api:
If there is an issue with the proxy configuration, the command should return one of the following error codes: 403, 502, 503 or 504.curl -vks "https://zen-core-api-svc.${PROJECT_CPD_INST_OPERANDS}:4444/v2/config"
- Set the
- The
- Resolving the problem
-
- Update the
ZEN_CORE_API_URLproperty in theproduct-configmapConfigMap:oc patch cm product-configmap \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch="{\"data\": {\"ZEN_CORE_API_URL\": \"https://zen-core-api-svc.${PROJECT_CPD_INST_OPERANDS}.svc:4444\"}}" - Confirm that the patch was
applied:
oc get cm product-configmap \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ -o yamlThe value of the
ZEN_CORE_API_URLproperty should behttps://zen-core-api-svc.<project-name>.svc:4444, where <project-name> is the value of thePROJECT_CPD_INST_OPERANDSenvironment variable. - Patch the
ZenServicecustom resource to trigger a reconcile loop:oc patch ZenService lite-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec": {"patchProductConfigmap": "true"}}' - Wait for the status of the
zencomponent to beCompleted. To check the status of thezencomponent, run:cpd-cli manage get-cr-status \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --components=zen
- Update the
The apply-cr command fails when installing the
zen component
Applies to: 4.7.0, 4.7.1, 4.7.2
Fixed in: 4.7.3
When you install IBM Cloud Pak for Data Version 4.7, the apply-cr command fails when
installing the zen component. This issue occurs when the
PostgreSQL API server is installed on the
cluster.
- Diagnosing the problem
-
- The
apply-crcommand fails when installing thezencomponent. - Review the
ibm-zen-operatorpod log:- Get the name of the
ibm-zen-operatorpod:oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-zen-operator - Check the pod log for a
RecursionError:oc logs <ibm-zen-operator-pod-name> -n ${PROJECT_CPD_INST_OPERATORS} | grep "RecursionError: maximum recursion depth exceeded in comparison"If the command returns a non-empty response, continue to the next step.
- Get the name of the
- Examine the
ZenServicecustom resource:oc describe ZenService -n ${PROJECT_CPD_INST_OPERANDS}In theStatussection, confirm that the following information is true:- The
Progressis3%or66% - The
messagespecifies an unknown playbook failure
- The
- The
- Resolving the problem
- Contact IBM Software support to patch the following operators:
ibm-zen-operatorcpd-platform-operator
The apply-cr command fails when installing
services with a dependency on Db2U
Applies to: 4.7.0, 4.7.1, 4.7.2, and 4.7.3
Fixed in: 4.7.4
- Diagnosing the problem
- You can specify the privileges that Db2U runs
with. If you configured Db2U to run
with limited privileges, the
apply-crcommand will fail if:- You set
DB2U_RUN_WITH_LIMITED_PRIVS: "true"in thedb2u-product-cm ConfigMap. - The kernel parameter settings were not modified to allow Db2U to run with limited privileges.
This issue can manifest in several ways.- The
wkc-db2u-initjob fails - If you are installing Watson Knowledge
Catalog, the
apply-crcommand fails with the message:"WKC DB2U post install job failed ('wkc-db2u-init' job)When you get the status of thewkc-db2u-initpods, they are in theErrorstate.oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep wkc-db2u-init - The
Db2UClusterresource never becomes ready - For other services you might notice that the
Db2uClusterresource never becomesReady.oc get Db2uCluster -n ${PROJECT_CPD_INST_OPERANDS} - You cannot provision service instances
- For services such as Db2 and Db2 Warehouse, the
apply-crcommand completes successfully, but the service instances never finish provisioning and the*-db2u-0pods are stuck inPendingorSysctlForbidden.oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep db2u-0
- You set
- Resolving the problem
- This problem occurs when you set
DB2U_RUN_WITH_LIMITED_PRIVS: "true"in thedb2u-product-cm ConfigMapbut the kernel parameter settings were not modified to allow Db2U to run with limited privileges.Review Changing kernel parameter settings to confirm that you can change the kernel parameter settings.- If you can change the kernel parameter settings, ensure that the worker nodes are restarted
after you change the settings.
In some cases, when you run the
cpd-cli manage apply-db2-kubeletcommand, the worker nodes are not restarted. - If you cannot or do not change the kernel parameter settings, update the
db2u-product-cm ConfigMapto setDB2U_RUN_WITH_LIMITED_PRIVS: "false". For more information, see Specifying the privileges that Db2U runs with.
- If you can change the kernel parameter settings, ensure that the worker nodes are restarted
after you change the settings.
The apply-cr command fails if the --components list includes the scheduler
component
Applies to: 4.7.0 and 4.7.1
Fixed in: 4.7.2
- Diagnosing the problem
-
When you run the
cpd-cli manage apply-crcommand and the--componentslist includes theschedulercomponent, the command returns the following error:[ERROR] You cannot use the apply-cr command to install or upgrade the scheduling service (component=scheduler). To install or upgrade the scheduling service, run the apply-scheduler command.Remove the
schedulerfrom the--componentslist.If you are using an environment variables script to populate the
--components, temporarily removeschedulerfrom theCOMPONENTSenvironment variable so that you can run theapply-crcommand.Re-add the component after the
apply-crcommand runs successfully. - Resolving the problem
-
- If you are installing Cloud Pak for Data Version 4.7.2, you have a version of the
olm-utils-v2image with the fix. - If you are installing Cloud Pak for Data Version 4.7.0 or 4.7.1, run the following command when
the client workstation is connected to the internet:
cpd-cli manage restart-containerThe command loads the latest version of the
olm-utils-v2image on the client workstation.If you pull images from a private container registry, you can use the following commands to move the latest version of the image to the private container registry:cpd-cli manage save-image(required only if the client workstation cannot connect to the internet and the private container registry at the same time)cpd-cli manage copy-image
- If you are installing Cloud Pak for Data Version 4.7.2, you have a version of the
Cannot update operators with a dependency on etcd or RabbitMQ
Applies to: Upgrades from Version 4.5.x
Fixed in: etcd issue: 4.7.2
- Diagnosing the problem
- When you run the
cpd-cli manage apply-olmcommand, the operator for one or more of the following services might get stuck in theInstallingphase. - Resolving the problem
-
If one or more pods are in the
CrashLoopBackOffstate, complete the following steps to resolve the problem.- Check the current
limitsandrequestsfor the operator with pods that are in a poor state.If all operators were stuck, repeat this process for each operator.
- Set the
OP_NAMEenvironment variable to the name of the operator.export OP_NAME=<operator-name> - Check the current
limitsfor the operator.oc get csv -n ${PROJECT_CPD_INST_OPERATORS} ${OP_NAME} \ -ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.limits.memory}' - Check the current
requestsfor the operator.oc get csv -n ${PROJECT_CPD_INST_OPERATORS} ${OP_NAME} \ -ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.requests.memory}' - Choose the appropriate action based on the values returned by the preceding commands.
- If either the
limitsorrequestsare less than1Gi, continue to the next step. - If both values are greater than
1Gi, then the cause of the problem was misdiagnosed. This solution will not resolve the issues that you are seeing.
- If either the
- Set the
- Increase the memory
limitsandrequestsfor the affected operator.If all operators are stuck, repeat this process for each operator.
- Create a JSON file named
patch.jsonwith the following content.[ { "op": "replace", "path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/requests/memory", "value": "1Gi" }, { "op": "replace", "path":"/spec/install/spec/deployments/0/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi" } ] - Ensure that the
OP_NAMEenvironment variable is set to the correct operator name.echo ${OP_NAME} - Patch the
operator.
oc patch csv -n ${PROJECT_CPD_INST_OPERATORS} ${OP_NAME} \ --type=json --patch="$(cat patch.json)" - Confirm that the patch was successfully
applied.
oc get csv -n ${PROJECT_CPD_INST_OPERATORS} ${OP_NAME} \ -ojsonpath='{.spec.install.spec.deployments[0].spec.template.spec.containers[0].resources.limits.memory}The command should return
1Gi.
Important: The patch is temporary. The memory settings apply only to the current deployment. The next time that you update the operator, the settings are replaced by the default settings. - Create a JSON file named
- Check the current
The setup-instance-topology command fails
Applies to: Upgrades from 4.7.0 and 4.7.1
Fixed in: 4.7.2
When you run the cpd-cli
manage
setup-instance-topology command, the command fails.
The error occurs because the ibm-common-service-operator.v4.0.1 operator does not come up.
- Diagnosing the problem
- To determine whether the failure was caused by the
ibm-common-service-operatoroperator, run the following command.oc get csv --namespace=${PROJECT_CPD_INST_OPERATORS} | grep ibm-common-service-operator- If the phase is
Failed, proceed to Resolving the problem. - If the phase is
Succeeded, use the information returned by thecpd-clito identify the root cause of the problem.
- If the phase is
- Resolving the problem
- To resolve the problem, follow the guidance in Operator installation or upgrade fails with exceeded progress deadline error in the IBM Cloud Pak foundational services documentation.
Service upgrades fail when the ibm-zen-operator pod runs out of
memory
- Upgrades from Version 4.5 to Version 4.7
- Upgrades from Version 4.6 to Version 4.7
When you run the cpd-cli
manage
apply-cr to upgrade a service, the command fails. In some
situations, this error occurs because the ibm-zen-operator pod runs out
of memory.
- Diagnosing the problem
- To determine whether the service upgrade failed because the
ibm-zen-operatorpod ran out of memory:- Set the
OP_NAMEenvironment variable to the name of theibm-zen-operatorpod:export OP_NAME=$(oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-zen| awk '{print $1}') - Get the status of the
ibm-zen-operatorpod:
If the pod ran out of memory, the pod will either be inoc get pods ${OP_NAME} -n ${PROJECT_CPD_INST_OPERATORS}CrashLoopBackOffor inRunningwith a large number of restarts.- If either of the preceding statements are true, continue to the next step.
- If neither of the statements are true, review the other known issues.
- Check the last state of the pod:
- If the pod is in the
CrashLoopBackOffstate, run the following command:oc get pods ${OP_NAME} -n ${PROJECT_CPD_INST_OPERATORS} -o yaml - If the pod is in the
Runningstate but has been restarted multiple times, run the following command:oc describe pods ${OP_NAME} -n ${PROJECT_CPD_INST_OPERATORS}
Review the
containerStatusessection of the output. Confirm that the last state includes "exitCode":137,..."reason":"OOMKilled". - If the pod is in the
- Set the
- Resolving the problem
- To resolve the problem, patch the
ibm-zen-operatorpod to increase the memory to 2 Gi:- Patch the pod:
oc patch pod ${OP_NAME} \ --namespace=${PROJECT_CPD_INST_OPERATORS} \ --type=merge \ --patch='{"spec":{"containers":[{"name":"ibm-zen-operator", "resources":{"limits":{"memory":"2Gi"}}}]}}'
- Patch the pod:
The apply-cr command fails when services with a
dependency on the common core services get stuck in
the InProgress state during upgrade
- Upgrades from Version 4.5 to Version 4.7
- Upgrades from Version 4.6 to Version 4.7
When you upgrade to IBM Cloud Pak for Data Version
4.7, services with a dependency on the common core services get stuck in the
InProgress state. The apply-cr command
fails waiting for the components to be Completed. This problem occurs when the
existing Elasticsearch pods are not successfully
terminated.
For a list of services with a dependency on the common core services, see Software requirements.
- Diagnosing the problem
-
- The
apply-crcommand fails with the following message:[ERROR] Playbook failed while running 'wait_for_cr.yml' of '<cr-name>' for component '<component-ID>' in namespace '<project-name>'.
- Review the custom resource for the component to confirm that the issue is caused by common core services.Run the following command to see the contents of the custom resource:
oc get <service-kind> <service-cr> -n ${PROJECT_CPD_INST_OPERANDS} -o yamlUse the following table to find the appropriate values for <service-kind> and <service-cr>. If the
apply-crcommand returns a different name for the custom resource, use the name returned by theapply-crcommand.Component Service kind Default custom resource (CR) name cognos_analyticsCAServiceca-addon-crdatastage_entDataStagedatastagedatastage_ent_plusDataStagedatastagedvDvServicedv-servicematch360MasterDataManagementmdm-crreplicationReplicationServicereplicationservice-crwkcWKCwkc-crwmlWmlBasewml-crwsWSws-crws_pipelinesWSPipelineswspipelines-cr - Review the output for the following message:
message: |- Dependency CCS failed to install - Review the common core services custom
resource:
oc get ccs ccs-cr -n ${PROJECT_CPD_INST_OPERANDS} -o yamlLook for the following message:message: |- unknown playbook failure The playbook has failed at task - 'check if original es sts is ready' - Check the status of the Elasticsearch
StatefulSetthat is created by common core services:oc get sts -n ${PROJECT_CPD_INST_OPERANDS} --selector=app=elasticsearch-masterIf some of the pods in the
StatefulSetare not ready, proceed to Resolving the problem.
- The
- Resolving the problem
- To resolve the issue, forcefully delete the pods that are associated with the
StatefulSet:oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app=elasticsearch-master --force
After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
Applies to: 4.7.0 and later
- Diagnosing the problem
- After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue
occurs, services that rely on FoundationDB such as
Watson
Knowledge Catalog and IBM Match
360 cannot function correctly.This issue affects deployments of the following services.
- IBM Watson® Knowledge Catalog
- IBM Match 360 with Watson
- Resolving the problem
- To resolve this issue, restart the FoundationDB
pods.
Required role: To complete this task, you must be a cluster administrator.
- Restart the FoundationDB cluster
pods.
oc get fdbcluster oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete poReplace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance. - Restart the FoundationDB operator
pods.
oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po - After the pods finish restarting, check to ensure that FoundationDB is available.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusThe returned status must be
Complete. - Check to ensure that the database is
available.
oc rsh sample-cluster-log-1 /bin/fdbcliIf the database is still not available, complete the following steps.
- Log in to the
ibm-fdb-controllerpod. - Run the
fix-coordinatorscript.kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}Replace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance.Note: For more information about thefix-coordinatorscript, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.
- Log in to the
- Check the FoundationDB
status.
- Restart the FoundationDB cluster
pods.
Inaccurate status message from the command line after you upgrade
- Watson Assistant
- Watson Discovery
- Watson Knowledge Studio
- Watson Speech services
- Diagnosing the problem
- If you run the
cpd-cli service-instance upgradecommand from the Cloud Pak for Data command-line interface, and then use theservice-instance listcommand to check the status of each service, the provision status for the service is listed asUPGRADE_FAILED. - Cause of the problem
- When you upgrade the service, only the
cpd-cli manage apply-crcommand is supported. You cannot use thecpd-cli service-instance upgradecommand to upgrade the service. And after you upgrade the service with theapply-crmethod, the change in version and status is not recognized by theservice-instancecommand. However, the correct version is displayed from the Cloud Pak for Data web client. - Resolving the problem
- No action is required. If you use the
cpd-cli manage apply-crmethod to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by thecpd-cli service-instance listcommand.
Secrets are not visible in connections after upgrade
Applies to: Version 4.7.0 and later
If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.
- Resolving the problem
- To see the secrets in the user interface:
- Change to the project where Cloud Pak for Data is
installed:
oc project ${PROJECT_CPD_INST_OPERANDS} - Set the following environment
variables:
oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true
- Change to the project where Cloud Pak for Data is
installed:
Security issues
Security scans return an Inadequate Account Lockout Mechanism message
Applies to: 4.7.0 and later
- Diagnosing the problem
-
If you run a security scan against Cloud Pak for Data, the scan returns the following message.
Inadequate Account Lockout Mechanism
- Resolving the problem
-
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.
Backup and restore issues (all methods)
Online backup of Watson Discovery fails at checkpoint stage
Applies to: 4.7.3 and later
- Diagnosing the problem
- When you try to create an online backup, the backup process fails at the checkpoint hook stage.
For example, if you are creating the backup with IBM Storage Fusion, the backup process fails at the
Hook: br-service-hooks-checkpoint stage in the backup sequence. In the log
file, you see an error message similar to the following
example:
download failed: s3://common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip to tmp/s3-backup/common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip Connection was closed before we received a valid response from endpoint URL: "https://s3.openshift-storage.svc:443/common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip". - Cause of the problem
- Large resource files can become corrupted while they are downloaded to the backup. As a result, the wd-discovery-aux-ch-s3-backup job does not complete successfully.
- Workaround
- Delete the file that is shown in the error message and recreate it.
- Exec into the wd-discovery-support
pod:
oc exec -it deploy/wd-discovery-support -- bash - Do the following steps within the pod.
- Delete the
file:
mc rm s3://common-zen-wd/mt/__built-in tenant__/fileResource/<file_name> - Confirm that the file is not listed when you run the following
command:
mc ls s3://common-zen-wd/mt/__built-in-tenant__/fileResource/ - Exit from the pod.
- Delete the
file:
- Delete the wd-discovery-orchestrator-setup
job:
oc delete job/wd-discovery-orchestrator-setup - Wait for the wd-discovery-orchestrator-setup job to run again and complete.
Confirm that the file was successfully recreated:
- Exec into the wd-discovery-support
pod:
oc exec -it deploy/wd-discovery-support -- bash - Do the following steps within the pod.
- Copy the file to the tmp
directory:
aws-wd s3 cp s3://common-zen-wd/mt/__built-in-tenant__/fileResource/<file_name> /tmp - Confirm that the file is
copied:
ls /tmp/<file_name> - Exit from the pod.
- Copy the file to the tmp
directory:
You can now retake the backup, and the wd-discovery-aux-ch-s3-backup job will complete successfully.
- Exec into the wd-discovery-support
pod:
Restore job fails at Db2 workload step
Applies to: 4.7.3 and later
- Diagnosing the problem
-
When restoring a Cloud Pak for Data deployment that includes Db2 or Db2 Warehouse, the restore job fails.
When backup and restore is done with IBM Storage Fusion, the restore job fails at the Hook: br-service-hooks/post-workload step.
In the cpdbr-oadp.log file, the following entry appears:WV is not running yet...sleep 30 - Workaround
- Before you take an online backup, for each db2u deployment, run the
following
commands:
clusters=$(kubectl get db2ucluster -o jsonpath='{.items[*].metadata.name}') for cluster in $clusters; do pod_name="c-${cluster}-db2u-0" echo "Running commands on ${pod_name}" # Check if wvcli exists in the pod if kubectl exec $pod_name -- bash -c "which wvcli 2> /dev/null"; then kubectl exec $pod_name -- bash -c "wvcli system mln-buffer --disable persist && sv stop wolverine && sv start wolverine" else echo "wvcli not found in $pod_name, skipping to the next pod." fi done
Online backup and restore with IBM Storage Fusion issues
Unable to run checkpoint backup post-hooks command after a failed online backup
Applies to: 4.7.0 and later
- Diagnosing the problem
- When an online backup fails, Cloud Pak for Data must
be returned to a good state before you can retry a backup. You return Cloud Pak for Data to a good state by running the checkpoint
backup post-hooks command:
cpd-cli oadp backup posthooks \ --include-namespaces ${PROJECT_CPD_INST_OPERANDS} \ --hook-kind=checkpoint \ --log-level=debug \ --verboseRunning the command results in an error like in the following example:Error: error running post-backup hooks: cannot get checkpoint id in namespace PROJECT_CPD_INST_OPERANDS, namespaces: [PROJECT_CPD_INST_OPERATORS PROJECT_CPD_INST_OPERANDS], checkpointId: , err: info is nil [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 - Workaround
- Do the following steps.
- Check whether the configmap cpdbr-ckpt-cm is
empty:
oc get cpdbr-ckpt-cm -o yaml - If the configmap is empty, do the following steps.
- Copy the following script:
create-checkpoint-id.sh script (click to expand)
#!/bin/bash function createUninitializedCmContent() { local hashedNamespaces="" local namespacesArr=() local quotedNamespacesArr=() local namespacesStr="" local namespacesJsonArr="" local uninitializedCheckpointLine="" local cpdOperatorLine="" local zenLine="" local currentTimeEpoch=$(date +%s) # if shasum command does not exists, resort to sha1sum command local hashCmd=$(command -v shasum > /dev/null && echo "shasum -a 1" || echo "sha1sum") # operator namespace should be appended first before the cpd namespace # to fully reproduce the logic in go if [ -n ${CPD_OPERATOR_NAMESPACE} ]; then namespacesArr+=(${CPD_OPERATOR_NAMESPACE}) fi namespacesArr+=(${NAMESPACE}) # do string join from array of namespaces to create hash IFS="_" namespacesStr="${namespacesArr[*]}" unset IFS # create json array string like ["ibm-common-services", "zen"] for namespace in "${namespacesArr[@]}" do quotedNamespacesArr+=("\"${namespace}\"") done IFS="," namespacesJsonArr="[${quotedNamespacesArr[*]}]" unset IFS hashedNamespaces="hk_$( echo -n $namespacesStr | eval $hashCmd | awk '{print $1}' )" uninitializedCheckpointId="dummy-$(uuidgen | tr A-Z a-z)" uninitializedCheckpointLine=$(cat << EOV ${hashedNamespaces}: '{"infos": [{"uid": "${uninitializedCheckpointId}","namespaces": ${namespacesJsonArr}, "createdAt": ${currentTimeEpoch}, "startTime":0, "completionTime": 0, "status": "", "hookInfos": null}]}' EOV ) zenLine="ns_${NAMESPACE}: ${hashedNamespaces}" if [ ! -z ${CPD_OPERATOR_NAMESPACE} ]; then cpdOperatorLine="ns_${CPD_OPERATOR_NAMESPACE}: ${hashedNamespaces}" fi CONFIGMAP_DATA=$(cat << EOF data: ${uninitializedCheckpointLine} ${zenLine} ${cpdOperatorLine} EOF ) } function patchCheckpointData() { echo "writing to yaml file ${OUTPUT_FILE_PATH}..." cat > "${OUTPUT_FILE_PATH}" << EOF ${CONFIGMAP_DATA} EOF echo "patching cpdbr-ckpt-cm with yaml file..." oc patch cm cpdbr-ckpt-cm --patch-file "${OUTPUT_FILE_PATH}" echo "all done!" } function help() { echo "" echo "create-checkpoint-id.sh - Tenant Backup and Restore" echo " SYNTAX:" echo " ./create-checkpoint-id.sh --namespace 'namespace' [--tenant-operator-namespace 'CPD Operators' | --out 'out yaml file' | --dry-run]" echo "" echo " COMMANDS:" echo " help : Display help usage" echo "" echo " PARAMETERS:" echo " --tenant-operator-namespace : CPD Operator namespace. Used with label-tenant command." echo " --namepace : zen namespace." echo " --out : output yaml file path. Default yaml file is 'cpdbr-ckpt-cm-uninitialized-patch.yaml'" echo " --dry-run : if flag is set, then do not directly patch the configmap" echo "" echo " NOTE: User must be logged into the Openshift cluster from the oc command line." echo "" } # main login if [ $# -eq 0 ]; then echo "No parameters provided" help exit 1 fi while (( $# )); do case "$1" in -n| --namespace) if [ -n "$2" ] && [ ${2:0:1} != "-" ]; then NAMESPACE=$2 shift 2 else echo "Invalid --namespace): ${2}" help exit 1 fi ;; --tenant-operator-namespace) if [ -n "$2" ] && [ ${2:0:1} != "-" ]; then CPD_OPERATOR_NAMESPACE=$2 shift 2 else echo "Invalid --tenant-operator-namespace): ${2}" help exit 1 fi ;; --out) if [ -n "$2" ] && [ ${2:0:1} != "-" ]; then OUTPUT_FILE_PATH=$2 shift 2 else echo "Invalid --out): ${2}" help exit 1 fi ;; --dry-run) if [ -n "$1" ]; then DRY_RUN=1 shift 1 fi ;; help|-h|--h|-help|--help) # help help exit 0 ;; -*|--*=) # unsupported flags echo "Invalid parameter $1" >&2 help exit 1 ;; *) # preserve positional arguments PARAMS="$PARAMS $1" shift ;; esac done if [ -z "$NAMESPACE" ]; then echo "--namespace has to be defined" help exit 1 fi if [ -z "${OUTPUT_FILE_PATH}" ]; then OUTPUT_FILE_PATH="cpdbr-ckpt-cm-uninitialized-patch.yaml" echo "will defaulted to save checkpoint data at ${OUTPUT_FILE_PATH}" fi createUninitializedCmContent echo "will patch configmap data with this content..." echo "----------------------------------------------" echo "${CONFIGMAP_DATA}" echo "----------------------------------------------" if [ -z "${DRY_RUN}" ]; then patchCheckpointData fi
- Test the script by running the following
command:
./create-checkpoint-id.sh --namespace ${PROJECT_CPD_INST_OPERANDS} --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} --dry-run - Run the script by removing the
--dry-runoption:./create-checkpoint-id.sh --namespace ${PROJECT_CPD_INST_OPERANDS} --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS}
- Copy the following script:
- Re-run the checkpoint backup post-hooks command.
- Check whether the configmap cpdbr-ckpt-cm is
empty:
Unable to back up Cloud Pak for Data operators when OpenPages is installed
Applies to: 4.7.0, 4.7.1, and 4.7.2
Fixed in: 4.7.3
- Diagnosing the problem
- In IBM Storage Fusion, the backup status
of the Cloud Pak for Data operators is Failed
snapshot.In the log, the following entries appear:
time=<timestamp> level=info msg=cmd stdout: time=<timestamp> level=info msg=cmd stderr: ksh: /mnt/backup/db2-online-backup.log: cannot create [Permission denied] - Workaround
- Modify the openpages-<instance_name>-aux-ckpt-cm configmap.
- Edit the
configmap:
oc edit configmap -n ${PROJECT_CPD_INST_OPERANDS} openpages-<instance_name>-aux-ckpt-cm - Locate the following
line:
"ksh -lc '/mnt/backup/online/db2-online-backup.sh max_no_of_days_between_full_backup=7 > /mnt/backup/db2-online-backup.log'" - Replace the line
with:
"ksh -lc 'mkdir -p /mnt/backup/online/logs && /mnt/backup/online/db2-online-backup.sh max_no_of_days_between_full_backup=7 > /mnt/backup/online/logs/db2-online-backup.log'"
- Edit the
configmap:
Db2 Data Management Console is not successfully restored
Applies to: 4.7.1
Fixed in: 4.7.2
- Diagnosing the problem
- When Cloud Pak for Data users open the My
instance page, the Db2 Data Management Console
instance is in a Pending state.
This problem occurs only when Cloud Pak for Data was upgraded from version 4.5.x or 4.6.x.
- Workaround
- Ask the Db2 Data Management Console administrator to delete the instance and reprovision it.
Restoring Cognos Dashboards remains in progress
Applies to: 4.7.0 and 4.7.1
Fixed in: 4.7.2
- Diagnosing the problem
- When you use IBM Storage Fusion Version 2.6 to back up a Cloud Pak for Data deployment that includes Cognos® Dashboards, and restore the deployment to a different cluster, the Cognos Dashboards restore operation remains in an InProgress state.
- Workaround
- To ensure that the Cognos
Dashboards restore
operation completes as expected, run the following commands before you create the backup.
-
Run the
cpd-cli manage login-to-ocpcommand to log in to the cluster as a user with sufficient permissions to complete this task. For example:cpd-cli manage login-to-ocp \ --username=${OCP_USERNAME} \ --password=${OCP_PASSWORD} \ --server=${OCP_URL}Tip: Thelogin-to-ocpcommand takes the same input as theoc logincommand. Runoc login --helpfor details. - Run the following
commands:
oc -n ${PROJECT_CPD_INST_OPERANDS} label service c-dashboard-redis-p icpdsupport/ignore-on-nd-backup=true oc -n ${PROJECT_CPD_INST_OPERANDS} label secret dashboard-redis-cert icpdsupport/ignore-on-nd-backup=true
For more information, see Prepare Cognos Dashboards.
-
Online backup and restore with the OADP backup and restore utility issues
- The ZenService custom resource is in a Failed state after restoring an online backup
- IBM Match 360 encounters errors during pre-backup and post-restore operations
- Unable to create online backup when Watson Knowledge Catalog and IBM Match 360 are installed in the same operand project
- After restoring IBM Match 360 from an online backup, the associated Redis pods can enter a CrashLoopBackOff state
- Watson Knowledge Catalog profiling operations fail after you restore an online backup
- Following an upgrade, some Db2 Data Management Console pods are not running after you restore an online backup
The ZenService custom resource is in a Failed state
after restoring an online backup
Applies to: 4.7.2 and later
- Diagnosing the problem
- This problem occurs when Cloud Pak for Data is
integrated with the Identity Management Service.
Get the Identity Management Service operator logs.
- Get the pod name for
ibm-iam-operator:oc get pod -A | grep -e ibm-iam-operator - View the Identity Management Service operator log
output:
oc logs ibm-iam-operator-<pod_name> -n ${PROJECT_CPD_INST_OPERATORS}
The log contains an entry like in the following example:
{"level":"info","ts":1693307569.5649076,"logger":"leader","msg":"Leader lock configmap must have exactly one owner reference.","ConfigMap":{"namespace":"cpd-operators","name":"ibm-iam-operator-lock"}} - Get the pod name for
- Workaround
- Delete the ibm-iam-operator-lock
configmap.
oc delete cm ibm-iam-operator-lock -n ${PROJECT_CPD_INST_OPERATORS}
IBM Match 360 encounters errors during pre-backup and post-restore operations
After a successful IBM Match
360 deployment,
the CR status (mdmStatus) enters a Completed state and the
operator stops its reconciliation process. During backup and restore operations, this can lead to
errors in the pre-backup and post-restore steps.
Applies to: 4.7.4
- Resolving the problem
-
After installing the IBM Cloud Pak for Data platform with the IBM Match 360 service, run the provided script to resolve this issue by ensuring that the operator checks for the required backup and restore annotations. This enables reconciliation to complete after successful installations of IBM Match 360so that the pre-backup and post-restore steps can complete.
Required role: To complete this task, you must be a cluster administrator.
To resolve this issue, complete the following steps:
- Ensure that the IBM Match
360 CR is in a
Completedstate:
The returned status should beoc get MasterDataManagement mdm-cr -o jsonpath='{.status.mdmStatus} {"\n"}' -n ${PROJECT_CPD_INST_OPERATORS}Completed. - Copy and save the following script as a file with the name
mdm-br-fix.sh.
mdm-br-fix.sh script (click to expand)
set -e if [ -z "$1" ] then echo "ERROR: Please provide value for operator namespace" exit fi if [ -z "$2" ] then echo "ERROR: Please provide value for operand namespace" exit fi cat << EOF >> verify_completed_state.yml # # _____{COPYRIGHT-BEGIN}_____ # IBM Confidential # OCO Source Materials # # 5725-E59 # # (C) Copyright IBM Corp. 2021-2023 All Rights Reserved. # # The source code for this program is not published or otherwise # divested of its trade secrets, irrespective of what has been # deposited with the U.S. Copyright Office. # _____{COPYRIGHT-END}_____ # # Switch status from Completed to InProgress if: # 1. The instance identifier cannot be found # 2. The targeted operand version does not match the currently installed operand (upgrade scenario) # 3. Any dependency or service is in an unavailable state - name: Get mdm CR k8s_info: api_version: mdm.cpd.ibm.com/v1 kind: MasterDataManagement name: "{{ ansible_operator_meta.name }}" namespace: "{{ ansible_operator_meta.namespace }}" register: mdm_cr - name: The current operand version debug: var: mdm_current_version vars: mdm_resource: "{{ (mdm_cr.resources | default([]))[0] | default({}) }}" mdm_current_version: "{{ (((mdm_resource.status | default({})).versions | default([]))[1] | default({})).version | default(None) }}" - name: The targeted operand version debug: var: mdm_target_version vars: mdm_resource: "{{ (mdm_cr.resources | default([]))[0] | default({}) }}" mdm_target_version: "{{ (mdm_resource.spec | default({})).version | default(None) }}" - name: Set var if Backup annotations are present set_fact: bkp_annotation: "{% if (mdm_cr.resources[0].metadata.annotations['mdm.cpd.ibm.com/backup-trigger'] is defined) %} True {% else %} False {% endif %}" - name: Set var if Restore annotations are present set_fact: br_annotation: "{% if (mdm_cr.resources[0].metadata.annotations['mdm.cpd.ibm.com/restore-trigger'] is defined or mdm_cr.resources[0].metadata.annotations['mdm.cpd.ibm.com/restore-trigger-offline'] is defined) %} True {% else %} False {% endif %}" - name: Check if Operator is in reconciliation state set_fact: reconcile_state: "{% if (mdm_cr.resources | length>0 and (mdm_cr.resources[0].status.conditions[2].message == 'Running reconciliation' and mdm_cr.resources[0].status.conditions[2].status == 'True')) %} True {% else %} False {% endif %}" when: (br_annotation is defined and br_annotation == " False ") or (bkp_annotation is defined and bkp_annotation == " False ") - name: Set CR status to InProgress if the instance identifier cannot be found or the targeted operand version does not match the installed version or operator is in reconcile state operator_sdk.util.k8s_status: api_version: "mdm.cpd.ibm.com/v1" kind: "MasterDataManagement" name: "{{ ansible_operator_meta.name }}" namespace: "{{ ansible_operator_meta.namespace }}" status: mdmStatus: "InProgress" vars: mdm_resource: "{{ (mdm_cr.resources | default([]))[0] | default({}) }}" mdm_instance_id: "{{ (mdm_resource.status | default({})).instance_id | default(None) }}" mdm_target_version: "{{ (mdm_resource.spec | default({})).version | default(None) }}" mdm_current_version: "{{ (((mdm_resource.status | default({})).versions | default([]))[1] | default({})).version | default(None) }}" when: (mdm_instance_id is not defined and instance_identifier is not defined) or (mdm_target_version is defined and mdm_current_version is defined and mdm_target_version != mdm_current_version) or (reconcile_state is defined and reconcile_state == " True ") - block: - name: Initialize all_services_available to true set_fact: all_services_available: true - name: Check availability of services post-install include_tasks: check_services.yml - name: Set CR status to InProgress if not all services are available operator_sdk.util.k8s_status: api_version: "mdm.cpd.ibm.com/v1" kind: "MasterDataManagement" name: "{{ ansible_operator_meta.name }}" namespace: "{{ ansible_operator_meta.namespace }}" status: mdmStatus: "InProgress" when: not all_services_available when: - instance_identifier is defined EOF cat << EOF >> skip_reconcile.yml # # _____{COPYRIGHT-BEGIN}_____ # IBM Confidential # OCO Source Materials # # 5725-E59 # # (C) Copyright IBM Corp. 2021-2023 All Rights Reserved. # # The source code for this program is not published or otherwise # divested of its trade secrets, irrespective of what has been # deposited with the U.S. Copyright Office. # _____{COPYRIGHT-END}_____ # - name: "Check saved and actual CR spec section data to determine if reconcile can be skipped" block: - name: Get mdm CR k8s_info: api_version: mdm.cpd.ibm.com/v1 kind: MasterDataManagement name: "{{ ansible_operator_meta.name }}" namespace: "{{ ansible_operator_meta.namespace }}" register: mdm_cr - name: Get mdm-cr-cm k8s_info: api_version: v1 kind: ConfigMap name: "mdm-{{ ansible_operator_meta.name }}-cm" namespace: "{{ ansible_operator_meta.namespace }}" register: mdm_cm - name: Compare data only when mdm-cr-cm and mdm-cr is present block: - name: Save current CR spec set_fact: current_spec: "{{ mdm_cr.resources[0].spec }}" - name: Get mdm-cr-cm configmap data set_fact: cm_data: "{{ mdm_cm.resources[0].data['instance.json'] }}" - name: Retrive Saved spec metadata from mdm-cr-cm set_fact: saved_spec: "{{ cm_data.create_arguments.metadata }}" when: cm_data is defined and cm_data | length>0 - name: Set var if BR annotations are present set_fact: br_annotation: "{% if (mdm_cr.resources[0].metadata.annotations['mdm.cpd.ibm.com/restore-trigger'] is defined or mdm_cr.resources[0].metadata.annotations['mdm.cpd.ibm.com/restore-trigger-offline'] is defined) %} True {% else %} False {% endif %}" - name: Set var if Backup annotations are present set_fact: bkp_annotation: "{% if (mdm_cr.resources[0].metadata.annotations['mdm.cpd.ibm.com/backup-trigger'] is defined) %} True {% else %} False {% endif %}" - debug: msg: br_annotation {{ br_annotation }} bkp_annotation {{ bkp_annotation }} when: (mdm_cr.resources is defined and mdm_cr.resources | length>0) and (mdm_cm.resources is defined and mdm_cm.resources | length>0) - name: End Reconciliation if spec is unchanged and mdmStatus is Completed block: - debug: msg: "Previously saved and actual data from the MDM CR spec section are the same and CR status is Completed. Skipping reconcile by ending play" - meta: end_play when: (mdm_cr.resources | length>0) and (saved_spec is defined and current_spec is defined) and (saved_spec == current_spec) and (mdm_cr.resources[0].status is defined and (mdm_cr.resources[0].status.mdmStatus is defined and mdm_cr.resources[0].status.mdmStatus == "Completed")) and ((br_annotation is defined and br_annotation == " False ") or (bkp_annotation is defined and bkp_annotation == " False ")) EOF MDM_OPERATOR_POD=`oc get pods -n $1|grep ibm-mdm-operator-controller|awk '{print $1}'` MDM_CR_NAME=`oc get mdm -n $2|awk 'FNR ==2 {print $1}'` if [ -z "$MDM_OPERATOR_POD" ] then echo "ERROR: MDM operator pod is not present" exit fi if [ -z "$MDM_CR_NAME" ] then echo "ERROR: MDM operand CR is not present" exit fi oc cp verify_completed_state.yml $MDM_OPERATOR_POD:/opt/ansible/roles/3.2.35/mdm_cp4d/tasks/verify_completed_state.yml -n $1 oc cp skip_reconcile.yml $MDM_OPERATOR_POD:/opt/ansible/roles/3.2.35/mdm_cp4d/tasks/skip_reconcile.yml -n $1 oc patch mdm $MDM_CR_NAME --type=merge -p '{"spec":{"onboard":{"timeout_seconds":"840"}}}' -n $2 rc=$? if [ $rc != 0 ]; then echo "Error patching mdm-operator pod" exit $exitrc else echo "Patch successful" fi rm -Rf verify_completed_state.yml rm -Rf skip_reconcile.yml
- Give the script the permissions that it needs to
run:
chmod 777 mdm-br-fix.sh - Run the script using the following command, which includes two additional parameters: the
Operator namespace and the Operand namespace.
When the script is successful, the resulting message is similar to the following example:./mdm-br-fix.sh ${PROJECT_CPD_INST_OPERATORS} ${PROJECT_CPD_INST_OPERANDS}masterdatamanagement.mdm.cpd.ibm.com/mdm-cr patched Patch successful - Proceed with the backup and restore operations. For more information, see Backing up and restoring Cloud Pak for Data.
- Ensure that the IBM Match
360 CR is in a
Unable to create online backup when Watson Knowledge Catalog and IBM Match 360 are installed in the same operand project
Applies to: 4.7.0, 4.7.1, and 4.7.2
Fixed in: 4.7.3
- Diagnosing the problem
- Running the
cpd-cli oadp checkpoint createcommand fails, and you see an error such as in the following example:Error: error running checkpoint exec hooks: error processing configmap "wkc-foundationdb-cluster-aux-checkpoint-cm": component conflict detected for "fdb" defined in configmap "mdm-foundationdb-1691204250918697-aux-checkpoint-cm" [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1This problem occurs when both IBM Match 360 and Watson Knowledge Catalog are installed in the same operand project (namespace), and Watson Knowledge Catalog has enabled Manta.
- Workaround
- Modify the backup and restore configmap of one of the services, such as
mdm-foundationdb-<xxxxxxxxxxxxxxxx>-aux-checkpoint-cm.
- In the
aux-metasection, change thecomponent:value tofdb-<service_name>. - Locate and change the
cpdfwk.component:value to the same value from the previous step.
- In the
After restoring IBM Match
360 from an
online backup, the associated Redis pods
can enter a CrashLoopBackOff state
Applies to: 4.7.1, 4.7.2, and 4.7.3
Fixed in: 4.7.4
- Diagnosing the problem
- When you restore IBM Match
360 from backup,
the associated Redis pods can fail to come
up, showing a status of
CrashLoopBackOff. - Workaround
-
To fix this problem, complete the following steps to clean up the Redis pods:
Required role: To complete this task, you must be a cluster administrator.
- Get all of the Redis recipes in the
current
namespace:
oc get recipes.redis.databases.cloud.ibm.com -n ${PROJECT_CPD_INST_OPERANDS} - Delete each of the listed
recipes:
oc delete recipes.redis.databases.cloud.ibm.com <RECIPE_NAME> -n ${PROJECT_CPD_INST_OPERANDS} - Get the Redis CR name:
oc get redissentinel -n ${PROJECT_CPD_INST_OPERANDS} - Back up the Redis CR and name it
redissentinels.yaml:
oc get redissentinels <CR_NAME> -o yaml > redissentinels.yaml - Delete the current Redis CR:
oc delete redissentinels <CR_NAME> - Restore the Redis CR from the backup
that you just created:
oc apply -f redissentinels.yaml - Wait for reconciliation to complete. The Redis pods should now up and enter a running state.
- Refresh the IBM Match
360 configuration UI
pod (
mdm-config-ui) to restore its Redis connection.- Get the name of the
mdm-config-uipod:oc get pod -n ${PROJECT_CPD_INST_OPERANDS} | grep mdm-config-ui - Delete the
mdm-config-uipod using the name you retrieved in the previous step.oc delete pod <CONFIG-UI-POD-NAME> -n ${PROJECT_CPD_INST_OPERANDS} - Wait for the
mdm-config-uipod to be recreated and in a running state.
- Get the name of the
- Get all of the Redis recipes in the
current
namespace:
Watson Knowledge Catalog profiling operations fail after you restore an online backup
Applies to: 4.7.0 and 4.7.1
Fixed in: 4.7.2
- Diagnosing the problem
- Profiling jobs fail with the following error
message:
[ERROR ] {"class_name":"com.ibm.wdp.profiling.impl.messaging.consumer.DataProfileConsumer","method_name":"handleDataProfileFailure","class":"com.ibm.wdp.profiling.impl.messaging.consumer.DataProfileConsumer","method":"handleDataProfileFailure","appname":"wdp-profiling","user":"NONE","thread_ID":"db","trace_ID":"7wmei02iwgd5o70522m5ry8hm","transaction_ID":"NONE","timestamp":"2023-07-10T18:57:40.051Z","tenant":"NONE","session_ID":"NONE","perf":"false","auditLog":"false","loglevel":"SEVERE","message":"THROW. The WDPException is: Internal Server Error Failed to start Humming-Bird job..","msg_ID":"CDIWC2006E","exception":"com.ibm.wdp.service.common.exceptions.WDPException: CDIWC2006E: Internal Server Error Failed to start Humming-Bird job..\n\tat com.ibm.wdp.profiling.impl.messaging.consumer.DataProfileConsumer.handleDataProfileFailure(DataProfileConsumer.java:429)\n\tat com.ibm.wdp.profiling.impl.messaging.consumer.DataProfileConsumer.handleHummingbirdEvents(DataProfileConsumer.java:275)\n\tat com.ibm.wdp.profiling.impl.messaging.consumer.DataProfileConsumer.processDelivery(DataProfileConsumer.java:223)\n\tat com.ibm.wdp.service.common.rabbitmq.ConsumerManager.handle(ConsumerManager.java:308)\n\tat com.rabbitmq.client.impl.recovery.AutorecoveringChannel$4.handleDelivery(AutorecoveringChannel.java:642)\n\tat com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)\n\tat com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:111)\n\tat java.base\/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base\/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base\/java.lang.Thread.run(Unknown Source)\n","component_ID":"wdp-profiling","message_details":"THROW. The WDPException is: Internal Server Error Failed to start Humming-Bird job.."} CDIWC2006E: Internal Server Error Failed to start Humming-Bird job.. - Workaround
- Restart the wdp-profiling-<xxx> pod.
- Get the pod
name.
oc get pod -l app=wdp-profiling --no-header - Restart the
pod:
oc delete pod wdp-profiling-<xxx>
- Get the pod
name.
Following an upgrade, some Db2 Data Management Console pods are not running after you restore an online backup
Applies to: 4.7.0 and 4.7.1
Fixed in: 4.7.2
- Diagnosing the problem
-
After you restore an online backup, some Db2 Data Management Console pods are not running even though the custom resource status is
Completed.This problem occurs after Cloud Pak for Data is upgraded to version 4.7.0.
- Workaround
-
To work around the problem, do the following steps:
- Get the list of
recipes:
oc get recipes.redis.databases.cloud.ibm.com -n ${PROJECT_CPD_INST_OPERANDS} - Delete each recipe by running the following
command:
oc delete recipes.redis.databases.cloud.ibm.com <recipe_name> -n ${PROJECT_CPD_INST_OPERANDS} - Get the custom resource
name:
oc get redissentinel -n ${PROJECT_CPD_INST_OPERANDS} - Back up the custom
resource:
oc get redissentinels <CR_NAME> -o yaml > redissentinels.yaml - Delete the custom
resource:
oc delete redissentinels <CR_NAME> - Reapply the custom resource that you previously backed
up:
oc apply -f redissentinels.yaml
- Get the list of
recipes:
Offline backup and restore with the OADP backup and restore utility issues
- Creating an offline backup in REST mode stalls
- Offline restic backup of Watson Knowledge Catalog fails
- After restoring IBM Match 360 from an offline backup, Redis pods can enter a CrashLoopBackOff state
- Restoring an offline backup can fail when MANTA Automated Data Lineage is enabled
- Cannot access the Cloud Pak for Data user interface after you restore Cloud Pak for Data
- Some Db2 Data Management Console pods are stuck during restore
Creating an offline backup in REST mode stalls
Applies to: 4.7.0 and later
- Diagnosing the problem
- This problem occurs when you try to create an offline backup in REST mode by using a custom
--image-prefixvalue. The offline backup stalls with cpdbr-vol-mnt pods in theImagePullBackOffstate. - Cause of the problem
- When you specify the
--image-prefixoption in thecpd-cli oadp backup createcommand, the default prefixregistry.redhat.io/ubi9is always used. - Resolving the problem
- To work around the problem, create the backup in Kubernetes mode instead. To change to this mode,
run the following
command:
cpd-cli oadp client config set runtime-mode=
Offline restic backup of Watson Knowledge Catalog fails
Applies to: 4.7.0, 4.7.1, and 4.7.2
Fixed in: 4.7.3
- Diagnosing the problem
- The CPD-CLI*.log file has a message that shows that the
wkc-unquiesce-job did not complete.A log entry for wkc-unquiesce-job has the following message:
This jobs is waiting for wkc-catalog-api-jobs: =============================== ====== ===== <timestamp> INFO: Yet to Scale Up 1 objects wkc-catalog-api-jobs deploy 0/1 - Workaround
-
To fix this problem, do the following steps:
- Create a JSON file named specpatchreadiness.json with the following
contents:
[ { "op": "replace", "path": "/spec/containers/0/readinessProbe/exec/command", "value": [ "sh", "-c", "curl --fail -G -sS -k --max-time 30 https://localhost:8081/actuator/health\n" ] } ] - Copy the file to one of the following directories:
- cpd-cli-workspace/olm-utils-workspace/work/rsi/
- If you defined
$CPD_CLI_MANAGE_WORKSPACE, $CPD_CLI_MANAGE_WORKSPACE/work/rsi
- Log in to the Cloud Pak for Data
server:
cpd-cli manage login-to-ocp --server=${OCP_URL} -u ${OCP_USERNAME} -p ${OCP_PASSWORD} - Install the resource specification injection (RSI)
feature:
cpd-cli manage install-rsi --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} - Enable
RSI:
cpd-cli manage enable-rsi --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} - Apply the following
patch:
cpd-cli manage create-rsi-patch --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} --patch_type=rsi_pod_spec \ --patch_name=camsreadinessrsi \ --description=\"This is spec patch for Catalog readiness Probe fix\" \ --selector=app:wkc-catalog-api-jobs \ --state=active \ --spec_format=json \ --patch_spec=/tmp/work/rsi/specpatchreadiness.json
- Create a JSON file named specpatchreadiness.json with the following
contents:
After restoring IBM Match
360 from an
offline backup, Redis pods can enter a
CrashLoopBackOff state
Applies to: 4.7.0, 4.7.1, 4.7.2, and 4.7.3
Fixed in: 4.7.4
- Diagnosing the problem
- After you restore the IBM Match
360 service
from an offline backup, the corresponding Redis pods can go into a
CrashLoopBackOffstate. - Workaround
-
To fix this problem, run the following command to delete and refresh the Redis pods:
oc get pod -n ${PROJECT_CPD_INST_OPERANDS} | grep mdm-redis | awk '{ print $1 }' | xargs oc delete pod -n ${PROJECT_CPD_INST_OPERANDS}
Restoring an offline backup can fail when MANTA Automated Data Lineage is enabled
Applies to: 4.7.0
Fixed in: 4.7.1
- Diagnosing the problem
-
Restoring a Cloud Pak for Data offline backup that includes Watson Knowledge Catalog can fail if MANTA Automated Data Lineage is enabled.
- Workaround
-
To work around the problem, shut down the MANTA application before you create the backup, and manually start it after you restore the backup.
To shut down the MANTA application, do the following steps.-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
oc login ${OCP_URL} - Change to the appropriate project (namespace).For example,
oc project wkc - Edit the mantaflow custom
resource.
oc edit mantaflow mantaflow-wkc - Locate
spec. - Update the following
line.
shutdown : "force" - Wait until the MANTA pods are removed.
- Create the offline backup.
After the backup is restored, manually start the MANTA application by doing the following steps.-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
oc login ${OCP_URL} - Change to the appropriate project (namespace).For example,
oc project wkc - Edit the mantaflow custom
resource.
oc edit mantaflow mantaflow-wkc - Locate
spec. - Update the following
line.
shutdown : "false" - Wait until the MANTA pods are up.
-
Some Db2 Data Management Console pods are stuck during restore
Applies to: 4.7.0 and later
- Diagnosing the problem
- During restore, some Db2 Data Management Console pods remain stuck.
- Workaround
- Delete the Redis pods. The pods will
then reconcile to a successful state.
- Get the list of Redis
pods:
oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep redis - Delete each pod by running the following
command:
oc delete po <podname>
- Get the list of Redis
pods:
Cannot access the Cloud Pak for Data user interface after you restore Cloud Pak for Data
Applies to: 4.7.0 and 4.7.1
Fixed in: 4.7.2
- Diagnosing the problem
-
When you try to access the Cloud Pak for Data user interface after you restore an offline backup to a different cluster, a 502 bad gateway error page appears. This problem occurs when Cloud Pak for Data is integrated with the Identity Management Service.
- Workaround
-
To resolve the problem, complete the following steps after you restore your Cloud Pak for Data instance.
- Save the NGINX deployment replicas as an environment
variable.
NGINX_REPLICAS=`oc get deploy ibm-nginx -o jsonpath='{.spec.replicas}'` - Scale the NGINX deployment replicas to
0.oc scale deploy ibm-nginx --replicas=0 - Scale the NGINX deployment replicas back up to the value that you saved in step
1.
oc scale deploy ibm-nginx --replicas=${NGINX_REPLICAS}
- Save the NGINX deployment replicas as an environment
variable.
Cloud Pak for Data API issues
Calls to the users or
currentUserInfo API methods without pagination might crash the
zen-metastore-edb pods.
Applies to: 4.7.0
- Diagnosing the problem
- When you run the
/api/v1/usermgmt/v1/usermgmt/usersAPI method without pagination in a small scale environment, thezen-metastore-edbpods goes in toCrashmode orNot readymode; and you cannot recover the pod. - Resolving the problem
- When you run the
/api/v1/usermgmt/v1/usermgmt/usersAPI method, you must use pagination.
Service issues
The following issues are specific to services.