Known issues and limitations for IBM Cloud Pak for Data
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.
The following issues apply to IBM Cloud Pak for Data.
Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.
Customer-reported issues
Issues that are found after the release are posted on the IBM Support site.
General issues
- Namespace deleted when the
--namespaceparameter is used with thehealth network-performancecommand - After rebooting the cluster, some services in Cloud Pak for Data on OpenShift® Data Foundation aren't functional
- You cannot activate notification forwarding from the Events tab
- The services catalog includes two IBM® Knowledge Catalog tiles
- Some text in services catalog is outdated after upgrade
- The Identity and user access card on the home page displays an error
- You are occasionally redirected to the wrong login page when Cloud Pak for Data is integrated with the Identity Management Service
- Intermittent login issues when Cloud Pak for Data is integrated with the Identity Management Service
- Warning alerts might appear on the home page after installation
- Tethered projects are not retrieved
- RSI patches are not applied to pods as expected
- Modifying existing RSI patches causes undefined label selector errors
- Cannot get the contents of the
zen-metastore-edbpod as a YAML file
Namespace deleted when the
--namespace parameter is
used with the health
network-performance command
Applies to: 4.8.4 and 4.8.5
Fixed in: 4.8.5 (v13.1.5r1) and later
To use the --namespace
parameter, you must download and reinstall the latest version of cpd-cli (v13.1.5r1). For more information, see Installing the
Cloud Pak for Data command-line interface (cpd-cli).
--namespace parameter. Any namespace
that you enter will be deleted.After rebooting the cluster, some services in Cloud Pak for Data on OpenShift Data Foundation aren't functional
Applies to: 4.8.3 and later
- Diagnosing the problem
- After rebooting the cluster, some Cloud Pak for Data custom resources remain in the
InProgressstate.For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift Data Foundation 4.1.4 release notes.
- Workaround
- Do the following steps:
- Find the nodes that have pods that are in an
Errorstate:oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A | grep -v -P "Completed|(\d+)\/\1" - Mark each node as
unschedulable.
oc adm cordon <node_name> - Delete the affected
pods:
oc get pod | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0 - Mark each node as
scheduled:
oc adm uncordon <node_name>
- Find the nodes that have pods that are in an
You cannot activate notification forwarding from the Events tab
Applies to: 4.8.0
Fixed in: 4.8.1
- If you are forwarding email notifications, use the Activate button on the Email recipients tab.
- If you are forwarding notification to an external application, use the Activate button on the External service tab.
The services catalog includes two IBM Knowledge Catalog tiles
Applies to: 4.8.0 and later
When you go to the Services catalog page in the web client, you see two tiles for the IBM Knowledge Catalog service.
- Resolving the problem
- To remove the extra tile from the services catalog, restart the
zen-watcherpod:oc delete pods -n=${PROJECT_CPD_INST_OPERANDS} | grep zen-watcher
Some text in services catalog is outdated after upgrade
Applies to: 4.8.0 and 4.8.1
Fixed in: 4.8.2
When you go to the Services catalog page in the web client, you see outdated information after you upgrade Cloud Pak for Data. For example, Watson Assistant was renamed to watsonx Assistant in Cloud Pak for Data Version 4.8. However, the services catalog still displays Watson Assistant.
- Resolving the problem
- To update the text in the services catalog,
- Restart the
zen-core-apipods:oc rollout restart deployment zen-core-api -n=${PROJECT_CPD_INST_OPERANDS} - Clear your browser cache and re-authenticate to the web client.
- Restart the
The Identity and user access card on the home page displays an error
Applies to: 4.8.0 and 4.8.1
Fixed in: 4.8.2
The data cannot be displayed. The service might be restarting. Wait a few minutes and refresh the page. If the problem persists, contact your administrator.
However, refreshing the page does not resolve the problem.
- Resolving the problem
- To resolve the error, clear your web browser cache and cookies.
You are occasionally redirected to the wrong login page when Cloud Pak for Data is integrated with the Identity Management Service
Applies to: 4.8.0 and 4.8.1
Fixed in: 4.8.2
If your session expires and you click the Log in button on the
Logout page, you are sometimes redirected to a URL that starts with
https://cp-console rather than https://cpd. If you log
in to the https://cp-console URL, you are directed to the
Identity providers page, which you might not have access to.
- Resolving the issue
- Use the login URL provided by your administrator, or edit the URL to replace
cp-console with
cpdand try to log in again.
Intermittent login issues when Cloud Pak for Data is integrated with the Identity Management Service
Applies to: 4.8.0
- Error 504 - Gateway Timeout
- Internal Server Error
Some users might be directed to the Identity providers page rather than the Cloud Pak for Data home page.
- Diagnosing the problem
- If users experience one or more of the issues described in the preceding text, check
the
platform-identity-providerpods to determine whether the pods have been restarted multiple times:oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-identity-providerIf the output indicates multiple restarts, proceed to Resolving the problem
- Resolving the problem
-
- Restart the
icp-mongodb-0pod:oc delete podsicp-mongodb-0-n ${PROJECT_CPD_INST_OPERANDS} - Restart the
icp-mongodb-1pod:oc delete podsicp-mongodb-1-n ${PROJECT_CPD_INST_OPERANDS} - Restart the
icp-mongodb-2pod:oc delete podsicp-mongodb-2-n ${PROJECT_CPD_INST_OPERANDS} - Restart the
platform-auth-servicepod:oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-auth-service - Restart the
platform-identity-managementpod:oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-identity-management
- Restart the
Warning alerts might appear on the home page after installation
Applies to: 4.8.0 and later
- Diagnosing the problem
-
After installation, you might see warning alerts on the home page. However, the events that generated the alerts have cleared. The alerts will continue to display on the Alerts card for up to 3 days unless you delete the pods that generated the alerts.
The alerts are visible to users with one of the following permissions:- Administer platform
- Manage platform health
- View platform health
- Log in to the web client as a user with the appropriate permissions to view alerts.
- On the home page, click View all on the Alerts card.
- On the Alerts and events page, confirm that the alerts
were generated by one of the following services:
- Common core services
- IBM Knowledge Catalog
This issue can occur because
wkc-db2-initpods orjdbc-driver-sync-jobpods are in anErrorstate. - Resolving the problem
- An instance administrator or cluster administrator must resolve the problem.
-
Log in to Red Hat OpenShift Container Platform as a user with sufficient permissions to complete the task.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Check the status of the
wkc-db2-initpods and thejdbc-driver-sync-jobpods.oc get pods --sort-by=.status.startTime -n ${PROJECT_CPD_INST_OPERANDS} | grep -E 'wkc-db2u-init|jdbc-driver' - Delete any pods that are in the
Errorstate.Replace
<pod-name>with the name of the pod in the error state.oc delete pod <pod-name> -n ${PROJECT_CPD_INST_OPERANDS}
-
Tethered projects are not retrieved
Applied to: 4.8.0
/v2/namespaces
API call. For example, tethered projects might not be displayed in the web client, when you:- Try to create a service instance in a tethered project
- Create a storage volume in a tethered project
- Resolving the problem
- Patch the
ZenServicecustom resource to trigger a reconcile loop:oc patch ZenService lite-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec": {"patchProductConfigmap": "true"}}'
RSI patches are not applied to pods as expected
Applies to: 4.8.0 and 4.8.1
Fixed in: 4.8.2
In some situations, RSI patches are not applied to a subset of the specified pods. This
typically occurs when you use the --select_all_pods
option when you create an RSI patch.
- Diagnosing the problem
- Check the status of the pod owners, such as
StatefulSets, to determine whether the owners are stuck in thepatch in progressstate:- Check for any
Deploymentsin thepatch in progressstate:oc get deployment \ -n=${PROJECT_CPD_INST_OPERANDS} \ -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'The command returns a list of any
Deploymentsin this state. - Check for any
StatefulSetsin thepatch in progressstate:oc get statefulset \ -n=${PROJECT_CPD_INST_OPERANDS} \ -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'The command returns a list of any
StatefulSetsin this state. - Check for any
ReplicaSetsin thepatch in progressstate:oc get replicaset \ -n=${PROJECT_CPD_INST_OPERANDS} \ -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'The command returns a list of any
ReplicaSetsin this state. - Check for any
ReplicaControllersin thepatch in progressstate:oc get replicacontroller \ -n=${PROJECT_CPD_INST_OPERANDS} \ -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'The command returns a list of any
ReplicaControllersin this state. - Check for any
Jobsin thepatch in progressstate:oc get jobs \ -n=${PROJECT_CPD_INST_OPERANDS} \ -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'The command returns a list of any
Jobsin this state. - Check for any
CronJobsin thepatch in progressstate:oc get cronjob \ -n=${PROJECT_CPD_INST_OPERANDS} \ -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'The command returns a list of any
CronJobsin this state.
- Check for any
- Resolving the problem
- For each resource returned by the preceding commands, update the annotation on the
resource:
oc annotate <resource-type> <resource-name> \ -n=${PROJECT_CPD_INST_OPERANDS} \ resourcespecinjector.ibm.com/injection_status-Replace
<resource-type>with the type of the resource and<resource-name>with the name of the resource.
Modifying existing RSI patches causes undefined label selector errors
Applies to: 4.8.0, 4.8.1, and 4.8.2
Fixed in: 4.8.3
When you upgrade from IBM Cloud Pak for Data Version 4.6 or Version 4.7 to Version 4.8, you get an error when you modify existing RSI patches.
For example, if you want to make an active patch inactive, you encounter the following
error when you run the cpd-cli
manage
create-rsi-patch command:
fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'exclude_labels'...
- Resolving the issue
- To resolve the problem:
- Run the following command to get the complete name of the
patch:
oc get zenextension --namespace=${PROJECT_CPD_INST_OPERANDS}RSI patches are prefixed with
rsi-. - Set the
RSI_PATCH_NAMEenvironment variable to the name of the patch that you want to update:export RSI_PATCH_NAME=<patch-name> - Edit the
patch:
oc edit zenextension ${RSI_PATCH_NAME} \ --namespace=${PROJECT_CPD_INST_OPERANDS} - Add the following entries to the
spec.extensions[*].pod_selectorsection:
For example:"exclude_labels":{}, "select_all_pods":falseapiVersion: zen.cpd.ibm.com/v1 kind: ZenExtension metadata: name: rsi-zen-core-rsi-pod-env-var-json spec: extensions: | [ { "extension_point_id": "rsi_pod_env_var", "extension_name": "rsi-zen-core-rsi-pod-env-var-json", "display_name": "rsi-zen-core-rsi-pod-env-var-json", "description": "description", "meta": {}, "details": { "patch_spec": [ {"op":"add","path":"/spec/containers/0/env/-","value":{"name":"zen-core-env-json-one","value":"zen-core-env-json-one"}}, {"op":"add","path":"/spec/containers/0/env/-","value":{"name":"zen-core-env-json-two","value":"zen-core-env-json-two"}} ], "pod_selector":{ "selector":{ "app.kubernetes.io/component":"zen-core", "component":"zen-core" }, "exclude_labels":{}, "select_all_pods":false }, "state": "active", "type": "json" } } ] - Save your changes to the patch. For example, if you are using
vi, hit esc and enter :wq.
- Run the following command to get the complete name of the
patch:
Cannot get the contents of the
zen-metastore-edb pod as a YAML file
Applies to: 4.8.0, 4.8.1, 4.8.2, and 4.8.3
Fixed in: 4.8.4
When you run the following command against any of the zen-metastore-edb
pods, such as zen-metastore-edb-0, you get an error:
oc get pods zen-metastore-edb-0 -o yaml
The command returns the following error:
error: error converting JSON to YAML: yaml: control characters are not allowed
If you want to see the contents of the pod, you must use JSON. For example:
oc get pods zen-metastore-edb-0 -o json
Installation and upgrade issues
- General installation and upgrade issues
- General upgrade issues
-
- After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster
- After an upgrade from
Cloud Pak for Data 4.7.3, FoundationDB can indicate a
Failedstatus - The
apply-crcommand fails when upgrading services with a dependency on Db2U - The
apply-crcommand fails when upgrading services with a dependency on the common core services - The Projects page does not load after upgrade
- Unable to filter instances by type after you upgrade to Cloud Pak for Data 4.8.1
- After you upgrade
Red Hat OpenShift Container Platform, storage volume
pods are in the
CrashLoopBackOffstate - After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
- Inaccurate status message from the command line after you upgrade
- Secrets are not visible in connections after upgrade
- Red Hat OpenShift Container Platform upgrade issues
- After upgrading to Red Hat OpenShift Container Platform Version 4.14, the status of the Redis custom resource fluctuates
The apply-cr command fails when installing services with
a dependency on Db2U
Applies to: 4.8.0 and later
- Diagnosing the problem
- You can specify the privileges that Db2U runs with. If you
configured Db2U to run with
limited privileges, the
apply-crcommand will fail if:- You set
DB2U_RUN_WITH_LIMITED_PRIVS: "true"in thedb2u-product-cm ConfigMap. - The kernel parameter settings were not modified to allow Db2U to run with limited privileges.
This issue can manifest in several ways.- The
wkc-db2u-initjob fails - If you are installing IBM Knowledge Catalog, the
apply-crcommand fails with the message:"WKC DB2U post install job failed ('wkc-db2u-init' job)When you get the status of thewkc-db2u-initpods, they are in theErrorstate.oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep wkc-db2u-init - The
Db2UClusterresource never becomes ready - For other services you might notice that the
Db2uClusterresource never becomesReady.oc get Db2uCluster -n ${PROJECT_CPD_INST_OPERANDS} - You cannot provision service instances
- For services such as Db2® and Db2 Warehouse,
the
apply-crcommand completes successfully, but the service instances never finish provisioning and the*-db2u-0pods are stuck inPendingorSysctlForbidden.oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep db2u-0
- You set
- Resolving the problem
- This problem occurs when you set
DB2U_RUN_WITH_LIMITED_PRIVS: "true"in thedb2u-product-cm ConfigMapbut the kernel parameter settings were not modified to allow Db2U to run with limited privileges.Review Changing kernel parameter settings to confirm that you can change the kernel parameter settings.- If you can change the kernel parameter settings, ensure that the worker nodes
are restarted after you change the settings.
In some cases, when you run the
cpd-cli manage apply-db2-kubeletcommand, the worker nodes are not restarted. - If you cannot or do not change the kernel parameter settings, update the
db2u-product-cm ConfigMapto setDB2U_RUN_WITH_LIMITED_PRIVS: "false". For more information, see Specifying the privileges that Db2U runs with.
- If you can change the kernel parameter settings, ensure that the worker nodes
are restarted after you change the settings.
Some Cloud Pak for Data software cannot be installed when the OpenShift Virtualization Operator is installed on the cluster
Applies to: 4.8.1, 4.8.2, 4.8.3 and 4.8.4
Fixed in: 4.8.5
- OpenPages®
- watsonx.governance Risk and Compliance Foundation
Installing Red Hat OpenShift Container Platform for IBM Cloud Pak for Data previously stated that you could not install the OpenShift Virtualization Operator on the cluster. However, that restriction applies only to the environments with OpenPages or watsonx.governance Risk and Compliance Foundation.
After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster
Applies to: Upgrades from Version 4.7.4 to 4.8.0 and later
If you upgrade from Cloud Pak for Data version
4.7.4 to Cloud Pak for Data
4.8.0 or later, the IAM access token API
(/idprovider/v1/auth/identitytoken) fails. You cannot log in to the
user interface when the identitytoken API fails.
- Diagnosing the problem
-
The following error is displayed in the log when you generate an IAM access token:
Failed to get access token, Liberty error: {"error_description":"CWWKS1406E: The token request had an invalid client credential. The request URI was \/oidc\/endpoint\/OP\/token.","error":"invalid_client"}" - Resolving the problem
-
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Run the following command to restart the
oidc-client-registrationjob:oc -n delete job oidc-client-registration
-
After an upgrade from
Cloud Pak for Data 4.7.3, FoundationDB can indicate a
Failed status
Applies to: Upgrades from Version 4.7.3 to 4.8.2 and later
After upgrading Cloud Pak for Data from
Version 4.7.3 to 4.8.2 or later, the status of the FoundationDB cluster can indicate that it has
failed (fdbStatus: Failed). The Failed status can occur
even if FoundationDB is available and
working correctly. This issue occurs when the FoundationDB resources do not get properly
cleaned up by the upgrade.
- IBM Knowledge Catalog
- IBM Match 360
- Diagnosing the problem
-
To determine if this problem has occurred:
Required role: To complete this task, you must be a cluster administrator.
- Check the FoundationDB cluster
status.
oc get fdbcluster -o yaml | grep fdbStatusIf the returned status is
Failed, proceed to the next step to determine if the pods are available. - Check to see if the FoundationDB pods are up and
running.
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep foundationThe returned list of FoundationDB pods should all have a status of
Running. If they are not running, then the problem is something other than this issue.
- Check the FoundationDB cluster
status.
- Resolving the problem
-
To resolve this issue, restart the FoundationDB controller (
ibm-fdb-controller):Required role: To complete this task, you must be a cluster administrator.
- Identify your FoundationDB
controllers.
This command returns the names of two FoundationDB controllers in the following formats:oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-fdb-controlleribm-fdb-controller-manager-<INSTANCE-ID>apple-fdb-controller-manager-<INSTANCE-ID>
- Delete the
ibm-fdb-controller-managerto refresh it.oc delete pod ibm-fdb-controller-<INSTANCE-ID> -n ${PROJECT_CPD_INST_OPERATORS} - Wait for the controller to restart. This can take approximately one minute.
- Check the status of your FoundationDB
cluster:
Confirm that theoc -n ${PROJECT_CPD_INST_OPERANDS} get FdbCluster -o yamlfdbStatusis nowCompleted.
- Identify your FoundationDB
controllers.
The apply-cr command fails when upgrading services with a
dependency on Db2U
Applies to: 4.8.0, 4.8.1, 4.8.2, and 4.8.3
Fixed in: 4.8.4
When you try to upgrade services with a dependency on Db2U, the upgrade fails when trying to remove temporary files.
- IBM Knowledge Catalog
- OpenPages with an internal database
- Diagnosing the problem
- To determine if the upgrade failed when trying to remove temporary files:
- Set the
DB2U_PODenvironment variable to the name of thedb2u-0pod.- For IBM Knowledge Catalog,
run:
export DB2U_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep wkc-db2u-0) - For OpenPages:
- Set the
INSTANCE_IDenvironment variable:export INSTANCE_ID=$(oc get openpagesinstance -n=${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.items[0].spec.zenServiceInstanceId}{"\n"}') - Set the
DB2U_PODenvironment variable:export DB2U_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep ${INSTANCE_ID}-db2u-0 | awk '{print $1}')
- Set the
- For IBM Knowledge Catalog,
run:
- Open a remote shell in the pod to examine the
upgrade_update.log
file:
oc exec -it ${DB2U_POD} \ -n=${PROJECT_CPD_INST_OPERANDS} \ -- bash -c "cat /mnt/blumeta0/support/upgrade_update.log" - Look for the following strings in the error log:
-
INFO: Applying Db2 license -
in _delete_temporary_db2inst1_files -
OSError: [Errno 16] Device or resource busy: '.nfs
If the log contains the preceding strings, continue to Resolving the problem
-
- Set the
- Resolving the problem
-
- Remove the db2.tmp_db2inst1.* files from the
db2u-0pod:oc exec -it ${DB2U_POD} \ -n=${PROJECT_CPD_INST_OPERANDS} \ -- bash -c "rm -rf db2.tmp_db2inst1.*" - Restart the database
update:
oc exec -it ${DB2U_POD} \ -n=${PROJECT_CPD_INST_OPERANDS} \ -- bash -c "db2_update_upgrade --databases" - Patch the formation:
- For IBM Knowledge Catalog,
run:
oc patch formation db2oltp-wkc \ -n=${PROJECT_CPD_INST_OPERANDS} \ --type=json \ --patch='[{"op": "remove", "path": "/spec/resource_configs/m/env/upgrade~1podname"}]' - For OpenPages,
run:
oc patch formation db2oltp-${INSTANCE_ID} \ -n=${PROJECT_CPD_INST_OPERANDS} \ --type=json \ --patch='[{"op": "remove", "path": "/spec/resource_configs/m/env/upgrade~1podname"}]'
- For IBM Knowledge Catalog,
run:
- Remove the db2.tmp_db2inst1.* files from the
The apply-cr command fails when upgrading services with a
dependency on the common core services
- Upgrades from Version 4.6.4 to 4.8.0 and later
- Upgrades from Version 4.6.5 to 4.8.0 and later
- Upgrades from Version 4.6.6 to 4.8.0 and later
When you upgrade a service with a dependency on the common core services, the apply-cr command fails because the
ccs-cr custom resource is stuck in InProgress. The
problem can occur when the connection pods try to use an outdated certificate for
inter-pod communication, which prevents the connection migration jobs from completing.
- Diagnosing the problem
-
- Get the status of the
ccs-crcustom resource:oc get ccs -n=${PROJECT_CPD_INST_OPERANDS} - If the status of the custom resource is
InProgress, check the status of the common core services pods on in the operands project for the instance (${PROJECT_CPD_INST_OPERANDS}):oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep ccsIf the status of the
ccs-post-install-migration-jobpods areError, contact IBM Software Support to restart the connection migration job.
- Get the status of the
The Projects page does not load after upgrade
Applies to: Upgrades from Version 4.8.1 to Version 4.8.2
After you upgrade from Version 4.8.1 to Version 4.8.2, the Projects page does not load.
- Resolving the problem
- To enable the Projects page to load:
- Set the
PORTAL_MAIN_PODto the name of theportal-mainpod:export PORTAL_MAIN_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep portal-main | awk '{print $1}') - Delete the
portal-mainpod:oc delete pods ${PORTAL_MAIN_POD} -n=${PROJECT_CPD_INST_OPERANDS} - Wait for the pod to be
Readyand try to load the Projects page again. To check the status of the pod, run:oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep portal-main
- Set the
Unable to filter instances by type after you upgrade to Cloud Pak for Data 4.8.1
Applies to: 4.8.0 and 4.8.1
Fixed in: 4.8.2
- Diagnosing the problem
-
After you upgrade to Cloud Pak for Data 4.8.1, you cannot filter by type on the Instances page. You see a blank page in the user interface.
- Workaround
- If you click Filter by: and then click Type, the page will become blank. You can refresh the page to return to the Instance page.
After you upgrade
Red Hat OpenShift Container Platform, storage volume
pods are in the CrashLoopBackOff state
Applies to: 4.8.6 and later
After you upgrade Red Hat OpenShift Container Platform,
storage volume pods (volumes-* pods are in the
CrashLoopBackOff state. This problem occurs when the
/auth/jwtpublic URL path in ibm-nginx returns a TLS
error.
- Diagnosing the problem
- To confirm that the problem is caused by a TLS error returned by the
/auth/jwtpublicURL path:- Get the name of a
volumes-*pods that is in theCrashLoopBackOffstate:oc get pods \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ | grep volumes - Get the pod logs for a
volumes-*pod that is in theCrashLoopBackOffstate:oc logs <pod-name> \ --namespace=${PROJECT_CPD_INST_OPERANDS}Replace
<pod-name>with the name of a pod in theCrashLoopBackOffstate - Confirm that the logs include the following
error:
time="<timestamp>" level=error msg="ProcessPublicKeyFromNginx - Failed receiving response from server" func=zen-core-api/source/apis/commonutils.LogErr file="/go/src/zen-core-api/source/apis/commonutils/auth_util.go:342" err="Get \"https://ibm-nginx-svc.<namespace>:443/auth/jwtpublic\": remote error: tls: illegal parameter" panic: Get "https://ibm-nginx-svc.<namespace>:443/auth/jwtpublic": remote error: tls: illegal parameter
- Get the name of a
- Resolving the problem
- To resolve the problem:
- Get the name of the
volumes-*deployments:oc get deployments \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ | grep -E 'volumes|READY' - For each
deploymentwhere theREADYcolumn shows0/1, run the following command to patch the deployment:oc patch deployment <volume-deployment-name> \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=json \ --patch='[ {"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts/-", "value": {"mountPath": "/user-home/_global_/config/jwt", "name": "ibm-zen-secret-jwt"}}, {"op": "add", "path": "/spec/template/spec/volumes/-", "value": {"name": "ibm-zen-secret-jwt", "secret": {"defaultMode": 420, "optional": true, "secretName": "ibm-zen-secret-jwt"}}} ]'Replace
<volume-deployment-name>with the name of adeployment.
- Get the name of the
After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
Applies to: 4.8.0 and later
After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.
- IBM Knowledge Catalog
- IBM Match 360 with Watson
- Diagnosing the problem
- To identify the cause of this issue, check the FoundationDB status and details.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusIf this command is successful, the returned status is
Complete. If the status isInProgressorFailed, proceed to the workaround steps. - If the status is
Completebut FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.oc rsh sample-cluster-log-1 /bin/fdbcliTo check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the
fdb>prompt.status details- If you get a message that is similar to Could not communicate
with a quorum of coordination servers, run the
coordinatorscommand with the IP addresses specified in the error message as input.oc get pod -o wide | grep storage > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tlsIf this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
- If you get a message that is similar to Could not communicate
with a quorum of coordination servers, run the
- Check the FoundationDB
status.
- Resolving the problem
- To resolve this issue, restart the FoundationDB pods.
Required role: To complete this task, you must be a cluster administrator.
- Restart the FoundationDB
cluster pods.
oc get fdbcluster oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete poReplace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance. - Restart the FoundationDB
operator
pods.
oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po - After the pods finish restarting, check to ensure that FoundationDB is available.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusThe returned status must be
Complete. - Check to ensure that the database is
available.
oc rsh sample-cluster-log-1 /bin/fdbcliIf the database is still not available, complete the following steps.
- Log in to the
ibm-fdb-controllerpod. - Run the
fix-coordinatorscript.kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}Replace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance.Note: For more information about thefix-coordinatorscript, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.
- Log in to the
- Check the FoundationDB
status.
- Restart the FoundationDB
cluster pods.
Inaccurate status message from the command line after you upgrade
- watsonx Assistant
- Watson Discovery
- Watson Knowledge Studio
- Watson Speech services
- Diagnosing the problem
- If you run the
cpd-cli service-instance upgradecommand from the Cloud Pak for Data command-line interface, and then use theservice-instance listcommand to check the status of each service, the provision status for the service is listed asUPGRADE_FAILED. - Cause of the problem
- When you upgrade the service, only the
cpd-cli manage apply-crcommand is supported. You cannot use thecpd-cli service-instance upgradecommand to upgrade the service. And after you upgrade the service with theapply-crmethod, the change in version and status is not recognized by theservice-instancecommand. However, the correct version is displayed from the Cloud Pak for Data web client. - Resolving the problem
- No action is required. If you use the
cpd-cli manage apply-crmethod to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by thecpd-cli service-instance listcommand.
Secrets are not visible in connections after upgrade
Applies to: Version 4.8.0, 4.8.1, 4.8.2, 4.8.3, and 4.8.4
If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.
- Resolving the problem
- To see the secrets in the user interface:
- Change to the project where Cloud Pak for Data is
installed:
oc project ${PROJECT_CPD_INST_OPERANDS} - Set the following environment
variables:
oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true
- Change to the project where Cloud Pak for Data is
installed:
After upgrading to Red Hat OpenShift Container Platform Version 4.14, the status of the Redis custom resource fluctuates
Applies to: 4.8.0, 4.8.1, 4.8.2, and 4.8.3
Fixed in: 4.8.4
ibm_redis_cp component:- Cognos® Dashboards
- Db2 Data Management Console
- IBM Match 360
- Watson Query
- watsonx Assistant
After you upgrade from Red Hat OpenShift Container Platform
Version 4.12 to Version 4.14, the status of the
Redis custom resource
fluctuates between InProgress and Completed.
However, if the Redis operator
(ibm-redis-cp-operator) is running without errors, you can
ignore the fluctuation in the custom resource status.
- Diagnosing the problem
-
- To determine whether the custom resource status is fluctuating run the following
command:
oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep redisWait several minutes and run the command again to see if the status changes.
- To determine whether the operator pods are healthy, run the following
command:
oc get pods -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-redis-cp-operator
- To determine whether the custom resource status is fluctuating run the following
command:
- Resolving the problem
-
If the status of the operator pods are
RunningorCompleted, you can ignore the fluctuation in the custom resource status.If the status of the operator pods are not
RunningorCompleted, follow the guidance in Troubleshooting the apply-olm command during installation or upgrade to determine the root cause of the problem.
Security issues
- Security scans return an Inadequate Account Lockout Mechanism message
- The Kubernetes version information is disclosed
Security scans return an Inadequate Account Lockout Mechanism message
Applies to: 4.8.0 and later
- Diagnosing the problem
-
If you run a security scan against Cloud Pak for Data, the scan returns the following message.
Inadequate Account Lockout Mechanism
- Resolving the problem
-
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.
The Kubernetes version information is disclosed
Applies to: 4.8.0 and later
- Diagnosing the problem
- If you run an Aqua Security scan against your cluster, the scan returns the following issue:
- Resolving the problem
- This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.
Backup and restore issues
- Issues that apply to several or all backup and restore methods
-
- Backup fails for the platform with error in EDB Postgres cluster
- OADP backup is missing EDB Postgres PVCs
- OADP backup precheck command fails with EDB Postgres cluster is out of sync error
- Informix® custom resource in
InProgressstate after restore - After restore, watsonx Assistant is stuck on the
17/19deployed state or custom resource is stuck inInProgressstate - Missing Identity Management Service data after a restore
- Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster
- Offline and online restore with OADP backup and restore utility fails with running post-restore hooks error
- Restoring online backup of Watson Query fails
- After a restore, OperandRequest timeout error in the ZenService custom resource
- After a restore, the
ZenService component remains in
InProgressstate - Password for the
cpadminuser changes when restoring Cloud Pak for Data - After restoring
IBM Match 360 from backup, the
associated Redis pods enter a
CrashLoopBackOffstate - Online backup of Watson Discovery fails at checkpoint stage
- Online restore posthooks fails with checkpoint id error
- Online restore posthooks
fail with
zenobjstore/ibm-zen-configuration is not emptyerror - Resource file causes online backup of Watson Discovery to fail at checkpoint stage
- Watson Knowledge Catalog custom resource stuck in
InProgressstate after restore
- Online backup and restore with OADP backup and restore utility issues
- Online backup and restore with IBM Storage Fusion issues
-
- Backup fails at Volume group: cpd-volumes stage
- Backup of Cloud Pak for Data operators project fails at data transfer stage
- Backup fails at Hook: br-service-hooks/post-backup stage
- ZenService is in
Failedstate after a restore - Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails
- Online backup and restore with NetApp Astra Control Center issues
-
- After a restore with
NetApp Astra Control Center,
post-restore hooks fail with an
Authentication failederror - After a restore
with NetApp Astra Control Center,
post-restore hooks fail with a
zen-mongodb-restore already existserror - After a restore with NetApp Astra Control Center, the ZenService custom resource remains stuck at 51% in progress
- After a restore with NetApp Astra Control Center, the ZenService custom resource does not reconcile
- After a restore with
NetApp Astra Control Center,
post-restore hooks fail with an
- Data replication with Portworx issues
-
- After a
restore, status of OpenPages
and watsonx.governance custom
resources are in a
Failedstate - When restoring Db2 Big SQL, post-restore hook fails
- Restore fails at the running post-restore script step
- Activating applications
after migrating data with Portworx
asynchronous disaster recovery fails at post-restore step with
constraints not satisfiableerror - Activating applications
after migrating data with Portworx
asynchronous disaster recovery fails at post-restore step with
subscription isn't created by ODLMerror
- After a
restore, status of OpenPages
and watsonx.governance custom
resources are in a
- Offline backup and restore with the OADP backup and restore utility issues
-
- Creating an offline backup in REST mode stalls
- Common core services custom resource is
in
InProgressstate after an offline restore to a different cluster - Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore
- Unable to back up Watson Machine Learning Accelerator
- Offline backup and restore with the volume backup and restore utility issues
Backup fails for the platform with error in EDB Postgres cluster
Applies to: 4.8.7 and later
- Diagnosing the problem
- For example, in IBM Storage Fusion, the backup fails at
the Hook: br-service hooks/pre-backup stage in the backup
sequence.
In the cpdbr-oadp.log file, you see the following error:
time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled - Cause of the problem
- Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.
- Resolving the problem
-
Use either the automatic or manual workaround.
- Automatic workaround
-
After you apply the YAML files, the following workaround automatically runs as a prehook every time you take a backup. The issue is automatically handled, so you do not encounter it again, which is especially useful if you have set up automatic backups.
- Check that the
${VERSION}environment variable is set in cpd_vars.sh to the correct Cloud Pak for Data version number. - Download the edb-patch-resources-legacy.yaml file.
- Run the following
command
oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f edb-patch-resources-legacy.yaml - Complete the steps that apply to your backup and restore method.
- Download the edb-patch-aux-ckpt-cm-legacy.yaml file.
- Run the following
command:
sed "s/VERSION_PLACEHOLDER/${VERSION}/g" edb-patch-aux-ckpt-cm-legacy.yaml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - - Retry the backup.
- Download the edb-patch-aux-br-cm-legacy.yaml file.
- Run the following
command:
sed "s/VERSION_PLACEHOLDER/${VERSION}/g" edb-patch-aux-br-cm-legacy.yaml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - - Retry the backup.
Online backup and restore
Offline backup and restore
- Check that the
- Manual workaround
-
Complete the following steps to manually run the workaround:
Note: If another switchover of the EDB Postgres cluster's primary instance and replica happens after you apply the manual workaround, you must complete the workaround again before you take a backup.- Download the edb-patch.sh file.
- Run the following
command:
sh edb-patch.sh ${PROJECT_CPD_INST_OPERANDS} - Retry the backup.
OADP backup is missing EDB Postgres PVCs
Applies to: 4.8.0 and later
- Diagnosing the problem
- After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.
- Cause of the problem
- EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.
- Resolving the problem
- Before you create a backup, run the following
command:
oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}For more information, see the following topics:
OADP backup precheck command fails with EDB Postgres cluster is out of sync error
Applies to: 4.8.7
Fixed in: 4.8.8
- Diagnosing the problem
- When you run the
cpd-cli oadp backup precheckcommand, you see the following error:Error: precheck failed with error: edb in-sync check failed: edb cluster is out of sync - Cause of the problem
- The backup precheck command is not accounting for any lag when log sequence numbers (LSNs) are received but not replayed yet.
- Resolving the problem
- Try one of the following options:
- Delete the EDB Postgres
cluster's replica pod and PVC.
For more information, see the troubleshooting topic PostgreSQL cluster replicas get out of sync.
- Wait for the LSNs to catch up.
- Delete the EDB Postgres
cluster's replica pod and PVC.
Informix custom resource in
InProgress state after restore
Applies to: 4.8.5
Fixed in: 4.8.8
- Diagnosing the problem
- After the restore, get the status of the Informix custom resource by
running the following
command:
oc get Informix informix-<xxxxxxxxxxxxxxxx> -n ${PROJECT_CPD_INST_OPERANDS} -o yamlIn the output, the
informixStatusshowsInProgress. - Cause of the problem
- The Informix custom resource stays in pending because the IBM Global Security Kit (GSKit) doesn't complete the SSL setup before the readiness and liveness checks complete.
- Resolving the problem
- Change the behavior of GSKit. From the OpenShift console, add
the environment variable
ICC_SHIFT=3to the informix-<xxxxxxxxxxxxxxxx>-cm-0 and to the informix-<xxxxxxxxxxxxxxxx>-server StatefulSets, and restart those pods.
After restore, watsonx Assistant is stuck on the
17/19 deployed state or custom resource is stuck in
InProgress state
Applies to: 4.8.5
Fixed in: 4.8.6
- Diagnosing the problem
- This problem can occur after you restore an online backup to the same cluster or to
a different cluster. Run the following
command:
oc get wa -n ${PROJECT_CPD_INST_OPERANDS}Example output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE AGE wa 5.0.1 False Initializing True VerifyWait 17/19 15/19 4h39m - Resolving the problem
- Delete the
wa-integrations-operand-secretandwa-integrations-datastore-connection-stringssecrets by running the following commands:oc delete secret wa-integrations-operand-secret -n ${PROJECT_CPD_INST_OPERANDS}oc delete secret wa-integrations-datastore-connection-strings -n ${PROJECT_CPD_INST_OPERANDS}After the secrets are deleted, the watsonx Assistant operator recreates them with the correct values, and the watsonx Assistant custom resource and pods are now in a good state.
Missing Identity Management Service data after a restore
Applies to: 4.8.0-4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
- After you upgrade to Cloud Pak for Data
4.8.0-4.8.4, you integrate
with the Identity Management Service. When
you back up and restore your Cloud Pak for Data deployment, Identity Management Service data is missing. No
results are returned for one or more of the following
commands:
oc get pvc -n ${PROJECT_CPD_INST_OPERANDS} ibm-zen-cs-mongo-backupoc get cm -n ${PROJECT_CPD_INST_OPERANDS} zen-cs-aux-br-cmoc get cm -n ${PROJECT_CPD_INST_OPERANDS} zen-cs-aux-ckpt-cm - Cause of the problem
- The Cloud Pak for Data deployment is missing some resources, including several backup and restore ConfigMaps.
- Resolving the problem
-
- Copy and run the Identity Management Service backup and restore scripts.
- Redo the backup and restore.
Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster
Applies to: 4.8.5 and later
- Diagnosing the problem
- When Cloud Pak for Data is integrated
with the Identity Management Service service,
you cannot log in with OpenShift cluster credentials. You
might be able to log in with LDAP or as
cpdadmin. - Resolving the problem
- To work around the problem, run the following
commands:
oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operatoroc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-serviceoc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-managementoc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider
Offline and online restore with OADP backup and restore utility fails with running post-restore hooks error
Applies to: 4.8.5-4.8.7
Fixed in: 4.8.8
- Diagnosing the problem
- The final
cpd-cli oadp restore createcommand that runs post-restore hooks fails. In the CPD-CLI*.log file, you see the following error message:error running post-restore hooks: pod/<podname> is not supported for scaling, please define the proper postrestore hooks because of restored replicaset(s) - Cause
- A recent change in OADP 1.3.1 removed logic that prevents restoring deployment-managed ReplicaSets that shouldn't be restored.
- Workaround
- Do the following steps:
- Find ReplicaSets that were restored from the
backup:
oc get rs -l velero.io/backup-name -n ${PROJECT_CPD_INST_OPERANDS} - Delete the ReplicaSets that are associated with the pods that
have is not supported for scaling errors from the
logs:
oc delete rs -n ${PROJECT_CPD_INST_OPERANDS} <replicaset-name> - Manually run post-restore hooks.
- If you are doing an offline backup and restore, run the following
command:
cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} - If you are doing an online backup and restore, run the following
command:
cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} --hook-kind=checkpoint
- If you are doing an offline backup and restore, run the following
command:
- Find ReplicaSets that were restored from the
backup:
Restoring online backup of Watson Query fails
Applies to: 4.8.5 and later
- Diagnosing the problem
- In the CPD-CLI*.log file, you see the following error
message:
time=<timestamp> level=info msg= zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137 - Workaround
- Do the following steps:
- Disable the Watson Query
liveness probe in the Watson Query head
pod:
oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - mkdir /mnt/PV/versioned/marker_file"oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - touch /mnt/PV/versioned/marker_file/.bar" - Disable the BigSQL restart daemon in the Watson Query head
pod:
oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker create BIGSQL_DAEMON_PAUSE" - Stop BigSQL in the Watson Query head
pod:
oc rsh c-db2u-dv-db2u-0 bashsu - db2inst1bigsql stop - Re-enable the Hive user in the users.json file in the
Watson Query head pod.
- Edit the users.json
file:
vi /mnt/blumeta0/db2_config/users.json - Locate
"locked":trueand change it to"locked":false.
- Edit the users.json
file:
- On the hurricane pod, rename the hive-site.xml config file so that it can be
reconfigured by restarting the
pod:
oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)su - db2inst1mv /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml.bak - Exit the pod, and then run the following command to delete it.Note: Since the configuration file was renamed, it is regenerated with the correct settings.
oc delete pod -l formation_id=db2u-dv,role=hurricane - After the hurricane pod is started again, run the following commands on the
hurricane pod to disable the SSL so that it can be reconfigured in a later
step:
oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)su - db2inst1bigsql-config -disableMetastoreSSLbigsql-config -disableSchedulerSSL - Clean up leftover files from the hurricane
pod:
rm -rf /mnt/blumeta0/bigsql/security/*rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null - Run the following commands to disable SSL from the head
pod:
oc rsh c-db2u-dv-db2u-0 bashsu - db2inst1rah "bigsql-config -disableMetastoreSSL"rah "bigsql-config -disableSchedulerSSL" - Clean up leftover files from the head and worker
pods:
rm -rf /mnt/blumeta0/bigsql/security/*rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null - Run the following commands to re-enable SSL on the head pod, and restart
Db2 Big SQL so that
configuration changes can take effect:
bigsql-config -enableMetastoreSSLbigsql-config -enableSchedulerSSLbigsql stop; bigsql start - Remove markers that were created in steps 1 and 2 in the Watson Query head
pod:
oc exec -it c-db2u-dv-db2u-0 -- bash -c "rm -rf /mnt/PV/versioned/marker_file/.bar"oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker delete BIGSQL_DAEMON_PAUSE" - If you are doing the backup and restore with the OADP backup and restore utility,
run the following
command:
cpd-cli oadp restore prehooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} --log-level debug --verbose - If you are doing the backup and restore with IBM Storage Fusion, NetApp Astra Control Center, or
Portworx data replication,
run the following
commands:
CPDBR_POD=$(oc get po -l component=cpdbr-tenant -n ${PROJECT_CPD_INST_OPERATORS} --no-headers | awk '{print $1}')oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS}"oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}"
- Disable the Watson Query
liveness probe in the Watson Query head
pod:
After a restore, OperandRequest timeout error in the ZenService custom resource
Applies to: 4.8.5 and later
- Diagnosing the problem
- Get the status of the ZenService
YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yamlIn the output, you see the following error:
... zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request": Timed out waiting on resource' ...Check for failing operandrequests:oc get operandrequests -AFor failing operandrequests, check their conditions forconstraints not satisfiablemessages:oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name> - Cause of the problem
- Subscription wait operations timed out. The problematic subscriptions show an error
similar to the following
example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0 exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0 and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0 originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0, subscription ibm-db2aaservice-cp4d-operator exists'This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.
- Workaround
- Do the following steps:
- Delete the problematic clusterserviceversions and subscriptions,
and restart the Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
- Delete Cloud Pak for Data instance
projects (namespaces).
For details, see Preparing to restore Cloud Pak for Data.
- Retry the restore.
- Delete the problematic clusterserviceversions and subscriptions,
and restart the Operand Deployment Lifecycle Manager (ODLM) pod.
After a restore, the
ZenService component remains in InProgress
state
Applies to: 4.8.0, 4.8.1, 4.8.2
Fixed in: 4.8.3
- Diagnosing the problem
-
In the backup and restore log file, you see the following message:
msg: Timeout creating internal-tls-certificate, make sure that certmanager has been installed and work properly - Workaround
- Delete all restored certificate requests.
- Get the list of all certificate
requests:
oc get certificaterequest -n ${PROJECT_CPD_INST_OPERANDS} - Delete each certificate request
one-by-one:
oc delete certificaterequest -n ${PROJECT_CPD_INST_OPERANDS} <cert-request-name>
- Get the list of all certificate
requests:
Password for the
cpadmin user changes when restoring Cloud Pak for Data
Applies to: 4.8.0-4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
-
This problem occurs when Cloud Pak for Data is integrated with the Identity Management Service service. The problem applies to all backup and restore methods.
After a restore, the password for the
cpadminuser is changed. - Workaround
- No workaround is available. To retrieve or change the new password, see Changing the administrator password.
After restoring
IBM Match 360 from backup, the
associated Redis pods enter a
CrashLoopBackOff state
Fixed in: 4.8.3
Applies to: 4.8.1, 4.8.2
- Diagnosing the problem
-
This issue occurs after restoring the IBM Match 360 from either an online or offline backup. The MDM Redis CP (
mdm-redis-cp) pods fail to restore correctly and enter aCrashLoopBackOffstate, which affects the functionality of the IBM Match 360 user interface services. This problem is caused by missing password fields in relevant secrets. - Workaround
-
To resolve this issue, set the administrator password field for RedisCP-related secrets:
- Encode the user-defined password for Redis.
echo -n '<USER-DEFINED-PASSWORD>' | base64 - Run the following commands to set the administrator password to the encoded
password value that you just created:
oc -n ${PROJECT_CPD_INST_OPERANDS} patch secret mdm-redis-cp-<INSTANCE-ID>-admin-secret --type=merge -p '{"data":{"admin_password": "<ENCODED-PASSWORD>"}}'oc -n ${PROJECT_CPD_INST_OPERANDS} patch secret mdm-redis-cp-<INSTANCE-ID>-em-ui --type=merge -p '{"data":{"auth": "<ENCODED-PASSWORD>"}}' - Bring down the MDM RedisCP Server StatefulSet.
Wait for all MDM RedisCP pods to terminate.oc patch sts mdm-redis-cp-<INSTANCE-ID>-server --type=merge -p '{"spec":{"replicas": 0}}' - Bring up the MDM RedisCP Server
StatefulSet.
oc patch sts mdm-redis-cp-<INSTANCE-ID>-server --type=merge -p '{"spec":{"replicas": 3}}' - Delete the MDM Redis
HAProxy
pod.
oc delete pod -l redis-app=mdm-redis-cp-<INSTANCE-ID>-haproxy
- Encode the user-defined password for Redis.
Online backup of Watson Discovery fails at checkpoint stage
Applies to: 4.8.3
Fixed in: 4.8.4
- Diagnosing the problem
- When you try to create an online backup, the backup process fails at the checkpoint
hook stage. For example, if you are creating the backup with IBM Storage Fusion, the backup process
fails at the Hook: br-service-hooks-checkpoint stage in the
backup sequence. In the log file, you see an error message similar to the following
example:
download failed: s3://common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip to tmp/s3-backup/common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip Connection was closed before we received a valid response from endpoint URL: "https://s3.openshift-storage.svc:443/common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip". - Cause of the problem
- Large resource files can become corrupted while they are downloaded to the backup. As a result, the wd-discovery-aux-ch-s3-backup job does not complete successfully.
- Workaround
- Delete the file that is shown in the error message and recreate it.
- Exec into the wd-discovery-support
pod:
oc exec -it deploy/wd-discovery-support -- bash - Do the following steps within the pod.
- Delete the
file:
aws-wd s3 rm s3://common-zen-wd/mt/__built-in tenant__/fileResource/<file_name> - Confirm that the file is not listed when you run the following
command:
aws-wd s3 ls s3://common-zen-wd/mt/__built-in-tenant__/fileResource/ - Exit from the pod.
- Delete the
file:
- Delete the wd-discovery-orchestrator-setup
job:
oc delete job/wd-discovery-orchestrator-setup - Wait for the wd-discovery-orchestrator-setup job to run again and complete.
Confirm that the file was successfully recreated:
- Exec into the wd-discovery-support
pod:
oc exec -it deploy/wd-discovery-support -- bash - Do the following steps within the pod.
- Copy the file to the tmp
directory:
aws-wd s3 cp s3://common-zen-wd/mt/__built-in-tenant__/fileResource/<file_name> /tmp - Confirm that the file is
copied:
ls /tmp/<file_name> - Exit from the pod.
- Copy the file to the tmp
directory:
You can now retake the backup, and the wd-discovery-aux-ch-s3-backup job will complete successfully.
- Exec into the wd-discovery-support
pod:
Resource file causes online backup of Watson Discovery to fail at checkpoint stage
Applies to: 4.8.4, 4.8.5
Fixed in: 4.8.6
- Diagnosing the problem
- When you try to create an online backup, the backup process fails at the checkpoint
hook stage. For example, if you are creating the backup with IBM Storage Fusion, the backup process
fails at the Hook: br-service-hooks-checkpoint stage in the
backup sequence. In the log file, you see an error message similar to the following
example:
download failed: s3://common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar to tmp/s3-backup/common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar Connection was closed before we received a valid response from endpoint URL: "https://s3.openshift-storage.svc:443/common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar". - Cause of the problem
- Resource files can become corrupted, and as a result, the wd-discovery-aux-ch-s3-backup job does not complete successfully.
- Workaround
- Delete the file that is shown in the error message from s3 and re-run the backup.
- Exec into the wd-discovery-support
pod:
oc exec -it deploy/wd-discovery-support -- bash - Do the following steps within the pod.
- Delete the
file:
aws-wd s3 rm <file_name>For example:aws-wd s3 rm s3://common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar` - Confirm that the file was
deleted:
aws-wd s3 ls <file_name> - Exit from the pod.
- Delete the
file:
You can now retake the backup, and the wd-discovery-aux-ch-s3-backup job will complete successfully.
- Exec into the wd-discovery-support
pod:
Watson Knowledge Catalog custom resource stuck in
InProgress state after restore
Applies to: 4.8.9
- Diagnosing the problem
- When you try to restore, the Watson Knowledge Catalog custom resource is stuck in an
InProgressstate, and you see a message similar to the following example:NAME VERSION RECONCILED STATUS AGE wkc-cr 4.8.9 InProgress 5h36m lastTransitionTime: "2025-04-16T17:08:20Z" message: |- unknown playbook failure Failed to deploy WKC prereqs Failed at task: Wait until FdbCluster is completed The error was: Please consult the operator logs. - Cause of the problem
- Corrupted or incomplete snapshot data is causing unrecoverable indexes in OpenSearch during the restore process.
- Workaround
- Complete the following steps to resolve the issue:
- Reduce GS replicas to
0:oc scale deployment wkc-search --replicas=0 - Change the readiness check configuration to accept
redas a valid status:oc patch ccs ccs-cr --type merge --patch '{"spec": {"openshift_cluster_health_check_params": "wait_for_status=red&timeout=30s"}}' - Put the OpenSearch
cluster in quiesce
mode:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}' - Take the OpenSearch
cluster out of quiesce
mode:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}' - Verify the presence of corrupted indexes, which are shards that are stuck and do
not respond to
recovery:
oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request GET --url http://localhost:19200/_cat/shards --header 'content-type: application/json' - Delete the corrupted
indexes:
oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request DELETE --url http://localhost:19200/gs-system-index-wkc-v001,semantic,wkc --header 'content-type: application/json' - Put the OpenSearch
cluster in quiesce
mode:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}' - Change the readiness check configuration back to the default setting, which
accepts only
yellowas a valid status:oc patch ccs ccs-cr --type merge --patch '{"spec": {"openshift_cluster_health_check_params": "wait_for_status=yellow&timeout=30s"}}' - Take the OpenSearch
cluster out of quiesce
mode:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}' - Scale the
wkc-searchpod back to its original size:oc scale deployment wkc-search --replicas=1
You can now complete the restore successfully.
- Reduce GS replicas to
Online restore posthooks fails with checkpoint id error
Applies to: 4.8.3, 4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
- Run the restore posthooks
command:
cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --hook-kind=checkpointIn the output, you see a missing checkpoint id message like in the following example:
processing request... oadp namespace: oadp-operator cpd namespace: zen runtime mode: resolved namespaces [ibm-common-services] ... Error: error running post-restore hooks: cannot get checkpoint id in namespace zen, namespaces: [ibm-common-services], checkpointId: , err: info is nil - Cause of the problem
- The restore posthooks command cannot resolve the Cloud Pak for Data project (namespace) from the
--tenant-operator-namespaceoption. - Workaround
- Run the restore posthooks command with the
--include-namespacesoption instead:cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERATORS}, ${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint
Online restore posthooks
fail with zenobjstore/ibm-zen-configuration is not empty error
Applies to: 4.8.3
Fixed in: 4.8.4
- Diagnosing the problem
- Run the restore posthooks
command:
cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --hook-kind=checkpointIn the output, you see an error message like in the following example:
zen/configmap/cpd-zen-aux-ckpt-cm: component=zen, op=<mode=post-restore,type=config-hook,method=job>, status=errorIn the cpdbr-oadp.log or cpd-cli*.log file, you see the following error:time=<timestamp> level=info msg=logs for pod=zen-ckpt-restore-job-fv7l5-g9rfm: -------------------------------- container: zen-ckpt-restore -------------------------------- ... mc: <ERROR> `zenobjstore/ibm-zen-configuration` is not empty. Retry this command with ‘--force’ flag if you want to remove `zenobjstore/ibm-zen-configuration` and all its contents <timestamp>: Deletion Unsuccessful... Exiting with error. - Cause of the problem
- The object store that is used to back up the platform metadata must be reset before you run the restore posthooks command.
- Workaround
- Do the following steps:
- Set up the backup location for the platform
metadata:
export BACKUP_DIR=$HOME/rstemp/zen_backup mkdir -p $BACKUP_DIR/workspace && \ mkdir -p $BACKUP_DIR/secrets/jwks mkdir -p $BACKUP_DIR/secrets/jwt mkdir -p $BACKUP_DIR/secrets/jwt-private && \ mkdir -p $BACKUP_DIR/secrets/ibmid-jwk && \ mkdir -p $BACKUP_DIR/secrets/aes-key && \ mkdir -p $BACKUP_DIR/secrets/admin-user && \ mkdir -p $BACKUP_DIR/database && \ mkdir -p $BACKUP_DIR/objstorage - Read the object store connection and credentials of the object store that is
used to back up the platform
metadata:
export IBM_ZEN_BUCKET_NAME=$(oc -n ${PROJECT_CPD_INST_OPERANDS} get cm ibm-zen-objectstore-cm -o=jsonpath='{.data.BUCKET_ZEN_CONFIGURATION}') OBJECTSTORE_ENDPOINT=$(oc -n ${PROJECT_CPD_INST_OPERANDS} get cm ibm-zen-objectstore-cm -o jsonpath="{.data.OBJECTSTORE_ENDPOINT}") oc -n ${PROJECT_CPD_INST_OPERANDS} extract secret/ibm-zen-objectstore-secret --to=$BACKUP_DIR/workspace --confirm - Remove and recreate the object store bucket that is used to back up the platform
metadata:
oc -n ${PROJECT_CPD_INST_OPERANDS} exec -t zen-minio-0 -- bash -c "rm -rf /tmp/backup && mkdir -p /tmp/backup && export HOME=/tmp && /workdir/bin/mc alias set zenobjstore ${OBJECTSTORE_ENDPOINT} $(<${BACKUP_DIR}/workspace/accesskey) $(<${BACKUP_DIR}/workspace/secretkey) --config-dir=/tmp/.mc --insecure && /workdir/bin/mc ls zenobjstore/${IBM_ZEN_BUCKET_NAME} --insecure && /workdir/bin/mc rb zenobjstore/${IBM_ZEN_BUCKET_NAME} --force --dangerous --insecure && /workdir/bin/mc mb zenobjstore/${IBM_ZEN_BUCKET_NAME} --insecure" - Rerun the restore posthooks
command:
cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint
- Set up the backup location for the platform
metadata:
Db2 Big SQL post-restore hook fails during online restore
Applies to: 4.8.8
- Diagnosing the problem
- When restoring an online backup with the OADP backup and restore utility,
the CPD-CLI*.log file shows error messages like in the
following
example:
<timestamp> INFO Attempting Write Resume/Restore..Running /db2u/scripts/write-restore.sh - retry <timestamp> ERROR Db2 status on c-bigsql-1736386855680378-db2u-1.c-bigsql-1736386855680378-db2u-internal: <timestamp> ERROR Failed to resume Db2 write in c-bigsql-<worker_pod_index>-db2u-1.c-bigsql-<worker_pod_index>-db2u-internal...Post-restore/resume hooks cannot continue func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:802 time=<timestamp> level=info msg=exit executeCommand func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:813 - Cause of the problem
- The Db2 Big SQL database on a worker node remains in write-suspend mode.
- Resolving the problem
- Take the database out of write-suspend mode by doing the following steps:
- In the error message, note the Db2 Big SQL worker pod that failed, identified by ERROR Failed to resume Db2 write in.
- Log in to the worker
pod:
oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} get pod | grep -i c-bigsql | grep -i db2u-<worker_pod_index> | cut -d' ' -f 1) bash - Switch to the
db2inst1user:su - db2inst1 - Reactivate the
database:
db2 restart db ${DBNAME} write resume db2 activate db ${DBNAME} - Verify that you can successfully connect to the Db2 Big SQL database on a worker
node:
db2 list active databases | grep ${DBNAME} db2 connect to ${DBNAME} - Repeat steps 1-5 for any other worker pod that failed.
- Restore the namespacescope.
- Get the cpdbr-tenant-service pod
ID:
oc get po -A | grep "cpdbr-tenant-service" - Log in to the cpdbr-tenant-service
pod:
oc rsh -n ${OADP_OPERATOR_NAMESPACE} <cpdbr-tenant-service pod id> - Run the restore namespacescope
script:
/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}
- Get the cpdbr-tenant-service pod
ID:
Watson Speech services status is stuck in
InProgress after restore
Applies to: 4.8.5 and later
- Diagnosing the problem
- After an online restore with the OADP backup and restore utility,
the CPD-CLI*.log file shows
speechStatusis in theInProgressstate. - Cause of the problem
- The
speechStatusis in theInProgressstate due to a race condition in the stt-async component. Pods that are associated with this component are stuck in0/1 Runningstate. Run the following command to confirm this state:oc get pods -l app.kubernetes.io/component=stt-asyncExample output:NAME READY STATUS RESTARTS AGE speech-cr-stt-async-775d5b9d55-fpj8x 0/1 Running 0 60mIf one or more pods is in the
0/1 Runningstate for 20 minutes or more, this problem might occur. - Resolving the problem
- For each pod in the
0/1 Runningstate, run the following command:oc delete pod <stt-async-podname>
Online restore posthooks time out when restoring Watson Query or Db2 Big SQL
Applies to: 4.8.5 and later
- Diagnosing the problem
-
- Run the
cpd-cli oadp restore logs <restore_name>command, replacing<restore_name>with the restore name that you specified when you did the step to restore Kubernetes resources that generate pods (for example,restore-name4).The restore log file shows the following error messages for Db2 Big SQL and Watson Query respectively:zen/configmap/cpd-bigsql-aux-ckpt-cm: component=bigsql, op=<mode=post-restore,type=config-hook,method=rule>, status=error zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error - In the head pod, open the
/var/log/bigsql/cli/bigsql-db2ubar-hook.log file, and look
for the following error messages.
- In Watson Query, the head pod is c-db2u-dv-db2u-0.
- In Db2 Big SQL, the head pod is c-bigsql-<xxxxxxxx>-db2u-0.
For Watson Query:
/db2u/.db2u_initialized ************************************************* Restart and activate the database(s)...this may take a while SQL1032N No start database manager command was issued. SQLSTATE=57019 c-db2u-dv-db2u-0.c-db2u-dv-db2u-internal.${PROJECT_CPD_INST_OPERANDS}.svc.cluster.local: db2 restart db BIGSQL ... completed rc=4 SQL1032N No start database manager command was issued. SQLSTATE=57019 c-db2u-dv-db2u-1.c-db2u-dv-db2u-internal: db2 restart db BIGSQL ... completed rc=4 activate db BIGSQL SQL1032N No start database manager command was issued. SQLSTATE=57019For Db2 Big SQL:
/db2u/.db2u_initialized ************************************************* Restart and activate the database(s)...this may take a while SQL1032N No start database manager command was issued. SQLSTATE=57019 c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-0.c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-internal.${PROJECT_CPD_INST_OPERANDS}.svc.cluster.local: db2 restart db BIGSQL ... completed rc=4 SQL1032N No start database manager command was issued. SQLSTATE=57019 c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-1.c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-internal: db2 restart db BIGSQL ... completed rc=4 activate db BIGSQL SQL1032N No start database manager command was issued. SQLSTATE=57019 - After about 5 minutes, check these log files to see if the error messages are still there.
- If the error message are still there, check whether the
bigsql-db2ubar-hook.sh script is still running.
- For Watson Query:
oc rsh c-db2u-dv-db2u-0 bashsu - db2instps -ef |grep -v grep |grep bigsql-db2ubar-hook.sh - For Db2 Big SQL:
oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bashsu - db2instps -ef |grep -v grep |grep bigsql-db2ubar-hook.sh
The error message SQL1032N No start database manager command was issued. SQLSTATE=57019 indicates that either Db2 was not running, or
db2 connect to bigsqlfailed when the Db2 process to restore writes to all databases started. - For Watson Query:
- Run the
- Resolving the problem
- Check that Db2 is running
and can connect to bigsql by doing the following steps:
- Check that Db2 is
running by running the following commands.
- For Watson Query:
oc rsh c-db2u-dv-db2u-0 bashsu - db2instbigsql status - For Db2 Big SQL:
oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bashsu - db2instbigsql status
- For Watson Query:
- Confirm that Db2 is running in the head pod and all worker pods.
- If Db2 is running,
check if
db2 connectis working.- For Watson Query:
oc rsh c-db2u-dv-db2u-0 bashsu - db2instdb2 connect to bigsql - For Db2 Big SQL:
oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bashsu - db2instdb2 connect to bigsql
- For Watson Query:
If Db2 is running and can connect to bigsql, re-run the restore posthooks by doing the following steps:
- Log in to the bigsql head pod, and if the
bigsql-db2ubar-hook.sh script is still running, terminate
the process.
- For Watson Query:
oc rsh c-db2u-dv-db2u-0 bashsu - db2instpid=`ps -ef | grep bigsql-db2ubar-hook.sh | grep -v grep | awk '{print $2}'` kill -9 ${pid} exit - For Db2 Big SQL:
oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bashsu - db2instpid=`ps -ef | grep bigsql-db2ubar-hook.sh | grep -v grep | awk '{print $2}'` kill -9 ${pid} exit
- For Watson Query:
- Re-run the restore posthooks
command:
cpd-cli oadp backup posthooks \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --hook-kind=checkpoint \ --log-level=debug \ --verbose - Confirm that the posthooks completed successfully for Db2 Big SQL and Watson Query.
- Check that Db2 is
running by running the following commands.
Backup fails at Volume group: cpd-volumes stage
Applies to: IBM Storage Fusion 2.7.2
Fixed in: IBM Storage Fusion 2.7.2 hotfix
- Diagnosing the problem
- In the backup sequence in IBM Storage Fusion 2.7.2, the backup
fails at the Volume group: cpd-volumes stage.
The transaction manager log shows several error messages, such as the following examples:
<timestamp>[TM_0] - Error: Processing of volume cc-home-pvc failed.\n", "<timestamp>[VOL_12] -Snapshot exception (410)\\nReason: Expired: too old resource version: 2575013 (2575014) - Workaround
- Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.
Backup of Cloud Pak for Data operators project fails at data transfer stage
Applies to: IBM Storage Fusion 2.7.2
Fixed in: IBM Storage Fusion 2.7.2 hotfix
- Diagnosing the problem
- In IBM Storage Fusion 2.7.2,
the backup fails at the Data transfer stage, with the following
error:
Failed transferring data There was an error when processing the job in the Transaction Manager service - Cause
- The length of a Persistent Volume Claim (PVC) name is more than 59 characters.
- Workaround
- Install the IBM Storage Fusion
2.7.2 hotfix. For details, see IBM Storage Fusion and
IBM Storage Fusion HCI
hotfix.
With the hotfix, PVC names can be up to 249 characters long.
Backup fails at Hook: br-service-hooks/post-backup stage
Applies to: 4.8.2, 4.8.3
Fixed in: 4.8.4
- Diagnosing the problem
- In the backup sequence in IBM Storage Fusion, the backup fails at
the Hook: br-service-hooks/post-backup stage.
Logs from the transaction manager have an error message similar to the following example:
.... [apphooks:executeHook Line 144][ERROR] - Timeout reached before command completed. However, the operation continues because of the on-error annotation value. - Workaround
- Increase the backup hooks timeout value in the backup recipe
ibmcpd-tenant from the default 600 seconds to 1800 seconds by
running the following
command:
oc patch frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} --type='json' -p='[{"op": "replace", "path": "/spec/hooks/0/ops/7/timeout", "value": 1800}]'
ZenService is in Failed state after a
restore
Applies to: IBM Storage Fusion 2.7.1
Fixed in: IBM Storage Fusion 2.7.2
- Diagnosing the problem
- After you restore an online backup with IBM Storage Fusion 2.7.1, check the
status of the ZenService
component:
The command returns output like in the following example:oc describe zenservice -n ${PROJECT_CPD_INST_OPERANDS} lite-crStatus: Progress: 16% Progress Message: Finished Zen-Metastore edb Supported Operand Versions: 5.1.1 Zen Message: 5.1.1/roles/0010-infra has failed with error: All items completed Zen Operator Build Number: zen operator 5.1.1 build 37 Zen Status: Failed - Cause of the problem
- When the Cloud Pak for Data project
(namespace) has the label
pod-security.kubernetes.io/enforce, the annotation for the user ID range,openshift.io/sa.scc.uid-range, is changed during the restore process. - Workaround
- Change the annotation for the user ID range in the target cluster to match the
source cluster.
- From the source cluster, get the security context constraint (SCC) user ID (UID)
range annotations of the Cloud Pak for Data operand
project:
oc get namespace ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.metadata.annotations."openshift.io/sa.scc.uid-range"'oc get namespace ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.metadata.annotations."openshift.io/sa.scc.supplemental-groups"'Example output from both commands:"1001130000/10000"Note: If the source cluster is no longer available, you can also find the namespace definition from the IBM Storage Fusion transaction-manager pod logs in the ibm-backup-restore project on the target cluster. For example:<timestamp>[TM_1][restoreguardian:createNamespace Line 2195][INFO] - Creating namespace with labels: {'kubernetes.io/metadata.name': 'cpd-instance', 'olm.operatorgroup.uid/1347e6d2-51de-4a7d-a246-a25dc85b0121': '', 'pod-security.kubernetes.io/audit': 'baseline', 'pod-security.kubernetes.io/audit-version': 'v1.24', 'pod-security.kubernetes.io/enforce': 'privileged', 'pod-security.kubernetes.io/warn': 'baseline', 'pod-security.kubernetes.io/warn-version': 'v1.24'} and annotations {'openshift.io/sa.scc.mcs': 's0:c34,c4', 'openshift.io/sa.scc.supplemental-groups': '1001130000/10000', 'openshift.io/sa.scc.uid-range': '1001130000/10000'} - On the target cluster, overwrite the SCC UID range annotations of the Cloud Pak for Data operands project to match
the source
cluster.
oc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.uid-range="<range-from-source-cluster>" --overwriteoc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.supplemental-groups="<range-from-source-cluster>" --overwriteFor example:oc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.uid-range="1001130000/10000" --overwriteoc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.supplemental-groups="1001130000/10000" --overwriteExample output from both commands:namespace/cpd-instance annotate - Verify that the project has the expected
values:
oc get namespace ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.metadata.annotations'Example output:{ "openshift.io/sa.scc.mcs": "s0:c34,c4", "openshift.io/sa.scc.supplemental-groups": "1001130000/10000", "openshift.io/sa.scc.uid-range": "1001130000/10000" } - Restart all pods in the ${PROJECT_CPD_INST_OPERANDS} project.
This step forces all pods to restart with the new UID range.
oc delete pod -n ${PROJECT_CPD_INST_OPERANDS} --allExample output:pod "common-web-ui-7475996b49-2pfkz" deleted pod "create-secrets-job-8w6d4" deleted pod "ibm-nginx-74d964878b-6dbcv" deleted pod "ibm-nginx-74d964878b-wwfpb" deleted pod "ibm-nginx-tester-7c959cdcd-cqztg" deleted pod "ibm-zen-vault-sdk-jwt-setup-job-b7sqn" deleted pod "icp-mongodb-0" deleted .. - To speed up the Cloud Pak for Data
service reconcile process, restart all pods in the ${PROJECT_CPD_INST_OPERATORS}
project.
oc delete pod -n ${PROJECT_CPD_INST_OPERATORS} --allExample output:.. pod "cloud-native-postgresql-catalog-nhzq6" deleted pod "cpd-platform-dc52d" deleted pod "cpd-platform-operator-manager-7878b59476-nrs2l" deleted pod "d14e1b98ea4e0b8a0ceecd128879389e88fde2f14e11203b83128ff468cld6x" deleted pod "d44fc276aa580bfc4f7b22999ea30cf9f4263225c360428ec1219d7fb98lj4b" deleted pod "fe26f3b3326b215794f4412d56259c3232b328c4a82cd4e9f8b77b7bdekfbw2" deleted pod "ibm-common-service-operator-6b7977d75c-k44j6" deleted pod "ibm-commonui-operator-67ff844b58-b468t" deleted pod "ibm-iam-operator-cc84f8674-tmq4r" deleted pod "ibm-mongodb-operator-65d5bc4698-v87ng" deleted pod "ibm-namespace-scope-operator-cfb964d54-s8qp6" deleted pod "ibm-zen-operator-85d7f7bbdc-jkqzv" deleted ..
Cloud Pak for Data services will reconcile and come up to theCompletedstate. Monitor the process periodically by running:cpd-cli manage get-cr-status --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} - From the source cluster, get the security context constraint (SCC) user ID (UID)
range annotations of the Cloud Pak for Data operand
project:
Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails
Applies to: 4.8.2 and later
- Diagnosing the problem
- When you restore an online backup with IBM Storage Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
- Workaround
- This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To work around the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.
After a restore with
NetApp Astra Control Center,
post-restore hooks fail with an Authentication failed error
Applies to: 4.8.3, 4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
- After restoring Cloud Pak for Data with
NetApp Astra Control Center,
post-restore hooks fail with the following
error:
time=<timestamp> level=info msg= zen/configmap/zen-cs-aux-ckpt-cm: component=zen-cs, op=<mode=post-restore,type=config-hook,method=job>, status=errorIn the cpdbr-oadp or cpd-cli*.log file, you see the following error:func=cpdbr-oadp/pkg/quiesce.logJob file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:222 time=<timestamp> level=info msg=logs for pod=zen-mongodb-restore-z9664: -------------------------------- container: cs-mongodb-restore -------------------------------- ------------------------------------------------------------------------- Mongo Restore output: <timestamp> Failed: error connecting to db server: server returned error on SASL authentication step: Authentication failed. ------------------------------------------------------------------------- Mongo Restore attempt: 1 return code: 1 ... ------------------------------------------------------------------------- Mongo Restore output: 2024-02-16T00:47:41.813+0000 error dialing icp-mongodb-2.icp-mongodb.zen.svc.cluster.local:27017: dial tcp 172.129.3.21:27017: connect: connection refused <timestamp> Failed: error connecting to db server: server returned error on SASL authentication step: Authentication failed. ------------------------------------------------------------------------- Mongo Restore attempt: 5 return code: 1 func=cpdbr-oadp/pkg/quiesce.logJob file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:229 time=<timestamp> level=info msg=exit logJob func=cpdbr-oadp/pkg/quiesce.logJob file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:232 time=<timestamp> level=info msg=job zen-mongodb-restore deleted func=cpdbr-oadp/pkg/quiesce.runJobX.func1 file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:124 time=<timestamp> level=error msg=job zen-mongodb-restore did not complete successfully - Cause of the problem
- The MongoDB operator ran before the icp-mongodb-admin secret was created. Services cannot connect to MongoDB.
- Workaround
- Do the following steps:
- Delete the icp-mongodb pods and
PVCs:
oc project ${PROJECT_CPD_INST_OPERANDS} oc delete pvc mongodbdir-icp-mongodb-0& oc delete pvc mongodbdir-icp-mongodb-1& oc delete pvc mongodbdir-icp-mongodb-2& oc delete po icp-mongodb-0& oc delete po icp-mongodb-1& oc delete po icp-mongodb-2& - Wait for all icp-mongodb pods to reach a running
state:
oc get po -n ${PROJECT_CPD_INST_OPERANDS} -w - Verify that you can connect to the mongo
shell:
oc rsh -n ${PROJECT_CPD_INST_OPERANDS} icp-mongodb-0mongo --host localhost:$MONGODB_SERVICE_PORT --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabase admin --ssl --sslCAFile=/data/configdb/tls.crt --sslPEMKeyFile=/work-dir/mongo.pem --verbose - Retry running the post-restore
hooks:
cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint
- Delete the icp-mongodb pods and
PVCs:
After a restore
with NetApp Astra Control Center,
post-restore hooks fail with a zen-mongodb-restore already exists
error
Applies to: 4.8.3, 4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
- After restoring Cloud Pak for Data with
NetApp Astra Control Center,
post-restore hooks fail with the following
error:
time=<timestamp> level=info msg= zen/configmap/zen-cs-aux-ckpt-cm: component=zen-cs, op=<mode=post-restore,type=config-hook,method=job>, status=errorIn the cpdbr-oadp or cpd-cli*.log file, you see the following error:time=<timestamp> level=error msg=jobs.batch "zen-mongodb-restore" already exists - Cause of the problem
- The zen-mongodb-restore job was already running from a previously failed post-restore hooks attempt.
- Workaround
- Do the following steps:
- Delete the zen-mongodb-restore
job:
oc delete job -n ${PROJECT_CPD_INST_OPERANDS} zen-mongodb-restore - Retry running the post-restore
hooks:
cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint
- Delete the zen-mongodb-restore
job:
After a restore with NetApp Astra Control Center, the ZenService custom resource remains stuck at 51% in progress
Applies to: 4.8.3, 4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
- Get the status of the ZenService custom
resource:
oc get zenservice -A -o yamlExample output of the command:
... status: Progress: 51% ProgressMessage: Finished Role 0020-core supportedOperandVersions: 5.1.1 zenMessage: 5.1.1/roles/0030-gateway has failed. See the latest operator debug log for exact error in operator pod ibm-zen-operator-85d7f7bbdc-h9xbm under /tmp/ansible-operator/runner/zen.cpd.ibm.com/v1/ZenService/zen/lite-cr/artifacts/ directory. zenOperatorBuildNumber: zen operator 5.1.1 build 37 zenStatus: Failed ...Get the status of the oidc-client-registration job:
oc get job oidc-client-registration -o yamlExample output of the command:
... - apiVersion: batch/v1 kind: Job namespace: zen objectName: oidc-client-registration status: NotReady ... - Cause of the problem
- The MongoDB operator ran
before the icp-mongodb-admin secret was created. Services cannot
connect toMongoDB.
The Auth service cannot create OAuthDBSchema for storing client registrations.
- Workaround
- Do the following steps:
- Restart the platform-auth-service pods.
- Get the list of
pods:
oc get po -A | grep "platform-auth-service" - Delete each
pod:
oc delete po -n ${PROJECT_CPD_INST_OPERANDS} platform-auth-service-<pod_name>
- Get the list of
pods:
- Wait for the ZenService custom resource to reconcile and
progress past
51%:
oc get zenservice -A -o yaml -w
- Restart the platform-auth-service pods.
After a restore with NetApp Astra Control Center, the ZenService custom resource does not reconcile
Applies to: 4.8.3, 4.8.4
Fixed in: 4.8.5
- Diagnosing the problem
- Check if namespacescope is missing the instance
namespace:
oc get nss -A -o yamlExample output of the command:
validatedMembers: - ibm-common-services - Cause of the problem
- The namespacescope was not restored after the post-restore hooks successfully completed.
- Workaround
- Do the following steps:
- Restore the namespacescope.
- Get the cpdbr-tenant-service pod
name:
oc get po -A | grep "cpdbr-tenant-service" - Access the cpdbr-tenant-service
pod:
oc rsh -n ${PROJECT_CPD_INST_OPERANDS} cpdbr-tenant-service-<pod_name> - Run the following
command:
/cpdbr-scripts/cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}
- Get the cpdbr-tenant-service pod
name:
- Wait for the ZenService custom resource to
progress:
oc get zenservice -A -o yaml -w
- Restore the namespacescope.
After a
restore, status of OpenPages
and watsonx.governance custom
resources are in a Failed state
Applies to: 4.8.7
Fixed in: 4.8.8
- Diagnosing the problem
- After you restore Cloud Pak for Data by using Portworx asynchronous disaster recovery, you activate applications. In the Cloud Pak for Data Instances page, the status of the OpenPages custom resource is Failed. The status of the watsonx.governance custom resource alternates between Failed and Completed.
- Resolving the problem
- To work around the problem, do the following steps:
- Get the OpenPages
instance
ID:
INSTANCE_ID=$(oc get openpagesinstance ${INSTANCE_NAME} -o json | jq -r '.spec.zenServiceInstanceId') - Exec into the Db2
container:
oc exec -it c-db2oltp-$INSTANCE_ID-db2u-0 bash - Edit the online restore script inside the Db2 container for
OpenPages:
vi /mnt/backup/online/db2-online-restore.sh - Replace line
32.From
gsk8capicmd_64 -secretkey -add -db "/mnt/blumeta0/db2/keystore/keystore.p12" -stashed -label ${label} -format ascii -file "/tmp/${label}.kdb"togsk8capicmd_64 -secretkey -add -db "/mnt/blumeta0/db2/keystore/keystore.p12" -stashed -label ${label} -format ascii -file "/tmp/${label}.kdb" || true - Reboot the operator pod so that it reconciles.
- Rerun the restore.
- Get the OpenPages
instance
ID:
When restoring Db2 Big SQL, post-restore hook fails
Applies to: 4.8.7 and later
- Diagnosing the problem
- When you restore Db2 Big SQL with Portworx asynchronous disaster recovery, the restore fails with an error message
like in the following
example:
zen/configmap/cpd-bigsql-aux-ckpt-cm: component=bigsql, op=<mode=post-restore,type=config-hook,method=rule>, status=error - Cause of the problem
- The error is caused by a timing issue. The post-restore hook runs while the Db2 Big SQL head pod is starting up.
- Resolving the problem
- To resolve the problem, do the following steps:
- Rerun the post-restore hook logic inside the hurricane pod.
- Log in to the hurricane pod and switch to the
db2inst1user:oc project ${PROJECT_CPD_INST_OPERANDS}oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)su - db2inst1 - Run the post-restore hook
logic:
/usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L`
- Log in to the hurricane pod and switch to the
- Restore the namespacescope custom resource in the
${PROJECT_CPD_INST_OPERATORS}project.- Edit the namespacescope common-service custom
resource:
oc -n ${PROJECT_CPD_INST_OPERATORS} edit namespacescope common-service - Add
${PROJECT_CPD_INST_OPERANDS}to thenamespaceMemberssection.The
namespaceMemberssection should like the following example:namespaceMembers: - cpd-operators - zen
- Edit the namespacescope common-service custom
resource:
- Rerun the post-restore hook logic inside the hurricane pod.
Restore fails at the running post-restore script step
Applies to: 4.8.0-4.8.6
Fixed in: 4.8.7
- Diagnosing the problem
- When you use Portworx asynchronous disaster recovery, activating applications fails when
you run the post-restore script. In the
restore_post_hooks_<timestamp>.log
file, you see an error message such as in the following
example:
Time: <timestamp> level=error - cpd-tenant-restore-<timestamp>-r2 failed /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed *** command terminated with exit code 1 - Resolving the problem
- To work around the problem, prior to running the post-restore script, restore custom
resource definitions by running the following
command:
cpd-cli oadp restore create <restore-name-r2> \ --from-backup=cpd-tenant-backup-<timestamp>-b2 \ --include-resources='customresourcedefinitions' \ --include-cluster-resources=true \ --skip-hooks \ --log-level=debug \ --verbose
Activating applications
after migrating data with Portworx
asynchronous disaster recovery fails at post-restore step with constraints not
satisfiable error
Applies to: 4.8.4 and later
- Diagnosing the problem
- If the post-restore step in Activating applications in the destination cluster fails, you
see an error in the log file similar to the following
example:
Time: <timestamp> level=warning - Create OperandRequest Timeout Warning Exited with return code=0 Time: <timestamp> level=info - cpd-operators.sh restore done Time: <timestamp> level=error - OperandRequest: operandrequest.operator.ibm.com/im-service - phase: Installing /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed ***Check for failing operandrequests:oc get operandrequests -AFor failing operandrequests, check their conditions forconstraints not satisfiablemessages:oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name> - Cause of the problem
- Subscription wait operations timed out. The problematic subscriptions show an error
similar to the following
example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0 exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0 and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0 originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0, subscription ibm-db2aaservice-cp4d-operator exists'This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.
- Workaround
- Do the following steps:
- Delete the problematic clusterserviceversions and subscriptions,
and restart the Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
- Re-run the post-restore step.
- Delete the problematic clusterserviceversions and subscriptions,
and restart the Operand Deployment Lifecycle Manager (ODLM) pod.
Activating applications
after migrating data with Portworx
asynchronous disaster recovery fails at post-restore step with subscription
isn't created by ODLM error
Applies to: 4.8.4 and later
- Diagnosing the problem
- If the post-restore step in Activating applications in the destination cluster fails with
the following
error:
Time: <timestamp> level=warning - Create OperandRequest Timeout Warning Exited with return code=0 Time: <timestamp> level=info - cpd-operators.sh restore done Time: <timestamp> level=error - OperandRequest: operandrequest.operator.ibm.com/im-service - phase: Installing /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed ***Check the ODLM pod logs:oc logs $(oc get po -o name -lname=operand-deployment-lifecycle-manager -n ${PROJECT_CPD_INST_OPERATORS}) -n ${PROJECT_CPD_INST_OPERATORS} | grep "isn't created by ODLM"You see an error like in the following example:I0323 06:30:25.293343 1 reconcile_operator.go:277] Subscription cloud-native-postgresql-stable-v1.18-cloud-native-postgresql-catalog-cpd-operators in namespace cpd-operators isn't created by ODLM. Ignore update/delete it. - Cause
- The name of the target cluster subscription is different from the source cluster subscription.
- Workaround
- Follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with
the certified-operators catalogsource. However, before doing the step on
restarting the ODLM pod, do the following steps:
- Restart the Operator Lifecycle Manager (OLM)
pod:
oc delete pods -n openshift-operator-lifecycle-manager -l app=catalog-operator oc delete pods -n openshift-operator-lifecycle-manager -l app=olm-operator - Check that all operand requests reach a running
state:
oc get operandrequests -A -w - Now restart the ODLM pod.
- Re-run the post-restore step.
- Restart the Operator Lifecycle Manager (OLM)
pod:
Creating an offline backup in REST mode stalls
Applies to: 4.8.0 and later
- Diagnosing the problem
- This problem occurs when you try to create an offline backup in REST mode by using a
custom
--image-prefixvalue. The offline backup stalls with cpdbr-vol-mnt pods in theImagePullBackOffstate. - Cause of the problem
- When you specify the
--image-prefixoption in thecpd-cli oadp backup createcommand, the default prefixregistry.redhat.io/ubi9is always used. - Resolving the problem
- To work around the problem, create the backup in Kubernetes mode instead. To
change to this mode, run the following
command:
cpd-cli oadp client config set runtime-mode=
Unable to back up Watson Machine Learning Accelerator
Applies to: 4.8.0
Fixed in: 4.8.1
- Diagnosing the problem
- Creating an offline backup of a Cloud Pak for Data deployment that includes
Watson Machine Learning Accelerator with the OADP utility fails.An error message like in the following example appears in the CPD-CLI*.log file:
zen/configmap/cpd-wmla-br-cm: component=wmla, op=<mode=pre-backup,type=config-hook,method=rule>, status=error - Workaround
- Do the following steps each time you create a backup:
- Before you create the backup, stop the Watson Machine Learning Accelerator
operator:
oc scale --replicas 0 deploy wmla-operator-controller-manager -n ${PROJECT_CPD_INST_OPERATORS} - Update the Watson Machine Learning Accelerator backup
and restore
configmap.
oc project ${PROJECT_CPD_INST_OPERANDS} WMLA_BR_CM='cpd-wmla-br-cm' WMLA_BR_CM_YAML=${WMLA_BR_CM}.yaml WMLA_BR_CM_YAML_ORIG=${WMLA_BR_CM_YAML}.orig oc get cm ${WMLA_BR_CM} -o yaml > ${WMLA_BR_CM_YAML_ORIG} line=`grep -n 'enable-maint' ${WMLA_BR_CM_YAML_ORIG}|grep -v apiVersion|awk -F: '{print $1}'` begin_line=$((line-4)) end_line=$((line+3)) sed -e "${begin_line},${end_line}d" ${WMLA_BR_CM_YAML_ORIG} -e '/resourceVersion/d' > ${WMLA_BR_CM_YAML} oc apply -f ${WMLA_BR_CM_YAML} WMLA_ADD_ON_BR_CM='cpd-wmla-add-on-br-cm' WMLA_ADD_ON_BR_CM_YAML=${WMLA_ADD_ON_BR_CM}.yaml WMLA_ADD_ON_BR_CM_YAML_ORIG=${WMLA_ADD_ON_BR_CM_YAML}.orig oc get cm ${WMLA_ADD_ON_BR_CM} -o yaml > ${WMLA_ADD_ON_BR_CM_YAML_ORIG} line=`grep -n 'enable-maint' ${WMLA_ADD_ON_BR_CM_YAML_ORIG}|grep -v apiVersion|awk -F: '{print $1}'` begin_line=$((line-7)) end_line=$((line+13)) sed -e "${begin_line},${end_line}d" ${WMLA_ADD_ON_BR_CM_YAML_ORIG} -e '/resourceVersion/d' > ${WMLA_ADD_ON_BR_CM_YAML} oc replace -f ${WMLA_ADD_ON_BR_CM_YAML} - After the backup is created, start the Watson Machine Learning Accelerator
operator.
oc scale --replicas 1 deploy wmla-operator-controller-manager -n ${PROJECT_CPD_INST_OPERATORS}
- Before you create the backup, stop the Watson Machine Learning Accelerator
operator:
Common core services custom resource is
in InProgress state after an offline restore to a different
cluster
Applies to: 4.8.0-4.8.5
Fixed in: 4.8.6
- Diagnosing the problem
-
- Get the status of installed components by running the following
command.
cpd-cli manage get-cr-status \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} - Check that the status of ccs-cr is
InProgress.
- Get the status of installed components by running the following
command.
- Cause of the problem
- The Common core services component failed to reconcile on the restored cluster, because the dsx-requisite-pre-install-job-<xxxx> pod job is failing.
- Resolving the problem
- To resolve the problem, follow the instructions that are described in the technote Failed dsx-requisite-pre-install-job during offline restore.
Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore
Applies to: 4.8.5 and later
- Diagnosing the problem
- After an offline backup is created, but before doing a restore, check if the
management-ingress-ibmcloud-cluster-info ConfigMap was backed
up by running the following
commands:
cpd-cli oadp backup status --details <backup_name1> | grep management-ingress-ibmcloud-cluster-infocpd-cli oadp backup status --details <backup_name2> | grep management-ingress-ibmcloud-cluster-infoDuring or after the restore, pods that mount the missing ConfigMap show errors. For example:
oc describe po c-db2oltp-wkc-db2u-0 -n ${PROJECT_CPD_INST_OPERANDS}Example output:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 41m (x512 over 17h) kubelet MountVolume.SetUp failed for volume "management-ingress-ibmcloud-cluster-info" : configmap "management-ingress-ibmcloud-cluster-info" not found Warning FailedMount 62s (x518 over 17h) kubelet Unable to attach or mount volumes: unmounted volumes=[management-ingress-ibmcloud-cluster-info], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition - Cause of the problem
- When a related ibmcloud-cluster-info ConfigMap gets excluded as
part of backup hooks, the
management-ingress-ibmcloud-cluster-info ConfigMap copies the
excludelabeling and unintentionally gets excluded from the backup. - Workaround
- If IAM is enabled, apply the following patch.
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Check if IAM is
enabled:
oc get zenservices lite-cr -o jsonpath='{.spec.iamIntegration}{"\n"}' -n ${PROJECT_CPD_INST_OPERANDS} - If IAM is enabled, apply the following patch to ensure that the
management-ingress-ibmcloud-cluster-info ConfigMap is not
excluded from the
backup:
oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - << EOF apiVersion: v1 kind: ConfigMap metadata: name: cpdbr-management-ingress-exclude-fix-br labels: cpdfwk.aux-kind: br cpdfwk.component: cpdbr-patch cpdfwk.module: cpdbr-management-ingress-exclude-fix cpdfwk.name: cpdbr-management-ingress-exclude-fix-br-cm cpdfwk.managed-by: ibm-cpd-sre cpdfwk.vendor: ibm cpdfwk.version: 1.0.0 data: aux-meta: | name: cpdbr-management-ingress-exclude-fix-br description: | This configmap defines offline backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap prehooks. This is a temporary workaround until a complete fix is implemented. version: 1.0.0 component: cpdbr-patch aux-kind: br priority-order: 99999 # This should happen at the end of backup prehooks backup-meta: | pre-hooks: exec-rules: # Remove lingering velero exclude label from offline prehooks - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: velero.io/exclude-from-backup value: "true" timeout: 360s # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: icpdsupport/ignore-on-nd-backup value: "true" timeout: 360s post-hooks: exec-rules: - resource-kind: # do nothing for posthooks --- apiVersion: v1 kind: ConfigMap metadata: name: cpdbr-management-ingress-exclude-fix-ckpt labels: cpdfwk.aux-kind: checkpoint cpdfwk.component: cpdbr-patch cpdfwk.module: cpdbr-management-ingress-exclude-fix cpdfwk.name: cpdbr-management-ingress-exclude-fix-ckpt-cm cpdfwk.managed-by: ibm-cpd-sre cpdfwk.vendor: ibm cpdfwk.version: 1.0.0 data: aux-meta: | name: cpdbr-management-ingress-exclude-fix-ckpt description: | This configmap defines online backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap checkpoint operation. This is a temporary workaround until a complete fix is implemented. version: 1.0.0 component: cpdbr-patch aux-kind: ckpt priority-order: 99999 # This should happen at the end of backup prehooks backup-meta: | pre-hooks: exec-rules: # Remove lingering velero exclude label from offline prehooks - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: velero.io/exclude-from-backup value: "true" timeout: 360s # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: icpdsupport/ignore-on-nd-backup value: "true" timeout: 360s post-hooks: exec-rules: - resource-kind: # do nothing for posthooks checkpoint-meta: | exec-hooks: exec-rules: - resource-kind: # do nothing for checkpoint EOF
-
Volume backup and restore commands do not work when Identity Management Service is enabled
Applies to: 4.8.0, 4.8.1
Fixed in: 4.8.2
- Diagnosing the problem
-
This problem occurs when Cloud Pak for Data is integrated with the Identity Management Service service.
- During the backup volume quiesce operation, PVCs like mongodbdir-icp-mongodb-0 are not properly excluded.
- When creating the restore, icp-mongodb-* pods fail with a
CreateContainerConfigError. The error log has an entry such as in the following example:Failed global initialization: InvalidSSLConfiguration Can not set up PEM key file. Normal Pulled 23s (x6 over 58m) kubelet Container image "icr.io/cpopen/cpfs/ibm- mongodb@sha256:1980150bc49d215f7ff764c9437e5c15efbb67e5af6f0184fd98a8b837cf9a02" already present on machine Warning Failed 23s (x5 over 107s) kubelet Error: failed to prepare subPath for volumeMount "mongodbdir" of container "icp-mongodb"
- Workaround
- Before you run
cpd-cli backup-restore quiesce,cpd-cli backup-restore unquiesce,cpd-cli backup-restore volume-backup, orcpd-cli backup-restore volume-restorecommands, check the cm zen-cs-aux-qu-cm configmap:- Run the following
commands:
oc get cm zen-cs-aux-qu-cm -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.data.quiesce-meta}{"\n"}'oc get cm zen-cs-aux-qu-cm -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.data.unquiesce-meta}{"\n"}' - If the output of either command contains the
cpdbr.cpd.ibm.com/label-resourcesbuiltin name, replace it withcpdbr.cpd.ibm.com/annotate-resourcesby running the following command:oc edit cm zen-cs-aux-qu-cm -n ${PROJECT_CPD_INST_OPERANDS}
- Run the following
commands:
Flight service issues
- The Flight service
do_actioncommand might show misleading error messages in Db2 table drop or create operations - The Flight service returns "Received RST_STREAM with error code 3" when reading large data sets
- Passing the value "None" as the schema argument in the "do_put" PyArrow library method stops the kernel
- Can't run generated code from deprecated functionality to load data for an Informix connection
- Can't load large data sets if the notebook environment doesn't have enough memory
The Flight service
do_action command might show misleading error messages in Db2 table drop or create
operations
Applies to: 4.8.0
Fixed in: 4.8.1
- Diagnosing the problem
-
If you run the
do_actioncommand from Jupyter Notebooks with Python, you might see error messages thatdroporcreatestatements in a Db2 table were not successful. For example:../src/arrow/status.cc:137: DoAction result was not fully consumed: Cancelled: Flight cancelled call, with message: Cancelled. Detail: Cancelled
- Workaround
-
The behavior of
do_actionhas changed. You must iterate through the results ofdo_actionto ensure that all results are collected. For example,results = list(flight_client.do_action(action)) return [r.body.to_pybytes().decode('utf-8') for r in results]
Service issues
The following issues are specific to services.