Known issues and limitations for IBM Cloud Pak for Data

Important: IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.

The following issues apply to IBM Cloud Pak for Data.

Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

Namespace deleted when the --namespace parameter is used with the health network-performance command

Applies to: 4.8.4 and 4.8.5

Fixed in: 4.8.5 (v13.1.5r1) and later

To use the --namespace parameter, you must download and reinstall the latest version of cpd-cli (v13.1.5r1). For more information, see Installing the Cloud Pak for Data command-line interface (cpd-cli).

Important: 4.8.4 (13.1.4) Do not use the --namespace parameter. Any namespace that you enter will be deleted.

After rebooting the cluster, some services in Cloud Pak for Data on OpenShift Data Foundation aren't functional

Applies to: 4.8.3 and later

Diagnosing the problem
After rebooting the cluster, some Cloud Pak for Data custom resources remain in the InProgress state.

For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift Data Foundation 4.1.4 release notes.

Workaround
Do the following steps:
  1. Find the nodes that have pods that are in an Error state:
    oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A  | grep -v -P "Completed|(\d+)\/\1"
  2. Mark each node as unschedulable.
    oc adm cordon <node_name>
  3. Delete the affected pods:
    oc get pod   | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0
  4. Mark each node as scheduled:
    oc adm uncordon <node_name>

You cannot activate notification forwarding from the Events tab

Applies to: 4.8.0

Fixed in: 4.8.1

When you configure notification forwarding, the Activate button on the Events tab does not work.
  • If you are forwarding email notifications, use the Activate button on the Email recipients tab.
  • If you are forwarding notification to an external application, use the Activate button on the External service tab.

The services catalog includes two IBM Knowledge Catalog tiles

Applies to: 4.8.0 and later

When you go to the Services catalog page in the web client, you see two tiles for the IBM Knowledge Catalog service.

Resolving the problem
To remove the extra tile from the services catalog, restart the zen-watcher pod:
oc delete pods -n=${PROJECT_CPD_INST_OPERANDS} | grep zen-watcher

Some text in services catalog is outdated after upgrade

Applies to: 4.8.0 and 4.8.1

Fixed in: 4.8.2

When you go to the Services catalog page in the web client, you see outdated information after you upgrade Cloud Pak for Data. For example, Watson Assistant was renamed to watsonx Assistant in Cloud Pak for Data Version 4.8. However, the services catalog still displays Watson Assistant.

Resolving the problem
To update the text in the services catalog,
  1. Restart the zen-core-api pods:
    oc rollout restart deployment zen-core-api -n=${PROJECT_CPD_INST_OPERANDS}
  2. Clear your browser cache and re-authenticate to the web client.

The Identity and user access card on the home page displays an error

Applies to: 4.8.0 and 4.8.1

Fixed in: 4.8.2

The Identity and user access card does not show the number of logged-in users or total users. Instead, the card shows the following error:
The data cannot be displayed.
The service might be restarting. Wait a few minutes and refresh the page. 
If the problem persists, contact your administrator.

However, refreshing the page does not resolve the problem.

Resolving the problem
To resolve the error, clear your web browser cache and cookies.

You are occasionally redirected to the wrong login page when Cloud Pak for Data is integrated with the Identity Management Service

Applies to: 4.8.0 and 4.8.1

Fixed in: 4.8.2

If your session expires and you click the Log in button on the Logout page, you are sometimes redirected to a URL that starts with https://cp-console rather than https://cpd. If you log in to the https://cp-console URL, you are directed to the Identity providers page, which you might not have access to.

Resolving the issue
Use the login URL provided by your administrator, or edit the URL to replace cp-console with cpd and try to log in again.

Intermittent login issues when Cloud Pak for Data is integrated with the Identity Management Service

Applies to: 4.8.0

When Cloud Pak for Data is integrated with the Identity Management Service, users sometimes encounter an error when the log in to the web client. Users might see one of the following errors when they log in:
  • Error 504 - Gateway Timeout
  • Internal Server Error

Some users might be directed to the Identity providers page rather than the Cloud Pak for Data home page.

Diagnosing the problem
If users experience one or more of the issues described in the preceding text, check the platform-identity-provider pods to determine whether the pods have been restarted multiple times:
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-identity-provider

If the output indicates multiple restarts, proceed to Resolving the problem

Resolving the problem
  1. Restart the icp-mongodb-0 pod:
    oc delete pods icp-mongodb-0 -n ${PROJECT_CPD_INST_OPERANDS}
  2. Restart the icp-mongodb-1 pod:
    oc delete pods icp-mongodb-1 -n ${PROJECT_CPD_INST_OPERANDS}
  3. Restart the icp-mongodb-2 pod:
    oc delete pods icp-mongodb-2 -n ${PROJECT_CPD_INST_OPERANDS}
  4. Restart the platform-auth-service pod:
    oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-auth-service
  5. Restart the platform-identity-management pod:
    oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} | grep platform-identity-management

Warning alerts might appear on the home page after installation

Applies to: 4.8.0 and later

Diagnosing the problem

After installation, you might see warning alerts on the home page. However, the events that generated the alerts have cleared. The alerts will continue to display on the Alerts card for up to 3 days unless you delete the pods that generated the alerts.

The alerts are visible to users with one of the following permissions:
  • Administer platform
  • Manage platform health
  • View platform health
  1. Log in to the web client as a user with the appropriate permissions to view alerts.
  2. On the home page, click View all on the Alerts card.
  3. On the Alerts and events page, confirm that the alerts were generated by one of the following services:
    • Common core services
    • IBM Knowledge Catalog

This issue can occur because wkc-db2-init pods or jdbc-driver-sync-job pods are in an Error state.

Resolving the problem
An instance administrator or cluster administrator must resolve the problem.
  1. Log in to Red Hat OpenShift Container Platform as a user with sufficient permissions to complete the task.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Check the status of the wkc-db2-init pods and the jdbc-driver-sync-job pods.
    oc get pods --sort-by=.status.startTime -n ${PROJECT_CPD_INST_OPERANDS} | grep -E 'wkc-db2u-init|jdbc-driver'
  3. Delete any pods that are in the Error state.

    Replace <pod-name> with the name of the pod in the error state.

    oc delete pod <pod-name> -n ${PROJECT_CPD_INST_OPERANDS}

Tethered projects are not retrieved

Applied to: 4.8.0

Occasionally, tethered projects are not returned by the /v2/namespaces API call. For example, tethered projects might not be displayed in the web client, when you:
  • Try to create a service instance in a tethered project
  • Create a storage volume in a tethered project
Resolving the problem
Patch the ZenService custom resource to trigger a reconcile loop:
oc patch ZenService lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec": {"patchProductConfigmap": "true"}}'

RSI patches are not applied to pods as expected

Applies to: 4.8.0 and 4.8.1

Fixed in: 4.8.2

In some situations, RSI patches are not applied to a subset of the specified pods. This typically occurs when you use the --select_all_pods option when you create an RSI patch.

Diagnosing the problem
Check the status of the pod owners, such as StatefulSets, to determine whether the owners are stuck in the patch in progress state:
  1. Check for any Deployments in the patch in progress state:
    oc get deployment \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'

    The command returns a list of any Deployments in this state.

  2. Check for any StatefulSets in the patch in progress state:
    oc get statefulset \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'

    The command returns a list of any StatefulSets in this state.

  3. Check for any ReplicaSets in the patch in progress state:
    oc get replicaset \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'

    The command returns a list of any ReplicaSets in this state.

  4. Check for any ReplicaControllers in the patch in progress state:
    oc get replicacontroller \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'

    The command returns a list of any ReplicaControllers in this state.

  5. Check for any Jobs in the patch in progress state:
    oc get jobs \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'

    The command returns a list of any Jobs in this state.

  6. Check for any CronJobs in the patch in progress state:
    oc get cronjob \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -o=jsonpath='{.items[?(@.metadata.annotations.resourcespecinjector\.ibm\.com/injection_status=="patch in progress")].metadata.name}'

    The command returns a list of any CronJobs in this state.

Resolving the problem
For each resource returned by the preceding commands, update the annotation on the resource:
oc annotate <resource-type> <resource-name> \
-n=${PROJECT_CPD_INST_OPERANDS} \
resourcespecinjector.ibm.com/injection_status-

Replace <resource-type> with the type of the resource and <resource-name> with the name of the resource.

Modifying existing RSI patches causes undefined label selector errors

Applies to: 4.8.0, 4.8.1, and 4.8.2

Fixed in: 4.8.3

When you upgrade from IBM Cloud Pak for Data Version 4.6 or Version 4.7 to Version 4.8, you get an error when you modify existing RSI patches.

For example, if you want to make an active patch inactive, you encounter the following error when you run the cpd-cli manage create-rsi-patch command:

fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. 
The error was: 'dict object' has no attribute 'exclude_labels'...
Resolving the issue
To resolve the problem:
  1. Run the following command to get the complete name of the patch:
    oc get zenextension --namespace=${PROJECT_CPD_INST_OPERANDS}

    RSI patches are prefixed with rsi-.

  2. Set the RSI_PATCH_NAME environment variable to the name of the patch that you want to update:
    export RSI_PATCH_NAME=<patch-name>
  3. Edit the patch:
    oc edit zenextension ${RSI_PATCH_NAME} \
    --namespace=${PROJECT_CPD_INST_OPERANDS}
  4. Add the following entries to the spec.extensions[*].pod_selector section:
    "exclude_labels":{},
    "select_all_pods":false
    For example:
    apiVersion: zen.cpd.ibm.com/v1
    kind: ZenExtension
    metadata:
      name: rsi-zen-core-rsi-pod-env-var-json
    spec:
      extensions: |
        [
          {
            "extension_point_id": "rsi_pod_env_var",
            "extension_name": "rsi-zen-core-rsi-pod-env-var-json",
            "display_name": "rsi-zen-core-rsi-pod-env-var-json",
            "description": "description",
            "meta": {},
            "details": {
              "patch_spec": [
                {"op":"add","path":"/spec/containers/0/env/-","value":{"name":"zen-core-env-json-one","value":"zen-core-env-json-one"}},
                {"op":"add","path":"/spec/containers/0/env/-","value":{"name":"zen-core-env-json-two","value":"zen-core-env-json-two"}}
              ], 
             "pod_selector":{
                "selector":{
                   "app.kubernetes.io/component":"zen-core",
                   "component":"zen-core"
                },
                "exclude_labels":{}, 
                "select_all_pods":false
             },
              "state": "active",
              "type": "json"
            }
          }
        ]
  5. Save your changes to the patch. For example, if you are using vi, hit esc and enter :wq.

Cannot get the contents of the zen-metastore-edb pod as a YAML file

Applies to: 4.8.0, 4.8.1, 4.8.2, and 4.8.3

Fixed in: 4.8.4

When you run the following command against any of the zen-metastore-edb pods, such as zen-metastore-edb-0, you get an error:

oc get pods zen-metastore-edb-0 -o yaml

The command returns the following error:

error: error converting JSON to YAML: yaml: control characters are not allowed

If you want to see the contents of the pod, you must use JSON. For example:

oc get pods zen-metastore-edb-0 -o json

Installation and upgrade issues

General installation and upgrade issues
General upgrade issues
Red Hat OpenShift Container Platform upgrade issues
After upgrading to Red Hat OpenShift Container Platform Version 4.14, the status of the Redis custom resource fluctuates

The apply-cr command fails when installing services with a dependency on Db2U

Applies to: 4.8.0 and later

Diagnosing the problem
You can specify the privileges that Db2U runs with. If you configured Db2U to run with limited privileges, the apply-cr command will fail if:
  • You set DB2U_RUN_WITH_LIMITED_PRIVS: "true" in the db2u-product-cm ConfigMap.
  • The kernel parameter settings were not modified to allow Db2U to run with limited privileges.
This issue can manifest in several ways.
The wkc-db2u-init job fails
If you are installing IBM Knowledge Catalog, the apply-cr command fails with the message:
"WKC DB2U post install job failed ('wkc-db2u-init' job)
When you get the status of the wkc-db2u-init pods, they are in the Error state.
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep wkc-db2u-init
The Db2UCluster resource never becomes ready
For other services you might notice that the Db2uCluster resource never becomes Ready.
oc get Db2uCluster -n ${PROJECT_CPD_INST_OPERANDS}
You cannot provision service instances
For services such as Db2® and Db2 Warehouse, the apply-cr command completes successfully, but the service instances never finish provisioning and the *-db2u-0 pods are stuck in Pending or SysctlForbidden.
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep db2u-0
Resolving the problem
This problem occurs when you set DB2U_RUN_WITH_LIMITED_PRIVS: "true" in the db2u-product-cm ConfigMap but the kernel parameter settings were not modified to allow Db2U to run with limited privileges.
Review Changing kernel parameter settings to confirm that you can change the kernel parameter settings.
  • If you can change the kernel parameter settings, ensure that the worker nodes are restarted after you change the settings.

    In some cases, when you run the cpd-cli manage apply-db2-kubelet command, the worker nodes are not restarted.

  • If you cannot or do not change the kernel parameter settings, update the db2u-product-cm ConfigMap to set DB2U_RUN_WITH_LIMITED_PRIVS: "false". For more information, see Specifying the privileges that Db2U runs with.

Some Cloud Pak for Data software cannot be installed when the OpenShift Virtualization Operator is installed on the cluster

Applies to: 4.8.1, 4.8.2, 4.8.3 and 4.8.4

Fixed in: 4.8.5

If you are running Red Hat OpenShift Container Platform Version 4.14 or later, do not install the OpenShift Virtualization Operator on the cluster if you plan to install either of the following services:
  • OpenPages®
  • watsonx.governance Risk and Compliance Foundation

Installing Red Hat OpenShift Container Platform for IBM Cloud Pak for Data previously stated that you could not install the OpenShift Virtualization Operator on the cluster. However, that restriction applies only to the environments with OpenPages or watsonx.governance Risk and Compliance Foundation.

After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster

Applies to: Upgrades from Version 4.7.4 to 4.8.0 and later

If you upgrade from Cloud Pak for Data version 4.7.4 to Cloud Pak for Data 4.8.0 or later, the IAM access token API (/idprovider/v1/auth/identitytoken) fails. You cannot log in to the user interface when the identitytoken API fails.

Diagnosing the problem
The following error is displayed in the log when you generate an IAM access token:
Failed to get access token, Liberty error: {"error_description":"CWWKS1406E: The token request had an invalid client credential. The request URI was \/oidc\/endpoint\/OP\/token.","error":"invalid_client"}"
Resolving the problem
  1. Log in to Red Hat OpenShift Container Platform as a cluster administrator.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Run the following command to restart the oidc-client-registration job:
    oc -n delete job oidc-client-registration

After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed status

Applies to: Upgrades from Version 4.7.3 to 4.8.2 and later

After upgrading Cloud Pak for Data from Version 4.7.3 to 4.8.2 or later, the status of the FoundationDB cluster can indicate that it has failed (fdbStatus: Failed). The Failed status can occur even if FoundationDB is available and working correctly. This issue occurs when the FoundationDB resources do not get properly cleaned up by the upgrade.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Match 360
Diagnosing the problem

To determine if this problem has occurred:

Required role: To complete this task, you must be a cluster administrator.

  1. Check the FoundationDB cluster status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If the returned status is Failed, proceed to the next step to determine if the pods are available.

  2. Check to see if the FoundationDB pods are up and running.
    oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep foundation

    The returned list of FoundationDB pods should all have a status of Running. If they are not running, then the problem is something other than this issue.

Resolving the problem

To resolve this issue, restart the FoundationDB controller (ibm-fdb-controller):

Required role: To complete this task, you must be a cluster administrator.

  1. Identify your FoundationDB controllers.
    oc get pods  -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-fdb-controller
    This command returns the names of two FoundationDB controllers in the following formats:
    • ibm-fdb-controller-manager-<INSTANCE-ID>
    • apple-fdb-controller-manager-<INSTANCE-ID>
  2. Delete the ibm-fdb-controller-manager to refresh it.
    oc delete pod ibm-fdb-controller-<INSTANCE-ID> -n ${PROJECT_CPD_INST_OPERATORS}
  3. Wait for the controller to restart. This can take approximately one minute.
  4. Check the status of your FoundationDB cluster:
    oc -n ${PROJECT_CPD_INST_OPERANDS} get FdbCluster -o yaml
    Confirm that the fdbStatus is now Completed.

The apply-cr command fails when upgrading services with a dependency on Db2U

Applies to: 4.8.0, 4.8.1, 4.8.2, and 4.8.3

Fixed in: 4.8.4

When you try to upgrade services with a dependency on Db2U, the upgrade fails when trying to remove temporary files.

The following services might be affected:
  • IBM Knowledge Catalog
  • OpenPages with an internal database
Diagnosing the problem
To determine if the upgrade failed when trying to remove temporary files:
  1. Set the DB2U_POD environment variable to the name of the db2u-0 pod.
    • For IBM Knowledge Catalog, run:
      export DB2U_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep wkc-db2u-0)
    • For OpenPages:
      1. Set the INSTANCE_ID environment variable:
        export INSTANCE_ID=$(oc get openpagesinstance -n=${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.items[0].spec.zenServiceInstanceId}{"\n"}')
      2. Set the DB2U_POD environment variable:
        export DB2U_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep ${INSTANCE_ID}-db2u-0 | awk '{print $1}')
  2. Open a remote shell in the pod to examine the upgrade_update.log file:
    oc exec -it ${DB2U_POD} \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -- bash -c "cat /mnt/blumeta0/support/upgrade_update.log"
  3. Look for the following strings in the error log:
    • INFO: Applying Db2 license
    • in _delete_temporary_db2inst1_files
    • OSError: [Errno 16] Device or resource busy: '.nfs

    If the log contains the preceding strings, continue to Resolving the problem

Resolving the problem
  1. Remove the db2.tmp_db2inst1.* files from the db2u-0 pod:
    oc exec -it ${DB2U_POD} \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -- bash -c "rm -rf db2.tmp_db2inst1.*"
  2. Restart the database update:
    oc exec -it ${DB2U_POD} \
    -n=${PROJECT_CPD_INST_OPERANDS} \
    -- bash -c "db2_update_upgrade --databases"
  3. Patch the formation:
    • For IBM Knowledge Catalog, run:
      oc patch formation db2oltp-wkc \
      -n=${PROJECT_CPD_INST_OPERANDS} \
      --type=json \
      --patch='[{"op": "remove", "path": "/spec/resource_configs/m/env/upgrade~1podname"}]'
    • For OpenPages, run:
      oc patch formation db2oltp-${INSTANCE_ID} \
      -n=${PROJECT_CPD_INST_OPERANDS} \
      --type=json \
      --patch='[{"op": "remove", "path": "/spec/resource_configs/m/env/upgrade~1podname"}]'

The apply-cr command fails when upgrading services with a dependency on the common core services

Applies to:
  • Upgrades from Version 4.6.4 to 4.8.0 and later
  • Upgrades from Version 4.6.5 to 4.8.0 and later
  • Upgrades from Version 4.6.6 to 4.8.0 and later

When you upgrade a service with a dependency on the common core services, the apply-cr command fails because the ccs-cr custom resource is stuck in InProgress. The problem can occur when the connection pods try to use an outdated certificate for inter-pod communication, which prevents the connection migration jobs from completing.

Diagnosing the problem
  1. Get the status of the ccs-cr custom resource:
    oc get ccs -n=${PROJECT_CPD_INST_OPERANDS}
  2. If the status of the custom resource is InProgress, check the status of the common core services pods on in the operands project for the instance (${PROJECT_CPD_INST_OPERANDS}):
    oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep ccs

    If the status of the ccs-post-install-migration-job pods are Error, contact IBM Software Support to restart the connection migration job.

The Projects page does not load after upgrade

Applies to: Upgrades from Version 4.8.1 to Version 4.8.2

After you upgrade from Version 4.8.1 to Version 4.8.2, the Projects page does not load.

Resolving the problem
To enable the Projects page to load:
  1. Set the PORTAL_MAIN_POD to the name of the portal-main pod:
    export PORTAL_MAIN_POD=$(oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep portal-main | awk '{print $1}')
  2. Delete the portal-main pod:
    oc delete pods ${PORTAL_MAIN_POD} -n=${PROJECT_CPD_INST_OPERANDS}
  3. Wait for the pod to be Ready and try to load the Projects page again. To check the status of the pod, run:
    oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep portal-main

Unable to filter instances by type after you upgrade to Cloud Pak for Data 4.8.1

Applies to: 4.8.0 and 4.8.1

Fixed in: 4.8.2

Diagnosing the problem

After you upgrade to Cloud Pak for Data 4.8.1, you cannot filter by type on the Instances page. You see a blank page in the user interface.

Workaround
If you click Filter by: and then click Type, the page will become blank. You can refresh the page to return to the Instance page.
You must scroll through the Instances page to find a specific instance and its instance type.

After you upgrade Red Hat OpenShift Container Platform, storage volume pods are in the CrashLoopBackOff state

Applies to: 4.8.6 and later

After you upgrade Red Hat OpenShift Container Platform, storage volume pods (volumes-* pods are in the CrashLoopBackOff state. This problem occurs when the /auth/jwtpublic URL path in ibm-nginx returns a TLS error.

Diagnosing the problem
To confirm that the problem is caused by a TLS error returned by the /auth/jwtpublic URL path:
  1. Get the name of a volumes-* pods that is in the CrashLoopBackOff state:
    oc get pods \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    | grep volumes
  2. Get the pod logs for a volumes-* pod that is in the CrashLoopBackOff state:
    oc logs <pod-name> \
    --namespace=${PROJECT_CPD_INST_OPERANDS}

    Replace <pod-name> with the name of a pod in the CrashLoopBackOff state

  3. Confirm that the logs include the following error:
    time="<timestamp>" level=error msg="ProcessPublicKeyFromNginx - Failed receiving response from server" func=zen-core-api/source/apis/commonutils.LogErr file="/go/src/zen-core-api/source/apis/commonutils/auth_util.go:342" err="Get \"https://ibm-nginx-svc.<namespace>:443/auth/jwtpublic\": remote error: tls: illegal parameter"
    panic: Get "https://ibm-nginx-svc.<namespace>:443/auth/jwtpublic": remote error: tls: illegal parameter
Resolving the problem
To resolve the problem:
  1. Get the name of the volumes-* deployments:
    oc get deployments \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    | grep -E 'volumes|READY'
  2. For each deployment where the READY column shows 0/1, run the following command to patch the deployment:
    oc patch deployment <volume-deployment-name> \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --type=json \
    --patch='[
           {"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts/-", "value": {"mountPath": "/user-home/_global_/config/jwt", "name": "ibm-zen-secret-jwt"}},
           {"op": "add", "path": "/spec/template/spec/volumes/-", "value": {"name": "ibm-zen-secret-jwt", "secret": {"defaultMode": 420, "optional": true, "secretName": "ibm-zen-secret-jwt"}}}
         ]'

    Replace <volume-deployment-name> with the name of a deployment.

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 4.8.0 and later

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Match 360 with Watson
Diagnosing the problem
To identify the cause of this issue, check the FoundationDB status and details.
  1. Check the FoundationDB status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.

  2. If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
    oc rsh sample-cluster-log-1 /bin/fdbcli

    To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.

    status details
    • If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
      oc get pod -o wide | grep storage
      > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 

      If this step does not resolve the problem, proceed to the workaround steps.

    • If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
Resolving the problem
To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

  1. Restart the FoundationDB cluster pods.
    oc get fdbcluster 
    oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po

    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

  2. Restart the FoundationDB operator pods.
    oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po
  3. After the pods finish restarting, check to ensure that FoundationDB is available.
    1. Check the FoundationDB status.
      oc get fdbcluster -o yaml | grep fdbStatus

      The returned status must be Complete.

    2. Check to ensure that the database is available.
      oc rsh sample-cluster-log-1 /bin/fdbcli

      If the database is still not available, complete the following steps.

      1. Log in to the ibm-fdb-controller pod.
      2. Run the fix-coordinator script.
        kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}

        Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

        Note: For more information about the fix-coordinator script, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.

Inaccurate status message from the command line after you upgrade

Applies to: Version 4.8.0 and later of the following services.
  • watsonx Assistant
  • Watson Discovery
  • Watson Knowledge Studio
  • Watson Speech services
Diagnosing the problem
If you run the cpd-cli service-instance upgrade command from the Cloud Pak for Data command-line interface, and then use the service-instance list command to check the status of each service, the provision status for the service is listed as UPGRADE_FAILED.
Cause of the problem
When you upgrade the service, only the cpd-cli manage apply-cr command is supported. You cannot use the cpd-cli service-instance upgrade command to upgrade the service. And after you upgrade the service with the apply-cr method, the change in version and status is not recognized by the service-instance command. However, the correct version is displayed from the Cloud Pak for Data web client.
Resolving the problem
No action is required. If you use the cpd-cli manage apply-cr method to upgrade the service as documented, the upgrade is successful and you can ignore the version and status information that is generated by the cpd-cli service-instance list command.

Secrets are not visible in connections after upgrade

Applies to: Version 4.8.0, 4.8.1, 4.8.2, 4.8.3, and 4.8.4

If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.

Resolving the problem
To see the secrets in the user interface:
  1. Change to the project where Cloud Pak for Data is installed:
    oc project ${PROJECT_CPD_INST_OPERANDS}
  2. Set the following environment variables:
    oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true
    oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true

After upgrading to Red Hat OpenShift Container Platform Version 4.14, the status of the Redis custom resource fluctuates

Applies to: 4.8.0, 4.8.1, 4.8.2, and 4.8.3

Fixed in: 4.8.4

The following services have a dependency on the ibm_redis_cp component:
  • Cognos® Dashboards
  • Db2 Data Management Console
  • IBM Match 360
  • Watson Query
  • watsonx Assistant

After you upgrade from Red Hat OpenShift Container Platform Version 4.12 to Version 4.14, the status of the Redis custom resource fluctuates between InProgress and Completed.

However, if the Redis operator (ibm-redis-cp-operator) is running without errors, you can ignore the fluctuation in the custom resource status.

Diagnosing the problem
  1. To determine whether the custom resource status is fluctuating run the following command:
    oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep redis

    Wait several minutes and run the command again to see if the status changes.

  2. To determine whether the operator pods are healthy, run the following command:
    oc get pods -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-redis-cp-operator
Resolving the problem

If the status of the operator pods are Running or Completed, you can ignore the fluctuation in the custom resource status.

If the status of the operator pods are not Running or Completed, follow the guidance in Troubleshooting the apply-olm command during installation or upgrade to determine the root cause of the problem.

Security issues

Security scans return an Inadequate Account Lockout Mechanism message

Applies to: 4.8.0 and later

Diagnosing the problem
If you run a security scan against Cloud Pak for Data, the scan returns the following message.
Inadequate Account Lockout Mechanism
Resolving the problem
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 4.8.0 and later

Diagnosing the problem
If you run an Aqua Security scan against your cluster, the scan returns the following issue:
Resolving the problem
This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.

Backup and restore issues

Issues that apply to several or all backup and restore methods
Online backup and restore with OADP backup and restore utility issues
Online backup and restore with IBM Storage Fusion issues
Online backup and restore with NetApp Astra Control Center issues
Data replication with Portworx issues
Offline backup and restore with the OADP backup and restore utility issues
Offline backup and restore with the volume backup and restore utility issues

Backup fails for the platform with error in EDB Postgres cluster

Applies to: 4.8.7 and later

Diagnosing the problem
For example, in IBM Storage Fusion, the backup fails at the Hook: br-service hooks/pre-backup stage in the backup sequence.

In the cpdbr-oadp.log file, you see the following error:

time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance 
or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled
Cause of the problem
Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.
Resolving the problem

Use either the automatic or manual workaround.

Automatic workaround

After you apply the YAML files, the following workaround automatically runs as a prehook every time you take a backup. The issue is automatically handled, so you do not encounter it again, which is especially useful if you have set up automatic backups.

  1. Check that the ${VERSION} environment variable is set in cpd_vars.sh to the correct Cloud Pak for Data version number.
  2. Download the edb-patch-resources-legacy.yaml file.
  3. Run the following command
    oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f edb-patch-resources-legacy.yaml
  4. Complete the steps that apply to your backup and restore method.
      Online backup and restore
      1. Download the edb-patch-aux-ckpt-cm-legacy.yaml file.
      2. Run the following command:
        sed "s/VERSION_PLACEHOLDER/${VERSION}/g" edb-patch-aux-ckpt-cm-legacy.yaml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f -
        
      3. Retry the backup.

      Offline backup and restore
      1. Download the edb-patch-aux-br-cm-legacy.yaml file.
      2. Run the following command:
        sed "s/VERSION_PLACEHOLDER/${VERSION}/g" edb-patch-aux-br-cm-legacy.yaml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f -
        
      3. Retry the backup.

Manual workaround

Complete the following steps to manually run the workaround:

Note: If another switchover of the EDB Postgres cluster's primary instance and replica happens after you apply the manual workaround, you must complete the workaround again before you take a backup.
  1. Download the edb-patch.sh file.
  2. Run the following command:
    sh edb-patch.sh ${PROJECT_CPD_INST_OPERANDS}
  3. Retry the backup.

OADP backup is missing EDB Postgres PVCs

Applies to: 4.8.0 and later

Diagnosing the problem
After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.
Cause of the problem
EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.
Resolving the problem
Before you create a backup, run the following command:
oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}

OADP backup precheck command fails with EDB Postgres cluster is out of sync error

Applies to: 4.8.7

Fixed in: 4.8.8

Diagnosing the problem
When you run the cpd-cli oadp backup precheck command, you see the following error:
Error: precheck failed with error: edb in-sync check failed: edb cluster is out of sync
Cause of the problem
The backup precheck command is not accounting for any lag when log sequence numbers (LSNs) are received but not replayed yet.
Resolving the problem
Try one of the following options:

Informix custom resource in InProgress state after restore

Applies to: 4.8.5

Fixed in: 4.8.8

Diagnosing the problem
After the restore, get the status of the Informix custom resource by running the following command:
oc get Informix informix-<xxxxxxxxxxxxxxxx>  -n ${PROJECT_CPD_INST_OPERANDS} -o yaml

In the output, the informixStatus shows InProgress.

Cause of the problem
The Informix custom resource stays in pending because the IBM Global Security Kit (GSKit) doesn't complete the SSL setup before the readiness and liveness checks complete.
Resolving the problem
Change the behavior of GSKit. From the OpenShift console, add the environment variable ICC_SHIFT=3 to the informix-<xxxxxxxxxxxxxxxx>-cm-0 and to the informix-<xxxxxxxxxxxxxxxx>-server StatefulSets, and restart those pods.

After restore, watsonx Assistant is stuck on the 17/19 deployed state or custom resource is stuck in InProgress state

Applies to: 4.8.5

Fixed in: 4.8.6

Diagnosing the problem
This problem can occur after you restore an online backup to the same cluster or to a different cluster. Run the following command:
oc get wa -n ${PROJECT_CPD_INST_OPERANDS}
Example output:
NAME   VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE   AGE
wa     5.0.1     False   Initializing   True       VerifyWait       17/19      15/19                4h39m
Resolving the problem
Delete the wa-integrations-operand-secret and wa-integrations-datastore-connection-strings secrets by running the following commands:
oc delete secret wa-integrations-operand-secret -n ${PROJECT_CPD_INST_OPERANDS}
oc delete secret wa-integrations-datastore-connection-strings -n ${PROJECT_CPD_INST_OPERANDS}

After the secrets are deleted, the watsonx Assistant operator recreates them with the correct values, and the watsonx Assistant custom resource and pods are now in a good state.

Missing Identity Management Service data after a restore

Applies to: 4.8.0-4.8.4

Fixed in: 4.8.5

Diagnosing the problem
After you upgrade to Cloud Pak for Data 4.8.0-4.8.4, you integrate with the Identity Management Service. When you back up and restore your Cloud Pak for Data deployment, Identity Management Service data is missing. No results are returned for one or more of the following commands:
oc get pvc -n ${PROJECT_CPD_INST_OPERANDS} ibm-zen-cs-mongo-backup
oc get cm -n ${PROJECT_CPD_INST_OPERANDS}  zen-cs-aux-br-cm
oc get cm -n ${PROJECT_CPD_INST_OPERANDS}  zen-cs-aux-ckpt-cm
Cause of the problem
The Cloud Pak for Data deployment is missing some resources, including several backup and restore ConfigMaps.
Resolving the problem
  1. Copy and run the Identity Management Service backup and restore scripts.
  2. Redo the backup and restore.

Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster

Applies to: 4.8.5 and later

Diagnosing the problem
When Cloud Pak for Data is integrated with the Identity Management Service service, you cannot log in with OpenShift cluster credentials. You might be able to log in with LDAP or as cpdadmin.
Resolving the problem
To work around the problem, run the following commands:
oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operator
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-service
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-management
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider

Offline and online restore with OADP backup and restore utility fails with running post-restore hooks error

Applies to: 4.8.5-4.8.7

Fixed in: 4.8.8

Diagnosing the problem
The final cpd-cli oadp restore create command that runs post-restore hooks fails. In the CPD-CLI*.log file, you see the following error message:
error running post-restore hooks: pod/<podname> is not supported for scaling, please 
define the proper postrestore hooks because of restored replicaset(s)
Cause
A recent change in OADP 1.3.1 removed logic that prevents restoring deployment-managed ReplicaSets that shouldn't be restored.
Workaround
Do the following steps:
  1. Find ReplicaSets that were restored from the backup:
    oc get rs -l velero.io/backup-name -n ${PROJECT_CPD_INST_OPERANDS}
  2. Delete the ReplicaSets that are associated with the pods that have is not supported for scaling errors from the logs:
    oc delete rs -n ${PROJECT_CPD_INST_OPERANDS} <replicaset-name>
  3. Manually run post-restore hooks.
    • If you are doing an offline backup and restore, run the following command:
      cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS}
    • If you are doing an online backup and restore, run the following command:
      cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} --hook-kind=checkpoint

Restoring online backup of Watson Query fails

Applies to: 4.8.5 and later

Diagnosing the problem
In the CPD-CLI*.log file, you see the following error message:
time=<timestamp> level=info msg=   zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, 
op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult 
file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137
Workaround
Do the following steps:
  1. Disable the Watson Query liveness probe in the Watson Query head pod:
    oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - mkdir /mnt/PV/versioned/marker_file"
    oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - touch /mnt/PV/versioned/marker_file/.bar"
  2. Disable the BigSQL restart daemon in the Watson Query head pod:
    oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker create BIGSQL_DAEMON_PAUSE"
  3. Stop BigSQL in the Watson Query head pod:
    oc rsh c-db2u-dv-db2u-0 bash
    su - db2inst1
    bigsql stop
  4. Re-enable the Hive user in the users.json file in the Watson Query head pod.
    1. Edit the users.json file:
      vi /mnt/blumeta0/db2_config/users.json
    2. Locate "locked":true and change it to "locked":false.
  5. On the hurricane pod, rename the hive-site.xml config file so that it can be reconfigured by restarting the pod:
    oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)
    su - db2inst1
    mv /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml.bak
  6. Exit the pod, and then run the following command to delete it.
    Note: Since the configuration file was renamed, it is regenerated with the correct settings.
    oc delete pod -l formation_id=db2u-dv,role=hurricane
  7. After the hurricane pod is started again, run the following commands on the hurricane pod to disable the SSL so that it can be reconfigured in a later step:
    oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)
    su - db2inst1
    bigsql-config -disableMetastoreSSL
    bigsql-config -disableSchedulerSSL
  8. Clean up leftover files from the hurricane pod:
    rm -rf /mnt/blumeta0/bigsql/security/*
    rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null
  9. Run the following commands to disable SSL from the head pod:
    oc rsh c-db2u-dv-db2u-0 bash
    su - db2inst1
    rah "bigsql-config -disableMetastoreSSL"
    rah "bigsql-config -disableSchedulerSSL"
  10. Clean up leftover files from the head and worker pods:
    rm -rf /mnt/blumeta0/bigsql/security/*
    rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null
  11. Run the following commands to re-enable SSL on the head pod, and restart Db2 Big SQL so that configuration changes can take effect:
    bigsql-config -enableMetastoreSSL
    bigsql-config -enableSchedulerSSL
    bigsql stop; bigsql start
  12. Remove markers that were created in steps 1 and 2 in the Watson Query head pod:
    oc exec -it c-db2u-dv-db2u-0 -- bash -c "rm -rf /mnt/PV/versioned/marker_file/.bar"
    oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker delete BIGSQL_DAEMON_PAUSE"
  13. If you are doing the backup and restore with the OADP backup and restore utility, run the following command:
    cpd-cli oadp restore prehooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} --log-level debug --verbose
  14. If you are doing the backup and restore with IBM Storage Fusion, NetApp Astra Control Center, or Portworx data replication, run the following commands:
    CPDBR_POD=$(oc get po -l component=cpdbr-tenant -n ${PROJECT_CPD_INST_OPERATORS} --no-headers | awk '{print $1}')
    oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS}"
    oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}"

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 4.8.5 and later

Diagnosing the problem
Get the status of the ZenService YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...
Check for failing operandrequests:
oc get operandrequests -A
For failing operandrequests, check their conditions for constraints not satisfiable messages:
oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
Cause of the problem
Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Workaround
Do the following steps:
  1. Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.

    For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.

  2. Delete Cloud Pak for Data instance projects (namespaces).

    For details, see Preparing to restore Cloud Pak for Data.

  3. Retry the restore.

After a restore, the ZenService component remains in InProgress state

Applies to: 4.8.0, 4.8.1, 4.8.2

Fixed in: 4.8.3

Diagnosing the problem
In the backup and restore log file, you see the following message:
msg: Timeout creating internal-tls-certificate, make sure that certmanager has been installed and work properly
Workaround
Delete all restored certificate requests.
  1. Get the list of all certificate requests:
    oc get certificaterequest -n ${PROJECT_CPD_INST_OPERANDS}
  2. Delete each certificate request one-by-one:
    oc delete certificaterequest -n ${PROJECT_CPD_INST_OPERANDS} <cert-request-name>

Password for the cpadmin user changes when restoring Cloud Pak for Data

Applies to: 4.8.0-4.8.4

Fixed in: 4.8.5

Diagnosing the problem

This problem occurs when Cloud Pak for Data is integrated with the Identity Management Service service. The problem applies to all backup and restore methods.

After a restore, the password for the cpadmin user is changed.

Workaround
No workaround is available. To retrieve or change the new password, see Changing the administrator password.

After restoring IBM Match 360 from backup, the associated Redis pods enter a CrashLoopBackOff state

Fixed in: 4.8.3

Applies to: 4.8.1, 4.8.2

Diagnosing the problem

This issue occurs after restoring the IBM Match 360 from either an online or offline backup. The MDM Redis CP (mdm-redis-cp) pods fail to restore correctly and enter a CrashLoopBackOff state, which affects the functionality of the IBM Match 360 user interface services. This problem is caused by missing password fields in relevant secrets.

Workaround
To resolve this issue, set the administrator password field for RedisCP-related secrets:
  1. Encode the user-defined password for Redis.
    echo -n '<USER-DEFINED-PASSWORD>' | base64
  2. Run the following commands to set the administrator password to the encoded password value that you just created:
    oc -n ${PROJECT_CPD_INST_OPERANDS} patch secret  mdm-redis-cp-<INSTANCE-ID>-admin-secret --type=merge -p '{"data":{"admin_password": "<ENCODED-PASSWORD>"}}'
    oc -n ${PROJECT_CPD_INST_OPERANDS} patch secret  mdm-redis-cp-<INSTANCE-ID>-em-ui --type=merge -p '{"data":{"auth": "<ENCODED-PASSWORD>"}}'
  3. Bring down the MDM RedisCP Server StatefulSet.
    oc patch sts mdm-redis-cp-<INSTANCE-ID>-server --type=merge -p '{"spec":{"replicas": 0}}'
    Wait for all MDM RedisCP pods to terminate.
  4. Bring up the MDM RedisCP Server StatefulSet.
    oc patch sts mdm-redis-cp-<INSTANCE-ID>-server --type=merge -p '{"spec":{"replicas": 3}}'
  5. Delete the MDM Redis HAProxy pod.
    oc delete pod -l redis-app=mdm-redis-cp-<INSTANCE-ID>-haproxy

Online backup of Watson Discovery fails at checkpoint stage

Applies to: 4.8.3

Fixed in: 4.8.4

Diagnosing the problem
When you try to create an online backup, the backup process fails at the checkpoint hook stage. For example, if you are creating the backup with IBM Storage Fusion, the backup process fails at the Hook: br-service-hooks-checkpoint stage in the backup sequence. In the log file, you see an error message similar to the following example:
download failed: s3://common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip 
to tmp/s3-backup/common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip 
Connection was closed before we received a valid response from endpoint URL: "https://s3.openshift-storage.svc:443/common-zen-wd/mt/__built-in-tenant__/fileResource/701db916-fc83-57ab-0000-000000000010.zip".
Cause of the problem
Large resource files can become corrupted while they are downloaded to the backup. As a result, the wd-discovery-aux-ch-s3-backup job does not complete successfully.
Workaround
Delete the file that is shown in the error message and recreate it.
  1. Exec into the wd-discovery-support pod:
    oc exec -it deploy/wd-discovery-support -- bash
  2. Do the following steps within the pod.
    1. Delete the file:
      aws-wd s3 rm s3://common-zen-wd/mt/__built-in tenant__/fileResource/<file_name>
    2. Confirm that the file is not listed when you run the following command:
      aws-wd s3 ls s3://common-zen-wd/mt/__built-in-tenant__/fileResource/
    3. Exit from the pod.
  3. Delete the wd-discovery-orchestrator-setup job:
    oc delete job/wd-discovery-orchestrator-setup
  4. Wait for the wd-discovery-orchestrator-setup job to run again and complete.

Confirm that the file was successfully recreated:

  1. Exec into the wd-discovery-support pod:
    oc exec -it deploy/wd-discovery-support -- bash
  2. Do the following steps within the pod.
    1. Copy the file to the tmp directory:
      aws-wd s3 cp s3://common-zen-wd/mt/__built-in-tenant__/fileResource/<file_name> /tmp
    2. Confirm that the file is copied:
      ls /tmp/<file_name>
    3. Exit from the pod.

You can now retake the backup, and the wd-discovery-aux-ch-s3-backup job will complete successfully.

Resource file causes online backup of Watson Discovery to fail at checkpoint stage

Applies to: 4.8.4, 4.8.5

Fixed in: 4.8.6

Diagnosing the problem
When you try to create an online backup, the backup process fails at the checkpoint hook stage. For example, if you are creating the backup with IBM Storage Fusion, the backup process fails at the Hook: br-service-hooks-checkpoint stage in the backup sequence. In the log file, you see an error message similar to the following example:
download failed: s3://common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar 
to tmp/s3-backup/common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar 
Connection was closed before we received a valid response from endpoint 
URL: "https://s3.openshift-storage.svc:443/common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar".
Cause of the problem
Resource files can become corrupted, and as a result, the wd-discovery-aux-ch-s3-backup job does not complete successfully.
Workaround
Delete the file that is shown in the error message from s3 and re-run the backup.
  1. Exec into the wd-discovery-support pod:
    oc exec -it deploy/wd-discovery-support -- bash
  2. Do the following steps within the pod.
    1. Delete the file:
      aws-wd s3 rm <file_name>
      For example:
      aws-wd s3 rm s3://common-zen-wd/user/1000750000/.sparkStaging/application_1710244391677_0005/training-job.jar`
    2. Confirm that the file was deleted:
      aws-wd s3 ls <file_name>
    3. Exit from the pod.

You can now retake the backup, and the wd-discovery-aux-ch-s3-backup job will complete successfully.

Watson Knowledge Catalog custom resource stuck in InProgress state after restore

Applies to: 4.8.9

Diagnosing the problem
When you try to restore, the Watson Knowledge Catalog custom resource is stuck in an InProgress state, and you see a message similar to the following example:
NAME     VERSION   RECONCILED   STATUS       AGE
wkc-cr   4.8.9                  InProgress   5h36m


  lastTransitionTime: "2025-04-16T17:08:20Z"
      message: |-
        unknown playbook failure
        Failed to deploy WKC prereqs
        Failed at task: Wait until FdbCluster is completed
        The error was: Please consult the operator logs.
Cause of the problem
Corrupted or incomplete snapshot data is causing unrecoverable indexes in OpenSearch during the restore process.
Workaround
Complete the following steps to resolve the issue:
  1. Reduce GS replicas to 0:
    oc scale deployment wkc-search --replicas=0
  2. Change the readiness check configuration to accept red as a valid status:
    oc patch ccs ccs-cr --type merge --patch '{"spec": {"openshift_cluster_health_check_params": "wait_for_status=red&timeout=30s"}}'
  3. Put the OpenSearch cluster in quiesce mode:
    oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}'
  4. Take the OpenSearch cluster out of quiesce mode:
    oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}'
  5. Verify the presence of corrupted indexes, which are shards that are stuck and do not respond to recovery:
    oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request GET --url http://localhost:19200/_cat/shards  --header 'content-type: application/json'
  6. Delete the corrupted indexes:
    oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request DELETE --url http://localhost:19200/gs-system-index-wkc-v001,semantic,wkc --header 'content-type: application/json'
  7. Put the OpenSearch cluster in quiesce mode:
    oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}'
  8. Change the readiness check configuration back to the default setting, which accepts only yellow as a valid status:
    oc patch ccs ccs-cr --type merge --patch '{"spec": {"openshift_cluster_health_check_params": "wait_for_status=yellow&timeout=30s"}}'
  9. Take the OpenSearch cluster out of quiesce mode:
    oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}'
  10. Scale the wkc-search pod back to its original size:
    oc scale deployment wkc-search --replicas=1

You can now complete the restore successfully.

Online restore posthooks fails with checkpoint id error

Applies to: 4.8.3, 4.8.4

Fixed in: 4.8.5

Diagnosing the problem
Run the restore posthooks command:
cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --hook-kind=checkpoint

In the output, you see a missing checkpoint id message like in the following example:

processing request...
oadp namespace: oadp-operator
cpd namespace: zen
runtime mode: 
resolved namespaces [ibm-common-services] ...
Error: error running post-restore hooks: cannot get checkpoint id in namespace zen, namespaces: [ibm-common-services], checkpointId: , err: info is nil
Cause of the problem
The restore posthooks command cannot resolve the Cloud Pak for Data project (namespace) from the --tenant-operator-namespace option.
Workaround
Run the restore posthooks command with the --include-namespaces option instead:
cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERATORS}, ${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint

Online restore posthooks fail with zenobjstore/ibm-zen-configuration is not empty error

Applies to: 4.8.3

Fixed in: 4.8.4

Diagnosing the problem
Run the restore posthooks command:
cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --hook-kind=checkpoint

In the output, you see an error message like in the following example:

zen/configmap/cpd-zen-aux-ckpt-cm: component=zen, op=<mode=post-restore,type=config-hook,method=job>, status=error
In the cpdbr-oadp.log or cpd-cli*.log file, you see the following error:
time=<timestamp> level=info msg=logs for pod=zen-ckpt-restore-job-fv7l5-g9rfm:

--------------------------------
container: zen-ckpt-restore
--------------------------------
...
mc: <ERROR> `zenobjstore/ibm-zen-configuration` is not empty. Retry this command with ‘--force’ flag if you want to remove `zenobjstore/ibm-zen-configuration` and all its contents 
<timestamp>: Deletion Unsuccessful... Exiting with error.
Cause of the problem
The object store that is used to back up the platform metadata must be reset before you run the restore posthooks command.
Workaround
Do the following steps:
  1. Set up the backup location for the platform metadata:
    export BACKUP_DIR=$HOME/rstemp/zen_backup 
    mkdir -p $BACKUP_DIR/workspace && \
    mkdir -p $BACKUP_DIR/secrets/jwks mkdir -p $BACKUP_DIR/secrets/jwt mkdir -p $BACKUP_DIR/secrets/jwt-private && \
    mkdir -p $BACKUP_DIR/secrets/ibmid-jwk && \
    mkdir -p $BACKUP_DIR/secrets/aes-key && \
    mkdir -p $BACKUP_DIR/secrets/admin-user && \
    mkdir -p $BACKUP_DIR/database && \
    mkdir -p $BACKUP_DIR/objstorage
  2. Read the object store connection and credentials of the object store that is used to back up the platform metadata:
    export IBM_ZEN_BUCKET_NAME=$(oc -n ${PROJECT_CPD_INST_OPERANDS} get cm ibm-zen-objectstore-cm -o=jsonpath='{.data.BUCKET_ZEN_CONFIGURATION}')
    OBJECTSTORE_ENDPOINT=$(oc -n ${PROJECT_CPD_INST_OPERANDS} get cm ibm-zen-objectstore-cm -o jsonpath="{.data.OBJECTSTORE_ENDPOINT}")
    oc -n ${PROJECT_CPD_INST_OPERANDS} extract secret/ibm-zen-objectstore-secret --to=$BACKUP_DIR/workspace --confirm
  3. Remove and recreate the object store bucket that is used to back up the platform metadata:
    oc -n ${PROJECT_CPD_INST_OPERANDS} exec -t zen-minio-0 -- bash -c "rm -rf /tmp/backup && mkdir -p /tmp/backup && export HOME=/tmp && /workdir/bin/mc alias set zenobjstore ${OBJECTSTORE_ENDPOINT} $(<${BACKUP_DIR}/workspace/accesskey) $(<${BACKUP_DIR}/workspace/secretkey) --config-dir=/tmp/.mc --insecure && /workdir/bin/mc ls zenobjstore/${IBM_ZEN_BUCKET_NAME} --insecure && /workdir/bin/mc rb zenobjstore/${IBM_ZEN_BUCKET_NAME} --force --dangerous --insecure && /workdir/bin/mc mb zenobjstore/${IBM_ZEN_BUCKET_NAME} --insecure"
  4. Rerun the restore posthooks command:
    cpd-cli oadp restore posthooks --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint

Db2 Big SQL post-restore hook fails during online restore

Applies to: 4.8.8

Diagnosing the problem
When restoring an online backup with the OADP backup and restore utility, the CPD-CLI*.log file shows error messages like in the following example:
<timestamp>  INFO Attempting Write Resume/Restore..Running /db2u/scripts/write-restore.sh - retry
<timestamp>  ERROR Db2 status on c-bigsql-1736386855680378-db2u-1.c-bigsql-1736386855680378-db2u-internal:
<timestamp>  ERROR Failed to resume Db2 write in c-bigsql-<worker_pod_index>-db2u-1.c-bigsql-<worker_pod_index>-db2u-internal...Post-restore/resume hooks cannot continue
func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:802
time=<timestamp> level=info msg=exit executeCommand func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:813
Cause of the problem
The Db2 Big SQL database on a worker node remains in write-suspend mode.
Resolving the problem
Take the database out of write-suspend mode by doing the following steps:
  1. In the error message, note the Db2 Big SQL worker pod that failed, identified by ERROR Failed to resume Db2 write in.
  2. Log in to the worker pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS}  get pod | grep -i c-bigsql | grep -i db2u-<worker_pod_index> | cut -d' ' -f 1) bash
  3. Switch to the db2inst1 user:
    su - db2inst1
  4. Reactivate the database:
    db2 restart db ${DBNAME} write resume
    db2 activate db ${DBNAME}
  5. Verify that you can successfully connect to the Db2 Big SQL database on a worker node:
    db2 list active databases | grep ${DBNAME}
    db2 connect to ${DBNAME}
  6. Repeat steps 1-5 for any other worker pod that failed.
  7. Restore the namespacescope.
    1. Get the cpdbr-tenant-service pod ID:
      oc get po -A | grep "cpdbr-tenant-service"
    2. Log in to the cpdbr-tenant-service pod:
      oc rsh -n ${OADP_OPERATOR_NAMESPACE} <cpdbr-tenant-service pod id>
    3. Run the restore namespacescope script:
      /cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}

Watson Speech services status is stuck in InProgress after restore

Applies to: 4.8.5 and later

Diagnosing the problem
After an online restore with the OADP backup and restore utility, the CPD-CLI*.log file shows speechStatus is in the InProgress state.
Cause of the problem
The speechStatus is in the InProgress state due to a race condition in the stt-async component. Pods that are associated with this component are stuck in 0/1 Running state. Run the following command to confirm this state:
oc get pods -l app.kubernetes.io/component=stt-async
Example output:
NAME                                   READY   STATUS    RESTARTS   AGE
speech-cr-stt-async-775d5b9d55-fpj8x   0/1     Running   0          60m

If one or more pods is in the 0/1 Running state for 20 minutes or more, this problem might occur.

Resolving the problem
For each pod in the 0/1 Running state, run the following command:
oc delete pod <stt-async-podname>

Online restore posthooks time out when restoring Watson Query or Db2 Big SQL

Applies to: 4.8.5 and later

Diagnosing the problem
  1. Run the cpd-cli oadp restore logs <restore_name> command, replacing <restore_name> with the restore name that you specified when you did the step to restore Kubernetes resources that generate pods (for example, restore-name4).
    The restore log file shows the following error messages for Db2 Big SQL and Watson Query respectively:
    zen/configmap/cpd-bigsql-aux-ckpt-cm: component=bigsql, op=<mode=post-restore,type=config-hook,method=rule>, status=error
    zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error
  2. In the head pod, open the /var/log/bigsql/cli/bigsql-db2ubar-hook.log file, and look for the following error messages.
    • In Watson Query, the head pod is c-db2u-dv-db2u-0.
    • In Db2 Big SQL, the head pod is c-bigsql-<xxxxxxxx>-db2u-0.

    For Watson Query:

    /db2u/.db2u_initialized
    *************************************************
    Restart and activate the database(s)...this may take a while
     
    SQL1032N  No start database manager command was issued.  SQLSTATE=57019
    c-db2u-dv-db2u-0.c-db2u-dv-db2u-internal.${PROJECT_CPD_INST_OPERANDS}.svc.cluster.local: db2 restart db BIGSQL ... completed rc=4
     
    SQL1032N  No start database manager command was issued.  SQLSTATE=57019
    c-db2u-dv-db2u-1.c-db2u-dv-db2u-internal: db2 restart db BIGSQL ... completed rc=4
    	
    activate db BIGSQL
    SQL1032N  No start database manager command was issued.  SQLSTATE=57019

    For Db2 Big SQL:

    /db2u/.db2u_initialized
    *************************************************
    Restart and activate the database(s)...this may take a while
     
    SQL1032N  No start database manager command was issued.  SQLSTATE=57019
    c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-0.c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-internal.${PROJECT_CPD_INST_OPERANDS}.svc.cluster.local: db2 restart db BIGSQL ... completed rc=4
     
    SQL1032N  No start database manager command was issued.  SQLSTATE=57019
    c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-1.c-bigsql-<xxxxxxxxxxxxxxxx>-db2u-internal: db2 restart db BIGSQL ... completed rc=4
    	
    activate db BIGSQL
    SQL1032N  No start database manager command was issued.  SQLSTATE=57019
  3. After about 5 minutes, check these log files to see if the error messages are still there.
  4. If the error message are still there, check whether the bigsql-db2ubar-hook.sh script is still running.
    • For Watson Query:
      oc rsh c-db2u-dv-db2u-0 bash
      su - db2inst
      ps -ef |grep -v grep |grep bigsql-db2ubar-hook.sh 
    • For Db2 Big SQL:
      oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bash 
      su - db2inst
      ps -ef |grep -v grep |grep bigsql-db2ubar-hook.sh

    The error message SQL1032N No start database manager command was issued. SQLSTATE=57019 indicates that either Db2 was not running, or db2 connect to bigsql failed when the Db2 process to restore writes to all databases started.

Resolving the problem
Check that Db2 is running and can connect to bigsql by doing the following steps:
  1. Check that Db2 is running by running the following commands.
    • For Watson Query:
      oc rsh c-db2u-dv-db2u-0 bash
      su - db2inst
      bigsql status
    • For Db2 Big SQL:
      oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bash
      su - db2inst
      bigsql status
  2. Confirm that Db2 is running in the head pod and all worker pods.
  3. If Db2 is running, check if db2 connect is working.
    • For Watson Query:
      oc rsh c-db2u-dv-db2u-0 bash
      su - db2inst
      db2 connect to bigsql
    • For Db2 Big SQL:
      oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bash
      su - db2inst
      db2 connect to bigsql

If Db2 is running and can connect to bigsql, re-run the restore posthooks by doing the following steps:

  1. Log in to the bigsql head pod, and if the bigsql-db2ubar-hook.sh script is still running, terminate the process.
    • For Watson Query:
      oc rsh c-db2u-dv-db2u-0 bash
      su - db2inst
      pid=`ps -ef | grep bigsql-db2ubar-hook.sh | grep -v grep | awk '{print $2}'`
      kill -9  ${pid}
      exit
    • For Db2 Big SQL:
      oc rsh c-bigsql-<xxxxxxxx>-db2u-0 bash
      su - db2inst
      pid=`ps -ef | grep bigsql-db2ubar-hook.sh | grep -v grep | awk '{print $2}'`
      kill -9  ${pid}
      exit
  2. Re-run the restore posthooks command:
    cpd-cli oadp backup posthooks \
    --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
    --hook-kind=checkpoint \
    --log-level=debug \
    --verbose
  3. Confirm that the posthooks completed successfully for Db2 Big SQL and Watson Query.

Backup fails at Volume group: cpd-volumes stage

Applies to: IBM Storage Fusion 2.7.2

Fixed in: IBM Storage Fusion 2.7.2 hotfix

Diagnosing the problem
In the backup sequence in IBM Storage Fusion 2.7.2, the backup fails at the Volume group: cpd-volumes stage.

The transaction manager log shows several error messages, such as the following examples:

<timestamp>[TM_0] - Error: Processing of volume cc-home-pvc failed.\n", "<timestamp>[VOL_12] -
Snapshot exception (410)\\nReason: Expired: too old resource version: 2575013 (2575014)
Workaround
Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.

Backup of Cloud Pak for Data operators project fails at data transfer stage

Applies to: IBM Storage Fusion 2.7.2

Fixed in: IBM Storage Fusion 2.7.2 hotfix

Diagnosing the problem
In IBM Storage Fusion 2.7.2, the backup fails at the Data transfer stage, with the following error:
Failed transferring data
There was an error when processing the job in the Transaction Manager service
Cause
The length of a Persistent Volume Claim (PVC) name is more than 59 characters.
Workaround
Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.

With the hotfix, PVC names can be up to 249 characters long.

Backup fails at Hook: br-service-hooks/post-backup stage

Applies to: 4.8.2, 4.8.3

Fixed in: 4.8.4

Diagnosing the problem
In the backup sequence in IBM Storage Fusion, the backup fails at the Hook: br-service-hooks/post-backup stage.

Logs from the transaction manager have an error message similar to the following example:

.... [apphooks:executeHook Line 144][ERROR] - Timeout reached before command completed. 
However, the operation continues because of the on-error annotation value.
Workaround
Increase the backup hooks timeout value in the backup recipe ibmcpd-tenant from the default 600 seconds to 1800 seconds by running the following command:
oc patch frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} --type='json' -p='[{"op": "replace", "path": "/spec/hooks/0/ops/7/timeout", "value": 1800}]'

ZenService is in Failed state after a restore

Applies to: IBM Storage Fusion 2.7.1

Fixed in: IBM Storage Fusion 2.7.2

Diagnosing the problem
After you restore an online backup with IBM Storage Fusion 2.7.1, check the status of the ZenService component:
oc describe zenservice -n ${PROJECT_CPD_INST_OPERANDS} lite-cr
The command returns output like in the following example:
Status:
  Progress:                    16%
  Progress Message:            Finished Zen-Metastore edb
  Supported Operand Versions:  5.1.1
  Zen Message:                 5.1.1/roles/0010-infra has failed with error: All items completed
  Zen Operator Build Number:   zen operator 5.1.1 build 37
  Zen Status:                  Failed
Cause of the problem
When the Cloud Pak for Data project (namespace) has the label pod-security.kubernetes.io/enforce, the annotation for the user ID range, openshift.io/sa.scc.uid-range, is changed during the restore process.
Workaround
Change the annotation for the user ID range in the target cluster to match the source cluster.
  1. From the source cluster, get the security context constraint (SCC) user ID (UID) range annotations of the Cloud Pak for Data operand project:
    oc get namespace ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.metadata.annotations."openshift.io/sa.scc.uid-range"'
    oc get namespace ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.metadata.annotations."openshift.io/sa.scc.supplemental-groups"'
    Example output from both commands:
    "1001130000/10000"
    Note: If the source cluster is no longer available, you can also find the namespace definition from the IBM Storage Fusion transaction-manager pod logs in the ibm-backup-restore project on the target cluster. For example:
    <timestamp>[TM_1][restoreguardian:createNamespace Line 2195][INFO] - Creating namespace with labels: 
    {'kubernetes.io/metadata.name': 'cpd-instance', 'olm.operatorgroup.uid/1347e6d2-51de-4a7d-a246-a25dc85b0121': '', 
    'pod-security.kubernetes.io/audit': 'baseline', 'pod-security.kubernetes.io/audit-version': 'v1.24', 'pod-security.kubernetes.io/enforce': 
    'privileged', 'pod-security.kubernetes.io/warn': 'baseline', 'pod-security.kubernetes.io/warn-version': 'v1.24'} and annotations 
    {'openshift.io/sa.scc.mcs': 's0:c34,c4', 'openshift.io/sa.scc.supplemental-groups': '1001130000/10000', 
    'openshift.io/sa.scc.uid-range': '1001130000/10000'}
  2. On the target cluster, overwrite the SCC UID range annotations of the Cloud Pak for Data operands project to match the source cluster.
    oc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.uid-range="<range-from-source-cluster>" --overwrite
    oc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.supplemental-groups="<range-from-source-cluster>" --overwrite
    For example:
    oc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.uid-range="1001130000/10000" --overwrite
    oc annotate namespace ${PROJECT_CPD_INST_OPERANDS} openshift.io/sa.scc.supplemental-groups="1001130000/10000" --overwrite
    Example output from both commands:
    namespace/cpd-instance annotate
  3. Verify that the project has the expected values:
    oc get namespace ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.metadata.annotations'
    Example output:
    {
      "openshift.io/sa.scc.mcs": "s0:c34,c4",
      "openshift.io/sa.scc.supplemental-groups": "1001130000/10000",
      "openshift.io/sa.scc.uid-range": "1001130000/10000"
    }
  4. Restart all pods in the ${PROJECT_CPD_INST_OPERANDS} project.

    This step forces all pods to restart with the new UID range.

    oc delete pod -n ${PROJECT_CPD_INST_OPERANDS} --all
    Example output:
    pod "common-web-ui-7475996b49-2pfkz" deleted
    pod "create-secrets-job-8w6d4" deleted
    pod "ibm-nginx-74d964878b-6dbcv" deleted
    pod "ibm-nginx-74d964878b-wwfpb" deleted
    pod "ibm-nginx-tester-7c959cdcd-cqztg" deleted
    pod "ibm-zen-vault-sdk-jwt-setup-job-b7sqn" deleted
    pod "icp-mongodb-0" deleted
    ..
  5. To speed up the Cloud Pak for Data service reconcile process, restart all pods in the ${PROJECT_CPD_INST_OPERATORS} project.
    oc delete pod -n ${PROJECT_CPD_INST_OPERATORS} --all
    Example output:
    ..
    pod "cloud-native-postgresql-catalog-nhzq6" deleted
    pod "cpd-platform-dc52d" deleted
    pod "cpd-platform-operator-manager-7878b59476-nrs2l" deleted
    pod "d14e1b98ea4e0b8a0ceecd128879389e88fde2f14e11203b83128ff468cld6x" deleted
    pod "d44fc276aa580bfc4f7b22999ea30cf9f4263225c360428ec1219d7fb98lj4b" deleted
    pod "fe26f3b3326b215794f4412d56259c3232b328c4a82cd4e9f8b77b7bdekfbw2" deleted
    pod "ibm-common-service-operator-6b7977d75c-k44j6" deleted
    pod "ibm-commonui-operator-67ff844b58-b468t" deleted
    pod "ibm-iam-operator-cc84f8674-tmq4r" deleted
    pod "ibm-mongodb-operator-65d5bc4698-v87ng" deleted
    pod "ibm-namespace-scope-operator-cfb964d54-s8qp6" deleted
    pod "ibm-zen-operator-85d7f7bbdc-jkqzv" deleted
    ..
Cloud Pak for Data services will reconcile and come up to the Completed state. Monitor the process periodically by running:
cpd-cli manage get-cr-status --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}

Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails

Applies to: 4.8.2 and later

Diagnosing the problem
When you restore an online backup with IBM Storage Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
Workaround
This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To work around the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.

After a restore with NetApp Astra Control Center, post-restore hooks fail with an Authentication failed error

Applies to: 4.8.3, 4.8.4

Fixed in: 4.8.5

Diagnosing the problem
After restoring Cloud Pak for Data with NetApp Astra Control Center, post-restore hooks fail with the following error:
time=<timestamp> level=info msg=   zen/configmap/zen-cs-aux-ckpt-cm: 
component=zen-cs, op=<mode=post-restore,type=config-hook,method=job>, status=error
In the cpdbr-oadp or cpd-cli*.log file, you see the following error:
func=cpdbr-oadp/pkg/quiesce.logJob file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:222
time=<timestamp> level=info msg=logs for pod=zen-mongodb-restore-z9664:

--------------------------------
container: cs-mongodb-restore
--------------------------------
-------------------------------------------------------------------------
Mongo Restore output: <timestamp>	Failed: error connecting to db server: server returned error on SASL authentication step: Authentication failed.
-------------------------------------------------------------------------
Mongo Restore attempt: 1 return code: 1
...
-------------------------------------------------------------------------
Mongo Restore output: 2024-02-16T00:47:41.813+0000	error dialing icp-mongodb-2.icp-mongodb.zen.svc.cluster.local:27017: dial tcp 172.129.3.21:27017: connect: connection refused
<timestamp>	Failed: error connecting to db server: server returned error on SASL authentication step: Authentication failed.
-------------------------------------------------------------------------
Mongo Restore attempt: 5 return code: 1

func=cpdbr-oadp/pkg/quiesce.logJob file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:229
time=<timestamp> level=info msg=exit logJob func=cpdbr-oadp/pkg/quiesce.logJob file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:232
time=<timestamp> level=info msg=job zen-mongodb-restore deleted func=cpdbr-oadp/pkg/quiesce.runJobX.func1 file=/go/src/cpdbr-oadp/pkg/quiesce/jobexecutor.go:124
time=<timestamp> level=error msg=job zen-mongodb-restore did not complete successfully
Cause of the problem
The MongoDB operator ran before the icp-mongodb-admin secret was created. Services cannot connect to MongoDB.
Workaround
Do the following steps:
  1. Delete the icp-mongodb pods and PVCs:
    oc project ${PROJECT_CPD_INST_OPERANDS}
    oc delete pvc mongodbdir-icp-mongodb-0&
    oc delete pvc mongodbdir-icp-mongodb-1&
    oc delete pvc mongodbdir-icp-mongodb-2&
    oc delete po icp-mongodb-0&
    oc delete po icp-mongodb-1&
    oc delete po icp-mongodb-2&
  2. Wait for all icp-mongodb pods to reach a running state:
    oc get po -n ${PROJECT_CPD_INST_OPERANDS} -w
  3. Verify that you can connect to the mongo shell:
    oc rsh -n ${PROJECT_CPD_INST_OPERANDS} icp-mongodb-0
    mongo --host localhost:$MONGODB_SERVICE_PORT --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabase admin --ssl --sslCAFile=/data/configdb/tls.crt --sslPEMKeyFile=/work-dir/mongo.pem --verbose
  4. Retry running the post-restore hooks:
    cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint

After a restore with NetApp Astra Control Center, post-restore hooks fail with a zen-mongodb-restore already exists error

Applies to: 4.8.3, 4.8.4

Fixed in: 4.8.5

Diagnosing the problem
After restoring Cloud Pak for Data with NetApp Astra Control Center, post-restore hooks fail with the following error:
time=<timestamp> level=info msg=   zen/configmap/zen-cs-aux-ckpt-cm: component=zen-cs, op=<mode=post-restore,type=config-hook,method=job>, status=error
In the cpdbr-oadp or cpd-cli*.log file, you see the following error:
time=<timestamp> level=error msg=jobs.batch "zen-mongodb-restore" already exists
Cause of the problem
The zen-mongodb-restore job was already running from a previously failed post-restore hooks attempt.
Workaround
Do the following steps:
  1. Delete the zen-mongodb-restore job:
    oc delete job -n ${PROJECT_CPD_INST_OPERANDS} zen-mongodb-restore
  2. Retry running the post-restore hooks:
    cpd-cli oadp restore posthooks --tenant-operator-namespace=${PROJECT_CPD_INST_OPERANDS} --hook-kind=checkpoint

After a restore with NetApp Astra Control Center, the ZenService custom resource remains stuck at 51% in progress

Applies to: 4.8.3, 4.8.4

Fixed in: 4.8.5

Diagnosing the problem
Get the status of the ZenService custom resource:
oc get zenservice -A -o yaml

Example output of the command:

...
status:
    Progress: 51%
    ProgressMessage: Finished Role 0020-core
    supportedOperandVersions: 5.1.1
    zenMessage: 5.1.1/roles/0030-gateway has failed. See the latest operator debug
      log for exact error in operator pod ibm-zen-operator-85d7f7bbdc-h9xbm under
      /tmp/ansible-operator/runner/zen.cpd.ibm.com/v1/ZenService/zen/lite-cr/artifacts/
      directory.
    zenOperatorBuildNumber: zen operator 5.1.1 build 37
    zenStatus: Failed
...

Get the status of the oidc-client-registration job:

oc get job oidc-client-registration -o yaml

Example output of the command:

...
- apiVersion: batch/v1
  kind: Job
  namespace: zen
  objectName: oidc-client-registration
  status: NotReady
...
Cause of the problem
The MongoDB operator ran before the icp-mongodb-admin secret was created. Services cannot connect toMongoDB.

The Auth service cannot create OAuthDBSchema for storing client registrations.

Workaround
Do the following steps:
  1. Restart the platform-auth-service pods.
    1. Get the list of pods:
      oc get po -A | grep "platform-auth-service"
    2. Delete each pod:
      oc delete po -n ${PROJECT_CPD_INST_OPERANDS} platform-auth-service-<pod_name>
  2. Wait for the ZenService custom resource to reconcile and progress past 51%:
    oc get zenservice -A -o yaml -w

After a restore with NetApp Astra Control Center, the ZenService custom resource does not reconcile

Applies to: 4.8.3, 4.8.4

Fixed in: 4.8.5

Diagnosing the problem
Check if namespacescope is missing the instance namespace:
oc get nss -A -o yaml

Example output of the command:

validatedMembers:
    - ibm-common-services
Cause of the problem
The namespacescope was not restored after the post-restore hooks successfully completed.
Workaround
Do the following steps:
  1. Restore the namespacescope.
    1. Get the cpdbr-tenant-service pod name:
      oc get po -A | grep "cpdbr-tenant-service"
    2. Access the cpdbr-tenant-service pod:
      oc rsh -n ${PROJECT_CPD_INST_OPERANDS} cpdbr-tenant-service-<pod_name>
    3. Run the following command:
      /cpdbr-scripts/cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}
  2. Wait for the ZenService custom resource to progress:
    oc get zenservice -A -o yaml -w

After a restore, status of OpenPages and watsonx.governance custom resources are in a Failed state

Applies to: 4.8.7

Fixed in: 4.8.8

Diagnosing the problem
After you restore Cloud Pak for Data by using Portworx asynchronous disaster recovery, you activate applications. In the Cloud Pak for Data Instances page, the status of the OpenPages custom resource is Failed. The status of the watsonx.governance custom resource alternates between Failed and Completed.
Resolving the problem
To work around the problem, do the following steps:
  1. Get the OpenPages instance ID:
    INSTANCE_ID=$(oc get openpagesinstance ${INSTANCE_NAME} -o json | jq -r '.spec.zenServiceInstanceId')
  2. Exec into the Db2 container:
    oc exec -it c-db2oltp-$INSTANCE_ID-db2u-0 bash
  3. Edit the online restore script inside the Db2 container for OpenPages:
    vi /mnt/backup/online/db2-online-restore.sh
  4. Replace line 32.
    From
    gsk8capicmd_64 -secretkey -add -db "/mnt/blumeta0/db2/keystore/keystore.p12" -stashed -label ${label} -format ascii -file "/tmp/${label}.kdb"
    to
    gsk8capicmd_64 -secretkey -add -db "/mnt/blumeta0/db2/keystore/keystore.p12" -stashed -label ${label} -format ascii -file "/tmp/${label}.kdb" || true
  5. Reboot the operator pod so that it reconciles.
  6. Rerun the restore.

When restoring Db2 Big SQL, post-restore hook fails

Applies to: 4.8.7 and later

Diagnosing the problem
When you restore Db2 Big SQL with Portworx asynchronous disaster recovery, the restore fails with an error message like in the following example:
zen/configmap/cpd-bigsql-aux-ckpt-cm: component=bigsql, op=<mode=post-restore,type=config-hook,method=rule>, status=error
Cause of the problem
The error is caused by a timing issue. The post-restore hook runs while the Db2 Big SQL head pod is starting up.
Resolving the problem
To resolve the problem, do the following steps:
  1. Rerun the post-restore hook logic inside the hurricane pod.
    1. Log in to the hurricane pod and switch to the db2inst1 user:
      oc project ${PROJECT_CPD_INST_OPERANDS}
      oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)
      su - db2inst1
    2. Run the post-restore hook logic:
      /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L`
  2. Restore the namespacescope custom resource in the ${PROJECT_CPD_INST_OPERATORS} project.
    1. Edit the namespacescope common-service custom resource:
      oc -n ${PROJECT_CPD_INST_OPERATORS} edit namespacescope common-service
    2. Add ${PROJECT_CPD_INST_OPERANDS} to the namespaceMembers section.

      The namespaceMembers section should like the following example:

      namespaceMembers:
        - cpd-operators
        - zen

Restore fails at the running post-restore script step

Applies to: 4.8.0-4.8.6

Fixed in: 4.8.7

Diagnosing the problem
When you use Portworx asynchronous disaster recovery, activating applications fails when you run the post-restore script. In the restore_post_hooks_<timestamp>.log file, you see an error message such as in the following example:
Time: <timestamp> level=error -  cpd-tenant-restore-<timestamp>-r2 failed
/cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1
*** cpdbr-tenant.sh post-restore failed ***
command terminated with exit code 1
Resolving the problem
To work around the problem, prior to running the post-restore script, restore custom resource definitions by running the following command:
cpd-cli oadp restore create <restore-name-r2> \
--from-backup=cpd-tenant-backup-<timestamp>-b2 \
--include-resources='customresourcedefinitions' \
--include-cluster-resources=true \
--skip-hooks \
--log-level=debug \
--verbose

Activating applications after migrating data with Portworx asynchronous disaster recovery fails at post-restore step with constraints not satisfiable error

Applies to: 4.8.4 and later

Diagnosing the problem
If the post-restore step in Activating applications in the destination cluster fails, you see an error in the log file similar to the following example:
Time: <timestamp> level=warning - Create OperandRequest Timeout Warning Exited with return code=0 
Time: <timestamp> level=info - cpd-operators.sh restore done 
Time: <timestamp> level=error - OperandRequest: operandrequest.operator.ibm.com/im-service - phase: 
Installing /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed ***
Check for failing operandrequests:
oc get operandrequests -A
For failing operandrequests, check their conditions for constraints not satisfiable messages:
oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
Cause of the problem
Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Workaround
Do the following steps:
  1. Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.

    For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.

  2. Re-run the post-restore step.

Activating applications after migrating data with Portworx asynchronous disaster recovery fails at post-restore step with subscription isn't created by ODLM error

Applies to: 4.8.4 and later

Diagnosing the problem
If the post-restore step in Activating applications in the destination cluster fails with the following error:
Time: <timestamp> level=warning - Create OperandRequest Timeout Warning Exited with return code=0 
Time: <timestamp> level=info - cpd-operators.sh restore done 
Time: <timestamp> level=error - OperandRequest: operandrequest.operator.ibm.com/im-service - 
phase: Installing /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed ***
Check the ODLM pod logs:
oc logs $(oc get po -o name -lname=operand-deployment-lifecycle-manager -n ${PROJECT_CPD_INST_OPERATORS}) -n ${PROJECT_CPD_INST_OPERATORS} | grep "isn't created by ODLM"
You see an error like in the following example:
I0323 06:30:25.293343 1 reconcile_operator.go:277] Subscription cloud-native-postgresql-stable-v1.18-cloud-native-postgresql-catalog-cpd-operators 
in namespace cpd-operators isn't created by ODLM. Ignore update/delete it.
Cause
The name of the target cluster subscription is different from the source cluster subscription.
Workaround
Follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource. However, before doing the step on restarting the ODLM pod, do the following steps:
  1. Restart the Operator Lifecycle Manager (OLM) pod:
    oc delete pods -n openshift-operator-lifecycle-manager -l app=catalog-operator
    oc delete pods -n openshift-operator-lifecycle-manager -l app=olm-operator
  2. Check that all operand requests reach a running state:
    oc get operandrequests -A -w
  3. Now restart the ODLM pod.
  4. Re-run the post-restore step.

Creating an offline backup in REST mode stalls

Applies to: 4.8.0 and later

Diagnosing the problem
This problem occurs when you try to create an offline backup in REST mode by using a custom --image-prefix value. The offline backup stalls with cpdbr-vol-mnt pods in the ImagePullBackOff state.
Cause of the problem
When you specify the --image-prefix option in the cpd-cli oadp backup create command, the default prefix registry.redhat.io/ubi9 is always used.
Resolving the problem
To work around the problem, create the backup in Kubernetes mode instead. To change to this mode, run the following command:
cpd-cli oadp client config set runtime-mode=

Unable to back up Watson Machine Learning Accelerator

Applies to: 4.8.0

Fixed in: 4.8.1

Diagnosing the problem
Creating an offline backup of a Cloud Pak for Data deployment that includes Watson Machine Learning Accelerator with the OADP utility fails.
An error message like in the following example appears in the CPD-CLI*.log file:
zen/configmap/cpd-wmla-br-cm: component=wmla, op=<mode=pre-backup,type=config-hook,method=rule>, status=error
Workaround
Do the following steps each time you create a backup:
  1. Before you create the backup, stop the Watson Machine Learning Accelerator operator:
    oc scale --replicas 0 deploy wmla-operator-controller-manager -n ${PROJECT_CPD_INST_OPERATORS}
  2. Update the Watson Machine Learning Accelerator backup and restore configmap.
    oc project ${PROJECT_CPD_INST_OPERANDS}
    WMLA_BR_CM='cpd-wmla-br-cm'
    WMLA_BR_CM_YAML=${WMLA_BR_CM}.yaml
    WMLA_BR_CM_YAML_ORIG=${WMLA_BR_CM_YAML}.orig
    oc get cm ${WMLA_BR_CM} -o yaml > ${WMLA_BR_CM_YAML_ORIG}
    line=`grep -n 'enable-maint' ${WMLA_BR_CM_YAML_ORIG}|grep -v apiVersion|awk -F: '{print $1}'`
    begin_line=$((line-4))
    end_line=$((line+3))
    sed -e "${begin_line},${end_line}d" ${WMLA_BR_CM_YAML_ORIG} -e '/resourceVersion/d' > ${WMLA_BR_CM_YAML}
    oc apply -f ${WMLA_BR_CM_YAML}
    
    WMLA_ADD_ON_BR_CM='cpd-wmla-add-on-br-cm'
    WMLA_ADD_ON_BR_CM_YAML=${WMLA_ADD_ON_BR_CM}.yaml
    WMLA_ADD_ON_BR_CM_YAML_ORIG=${WMLA_ADD_ON_BR_CM_YAML}.orig
    oc get cm ${WMLA_ADD_ON_BR_CM} -o yaml > ${WMLA_ADD_ON_BR_CM_YAML_ORIG}
    line=`grep -n 'enable-maint' ${WMLA_ADD_ON_BR_CM_YAML_ORIG}|grep -v apiVersion|awk -F: '{print $1}'`
    begin_line=$((line-7))
    end_line=$((line+13))
    sed -e "${begin_line},${end_line}d" ${WMLA_ADD_ON_BR_CM_YAML_ORIG} -e '/resourceVersion/d' > ${WMLA_ADD_ON_BR_CM_YAML}
    oc replace -f ${WMLA_ADD_ON_BR_CM_YAML}
  3. After the backup is created, start the Watson Machine Learning Accelerator operator.
    oc scale --replicas 1 deploy wmla-operator-controller-manager -n ${PROJECT_CPD_INST_OPERATORS}

Common core services custom resource is in InProgress state after an offline restore to a different cluster

Applies to: 4.8.0-4.8.5

Fixed in: 4.8.6

Diagnosing the problem
  1. Get the status of installed components by running the following command.
    cpd-cli manage get-cr-status \
    --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}
  2. Check that the status of ccs-cr is InProgress.
Cause of the problem
The Common core services component failed to reconcile on the restored cluster, because the dsx-requisite-pre-install-job-<xxxx> pod job is failing.
Resolving the problem
To resolve the problem, follow the instructions that are described in the technote Failed dsx-requisite-pre-install-job during offline restore.

Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore

Applies to: 4.8.5 and later

Diagnosing the problem
After an offline backup is created, but before doing a restore, check if the management-ingress-ibmcloud-cluster-info ConfigMap was backed up by running the following commands:
cpd-cli oadp backup status --details <backup_name1> | grep management-ingress-ibmcloud-cluster-info
cpd-cli oadp backup status --details <backup_name2> | grep management-ingress-ibmcloud-cluster-info

During or after the restore, pods that mount the missing ConfigMap show errors. For example:

oc describe po c-db2oltp-wkc-db2u-0 -n ${PROJECT_CPD_INST_OPERANDS}
Example output:
Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  41m (x512 over 17h)  kubelet  MountVolume.SetUp failed for volume "management-ingress-ibmcloud-cluster-info" : configmap "management-ingress-ibmcloud-cluster-info" not found
  Warning  FailedMount  62s (x518 over 17h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[management-ingress-ibmcloud-cluster-info], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Cause of the problem
When a related ibmcloud-cluster-info ConfigMap gets excluded as part of backup hooks, the management-ingress-ibmcloud-cluster-info ConfigMap copies the exclude labeling and unintentionally gets excluded from the backup.
Workaround
If IAM is enabled, apply the following patch.
  1. Log in to Red Hat OpenShift Container Platform as a cluster administrator.
    ${OC_LOGIN}
    Remember: OC_LOGIN is an alias for the oc login command.
  2. Check if IAM is enabled:
    oc get zenservices  lite-cr -o jsonpath='{.spec.iamIntegration}{"\n"}' -n ${PROJECT_CPD_INST_OPERANDS}
  3. If IAM is enabled, apply the following patch to ensure that the management-ingress-ibmcloud-cluster-info ConfigMap is not excluded from the backup:
    oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - << EOF
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cpdbr-management-ingress-exclude-fix-br
      labels:
        cpdfwk.aux-kind: br
        cpdfwk.component: cpdbr-patch
        cpdfwk.module: cpdbr-management-ingress-exclude-fix
        cpdfwk.name: cpdbr-management-ingress-exclude-fix-br-cm
        cpdfwk.managed-by: ibm-cpd-sre
        cpdfwk.vendor: ibm
        cpdfwk.version: 1.0.0
    data:
      aux-meta: |
        name: cpdbr-management-ingress-exclude-fix-br
        description: |
          This configmap defines offline backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info
          configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap prehooks.
          This is a temporary workaround until a complete fix is implemented.
        version: 1.0.0
        component: cpdbr-patch
        aux-kind: br
        priority-order: 99999 # This should happen at the end of backup prehooks
      backup-meta: |
        pre-hooks:
          exec-rules:
          # Remove lingering velero exclude label from offline prehooks
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: velero.io/exclude-from-backup
                    value: "true"
                  timeout: 360s
          # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: icpdsupport/ignore-on-nd-backup
                    value: "true"
                  timeout: 360s
        post-hooks:
          exec-rules: 
          - resource-kind: # do nothing for posthooks
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: cpdbr-management-ingress-exclude-fix-ckpt
      labels:
        cpdfwk.aux-kind: checkpoint
        cpdfwk.component: cpdbr-patch
        cpdfwk.module: cpdbr-management-ingress-exclude-fix
        cpdfwk.name: cpdbr-management-ingress-exclude-fix-ckpt-cm
        cpdfwk.managed-by: ibm-cpd-sre
        cpdfwk.vendor: ibm
        cpdfwk.version: 1.0.0
    data:
      aux-meta: |
        name: cpdbr-management-ingress-exclude-fix-ckpt
        description: |
          This configmap defines online backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info
          configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap checkpoint operation.
          This is a temporary workaround until a complete fix is implemented.
        version: 1.0.0
        component: cpdbr-patch
        aux-kind: ckpt
        priority-order: 99999 # This should happen at the end of backup prehooks
      backup-meta: |
        pre-hooks:
          exec-rules:
          # Remove lingering velero exclude label from offline prehooks
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: velero.io/exclude-from-backup
                    value: "true"
                  timeout: 360s
          # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation
          - resource-kind: configmap
            name: management-ingress-ibmcloud-cluster-info
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/label-resources
                  params:
                    action: remove
                    key: icpdsupport/ignore-on-nd-backup
                    value: "true"
                  timeout: 360s
        post-hooks:
          exec-rules: 
          - resource-kind: # do nothing for posthooks
      checkpoint-meta: |
        exec-hooks:
          exec-rules: 
          - resource-kind: # do nothing for checkpoint
    EOF

Volume backup and restore commands do not work when Identity Management Service is enabled

Applies to: 4.8.0, 4.8.1

Fixed in: 4.8.2

Diagnosing the problem

This problem occurs when Cloud Pak for Data is integrated with the Identity Management Service service.

  • During the backup volume quiesce operation, PVCs like mongodbdir-icp-mongodb-0 are not properly excluded.
  • When creating the restore, icp-mongodb-* pods fail with a CreateContainerConfigError. The error log has an entry such as in the following example:
    Failed global initialization: InvalidSSLConfiguration Can not set up PEM key file.
    Normal   Pulled   23s (x6 over 58m)   kubelet  Container image "icr.io/cpopen/cpfs/ibm- 
    mongodb@sha256:1980150bc49d215f7ff764c9437e5c15efbb67e5af6f0184fd98a8b837cf9a02" already present on 
    machine
    Warning  Failed   23s (x5 over 107s)  kubelet  Error: failed to prepare subPath for volumeMount "mongodbdir" of 
    container "icp-mongodb"
Workaround
Before you run cpd-cli backup-restore quiesce, cpd-cli backup-restore unquiesce, cpd-cli backup-restore volume-backup, or cpd-cli backup-restore volume-restore commands, check the cm zen-cs-aux-qu-cm configmap:
  1. Run the following commands:
    oc get cm zen-cs-aux-qu-cm -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.data.quiesce-meta}{"\n"}'
    oc get cm zen-cs-aux-qu-cm -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.data.unquiesce-meta}{"\n"}'
  2. If the output of either command contains the cpdbr.cpd.ibm.com/label-resources builtin name, replace it with cpdbr.cpd.ibm.com/annotate-resources by running the following command:
    oc edit cm zen-cs-aux-qu-cm -n ${PROJECT_CPD_INST_OPERANDS}

Flight service issues

The Flight service do_action command might show misleading error messages in Db2 table drop or create operations

Applies to: 4.8.0

Fixed in: 4.8.1

Diagnosing the problem

If you run the do_action command from Jupyter Notebooks with Python, you might see error messages that drop or create statements in a Db2 table were not successful. For example:

../src/arrow/status.cc:137: DoAction result was not fully consumed: Cancelled: Flight cancelled call, with message: Cancelled. Detail: Cancelled
Workaround
The behavior of do_action has changed. You must iterate through the results of do_action to ensure that all results are collected. For example,
results = list(flight_client.do_action(action))
return [r.body.to_pybytes().decode('utf-8') for r in results]

Service issues

The following issues are specific to services.