Known issues and limitations for IBM Software Hub

The following issues apply to the IBM® Software Hub platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional

Applies to: 5.2.0 and later

Diagnosing the problem
After rebooting the cluster, some IBM Software Hub custom resources remain in the InProgress state.

For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.

Workaround
To enable the pods to come up after a reboot:
  1. Find the nodes that have pods that are in an Error state:
    oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A  | grep -v -P "Completed|(\d+)\/\1"
  2. Mark each node as unschedulable.
    oc adm cordon <node_name>
  3. Delete the affected pods:
    oc get pod   | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0
  4. Mark each node as scheduled:
    oc adm uncordon <node_name>

Auxiliary modules are not found when running export-import commands on ppc64le hardware

Applies to: 5.2.0

Fixed in: 5.2.1

When you run the cpd-cli export-import list aux-module command on ppc64le hardware, the command returns:

No module(s) found
Resolving the problem
To run the export-import commands on ppc64le hardware, you must edit the wkc-base-aux-exim-cm ConfigMap:
  1. Print the contents of the wkc-base-aux-exim-cm ConfigMap to a file called wkc-base-aux-exim-cm.yaml:
    oc get cm wkc-base-aux-exim-cm \
    -n ${PROJECT_CPD_INST_OPERANDS} \
    -o yaml > wkc-base-aux-exim-cm.yaml
  2. Open the wkc-base-aux-exim-cm.yaml file in a text editor.
  3. Locate the data.aux-meta section.
  4. Update the architecture parameter to ppc64le:
    data:
      aux-meta: |
        name: wkc-base-aux
        description: wkc-base auxiliary export/import module
        architecture: "ppc64le"
        ...
  5. Locate the metadata.labels section.
  6. Update the cpdfwk.arch parameter to ppc64le:
    metadata:
      labels:
        app: wkc-base
        app.kubernetes.io/instance: wkc-base
        app.kubernetes.io/managed-by: Tiller
        app.kubernetes.io/name: wkc-base
        chart: wkc-base
        cpdfwk.arch: ppc64le
        ...
  7. Save your changes to the wkc-base-aux-exim-cm.yaml file.
  8. Apply your changes to the wkc-base-aux-exim-cm ConfigMap:
    oc apply -f wkc-base-aux-exim-cm.yaml

Export jobs are not deleted if the export specification file includes incorrectly formatted JSON

Applies to: 5.2.0 and later

When you export data from IBM Software Hub, you must create an export specification file. The export specification file includes a JSON string that defines the data that you want to export.

If the JSON string is incorrectly formatted, the export will fail. In addition, if you try to run the cpd-cli export-import export delete command, the command completes without returning any errors, but the export job is not deleted.

Diagnosing the problem
  1. To confirm that the export job was not deleted, run:
    cpd-cli export-import export status \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --profile=${CPD_PROFILE_NAME}
Resolving the problem
  1. Use the OpenShift Container Platform command-line interface to delete the failed export job:
    1. Set the EXPORT_JOB environment variable to the name of the job that you want to delete:
      export EXPORT_JOB=<export-job-name>
    2. To remove the export job, run:
      oc delete job ${EXPORT_JOB} \
      --namespace=${PROJECT_CPD_INST_OPERANDS}
  2. If you want to use the export specification file to export data, fix the JSON formatting issues.

The health service-functionality check fails at certain API endpoints for AI Factsheets, Data Product Hub, OpenPages, and watsonx.data

Applies to: 5.2.0

Fixed in: 5.2.1

The cpd-cli health service-functionality checks fail at specific API endpoints for the following services:
AI Factsheets

API endpoint: /model_inventory/model_entries/

This failure is not indicative of a problem with AI Factsheets. There is no impact on the functionality of the service if this check fails.

Data Product Hub

API endpoint: /retire

This failure is not indicative of a problem with Data Product Hub. There is no impact on the functionality of the service if this check fails.

OpenPages

API endpoint: /workflows/definitions/68/start/2511

This failure is not indicative of a problem with OpenPages. There is no impact on the functionality of the service if this check fails.

watsonx.data™

API endpoint: /lakehouse/api/v2/upload/json?engine_id=presto-01

This failure is not indicative of a problem with watsonx.data. There is no impact on the functionality of the service if this check fails.

The health service-functionality check fails on the v2/ingestion_jobs API endpoint for watsonx.data

Applies to: 5.2.2

API endpoint: /lakehouse/api/v2/ingestion_jobs

This failure is not indicative of a problem with watsonx.data. There is no impact on the functionality of the service if this check fails.

Cannot scale down IBM Knowledge Catalog

Applies to: 5.2.1

Fixed in: 5.2.2

If you scale up IBM Knowledge Catalog, for example to level_3(medium), it increases the number of Neo4j cluster primaries. However, you cannot scale the number of Neo4j cluster back down once increased. This limitation prevents data integrity issues.

In Software Hub 5.2.1, scaling down might appear to be successful. However, in this case, the Neo4jCluster is in a Failed state even if the three Neo4j servers are up and running. This situation can cause other issues. For example, after you shut down IBM Knowledge Catalog, it can fail to restart.

To fix the issue, you can set the number of clusterPrimaries to 3 by using the following command:

oc patch neo4jcluster datalineage-neo4j \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge
--patch='{"spec":{"clusterPrimaries": 3}}'

Installation and upgrade issues

The setup-instance command fails waiting for the ibm-namespace-scope-operator operator to be installed or upgraded

Applies to: 5.2.1

Fixed in: 5.2.2

When you run the cpd-cli manage setup-instance command, the command fails while waiting for the ibm-namespace-scope-operator operator to be installed or upgraded.

The command fails with the following error:

/tmp/work/cpfs_scripts/5.2.1/cp3pt0-deployment/common/utils.sh at line 127 in function wait_for_condition: 
Timeout after 40 minutes waiting for operator ibm-namespace-scope-operator to be upgraded

This issue occurs when your cluster pulls images from a private container registry and one or more of the following images cannot be found:

  • ibm-cpd-cloud-native-postgresql-operator-bundle@sha256:525924934f74b28783dacf630a6401ccd4c88d253aafa2d312b7993d8b9a14c1
  • ibm-cpd-cloud-native-postgresql-operator@sha256:a72c2b70a9f0a93047d47247cddc1616138cbc87337289ddb208565af6e4369d

The setup-instance command fails if the ibm-common-service-operator-service service is not found

Applies to: Upgrades from:
  • 4.8.x
  • 5.0.x
  • 5.1.x
  • 5.2.x

When you run the cpd-cli manage setup-instance command, the command fails if the ibm-common-service-operator-service service is not found in the operator project for the instance.

When this error occurs, the ibmcpd ibmcpd-cr custom resource is stuck at 35%.

Diagnosing the problem
To determine if the command failed because the ibm-common-service-operator-service service was not found:
  1. Get the .status.progress value from the ibmcpd ibmcpd-cr custom resource:
    oc get ibmcpd ibmcpd-cr \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    -o json | jq -r '.status.progress'
    • If the command returns 35%, continue to the next step.
    • If the command returns a different value, the command failed for a different reason.
  2. Check for the Error found when checking commonservice CR in namespace error in the ibmcpd ibmcpd-cr custom resource:
    oc get ibmcpd ibmcpd-cr \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    -o json | grep 'Error found when checking commonservice CR in namespace'
    • If the command returns a response, continue to the next step.
    • If the command does not return a response, the command failed for a different.
  3. Confirm that the ibm-common-service-operator-service service does not exist:
    oc get svc ibm-common-service-operator-service \
    --namespace=${PROJECT_CPD_INST_OPERATORS}
    The command should return the following response:
    Error from server (NotFound): services "ibm-common-service-operator-service" not found
Resolving the problem
To resolve the problem:
  1. Get the name of the cpd-platform-operator-manager pod:
    oc get pod \
    --namespace=${PROJECT_CPD_INST_OPERATORS} \
    | grep cpd-platform-operator-manager
  2. Delete the cpd-platform-operator-manager pod.

    Replace <pod-name> with the name of the pod returned in the previous step.

    oc delete pod <pod-name> \
    --namespace=${PROJECT_CPD_INST_OPERATORS}
  3. Wait several minutes for the operator to run a reconcile loop.
  4. Confirm that the ibm-common-service-operator-service service exists:
    oc get svc ibm-common-service-operator-service \
    --namespace=${PROJECT_CPD_INST_OPERATORS}
    The command should return a response with the following format:
    NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
    ibm-common-service-operator-service   ClusterIP   198.51.100.255   <none>        443/TCP   1h

After you resolve the issue, you can re-run the cpd-cli manage setup-instance command.

The Switch locations icon is not available if the apply-cr command times out

Applies to: 5.2.0 and later

If you install solutions that are available in different experiences, the Switch locations icon (Image of the Switch locations icon.) is not available in the web client if the cpd-cli manage apply-cr command times out.

Resolving the problem

Re-run the cpd-cli manage apply-cr command.

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Applies to: 5.2.0 and later

If the Red Hat OpenShift Data Foundation or IBM Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.

One symptom is that pods will not start because of a FailedMount error. For example:

Warning  FailedMount  36s (x1456 over 2d1h)   kubelet  MountVolume.MountDevice failed for volume 
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given 
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
Diagnosing the problem
To confirm whether the Data Foundation Rook Ceph cluster is unstable:
  1. Ensure that the rook-ceph-tools pod is running.
    oc get pods -n openshift-storage | grep rook-ceph-tools
    Note: On IBM Fusion HCI System or on environments that use hosted control planes, the pods are running in the openshift-storage-client project.
  2. Set the TOOLS_POD environment variable to the name of the rook-ceph-tools pod:
    export TOOLS_POD=<pod-name>
  3. Execute into the rook-ceph-tools pod:
    oc rsh -n openshift-storage ${TOOLS_POD}
  4. Run the following command to get the status of the Rook Ceph cluster:
    ceph status
    Confirm that the output includes the following line:
    health: HEALTH_WARN
  5. Exit the pod:
    exit
Resolving the problem
To resolve the problem:
  1. Get the name of the rook-ceph-mrg pods:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  2. Set the MGR_POD_A environment variable to the name of the rook-ceph-mgr-a pod:
    export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
  3. Set the MGR_POD_B environment variable to the name of the rook-ceph-mgr-b pod:
    export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
  4. Delete the rook-ceph-mgr-a pod:
    oc delete pods ${MGR_POD_A} -n openshift-storage
  5. Ensure that the rook-ceph-mgr-a pod is running before you move to the next step:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  6. Delete the rook-ceph-mgr-b pod:
    oc delete pods ${MGR_POD_B} -n openshift-storage
  7. Ensure that the rook-ceph-mgr-b pod is running:
    oc get pods -n openshift-storage | grep rook-ceph-mgr

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 5.2.0 and later

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Match 360
Diagnosing the problem
To identify the cause of this issue, check the FoundationDB status and details.
  1. Check the FoundationDB status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.

  2. If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
    oc rsh sample-cluster-log-1 /bin/fdbcli

    To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.

    status details
    • If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
      oc get pod -o wide | grep storage
      > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 

      If this step does not resolve the problem, proceed to the workaround steps.

    • If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
Resolving the problem
To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

  1. Restart the FoundationDB cluster pods.
    oc get fdbcluster 
    oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po

    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

  2. Restart the FoundationDB operator pods.
    oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po
  3. After the pods finish restarting, check to ensure that FoundationDB is available.
    1. Check the FoundationDB status.
      oc get fdbcluster -o yaml | grep fdbStatus

      The returned status must be Complete.

    2. Check to ensure that the database is available.
      oc rsh sample-cluster-log-1 /bin/fdbcli

      If the database is still not available, complete the following steps.

      1. Log in to the ibm-fdb-controller pod.
      2. Run the fix-coordinator script.
        kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}

        Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

Persistent volume claims with the WaitForFirstConsumer volume binding mode are flagged by the installation health checks

Applies to: 5.2.0 and later

When you install IBM Software Hub, the following persistent volume claims are automatically created:
  • ibm-cs-postgres-backup
  • ibm-zen-objectstore-backup-pvc

Both of these persistent volume claims are created with the WaitForFirstConsumer volume binding mode. In addition, both persistent volume claims will remain in the Pending state until you back up your IBM Software Hub installation. This behavior is expected. However, when you run the cpd-cli health operands command, the Persistent Volume Claim Healthcheck fails.

If there are more persistent volume claims returned by the health check, you must investigate further to determine why those persistent volume claims are pending. However, if only the following persistent volume claims are returned, you can ignore the Failed result:

  • ibm-cs-postgres-backup
  • ibm-zen-objectstore-backup-pvc

Node pinning is not applied to postgresql pods

Applies to: 5.2.0 and later

If you use node pinning to schedule pods on specific nodes, and your environment includes postgresql pods, the node affinity settings are not applied to the postgresql pods that are associated with your IBM Software Hub deployment.

The resource specification injection (RSI) webhook cannot patch postgresql pods because the EDB Postgres operator uses a PodDisruptionBudget resource to limit the number of concurrent disruptions to postgresql pods. The PodDisruptionBudget resource prevents postgresql pods from being evicted.

The ibm-nginx deployment does not scale fast enough when automatic scaling is configured

Applies to: 5.2.0 and later

If you configure automatic scaling for IBM Software Hub, the ibm-nginx deployment might not scale fast enough. Some symptoms include:

  • Slow response times
  • High CPU requests are throttled
  • The deployment scales up and down even when the workload is steady

This problem typically occurs when you install watsonx Assistant or watsonx™ Orchestrate.

Resolving the problem
If you encounter the preceding symptoms, you must manually scale the ibm-nginx deployment:
oc patch zenservice lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {
    "Nginx": {
        "name": "ibm-nginx",
        "kind": "Deployment",
        "container": "ibm-nginx-container",
        "replicas": 5,
        "minReplicas": 2,
        "maxReplicas": 11,
        "guaranteedReplicas": 2,
        "metrics": [
            {
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": 529
                    }
                }
            }
        ],
        "resources": {
            "limits": {
                "cpu": "1700m",
                "memory": "2048Mi",
                "ephemeral-storage": "500Mi"
            },
            "requests": {
                "cpu": "225m",
                "memory": "920Mi",
                "ephemeral-storage": "100Mi"
            }
        },
        "containerPolicies": [
            {
                "containerName": "*",
                "minAllowed": {
                    "cpu": "200m",
                    "memory": "256Mi"
                },
                "maxAllowed": {
                    "cpu": "2000m",
                    "memory": "2048Mi"
                },
                "controlledResources": [
                    "cpu",
                    "memory"
                ],
                "controlledValues": "RequestsAndLimits"
            }
        ]
    }
}}'

Uninstalling IBM watsonx services does not remove the IBM watsonx experience

Applies to: 5.2.0 and later

After you uninstall watsonx.ai™ or watsonx.governance™, the IBM watsonx experience is still available in the web client even though there are no services that are specific to the IBM watsonx experience.

Resolving the problem
To remove the IBM watsonx experience from the web client, an instance administrator must run the following command:
oc delete zenextension wx-perspective-configuration \
--namespace=${PROJECT_CPD_INST_OPERANDS}

Backup and restore issues

Issues that apply to several backup and restore methods

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. IBM Match 360 backup fails with an error stating that there are no matches for elasticsearch.opencontent.ibm.com
  2. After backing up or restoring IBM Match 360, Redis fails to return to a Completed state
  3. Backup fails due to lingering pvc-sysbench-rwo created by storage-performance health check in Data Virtualization
Restore issues
Review the following issues before you restore a backup. Do the workarounds that apply to your environment.
  1. After a restore, OperandRequest timeout error in the ZenService custom resource
  2. After backing up or restoring IBM Match 360, Redis fails to return to a Completed state
  3. SQL30081N RC 115,*,* error for Db2 selectForReceiveTimeout function after instance restore (Fixed in 5.2.1)
  4. Restore fails and displays postRestoreViaConfigHookRule error in Data Virtualization
  5. Error 404 displays after backup and restore in Data Virtualization
  6. The restore process times out while waiting for the ibmcpd status check to complete
  7. Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration
  8. Watson OpenScale fails after restore for some Watson Machine Learning deployments
  9. IBM Software Hub user interface does not load after restoring
  10. Restore to same cluster fails with restore-cpd-volumes error

Backup and restore issues with the OADP utility

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Offline backup fails with PartiallyFailed error
  2. Backup fails after a service is upgraded and then uninstalled
  3. ObjectBucketClaim is not supported by the OADP utility
  4. Offline backup fails after watsonx.ai is uninstalled
  5. Db2 Big SQL backup pre-hook and post-hook fail during offline backup
  6. OpenSearch operator fails during backup
Restore issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Restoring Data Virtualization fails with metastore not running or failed to connect to database error
  2. Prompt tuning fails after restoring watsonx.ai

Backup and restore issues with IBM Fusion

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Db2 backup fails at the Hook: br-service hooks/pre-backup step
  2. wd-discovery-opensearch-client pod fails during backup
  3. IBM Match 360 backup fails if it takes longer than 20 minutes (Fixed in 5.2.2)
  4. Backup service location is unavailable during backup
  5. Backup validation fails with tls: failed to verify certificate: x509 error
Restore issues
Do the workarounds that apply to your environment after you restore a backup.
  1. Watson Discovery fails after restore with opensearch and post_restore unverified components
  2. Watson Discovery fails after restore with mcg and post_restore unverified components

Backup and restore issues with Portworx

Backup issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. IBM Software Hub resources are not migrated

ObjectBucketClaim is not supported by the OADP utility

Applies to: 5.2.0 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
If an ObjectBucketClaim is created in an IBM Software Hub instance, it is not included when you create a backup.
Cause of the problem
OADP does not support backup and restore of ObjectBucketClaim.
Resolving the problem
Services that provide the option to use ObjectBuckets must ensure that the ObjectBucketClaim is in a separate namespace and backed up separately.

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 5.2.0 and later

Applies to: All backup and restore methods

Diagnosing the problem
Get the status of the ZenService YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...
Check for failing operandrequests:
oc get operandrequests -A
For failing operandrequests, check their conditions for constraints not satisfiable messages:
oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
Cause of the problem
Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Resolving the problem
Do the following steps:
  1. Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.

    For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.

  2. Delete IBM Software Hub instance projects (namespaces).

    For details, see Cleaning up the cluster before a restore.

  3. Retry the restore.

After backing up or restoring IBM Match 360, Redis fails to return to a Completed state

Fixed in: 5.2.1

Applies to: 5.2.0

Applies to: All backup and restore methods

Diagnosing the problem
After completing a backup or restoring the IBM Match 360 service, its associated Redis pod (mdm-redis) and IBM Redis CP fail to return to a Completed state.
Cause of the problem
This occurs when Redis CP fails to come out of maintenance mode.
Resolving the problem
To resolve this issue, manually update Redis CP to take it out of maintenance mode:
  1. Open the Redis CP YAML file for editing.
    oc edit rediscp -n ${PROJECT_CPD_INST_OPERANDS} -o yaml
  2. Update the value of the ignoreForMaintenance parameter from true to false.
    ignoreForMaintenance: false
  3. Save the file and wait for the Redis pod to reconcile.

Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration

Applies to: 5.2.0 and later

Applies to: All backup and restore methods

Diagnosing the problem
After you restore, Watson OpenScale fails due to memory constraints. You might see Db2 (db2oltp) or Db2 Warehouse (db2wh) instances that return 404 errors and pod failures where scikit pods are unable to connect to Apache Kafka.
Cause of the problem
The root cause is typically insufficient memory or temporary table page size settings, which are critical for query execution and service stability.
Resolving the problem
Ensure that the Db2 (db2oltp) or Db2 Warehouse (db2wh) instance is configured with adequate memory resources, specifically:
  • Set the temporary table page size to at least 700 GB during instance setup or reconfiguration.
  • Monitor pod health and Apache Kafka connectivity to verify that dependent services recover after memory allocation is corrected.

Watson OpenScale fails after restore for some Watson Machine Learning deployments

Applies to: 5.2.0 and later

Applies to: All backup and restore methods

Diagnosing the problem
After restoring an online backup, automatic payload logging in Watson OpenScale fails for certain Watson Machine Learning deployments.
Cause of the problem
The failure is caused by a timing issue during the restore process. Specifically, the Watson OpenScale automatic setup fails because Watson Machine Learning runtime pods are unable to connect to the underlying Apache Kafka service. This connectivity issue occurs because the pods start before Apache Kafka is fully available.
Resolving the problem
To resolve the issue, restart all wml-dep pods after you restore. This ensures proper Apache Kafka connectivity and allows Watson OpenScale automatic setup to complete successfully:
  1. List all wml-dep pods:
    oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep wml-dep
  2. Run the following command for each wml-dep pod to delete it and trigger it to restart:
    oc delete pod <podname> -n=${PROJECT_CPD_INST_OPERANDS}

IBM Software Hub user interface does not load after restoring

Fixed in: 5.2.1

Applies to: 5.2.0

Applies to: All backup and restore methods

Diagnosing the problem
The IBM Software Hub user interface doesn't start after a restore. When you try to log in, you see a blank screen.
Cause of the problem
The following pods fail and must be restarted to load the IBM Software Hub user interface:
  • platform-auth-service
  • platform-identity-management
  • platform-identity-provider
Resolving the problem
Run the following command to restart the pods:
for po in $(oc get po -l icpdsupport/module=im -n ${PROJECT_CPD_INST_OPERANDS} --no-headers | awk '{print $1}' | grep -v oidc); do oc delete po -n ${PROJECT_CPD_INST_OPERANDS} ${po};done;

Restore to same cluster fails with restore-cpd-volumes error

Fixed in: 5.2.1

Applies to: 5.2.0

Applies to: All backup and restore methods

Diagnosing the problem
IBM Software Hub fails to restore, and you see an error similar to the following example:
error: DataProtectionPlan=cpd-offline-tenant/restore-service-orchestrated-parent-workflow, Action=restore-cpd-volumes (index=1)
error: expected restore phase to be Completed, received PartiallyFailed
Cause of the problem
The ibm-streamsets-static-content pod is mounted as read-only, which causes the restore to partially fail.
Resolving the problem
Before you back up, exclude the IBM StreamSets pod:
  1. Exclude the ibm-streamsets-static-content pod:
    oc label -l icpdsupport/addOnId=streamsets,app.kubernetes.io/name=ibm-streamsets-static-content velero.io/exclude-from-backup=true
  2. Take another back up and follow the back up and restore procedure for your environment.

IBM Match 360 backup fails with an error stating that there are no matches for elasticsearch.opencontent.ibm.com

Fixed in: 5.2.1

Applies to: 5.2.0

Applies to: All backup and restore methods

Diagnosing the problem
When you attempt to perform an offline backup of the IBM Match 360 service, the backup will fail with the following error message:
error performing op preBackupViaConfigHookRule for resource mdm (configmap=mdm-cpd-aux-br-cm): no matches for elasticsearch.opencontent.ibm.com/, Resource=elasticsearchclusters
Cause of the problem
The IBM Match 360 backup ConfigMap is configured incorrectly.
Resolving the problem
To resolve this problem, update the IBM Match 360 backup ConfigMap to correct the issue:
  1. Get the IBM Match 360 operator version:
    export mdm_operator_pod=`oc get pods -n ${PROJECT_CPD_INST_OPERATORS}|grep mdm-operator-controller-manager|awk '{print $1}'`
    echo mdm_operator_version=`oc exec -it $mdm_operator_pod -n cpd-operator  -- ls /opt/ansible/roles  |awk '{print $1}'`
  2. Patch the backup template by running the following command:
    oc exec -it $mdm_operator_pod -n ${PROJECT_CPD_INST_OPERATORS} -- sed -i 's/elasticsearchclusters.elasticsearch.opencontent.ibm.com/clusters.opensearch.cloudpackopen.ibm.com/g' /opt/ansible/roles/<mdm_operator_version>/addon/templates/mdm-backup-restore-extension-configmap.yaml.j2
  3. Complete a redundant update in the IBM Match 360 CR (mdm-cr) to trigger operator reconciliation.
    oc patch mdm mdm-cr --type=merge -p '{"spec":{"onboard":{"timeout_seconds": "800" }}}' -n ${PROJECT_CPD_INST_OPERANDS}
  4. Wait for mdm-cr to reconcile itself. Proceed to the next step when mdm-cr is in a Completed state.
  5. Start the backup process.

Db2 backup fails at the Hook: br-service hooks/pre-backup step

Applies to: 5.2.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
In the cpdbr-oadp.log file, you see messages like in the following example:
time=<timestamp> level=info msg=podName: c-db2oltp-5179995-db2u-0, podIdx: 0, container: db2u, actionIdx: 0, commandString: ksh -lc 'manage_snapshots --action suspend --retry 3', command: [sh -c ksh -lc 'manage_snapshots --action suspend --retry 3'], onError: Fail, singlePodOnly: false, timeout: 20m0s func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:767
time=<timestamp> level=info msg=cmd stdout:  func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:823
time=<timestamp> level=info msg=cmd stderr: [<timestamp>] - INFO: Setting wolverine to disable
Traceback (most recent call last):
  File "/usr/local/bin/snapshots", line 33, in <module>
    sys.exit(load_entry_point('db2u-containers==1.0.0.dev1', 'console_scripts', 'snapshots')())
  File "/usr/local/lib/python3.9/site-packages/cli/snapshots.py", line 35, in main
    snap.suspend_writes(parsed_args.retry)
  File "/usr/local/lib/python3.9/site-packages/snapshots/snapshots.py", line 86, in suspend_writes
    self._wolverine.toggle_state(enable=False, message="Suspend writes")
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 73, in toggle_state
    self._toggle_state(state, message)
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 77, in _toggle_state
    self._cmdr.execute(f'wvcli system {state} -m "{message}"')
  File "/usr/local/lib/python3.9/site-packages/utils/command_runner/command.py", line 122, in execute
    raise CommandException(err)
utils.command_runner.command.CommandException: Command failed to run:ERROR:root:HTTPSConnectionPool(host='localhost', port=9443): Read timed out. (read timeout=15)
Cause of the problem
The Wolverine high availability monitoring process was in a RECOVERING state before the backup was taken.

Check the Wolverine status by running the following command:

wvcli system status
Example output:
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
HA Management is RECOVERING at <timestamp>.
The Wolverine log file /mnt/blumeta0/wolverine/logs/ha.log shows errors like in the following example:
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:414)] -  check_and_recover: unhealthy_dm_set = {('c-db2oltp-5179995-db2u-0', 'node')}
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:416)] - (c-db2oltp-5179995-db2u-0, node) : not OK
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:421)] -  check_and_recover: unhealthy_dm_names = {'node'}
Resolving the problem
Do the following steps:
  1. Re-initialize Wolverine:
    wvcli system init --force
  2. Wait until the Wolverine status is RUNNING. Check the status by running the following command:
    wvcli system status
  3. Retry the backup.

wd-discovery-opensearch-client pod fails during backup

Fixed in: 5.2.1

Applies to: 5.2.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
To diagnose the problem, complete the following steps:
  1. Check the backup and restore log:
    oc -n ${PROJECT_CPD_INST_OPERANDS} exec -it $(oc -n ${PROJECT_CPD_INST_OPERANDS} get pods -licpdsupport/app=br-service -oname) -- cat cpdbr-oadp.log
  2. Look for an error similar to the following example:
    [2025-05-23T22:43:31,094][WARN ][r.suppressed             ] [wd-discovery-opensearch-client-000] path: /_snapshot/cloudpak/cloudpak_snapshot_2025-05-23-22-43-31, params: {pretty=, repository=cloudpak, wait_for_completion=true, snapshot=cloudpak_snapshot_2025-05-23-22-43-31}
    org.opensearch.transport.RemoteTransportException: [wd-discovery-opensearch-master-000][10.128.7.239:9300][cluster:admin/snapshot/create]
    Caused by: org.opensearch.repositories.RepositoryException: [cloudpak] Could not read repository data because the contents of the repository do not match its expected state. This is likely the result of either concurrently modifying the contents of the repository by a process other 
    than this cluster or an issue with the repository's underlying storage. The repository has been disabled to prevent corrupting its contents. To re-enable it and continue using it please remove the repository from the cluster and add it again to make the cluster recover the known state of the repository from its physical contents.
Cause of the problem
Snapshot data might be corrupted.
Resolving the problem
Complete the following steps:
  1. Quiesce the OpenSearch cluster:
    oc patch cluster wd-discovery-opensearch --type merge --patch '{"spec": {"quiesce": true}}'
  2. Wait until OpenSearch pods are deleted. Run the following command to check the pods are deleted:
    oc get pods -l app.kubernetes.io/managed-by=ibm-opensearch-operator,job-name!=wd-discovery-opensearch-snapshot-repo
    You'll see output similar to the following example:
    oc get pods -l app.kubernetes.io/managed-by=ibm-opensearch-operator,job-name!=wd-discovery-opensearch-snapshot-repo
    wd-discovery-opensearch-client-000                             1/1     Terminating   0             18h
    wd-discovery-opensearch-client-001                             1/1     Terminating   0             18h
    wd-discovery-opensearch-data-000                               1/1     Terminating   0             18h
    wd-discovery-opensearch-data-001                               1/1     Terminating   0             18h
    wd-discovery-opensearch-master-000                             1/1     Terminating   0             18h
    wd-discovery-opensearch-master-001                             1/1     Terminating   0             18h
    wd-discovery-opensearch-master-002                             1/1     Terminating   0             18h
  3. Unquiesce the OpenSearch cluster:
    oc patch cluster -n zen wd-discovery-opensearch --type merge --patch '{"spec": {"quiesce": false}}'
  4. Wait until OpenSearch pods are running:
    oc get pods -l app.kubernetes.io/managed-by=ibm-opensearch-operator,job-name!=wd-discovery-opensearch-snapshot-repo
    You'll see output similar to the following example:
    oc get pods -l app.kubernetes.io/managed-by=ibm-opensearch-operator,job-name!=wd-discovery-opensearch-snapshot-repo
    NAME                                 READY   STATUS    RESTARTS   AGE
    wd-discovery-opensearch-client-000   1/1     Running   0          2m9s
    wd-discovery-opensearch-client-001   1/1     Running   0          2m8s
    wd-discovery-opensearch-data-000     1/1     Running   0          2m8s
    wd-discovery-opensearch-data-001     1/1     Running   0          2m8s
    wd-discovery-opensearch-master-000   1/1     Running   0          2m8s
    wd-discovery-opensearch-master-001   1/1     Running   0          2m8s
    wd-discovery-opensearch-master-002   1/1     Running   0          2m7s
  5. Retry the backup.

IBM Match 360 backup fails if it takes longer than 20 minutes

Fixed in: 5.2.2

Applies to: 5.2.1

Applies to: Backup with IBM Fusion

Diagnosing the problem
When you attempt to perform a backup of the IBM Match 360 service, the backup fails for both offline and online strategies during the post-backup phase if it does not complete within the 20-minute timeout period.
Cause of the problem
The IBM Match 360 backup does not have a long enough post-backup hook timeout value configured by default.
Resolving the problem
Before starting the backup process, resolve this problem by updating the IBM Match 360 backup configuration. To configure a longer timeout value:
  1. Patch the IBM Match 360 CR (mdm-cr) to update the offline backup timeout value:
    oc patch mdm mdm-cr --type=merge -p '{"spec":{"backup_restore:":{"br_timeout":{"offline_backup":{"post_backup":{"br_status_update": "2400s" }}}}}}' -n ${PROJECT_CPD_INST_OPERANDS}
  2. Patch the IBM Match 360 CR (mdm-cr) to update the online backup timeout value:
    oc patch mdm mdm-cr --type=merge -p '{"spec":{"backup_restore:":{"br_timeout":{"online_backup":{"post_backup":{"br_status_update": "2400s" }}}}}}' -n ${PROJECT_CPD_INST_OPERANDS}
  3. Wait for mdm-cr to reconcile itself. Proceed to the next step when mdm-cr is in a Completed state.
  4. Start the backup process.

Backup service location is unavailable during backup

Applies to: 5.2.0 and later

Applies to: Backup with IBM Fusion

If your cluster is running Red Hat OpenShift Container Platform 4.19 with IBM Fusion 2.11 and OADP 1.5.1, then the backup service location might enter Unavailable state. To resolve this issue, upgrade OADP to version 1.5.2.

Backup validation fails with tls: failed to verify certificate: x509 error

Fixed in : 5.2.2

Applies to: 5.2.0 and 5.2.1

Applies to: Backup with IBM Fusion

Diagnosing the problem
The backup fails during validation and you receive the following error message:
tls: failed to verify certificate: x509: certificate signed by unknown authority func=cpdbr-oadp/cmd.Execute
Cause of the problem
The backup storage location uses a self-signed certificate. IBM Fusion did no recognize the certificate authority, so the connection and the backup validation fail.
Resolving the problem
Run the following script to add the --insecure-skip-tls-verify flag to the backup validation command:
brServiceHooksIndex=$(oc get frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} -o json | jq '.spec.hooks | map(.name == "br-service-hooks") | index(true)')
resourceValidationIndex=$(oc get frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} -o json | jq '.spec.hooks[] | select(.name == "br-service-hooks") | .ops | map(.name == "resource-validation") | index(true)')
resourceValidationCmd=$(oc get frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} -o jsonpath="{.spec.hooks[$brServiceHooksIndex].ops[$resourceValidationIndex].command}")
newResourceValidationCmd=$(echo $resourceValidationCmd " --insecure-skip-tls-verify")
echo "cmd was: $resourceValidationCmd"
echo "cmd now: $newResourceValidationCmd"
echo "press any key to continue to patch fusion recipe..."
read

oc patch frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} --type json -p "[{\"path\": \"/spec/hooks/$brServiceHooksIndex/ops/$resourceValidationIndex/command\", \"value\": \"$newResourceValidationCmd\", \"op\": \"replace\"}]"

currentCmdFromRecipe=$(oc get frcpe ibmcpd-tenant -n ${PROJECT_CPD_INST_OPERATORS} -o jsonpath="{.spec.hooks[$brServiceHooksIndex].ops[$resourceValidationIndex].command}")
echo "current command now from parent-recipe = '$currentCmdFromRecipe'"
echo "note: please refrain to run this script more than once"
echo "to reset the fusion software hub recipe (if you need to after this modification), please use 'cpd-cli oadp generate plan fusion parent-recipe --tenant-operator-namespace=\${PROJECT_CPD_INST_OPERATORS} --verbose --log-level=debug' command as stated in the documentation https://www.ibm.com/docs/en/software-hub/5.2.x?topic=software-installing-configuring-backup-restore-fusion"

Watson Discovery fails after restore with opensearch and post_restore unverified components

Applies to: 5.2.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
After you restore, Watson Discovery becomes stuck with the following components listed as unverifiedComponents:
unverifiedComponents:
- opensearch
- post_restore
Additionally, the OpenSearch client pod might show an unknown container status, similar to the following example:
NAME                                 READY   STATUS                    RESTARTS   AGE
wd-discovery-opensearch-client-000   0/1     ContainerStatusUnknown    0          11h
Cause of the problem
The post_restore component depends on the opensearch component being verified. However, the OpenSearch client pod is not running, which prevents verification and causes the restore process to stall.
Resolving the problem
Manually delete the OpenSearch client pod to allow it to restart:
$ oc delete -n ${PROJECT_CPD_INST_OPERANDS} pod wd-discovery-opensearch-client-000

After the pod is restarted and verified, the post_restore component should complete the verification process.

Watson Discovery fails after restore with mcg and post_restore unverified components

Fixed in: 5.2.1

Applies to: 5.2.0

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
After you restore, Watson Discovery becomes stuck with the following components listed as unverifiedComponents:
unverifiedComponents:
- ingestion
- mcg
- post_restore
Additionally, the OpenSearch client pod might show a failed container status, similar to the following example:
NAME                                 READY   STATUS                    RESTARTS   AGE
wd-discovery-s3-bucket-job   Failed   0/1   9d   9d
Cause of the problem
The post_restore component depends on the mcg component being verified. However, during the post restore process, the retry feature for the mcg verification job is disabled. If the job fails on its first attempt, it will not retry automatically, causing the verification process to fail.
Resolving the problem
Manually delete the wd-discovery-s3-bucket-job job to allow it to restart:
$ oc delete -n ${PROJECT_CPD_INST_OPERANDS} job wd-discovery-s3-bucket-job

After the job is restarted and verified, the post_restore component should complete the verification process.

IBM Software Hub resources are not migrated

Applies to: 5.2.0 and later

Applies to: Portworx asynchronous disaster recovery

Diagnosing the problem
When you use Portworx asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the expected number of resources are migrated. Run the following command:
storkctl get migrations -n ${PX_ADMIN_NS}
Tip: ${PX_ADMIN_NS} is usually kube-system.
Example output:
NAME                                                CLUSTERPAIR       STAGE   STATUS       VOLUMES   RESOURCES   CREATED               ELAPSED                       TOTAL BYTES TRANSFERRED
cpd-tenant-migrationschedule-interval-<timestamp>   mig-clusterpair   Final   Successful   0/0       0/0         <timestamp>   Volumes (0s) Resources (3s)   0
Cause of the problem
This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected IBM Software Hub resources are not migrated.
Resolving the problem
To resolve the problem, downgrade stork to a version prior to 23.11.0. For more information about stork releases, see the stork Releases page.
  1. Scale down the Portworx operator so that it doesn't reset manual changes to the stork deployment:
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
  2. Edit the stork deployment image version to a version prior to 23.11.0:
    oc edit deploy -n ${PX_ADMIN_NS} stork
  3. If you need to scale up the Portworx operator, run the following command.
    Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1

Prompt tuning fails after restoring watsonx.ai

Applies to: 5.2.2 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
When you try to create a prompt tuning experiment, you see the following error message:
An error occurred while processing prompt tune training.
Resolving the problem
Do the following steps:
  1. Restart the caikit operator:
    oc rollout restart deployment caikit-runtime-stack-operator -n ${PROJECT_CPD_INST_OPERATORS}

    Wait at least 2 minutes for the cais fmaas custom resource to become healthy.

  2. Check the status of the cais fmaas custom resource by running the following command:
    oc get cais fmaas -n ${PROJECT_CPD_INST_OPERANDS}
  3. Retry the prompt tuning experiment.

Restoring Data Virtualization fails with metastore not running or failed to connect to database error

Applies to: 5.2.2 and later

Applies to: Online backup and restore with the OADP utility

Diagnosing the problem
View the status of the restore by running the following command:
cpd-cli oadp tenant-restore status ${TENANT_BACKUP_NAME}-restore --details
The output shows errors like in the following examples:
time=<timestamp>  level=INFO msg=Verifying if Metastore is listening
SERVICE              HOSTNAME                               NODE      PID  STATUS                                                                                                                                        	
Standalone Metastore c-db2u-dv-hurricane-dv                   -        -   Not running
time=<timestamp>  level=ERROR msg=Failed to connect to BigSQL database                                                                                                                                                                                                                        	     	                      	                                                                                                                                                                                                                                                                                                                                                         	                     	                                                                                                                                                                                                                                                                                                                                                         	
* error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred:                                                                                                                                                                                                                                     	
* error executing command su - db2inst1 -c '/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L' (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=<namespace-name> auxMetaName=dv-aux component=dv actionIdx=0): command terminated with exit code 1
Cause of the problem
A timing issue causes restore posthooks to fail at the step where the posthooks check for the results of the db2 connect to bigsql command. The db2 connect to bigsql command has failed because bigsql is restarting at around the same time.
Resolving the problem
Run the following command:
export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from cpd-post-restore-hooks

Offline backup fails with PartiallyFailed error

Applies to: 5.2.2 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
In the Velero logs, you see errors like in the following example:
time="<timestamp>" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:180"
time="<timestamp>" level=error msg="error encountered while scanning stdout" backupLocation=oadp-operator/dpa-sample-1 cmd=/plugins/velero-plugin-for-aws controller=backup-sync error="read |0: file already closed" logSource="/remote-source
/velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:90"
time="<timestamp>" level=error msg="Restic command fail with ExitCode: 1. Process ID is 906, Exit error is: exit status 1" logSource="/remote-source/velero/app/pkg/util/exec/exec.go:66"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.examplehost.example.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=76b76bc4-27d4-4369-886c-1272dfdf9ce9 --tag=volume=cc-home-p
vc-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --tag=pod=cpdbr-vol-mnt --host=velero --json with error: exit status 3 stderr: {\"message_type\":\"e
rror\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/.scripts/system\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival
\",\"item\":\".scripts/system\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"_global_/security/artifacts/metakey\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kuberne
tes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/_global_/security/artifacts/metakey\"}\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/ve
lero/app/pkg/podvolume/backupper.go:328"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.cp.fyre.ibm.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod=cpdbr-vol-mnt --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=93e9e23c-d80a-49cc-80bb-31a36524e0d
c --tag=volume=data-rabbitmq-ha-0-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --host=velero --json with error: exit status 3 stderr: {\"message_typ
e\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\".erlang.cookie\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-93e9e23c-d80a-49cc-80bb-31a36524e0dc/.erlang.cookie\"}
\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/velero/app/pkg/podvolume/backupper.go:328"
Cause of the problem
The restic folder was deleted after backups were cleaned up (deleted). This problem is a Velero known issue. For more information, see velero does not recreate restic|kopia repository from manifest if its directories are deleted on s3.
Resolving the problem
Do the following steps:
  1. Get the list of backup repositories:
    oc get backuprepositories -n ${OADP-OPERATOR-NAMESPACE} -o yaml
  2. Check for old or invalid object storage URLs.
  3. Check that the object storage path is in the backuprepositories custom resource.
  4. Check that the <objstorage>/<bucket>/<prefix>/restic/<namespace>/config file exists.

    If the file does not exist, make sure that you do not share the same <objstorage>/<bucket>/<prefix> with another cluster, and specify a different <prefix>.

  5. Delete backup repositories that are invalid for the following reasons:
    • The path does not exist anymore in the object storage.
    • The restic/<namespace>/config file does not exist.
    oc delete backuprepositories -n ${OADP_OPERATOR_NAMESPACE} <backup-repository-name>

Backup fails after a service is upgraded and then uninstalled

Applies to: 5.2.2 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
The problem occurs after a service was upgraded from IBM Cloud Pak® for Data 4.8.x to IBM Software Hub 5.2.1 and later uninstalled. When you try to take a backup by running the cpd-cli oadp tenant-backup command, the backup fails. In the CPD-CLI*.log file, you see an error message like in the following example:
Error: global registry check failed: 1 error occurred:
        * error from addOnId=watsonx_ai: 2 errors occurred:
        * failed to find aux configmap 'cpd-watsonxai-maint-aux-ckpt-cm' in tenant service namespace='<namespace_name>': : configmaps "cpd-watsonxai-maint-aux-ckpt-cm" not found
        * failed to find aux configmap 'cpd-watsonxai-maint-aux-br-cm' in tenant service namespace='<namespace_name>': : configmaps "cpd-watsonxai-maint-aux-br-cm" not found




[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1
Resolving the problem
Re-run the backup command with the --skip-registry-check option. For example:
cpd-cli oadp tenant-backup create ${TENANT_OFFLINE_BACKUP_NAME} \
--namespace ${OADP_PROJECT} \
--vol-mnt-pod-mem-request=1Gi \
--vol-mnt-pod-mem-limit=4Gi \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
--mode offline \
--skip-registry-check \
--image-prefix=registry.redhat.io/ubi9 \
--log-level=debug \
--verbose &> ${TENANT_OFFLINE_BACKUP_NAME}.log&

Offline backup fails after watsonx.ai is uninstalled

Applies to: 5.2.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
The problem occurs when you try to take an offline backup after watsonx.ai was uninstalled. The backup process fails when post-backup hooks are run. In the CPD-CLI*.log file, you see error messages like in the following example:
time=<timestamp> level=info msg=cmd stderr: <timestamp> [emerg] 233346#233346: host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10
nginx: [emerg] host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10
nginx: configuration file /usr/local/openresty/nginx/conf/nginx.conf test failed
 func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:824
time=<timestamp> level=warning msg=failed to get exec hook JSON result for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 err=could not find JSON exec hook result func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:852
time=<timestamp> level=warning msg=no exec hook JSON result found for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:855
time=<timestamp> level=info msg=exit executeCommand func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:860
time=<timestamp> level=error msg=command terminated with exit code 1
Cause of the problem
After watsonx.ai is uninstalled, the nginx configuration in the ibm-nginx pod is not cleared up, and the pod fails.
Resolving the problem
Restart all ibm-nginx pods.
oc delete pod \
-n=${PROJECT_CPD_INST_OPERANDS} \
-l component=ibm-nginx

Db2 Big SQL backup pre-hook and post-hook fail during offline backup

Applies to: 5.2.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
In the db2diag logs of the Db2 Big SQL head pod, you see error messages such as in the following example when backup pre-hooks are running:
<timestamp>          LEVEL: Event
PID     : 3415135              TID : 22544119580160  PROC : db2star2
INSTANCE: db2inst1             NODE : 000
HOSTNAME: c-bigsql-<xxxxxxxxxxxxxxx>-db2u-0
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5692
MESSAGE : ZRC=0xFFFFFBD0=-1072
          SQL1072C  The request failed because the database manager resources
          are in an inconsistent state. The database manager might have been
          incorrectly terminated, or another application might be using system
          resources in a way that conflicts with the use of system resources by
          the database manager.
Cause of the problem
The Db2 database was unable to start because of the error code SQL1072C. As a result, the bigsql start command that runs as part of the post-backup hook hangs, which produces the timeout of the post-hook. The post-hook cannot succeed unless Db2 is brought back to a stable state and the bigsql start command runs successfully. The Db2 Big SQL instance is left in an unstable state.
Resolving the problem
Do one or both of the following troubleshooting and cleanup procedures.
Tip: For more information about the SQL1072C error code and how to resolve it, see SQL1000-1999 in the Db2 documentation.
Remove all the database manager processes running under the Db2 instance ID
Do the following steps:
  1. Log in to the Db2 Big SQL head pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash
  2. Switch to the db2inst1 user:
    su - db2inst1
  3. List all the database manager processes that are running under db2inst1:
    db2_ps
  4. Remove these processes:
    kill -9 <process-ID>
Ensure that no other application is running under the Db2 instance ID, and then remove all resources owned by the Db2 instance ID
Do the following steps:
  1. Log in to the Db2 Big SQL head pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash
  2. Switch to the db2inst1 user:
    su - db2inst1
  3. List all IPC resources owned by db2inst1:
    ipcs | grep db2inst1
  4. Remove these resources:
    ipcrm -[q|m|s] db2inst1

OpenSearch operator fails during backup

Applies to: 5.2.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
After restoring, verify the list of restored indices in OpenSearch. If only indices that start with a dot, for example, .ltstore, are present, and expected non-dot-prefixed indices are missing, this indicates the issue.
Cause of the problem
During backup and restore, the OpenSearch operator restores indices that start with either "." or indices that do not start with ".", but not both. This behavior affects Watson Discovery deployments where both types of indices are expected to be restored.
Resolving the problem
Complete the following steps to resolve the issue:
  1. Access the client pod:
    oc rsh -n ${PROJECT_CPD_INST_OPERANDS} wd-discovery-opensearch-client-000
  2. Set the credentials and repository:
    user=$(find /workdir/internal_users/ -mindepth 1 -maxdepth 1 | head -n 1 | xargs basename)
    password=$(cat "/workdir/internal_users/${user}")
    repo="cloudpak"
  3. Get the latest snapshot:
    last_snapshot=$(curl --retry 5 --retry-delay 5 --retry-all-errors -k -X GET "https://${user}:${password}@${OS_HOST}/_cat/snapshots/${repo}?h=id&s=end_epoch" | tail -n1)
  4. Check that the latest snapshot was saved:
    echo $last_snapshot
  5. Restore the snapshot:
    curl -k -X POST "https://${user}:${password}@${OS_HOST}/_snapshot/${repo}/${last_snapshot}/_restore?wait_for_completion=true" \
      -d '{"indices": "-.*","include_global_state": false}' \
      -H 'Content-Type: application/json'
    This command can take a while to run before output is shown. After it completes, you'll see output similar to the following example:
    {
        "snapshot": {
            "snapshot": "cloudpak_snapshot_2025-06-10-13-30-45",
            "indices": [
                "966d1979-52e8-6558-0000-019759db7bdc_notice",
                "b49b2470-70c3-4ba1-9bd2-a16d72ffe49f_curations",
    			...
            ],
            "shards": {
                "total": 142,
                "failed": 0,
                "successful": 142
            }
        }
    }

This restores all indices, including both dot-prefixed and non-dot-prefixed ones.

Security issues

Security scans return an Inadequate Account Lockout Mechanism message

Applies to: 5.2.0 and later

Diagnosing the problem
If you run a security scan against IBM Software Hub, the scan returns the following message.
Inadequate Account Lockout Mechanism
Resolving the problem
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 5.2.0 and later

Diagnosing the problem
If you run an Aqua Security scan against your cluster, the scan returns the following issue:
Resolving the problem
This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.