Known issues and limitations for IBM Software Hub

The following issues apply to the IBM® Software Hub platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

Known issue with 5.3.1 Patch 4

Applies to: 5.3.1 Patch 4

If Data Virtualization is installed on your environment, users might not be able to access catalogs after you apply IBM Software Hub Version 5.3.1 Patch 4.

Warning: Review the following Known Issue on IBM Support site before you install, upgrade to, or apply Patch 4: DT470317: Catalog UI page fails with 502 Bad Gateway error when a missing user group is used in a catalog

After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional

Applies to: 5.3.0 and later

Diagnosing the problem
After rebooting the cluster, some IBM Software Hub custom resources remain in the InProgress state.

For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.

Workaround
To enable the pods to come up after a reboot:
  1. Find the nodes that have pods that are in an Error state:
    oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A  | grep -v -P "Completed|(\d+)\/\1"
  2. Mark each node as unschedulable.
    oc adm cordon <node_name>
  3. Delete the affected pods:
    oc get pod   | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0
  4. Mark each node as scheduled:
    oc adm uncordon <node_name>

Export jobs are not deleted if the export specification file includes incorrectly formatted JSON

Applies to: 5.3.0 and later

When you export data from IBM Software Hub, you must create an export specification file. The export specification file includes a JSON string that defines the data that you want to export.

If the JSON string is incorrectly formatted, the export will fail. In addition, if you try to run the cpd-cli export-import export delete command, the command completes without returning any errors, but the export job is not deleted.

Diagnosing the problem
  1. To confirm that the export job was not deleted, run:
    cpd-cli export-import export status \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --profile=${CPD_PROFILE_NAME}
Resolving the problem
  1. Use the OpenShift Container Platform command-line interface to delete the failed export job:
    1. Set the EXPORT_JOB environment variable to the name of the job that you want to delete:
      export EXPORT_JOB=<export-job-name>
    2. To remove the export job, run:
      oc delete job ${EXPORT_JOB} \
      --namespace=${PROJECT_CPD_INST_OPERANDS}
  2. If you want to use the export specification file to export data, fix the JSON formatting issues.

The health service-functionality check for common core services fails

Applies to: 5.3.1 and later

Occasionally, the cpd-cli health service-functionality command fails on the Get Deleted Data Asset step, displaying an Unexpected Asset Found response, while checking the health of common core services. The failure occurs due to a latency in propagating asset deletion to the Global Search index, causing the command to detect a deleted asset during asset validation.

This failure is not indicative of a problem with common core services and the Global Search index. There is no impact on the functionality of the service if this check fails.

The health service-functionality check for IBM Master Data Management times out

Applies to: 5.3.0 and later

Occasionally, the cpd-cli health service-functionality command times out, displaying a 401 Unauthorized response, while checking the health of the IBM Master Data Management service. If this occurs, re-run the command to complete the health check.

The health service-functionality check for watsonx Code Assistant for Z Agentic fails

Applies to: 5.3.0

Fixed in: 5.3.1

The cpd-cli health service-functionality command fails on the Agentic Chat Check step while checking the health of the watsonx Code Assistant™ for Z Agentic service.

This failure is not indicative of a problem with watsonx Code Assistant for Z Agentic. There is no impact on the functionality of the service if this check fails.

To avoid this failure:
  • Use --services to specify the list of services while running the command.
  • Do no include the watsonx Code Assistant for Z Agentic service key in the list.

The health service-functionality check for watsonx.ai fails on the Inferencing on Text Generation Models with Guardrails enabled step

Applies to: 5.3.0

Fixed in: 5.3.1

When you have watsonx Code Assistant for Z Agentic installed, the cpd-cli health service-functionality command fails on the Inferencing on Text Generation Models with Guardrails enabled step while checking the health of the watsonx.ai™ service.

This failure is not indicative of a problem with watsonx.ai. There is no impact on the functionality of the service if this check fails.

To avoid this failure:
  • Use --services to specify the list of services while running the command.
  • Do no include these service keys in the list.
    • watsonx.ai
    • watsonx Code Assistant for Z Agentic

The health service-functionality check for watsonx.ai fails on the Inferencing on Text Chat Models step

Applies to: 5.3.1

The cpd-cli health service-functionality check for watsonx.ai fails on the Inferencing on Text Chat Models step while interfering the text on the ibm/granite-20b-code-8k-ansible and ibm/granite-20b-code-javaenterprise-v2 models. The failure occurs only on these models if installed, and the health check continues to run successfully on the other installed models.

This failure is not indicative of a problem with watsonx.ai. There is no impact on the functionality of the service if this check fails.

The health service-functionality check for watsonx.data times out

Applies to: 5.3.1

Occasionally, the cpd-cli health service-functionality command times out while checking the health of the watsonx.data™ service. If this occurs, re-run the command to complete the health check.

The health service-functionality check fails on clusters with HAProxy configuration

Applies to: 5.3.1 and later

On clusters that have HAProxy configured, the cpd-cli health service-functionality check fails with an End-of-File (EOF) error on long running requests. The error occurs due to HAProxy reloads disrupting active connections every 10 seconds.

As a work around, reconfigure the cluster router to increase the reload time from 10 seconds to 1 minute before running the cpd-cli health service-functionality command.

  1. Run the patch_router() command to reconfigure the cluster router:

    patch_router() {
      local timeout="5m"
      oc patch ingresscontroller/default -n openshift-ingress-operator \
      --type=merge \
      -p '{"spec":{"tuningOptions":{"reloadInterval":"1m"},"idleConnectionTerminationPolicy":"Deferred"}}'
    
      echo "Waiting for router rollout to start..."
      sleep 15
      
      echo "Waiting for router rollout to complete..."
      if ! oc -n openshift-ingress rollout status deployment/router-default --timeout="$timeout"; then
        echo "Warning: Rollout failed or timed out. Continuing to prevent blocking tests"
        return 1
      fi
      
      echo "Rollout completed. Waiting 60s for HAProxy config propagation and connection stabilization..."
      sleep 60
      
      echo "Router patch fully applied and stabilized."
    }
    
    # Call the function
    patch_router

    The command also restarts the deployments associated with the HAProxy server.

  2. Run the cpd-cli health service-functionality command.

This patch persists across cluster restarts and needs to be applied only once per cluster.

Installation and upgrade issues

The apply-cluster-components command returns an error when a third-party certificate manager is installed

Applies to: 5.3.0

Fixed in: 5.3.1

If you install a certificate manager other than the Red Hat OpenShift Container Platform certificate manager, the apply-cluster-components command fails with the following error:

The certificate manager check returned the following result: 127
error: the server doesn't have a resource type "certmanager"
Resolving the problem
To continue the installation:
  1. Get the name of the olm-utils-v4 container image:
    Docker
    docker ps
    Podman
    podman ps
  2. Execute into the olm-utils-v4 container:
    Docker
    docker exec -it <olm-utils-container-name> bash
    Podman
    podman exec -it <olm-utils-container-name> bash
  3. Run the following command to modify the bin/apply-cluster-components script:
    sed -i 's/^\([ \t]*\)rh_cert_manager_cr=.*/\1rh_cert_manager_cr=""/'  bin/apply-cluster-components
  4. Run the following command:
    cat bin/apply-cluster-components | grep rh_cert_manager_cr=
    Confirm that the command returns:
    rh_cert_manager_cr=""
  5. Type exit to exit the container.
  6. Re-run the apply-cluster-components command.

The install-components command fails when you set a resource quota on the operands project

Applies to: 5.3.0

Fixed in: 5.3.1

If you create a resource quota (ResourceQuota) on the operands project, the install-components command fails with one of the following errors:
  • During installation:
    Error: failed post-install: 1 error occurred:
    	* timed out waiting for the condition
  • During upgrade:
    Error: UPGRADE FAILED: post-upgrade hooks failed: 1 error occurred:
    	* timed out waiting for the condition
Resolving the problem
  1. Get the current requests and limits from the resource quota:
    oc get resourcequota \
    --namespace=${PROJECT_CPD_INST_OPERANDS}

    The command returns output the following format:

    zen    cpd-quota    4s    requests.cpu: 76280m/200, requests.memory: 287349460172800m/1200Gi   limits.cpu: 207875m/800, limits.memory: 606134Mi/1800Gi
  2. Create a limit range for resources in the operands project.

    Use the following command as an example. Adjust the requests and limits based on the values set in the resource quota.

    cat << EOF | oc apply -f -
    apiVersion: v1
    kind: LimitRange
    metadata:
      name: cpu-resource-limits
      namespace: ${PROJECT_CPD_INST_OPERANDS}
    spec:
      limits:
      - default:
          cpu: 300m
          memory: 200Mi
        defaultRequest:
          cpu: 200m
          memory: 200Mi
        type: Container
    EOF

    The values in the preceding example are based on the following values:

    Type Resource quota Limit range
    CPU request 76280m 200m
    CPU limit 207875m 300m
    Memory request 287349460172800m 200Mi
    Memory limit 606134Mi 200Mi

The installation or upgrade completes after you create the limit range.

The Switch locations icon is not available if the apply-cr command times out

Applies to: 5.3.0 and later

If you install solutions that are available in different experiences, the Switch locations icon (Image of the Switch locations icon.) is not available in the web client if the cpd-cli manage apply-cr command times out.

Resolving the problem

Re-run the cpd-cli manage apply-cr command.

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Applies to: 5.3.0 and later

If the Red Hat OpenShift Data Foundation or IBM Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.

One symptom is that pods will not start because of a FailedMount error. For example:

Warning  FailedMount  36s (x1456 over 2d1h)   kubelet  MountVolume.MountDevice failed for volume 
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given 
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
Diagnosing the problem
To confirm whether the Data Foundation Rook Ceph cluster is unstable:
  1. Ensure that the rook-ceph-tools pod is running.
    oc get pods -n openshift-storage | grep rook-ceph-tools
    Note: On IBM Fusion HCI System or on environments that use hosted control planes, the pods are running in the openshift-storage-client project.
  2. Set the TOOLS_POD environment variable to the name of the rook-ceph-tools pod:
    export TOOLS_POD=<pod-name>
  3. Execute into the rook-ceph-tools pod:
    oc rsh -n openshift-storage ${TOOLS_POD}
  4. Run the following command to get the status of the Rook Ceph cluster:
    ceph status
    Confirm that the output includes the following line:
    health: HEALTH_WARN
  5. Exit the pod:
    exit
Resolving the problem
To resolve the problem:
  1. Get the name of the rook-ceph-mrg pods:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  2. Set the MGR_POD_A environment variable to the name of the rook-ceph-mgr-a pod:
    export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
  3. Set the MGR_POD_B environment variable to the name of the rook-ceph-mgr-b pod:
    export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
  4. Delete the rook-ceph-mgr-a pod:
    oc delete pods ${MGR_POD_A} -n openshift-storage
  5. Ensure that the rook-ceph-mgr-a pod is running before you move to the next step:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  6. Delete the rook-ceph-mgr-b pod:
    oc delete pods ${MGR_POD_B} -n openshift-storage
  7. Ensure that the rook-ceph-mgr-b pod is running:
    oc get pods -n openshift-storage | grep rook-ceph-mgr

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 5.3.0 and later

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Master Data Management cannot function correctly.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Master Data Management
Diagnosing the problem
To identify the cause of this issue, check the FoundationDB status and details.
  1. Check the FoundationDB status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.

  2. If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
    oc rsh sample-cluster-log-1 /bin/fdbcli

    To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.

    status details
    • If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
      oc get pod -o wide | grep storage
      > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 

      If this step does not resolve the problem, proceed to the workaround steps.

    • If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
Resolving the problem
To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

  1. Restart the FoundationDB cluster pods.
    oc get fdbcluster 
    oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po

    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

  2. Restart the FoundationDB operator pods.
    oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po
  3. After the pods finish restarting, check to ensure that FoundationDB is available.
    1. Check the FoundationDB status.
      oc get fdbcluster -o yaml | grep fdbStatus

      The returned status must be Complete.

    2. Check to ensure that the database is available.
      oc rsh sample-cluster-log-1 /bin/fdbcli

      If the database is still not available, complete the following steps.

      1. Log in to the ibm-fdb-controller pod.
      2. Run the fix-coordinator script.
        kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}

        Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

Persistent volume claims with the WaitForFirstConsumer volume binding mode are flagged by the installation health checks

Applies to: 5.3.0 and later

When you install IBM Software Hub, the following persistent volume claims are automatically created:
  • ibm-cs-postgres-backup
  • ibm-zen-objectstore-backup-pvc

Both of these persistent volume claims are created with the WaitForFirstConsumer volume binding mode. In addition, both persistent volume claims will remain in the Pending state until you back up your IBM Software Hub installation. This behavior is expected. However, when you run the cpd-cli health operands command, the Persistent Volume Claim Healthcheck fails.

If there are more persistent volume claims returned by the health check, you must investigate further to determine why those persistent volume claims are pending. However, if only the following persistent volume claims are returned, you can ignore the Failed result:

  • ibm-cs-postgres-backup
  • ibm-zen-objectstore-backup-pvc

Node pinning is not applied to postgresql pods

Applies to: 5.3.0 and later

If you use node pinning to schedule pods on specific nodes, and your environment includes postgresql pods, the node affinity settings are not applied to the postgresql pods that are associated with your IBM Software Hub deployment.

The resource specification injection (RSI) webhook cannot patch postgresql pods because the EDB Postgres operator uses a PodDisruptionBudget resource to limit the number of concurrent disruptions to postgresql pods. The PodDisruptionBudget resource prevents postgresql pods from being evicted.

The ibm-nginx deployment does not scale fast enough when automatic scaling is configured

Applies to: 5.3.0 and later

If you configure automatic scaling for IBM Software Hub, the ibm-nginx deployment might not scale fast enough. Some symptoms include:

  • Slow response times
  • High CPU requests are throttled
  • The deployment scales up and down even when the workload is steady

This problem typically occurs when you install watsonx Assistant or watsonx™ Orchestrate.

Resolving the problem
If you encounter the preceding symptoms, you must manually scale the ibm-nginx deployment:
oc patch zenservice lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {
    "Nginx": {
        "name": "ibm-nginx",
        "kind": "Deployment",
        "container": "ibm-nginx-container",
        "replicas": 5,
        "minReplicas": 2,
        "maxReplicas": 11,
        "guaranteedReplicas": 2,
        "metrics": [
            {
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": 529
                    }
                }
            }
        ],
        "resources": {
            "limits": {
                "cpu": "1700m",
                "memory": "2048Mi",
                "ephemeral-storage": "500Mi"
            },
            "requests": {
                "cpu": "225m",
                "memory": "920Mi",
                "ephemeral-storage": "100Mi"
            }
        },
        "containerPolicies": [
            {
                "containerName": "*",
                "minAllowed": {
                    "cpu": "200m",
                    "memory": "256Mi"
                },
                "maxAllowed": {
                    "cpu": "2000m",
                    "memory": "2048Mi"
                },
                "controlledResources": [
                    "cpu",
                    "memory"
                ],
                "controlledValues": "RequestsAndLimits"
            }
        ]
    }
}}'

Uninstalling IBM watsonx services does not remove the IBM watsonx experience

Applies to: 5.3.0 and later

After you uninstall watsonx.ai or watsonx.governance™, the IBM watsonx experience is still available in the web client even though there are no services that are specific to the IBM watsonx experience.

Resolving the problem
To remove the IBM watsonx experience from the web client, an instance administrator must run the following command:
oc delete zenextension wx-perspective-configuration \
--namespace=${PROJECT_CPD_INST_OPERANDS}

After running the apply-patch command you might need to restart the wml-deployment-manager deployment

Applies to: 5.3.1 Patch 1

Fixed in: 5.3.1 Patch 2

If your IBM Software Hub Version 5.3.1 installation includes Watson Machine Learning, you might need to restart the Watson Machine Learning wml-deployment-manager deployment after you run the apply-patch command to apply Version 5.3.1 Patch 1.

This issue does not occur if IBM Software Hub Version 5.3.1 Patch 1 is applied automatically when you install or upgrade IBM Software Hub.

Diagnosing the problem
If the wml-deployment-manager pod starts before the ibm-nginx pods are in the Running state, the wml-deployment-manager pod logs include the following error:
Rejected authentication POST https://ibm-nginx-svc

To check for this error:

  1. Run the following command to search the logs:
    oc logs -n ${PROJECT_CPD_INST_OPERANDS} \
    $(oc get pods -n ${PROJECT_CPD_INST_OPERANDS} -l app=wml-deployment-manager -o NAME) \
    | grep "POST https://ibm-nginx-svc"
  2. Review the response returned by the command:
    • If the response contains only the following phrase, no action is required:
      Defaulted container "runtimemanagercontainer" out of: runtimemanagercontainer, init-container (init)
    • If the response contains the following phrase, proceed to Resolving the problem:
      Rejected authentication POST https://ibm-nginx-svc
Resolving the problem
To resolve the problem, restart the wml-deployment-manager deployment:
oc rollout restart deploy wml-deployment-manager \
-n ${PROJECT_CPD_INST_OPERANDS}

Backup and restore issues

Issues that apply to several backup and restore methods

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. IBM Master Data Management CR becomes stuck after a restore
  2. Backup pre-check fails for db2oltp, db2wh, or db2aaservice on upgraded cluster with proxy enabled
  3. Identity resources have file names that are too long
  4. Backup precheck fails for watsonx Code Assistant and watsonx Code Assistant for Red Hat Ansible Lightspeed due to ClusterRole configuration
  5. Backup fails due to lingering pvc-sysbench-rwo created by storage-performance health check in Data Virtualization
  6. Restore of IBM Master Data Management fails if RabbitMQ Helm secrets are not excluded before creating the backup (fixed in 5.3.1)
Restore issues
Review the following issues before you restore a backup. Do the workarounds that apply to your environment.
  1. Informix custom resource remains in InProgress status after restore
  2. Python tools are missing in watsonx Orchestrate after restoring to different cluster
  3. Domain agent connections are not displayed in watsonx Orchestrate after restoring to different cluster
  4. Cannot create connections in watsonx Orchestrate after restoring to a different cluster
  5. Extra connections appear on target cluster after restoring watsonx.data intelligence and IBM Manta Data Lineage
  6. Common core services CR is recreated instead of restored during restore to the same cluster
  7. IBM watsonx Orchestrate fails to deploy agents after restoring data to a different cluster
  8. Neo4j cluster in Failed status after restore
  9. During a restore, the IBM Master Data Management CR fails with an error stating that a conditional check failed
  10. After a service upgrade, restoring IBM Master Data Management fails with an error stating that mdm-cr cannot be found (Fixed in 5.3.1 Patch 1)
  11. After a restore, OperandRequest timeout error in the ZenService custom resource
  12. SQL30081N RC 115,*,* error for Db2 selectForReceiveTimeout function after instance restore (Fixed in 5.2.1)
  13. Restore fails and displays postRestoreViaConfigHookRule error in Data Virtualization
  14. Error 404 displays after backup and restore in Data Virtualization
  15. The restore process times out while waiting for the ibmcpd status check to complete
  16. Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration
  17. FoundationDB cluster is stuck in stage_mirror: post-restore state after restore from Online backup

Backup and restore issues with the OADP utility

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Restore fails for IBM Knowledge Catalog because PostgreSQL cluster cannot reach a healthy state
  2. Backup fails for watsonx Assistant due to ConfigMap issues
  3. Offline backup fails for cpd-ikc-ikc-aux-br-cm ConfigMap
  4. Backup pre-check fails for Db2 Data Management Console due to timeout for status check
  5. Unstructured Data Integration backup fails at pre-backup hooks phase
  6. Backup fails on upgraded cluster due to EDB Postgres Enterprise ConfigMap timeout error
  7. Backup validation fails for Data Virtualization due to missing label in dvendpoint PVC
  8. Offline backups fail for Db2 Data Management Console due to incorrect CR names in the ConfigMap
  9. Model Gateway online backup fails at checkpoint due to missing shell in container image
  10. Backup pre-check fails for Db2 Data Management Console in REST mode
  11. OpenPages instance fails to start after restore due to missing labels on secrets
  12. Offline backup fails with PartiallyFailed error
  13. ObjectBucketClaim is not supported by the OADP utility
  14. Db2 Big SQL backup pre-hook and post-hook fail during offline backup
  15. OpenSearch operator fails during backup
Restore issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Custom resource for watsonx.ai IFM is created instead of restored when restoring SemanticAutomation
  2. Custom resources for watsonx.data intelligence and Data Refinery are created instead of restored
  3. Custom resource for watsonx.ai IFM is created instead of restored
  4. Custom resource for common core services fails to restore due to missing nginx routes
  5. Custom resource for common core services stuck at InProgress status after rebooting cluster or restoring
  6. User interface for IBM Software Hub displays Internal Server Error after offline restore
  7. Offline restore fails with OIDC client registration error
  8. Restoring Data Virtualization fails with metastore not running or failed to connect to database error
  9. Prompt tuning fails after restoring watsonx.ai
  10. Restic backup that contains dynamically provisioned volumes in Amazon Elastic File System fails during restore

Backup and restore issues with IBM Fusion

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Backup of IBM Master Data Management fails intermittently with timeout error
  2. Watson OpenScale fails to reconcile due to Kafka pod
  3. Watson OpenScale fails to reconcile after restore
  4. Validation for cpd-ikc-ikc-aux-ckpt-cm ConfigMap fails
  5. Backup fails at pre-check for watsonx Orchestrate due to missing ClusterRole permissions
  6. Notification dropdown does not open after watsonx.ai restore with IBM Fusion
  7. Resource validation at the end of a backup fails with OOMKilled status
  8. Informix backup fails at pre-check due to incorrect custom resource name
  9. Db2 backup fails at the Hook: br-service hooks/pre-backup step
  10. Backup service location is unavailable during backup
Restore issues
Do the workarounds that apply to your environment after you restore a backup.
  1. Watson Discovery fails after restore with opensearch and post_restore unverified components
  2. Restore process stuck at db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState
  3. Restore process stuck at zenextensions-patch-ckpt-cm step

Backup and restore issues with NetApp Trident protect

Backup issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Restore for watsonx.data Premium completes but dependent CRs remain in a bad state
  2. OADP backup with the same name as a NetApp Trident protect backup can be deleted by cpd-trident-protect.py backup delete
  3. Neo4j config maps missing during backup
  4. Backups and restores fail because of missing SCCs
Restore issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Restore fails for watsonx Orchestrate and watsonx Assistant custom resources
  2. Backups and restores fail because of missing SCCs

Backup and restore issues with Portworx

Backup issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. IBM Software Hub resources are not migrated

ObjectBucketClaim is not supported by the OADP utility

Applies to: 5.3.0 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
If an ObjectBucketClaim is created in an IBM Software Hub instance, it is not included when you create a backup.
Cause of the problem
OADP does not support backup and restore of ObjectBucketClaim.
Resolving the problem
Services that provide the option to use ObjectBuckets must ensure that the ObjectBucketClaim is in a separate namespace and backed up separately.

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 5.3.0 and later

Applies to: All backup and restore methods

Diagnosing the problem
Get the status of the ZenService YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...
Check for failing operandrequests:
oc get operandrequests -A
For failing operandrequests, check their conditions for constraints not satisfiable messages:
oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
Cause of the problem
Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Resolving the problem
Do the following steps:
  1. Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.

    For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.

  2. Delete IBM Software Hub instance projects (namespaces).

    For details, see Cleaning up the cluster before a restore.

  3. Retry the restore.

Informix custom resource remains in InProgress status after restore

Applies to: 5.3.1

Applies to: All backup and restore methods

Diagnosing the problem

After restoring an Informix instance using any backup method, the custom resource (CR) for the Informix instance remains in InProgress status even though the informixservice-cr custom resource shows Completed status.

informix            Informix              informix-1770718673693260       2026-02-11T05:08:32Z  zen          10.1.0              InProgress
informix_cp4d       InformixService       informixservice-cr              2026-02-11T05:08:32Z  zen          10.1.0              Completed

However, when you check the Informix pods, they are all running and healthy:

oc get pods -n zen | grep informix-xxxxxxxxxxxxxxxxx
informix-xxxxxxxxxxxxxxxxx-cm-0                    1/1     Running     0          8h
informix-xxxxxxxxxxxxxxxxx-cp4dapi-85pqrf88-9bbpqr 1/1     Running     0          8h
informix-xxxxxxxxxxxxxxxxx-server-0                1/1     Running     0          8h
informix-xxxxxxxxxxxxxxxxx-wlistener-6pqr84pqr6    1/1     Running     0          8h

And the Informix database instance is running and accepts database connections, so operators are not impacted.

This status mismatch causes post-restore validation tests to fail because they check for the Informix CR to be in Completed status.

Cause of the problem

This issue occurs because the controller manager for the Informix operator stops reconciling the Informix custom resource after the statefulset is restored. During normal installation and upgrade operations, the controller manager properly monitors the statefulset and updates the CR status when pods reach Ready status. However, during the backup and restore operations, the controller manager fails to detect the pod status changes after the restore completes.

This appears to be a timing issue for the controller manager. The controller manager does not automatically trigger a reconciliation event when the restored statefulset pods become Ready. As a result, the CR status remains in InProgress even though all pods are running and the database is accepting connections.

Resolving the problem

The Informix database remains functional despite the InProgress status. Client applications can continue to connect to the database even while the CR status shows InProgress. This workaround only updates the CR status to reflect the actual running state of the database.

To fix the CR status, you can manually scale down and then scale up the Informix server statefulset to force the controller manager to run a reconciliation. Forcing a reconciliation event triggers the controller manager to detect that the ready replica count matches the desired replica count and update the CR status to Completed.

  1. Scale down the Informix server statefulset to 0 replicas:
    oc scale statefulset informix-<xxxxxxxxxxxxxxxxx>-server -n zen --replicas=0

    Replace <xxxxxxxxxxxxxxxxx> with the instance identifier.

  2. Wait for the server pod to terminate, and then scale up the Informix server statefulset back to 1 replica:
    oc scale statefulset informix-<xxxxxxxxxxxxxxxxx>-server -n zen --replicas=1

    Wait for the server pod to start and reach Running state:

    informix-<xxxxxxxxxxxxxxxxx>-server-0    1/1     Running     0          2m

Python tools are missing in watsonx Orchestrate after restoring to different cluster

Applies to: 5.3.1

Applies to: Online and offline restore to a different cluster

Diagnosing the problem

After restoring data from a source cluster to a target cluster, Python tools that were imported in the source cluster are not displayed in the restored watsonx Orchestrate environment.

When you click Manage > Agents from the watsonx Orchestrate navigation menu, the Python tools that you imported in the source cluster do not appear in the agents list. The Manage agents screen might also display an error when you try to access it.

Cause of the problem

This issue occurs because the certificates from the source cluster are included in the backup, and they are restored in the target cluster. These certificates are specific to the source cluster and are not valid in the target cluster environment. When watsonx Orchestrate components attempt to use these certificates, the Python tools fail to load and display properly.

Resolving the problem

The certificates need to be regenerated in the target cluster to match the target cluster's environment.

After restoring data to the target cluster, manually regenerate the certificates in the IBM Software Hub instance namespace.

  1. Delete the existing certificates:
    oc delete certificate `oc get certificates | grep icert | awk '{print $1}'`
  2. Restart the watsonx Orchestrate deployments to trigger the regeneration of the certificates:
    oc rollout restart deployments `oc get deployments -l icpdsupport/module=components-services-orchestrate | awk '{print $1}'`
  3. Wait for the deployments to restart and the new certificates to be generated.

Domain agent connections are not displayed in watsonx Orchestrate after restoring to different cluster

Applies to: 5.3.1

Applies to: Online and offline restore to a different cluster

Diagnosing the problem

After restoring data from a source cluster to a target cluster, domain agent connections that existed in the source cluster are not displayed in the watsonx Orchestrate environment in the target cluster.

When you click Connections from the watsonx Orchestrate navigation menu, the domain agent connections from the source cluster do not appear in the connections list. The Connections window might also display an error when you try to access it.

Cause of the problem

This issue occurs because the certificates from the source cluster are included in the backup, and they are restored in the target cluster. These certificates are specific to the source cluster and are not valid in the target cluster environment. When watsonx Orchestrate components attempt to use these certificates, the domain agent connections fail to load and display properly.

Resolving the problem

The certificates need to be regenerated in the target cluster to match the target cluster's environment.

After restoring data to the target cluster, manually regenerate the certificates in the IBM Software Hub instance namespace.

  1. Delete the existing certificates:
    oc delete certificate `oc get certificates | grep icert | awk '{print $1}'`
  2. Restart the watsonx Orchestrate deployments to trigger the regeneration of the certificates:
    oc rollout restart deployments `oc get deployments -l icpdsupport/module=components-services-orchestrate | awk '{print $1}'`
  3. Wait for the deployments to restart and the new certificates to be generated.

Cannot create connections in watsonx Orchestrate after restoring to a different cluster

Applies to: 5.3.1

Applies to: Online and offline restore to a different cluster

Diagnosing the problem

After restoring data from a source cluster to a target cluster, you cannot create new connections in the restored watsonx Orchestrate environment.

When you click Save and continue to create a connection in the "Add new connection" dialog, you see a Connection failed error. The Connections window might also display an error when you try to access it, and API calls to the connections service may fail with HTTP 500 errors.

Cause of the problem

This issue occurs because the certificates from the source cluster are included in the backup, and they are restored in the target cluster. These certificates are specific to the source cluster and are not valid in the target cluster environment. When watsonx Orchestrate components attempt to use these certificates, the connections service fails to process requests properly.

The issue can also cause errors and failures for JWT token validation and certificate verification during API calls between watsonx Orchestrate components.

Resolving the problem

The certificates need to be regenerated in the target cluster to match the target cluster's environment.

After restoring data to the target cluster, manually regenerate the certificates in the IBM Software Hub instance namespace.

  1. Delete the existing certificates:
    oc delete certificate `oc get certificates | grep icert | awk '{print $1}'`
  2. Restart the watsonx Orchestrate deployments to trigger the regeneration of the certificates:
    oc rollout restart deployments `oc get deployments -l icpdsupport/module=components-services-orchestrate | awk '{print $1}'`
  3. Wait for the deployments to restart and the new certificates to be generated.

Extra connections appear on target cluster after restoring watsonx.data intelligence and IBM Manta Data Lineage

Applies to: 5.3.1

Fixed in : 5.3.1 Patch 5

Applies to: Online backup and restore to a different cluster

Diagnosing the problem

After restoring watsonx.data intelligence and IBM Manta Data Lineage to a different cluster, you see duplicate connections on the target cluster. Specifically, you see an extra IBM Software Hub connection on the restored cluster that was not present on the source cluster.

This issue happens intermittently, and it has occurred with both NetApp Trident protect and IBM Fusion.

Cause of the problem

This issue happens because the watsonx.data intelligence and IBM Manta Data Lineage custom resources (CRs) are being restored during the restore-pre-operators phase instead of the restore-post-namespacescope phase. When the CRs are restored prematurely, the operators create connections that should be restored later through the normal service restoration flow. This results in duplicate connections on the target cluster.

Resolving the problem

To fix the issue, the following CRs need to be excluded from the resource groups:

  • watsonxdataintelligence.cpd.ibm.com needs to be excluded from the wxdi-resources group
  • datalineage.cpd.ibm.com needs to be excluded from the datalineage-resources group

Before taking a backup, edit the ConfigMaps for watsonx.data intelligence and IBM Manta Data Lineage to exclude the CRs.

Note: This workaround must be applied before taking a backup. If you have already performed a restore and have duplicate connections, you must manually delete the extra connections from the target cluster.
  1. Edit the watsonx.data intelligence ConfigMaps to exclude the watsonxdataintelligence.cpd.ibm.com resource type.
    1. Edit the wxdi-aux-ckpt-cm and wxdi-aux-br-cm ConfigMaps.
    2. In both ConfigMaps, ensure the plan-meta section has the following configuration:
      plan-meta: |
          appType: watsonxdataintelligence
          groups:
            - name: watsonxdataintelligence-crs
              type: resource
              includedResourceTypes:
                - watsonxdataintelligence.cpd.ibm.com
              backupRef: ${RESOURCE_BACKUP}
            - name: wxdi-resources
              type: resource
              labelSelector: "icpdsupport/addOnId=watsonx-dataintelligence"
              includeClusterResources: true
              excludedResourceTypes:
                - pods
                - replicaset
                - deployment
                - watsonxdataintelligence.cpd.ibm.com
  2. Edit the IBM Manta Data Lineage ConfigMaps to exclude the datalineage.cpd.ibm.com resource type.
    1. Edit the datalineage-aux-br-cm and datalineage-aux-ckpt-cm ConfigMaps.
    2. In both ConfigMaps, ensure the plan-meta section has the following configuration:
      appType: datalineage
      groups:
        - name: datalineage-resources
          type: resource
          labelSelector: icpdsupport/addOnId=datalineage
          includeClusterResources: true
          excludedResourceTypes:
            - pods
            - replicaset
            - deployment
            - statefulset
            - service
            - job
            - cronjob
            - zenextension
            - persistentvolumeclaims
            - datalineage.cpd.ibm.com
          backupRef: ${RESOURCE_BACKUP}
  3. Proceed with the backup and restore operation.

Common core services CR is recreated instead of restored during restore to the same cluster

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Backup and restore to the same cluster

Diagnosing the problem

After restoring to the same cluster, the Common core services custom resource (CR) fails to reconcile properly. The Common core services CR status shows the following error:

status:
  ccsStatus: InProgress
  conditions:
  - lastTransitionTime: "2026-01-08T13:29:04Z"
    message: |-
      unknown playbook failure
      The playbook has failed at task - 'wait for create-dap-directories job'
      The error was: 'Please consult the operator logs.'
    reason: Failed
    status: "True"
    type: Failure

You might also see environment-related problems, such as CRI-O errors on worker nodes, and error messages like error reserving ctr name.

When you examine the Common core services CR metadata, you see that it lacks the expected Velero annotations or labels that would indicate it was restored from a backup.

Important: This issue also affects services that depend on Common core services.
Cause of the problem

During the restore operation, the Common core services CR is not restored from the backup. Instead, a new Common core services CR is created with the default specifications. The default configuration might not match the original configuration from the backed-up cluster.

If the original Common core services CR had custom specifications or other custom settings, these configurations are lost during the restore process. If you used the default configuration, the recreated CR should work. However, if you encounter reconciliation failures, apply the workaround to restore the original specifications.

Resolving the problem

To resolve this issue, you must manually extract the original CCS CR specifications from the backup and reapply them to the restored cluster. The workaround depends on the backup and restore software that you use.


Workaround for NetApp Trident protect, Portworx, and the OADP utility
  1. Set the required environment variables:
    • BACKUP_NAME
    • PROJECT_CPD_INST_OPERANDS
  2. Download and extract the backup data:
    ./cpd-cli oadp tenant-backup download ${BACKUP_NAME}
    unzip ${BACKUP_NAME}-data.zip
    tar -xzf cpd-tenant-*-data.tar.gz
  3. Navigate to the Common core services resource directory and extract the Common core services CR specification:
    cd resources/ccs.ccs.cpd.ibm.com/namespaces/${PROJECT_CPD_INST_OPERANDS}/
    jq '.spec' ccs-cr.json > ccs-spec.json
  4. Patch the Common core services CR with the original specifications:
    oc patch ccs ccs-cr -n ${PROJECT_CPD_INST_OPERANDS} --type=merge --patch-file=ccs-spec.json
  5. Wait for the CCS CR to reconcile and reach Completed status.

Workaround for IBM Fusion
  1. Set the required environment variables:
    • FUSION_BACKUP_NAME
    • PROJECT_FUSION
    • OADP_PROJECT
  2. Retrieve the IBM Fusion job ID and Velero backup name:
    fusionJobId=$(oc get fbackup -n ${PROJECT_FUSION} ${FUSION_BACKUP_NAME} -o jsonpath='{.metadata.annotations.guardian\.ibm\.com/Id}')
    if [ -z "$fusionJobId" ]; then
      echo "Fusion backup job id for '${FUSION_BACKUP_NAME}' not found"
      return 1
    fi
    
    veleroBackupName=$(oc get backups.velero.io -n ${OADP_PROJECT} -l bnrJobId=${fusionJobId} --no-headers | head -1 | awk '{print $1}')
    if [ -z "$veleroBackupName" ]; then
      echo "Velero backup name for '${FUSION_BACKUP_NAME}' not found"
      return 1
    fi
  3. Download and extract the backup data:
    cpd-cli oadp backup download -n ${OADP_PROJECT} ${veleroBackupName} --insecure-skip-tls-verify
    mkdir ${veleroBackupName}-data
    tar -xf ${veleroBackupName}*-data.tar.gz
  4. Navigate to the Common core services resource directory and extract the Common core services CR specification:
    cd resources/ccs.ccs.cpd.ibm.com/namespaces/${PROJECT_CPD_INST_OPERANDS}/
    jq '.spec' ccs-cr.json > ccs-spec.json
  5. Patch the Common core services CR with the original specifications:
    oc patch ccs ccs-cr -n ${PROJECT_CPD_INST_OPERANDS} --type=merge --patch-file=ccs-spec.json
  6. Wait for the Common core services CR to reconcile and reach Completed status.

IBM Master Data Management CR becomes stuck after a restore

Applies to: 5.3.1 Patch 5

Applies to: Online restore with the OADP utility and IBM Fusion

Diagnosing the problem

After performing an online restore of IBM Master Data Management, the mdm-cr custom resource becomes stuck at 97% completion during the post-restore phase. The reconciliation process fails to complete.

When you check the status of the IBM Master Data Management components, the mdm-cr shows InProgress status with a progress message indicating the post-restore phase has finished, but the overall status remains incomplete:

Components    CR Kind               CR Name    Creation Timestamp    Namespace    Reconciled Version    Expected Version    Operator Information     Progress    Progress Message                   Reconcile History                                                        Patch Id  Status
------------  --------------------  ---------  --------------------  -----------  --------------------  ------------------  -----------------------  ----------  ---------------------------------  ---------------------------------------------------------------------  ----------  ----------
match360      MasterDataManagement  mdm-cr     2026-05-08T10:36:50Z  bvt          4.11.62               4.11.62             ibm-mdm operator 4.11.0  97%         Finished installing post_restore.  MDM Config UI is unavailable. Please check the mdm-config-ui pod logs           5  InProgress
Cause of the problem

This issue occurs because the IBM Master Data Management CR specifies an incorrect version of Neo4j, causing the restore process to become stuck during Neo4j deployment.

Resolving the problem
To resolve this issue, remove the hardcoded version parameter from the IBM Master Data Management deployment's Neo4j CR:
  1. Patch the Neo4j CR to remove the Neo4j version:
    oc patch neo4j -n ${PROJECT_CPD_INST_OPERATORS} --type json -p='[{"op": "remove", "path": "/spec/version"}]'
  2. Begin the IBM Master Data Management restore procedure again.

Backup pre-check fails for db2oltp, db2wh, or db2aaservice on upgraded cluster with proxy enabled

Applies to: 5.3.1

Applies to: Online and offline backups

Diagnosing the problem

After upgrading a cluster from IBM Software Hub 5.2.1 to 5.3.1, backups fail during the pre-check hook phase. You also see errors for some or all of the following Db2 ConfigMaps:

  • db2oltp-aux-ckpt-cm
  • db2oltp-aux-br-cm
  • db2wh-aux-ckpt-cm
  • db2wh-aux-br-cm
  • db2aaservice-aux-ckpt-cm
  • db2aaservice-aux-br-cm

When you check the status of the affected services (db2oltpservices, db2whservices, and db2aaserviceservices), you see that the CR status frequently flips between Completed and InProgress statuses:

oc get db2oltpservices -n <namespace> -oyaml | grep Status
    db2oltpStatus: InProgress
# A few seconds later...
    db2oltpStatus: Completed
# A few seconds later...
    db2oltpStatus: InProgress
    progress: 10%
    progressMessage: Ibmcpd dependency satisfied
Cause of the problem

When proxy configuration resources are enabled on clusters by using the cpd-cli manage create-proxy-config command, it creates a resource specification injection (RSI) patch. The RSI patch injects proxy environment variables into pods. A cronjob zen-rsi-evictor-cron-job then runs every 30 minutes to check all eligible pods, and it evicts unpatched pods that need to have the RSI patch applies. It skips pods that have already been patched.

During the checking process, the cronjob updates the pod owner's annotation to indicate the patching status. The Db2 operator watches these deployment changes and triggers a reconciliation of the Db2 custom resource whenever the annotations are updated. This causes the CR status to temporarily change from Completed to InProgress during the reconciliation. When a backup pre-check runs during this reconciliation window, it causes the precheck to fail for the db2oltp-aux-ckpt-cm, db2wh-aux-ckpt-cm, or db2aaservice-aux-ckpt-cm ConfigMap because the pod state is not in the expected Completed status while backup is trying to capture a consistent snapshot.

Resolving the problem

If the backup fails, you can try taking a backup after the reconciliation is finished.

If you don't want zen-rsi-evictor-cron-job running while you take a backup, you can suspend this cronjob during backup and resume it after the backup finishes. For scheduled backups, consider adjusting the backup schedule to avoid the RSI patch cycle, or implement automation to suspend and resume the cron job around backup windows.
  1. Use the following command to suspend zen-rsi-evictor-cron-job before you take a backup:
    oc patch cronjob zen-rsi-evictor-cron-job -n <cpd-instance-namespace> -p '{"spec":{"suspend":true}}'
  2. Take a backup.
  3. Use the following command to resume zen-rsi-evictor-cron-job:
    oc patch cronjob zen-rsi-evictor-cron-job -n <cpd-instance-namespace> -p '{"spec":{"suspend":false}}'

Identity resources have file names that are too long

Applies to: 5.3.0 and later

Applies to: All backup and restore methods

Diagnosing the problem

During the backup validation phase, the backup fails, and you see an error that says the file name is too long, for example:

time=2025-10-07T14:53:35.524157Z level=debug msg=exit backupValidationService.validateBackup
Error: failed to extract downloaded backup file '/tmp/backup-resources-f762cb7f-fea7-4d3a-9e13-c793147f7012-20251007145334-data.tar.gz': 
open /tmp/backup-resources-f762cb7f-fea7-4d3a-9e13-c793147f7012-20251007145334-data-20251007145334/resources/identities.user.openshift.io/cluster/xxx:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy.json: 
file name too long
Cause of the problem

The error occurs when extracting the backup archive. One or more OpenShift identity resources have very long names, for example identities.user.openshift.io in the error above. These names exceed the filesystem's 255 character filename limit. This typically occurs with LDAP-generated identities that include the full distinguished name (DN) in the resource name. When the backup validation process attempts to extract the backup archive and create files for these resources, the operation fails because the filename is too long.

Resolving the problem

Manually mark the problematic identity resources so that they are excluded from the backup:

  1. Use the following command to exclude all identity resources before you take a backup:
    oc get identities.user.openshift.io --no-headers | awk '{print $1}' | xargs -I {} sh -c "oc label identities.user.openshift.io {} velero.io/exclude-from-backup=true"
    Note: You can use the same command even if you already tried to take a backup, and the backup failed.
  2. Retry the backup.

Backup precheck fails for watsonx Code Assistant and watsonx Code Assistant for Red Hat Ansible Lightspeed due to ClusterRole configuration

Applies to: 5.3.0

Fixed in:5.3.1

Applies to: All backup and restore methods

Diagnosing the problem

This issue happens when both the watsonx Code Assistant and watsonx Code Assistant for Red Hat Ansible® Lightspeed services are installed. The Backup precheck fails at the precheck stage with the following errors:

pre-check (2):

    COMPONENT             CONFIGMAP                 METHOD STATUS DURATION     ADDONID                         
    code-assistant-ansible wca-ansible-aux-ckpt-cm  rule   error  197.414444ms code-assistant-ansible          
    wca-code-generation    codegeneration-aux-ckpt-cm rule  error  196.501111ms code-assistant-z-code-generation

Error: precheck failed with error: pre-check backup hooks encountered one or more error(s), err=2 errors occurred:
    * backup precheck hook finished with status=error, configmap=codegeneration-aux-ckpt-cm
    * backup precheck hook finished with status=error, configmap=wca-ansible-aux-ckpt-cm

You might also see an error that shows permission issues:

wcaansibles.wca.cpd.ibm.com "wcaansible-cr" is forbidden: User "system:serviceaccount:cpd-operators:cpdbr-tenant-service-sa" cannot get resource "wcaansibles" in API group "wca.cpd.ibm.com"
Cause of the problem

Both the watsonx Code Assistant operator and the the watsonx Code Assistant for Red Hat Ansible Lightspeed operator create a ClusterRole with the same name: wcaansibles.wca.cpd.ibm.com-v1beta1-$PROJECT_CPD_INST_OPERATORS-edit. Every time one service writes to ClusterRole, it overwrites the configuration for the other service. Then the backup service account cannot access resources due to the incorrect ClusterRole configuration, and it causes the backup precheck to fail when validating resources.

Resolving the problem
You must manually create or patch the ClusterRole for watsonx Code Assistant and watsonx Code Assistant for Red Hat Ansible Lightspeed to ensure both have the correct configurations:
  1. Patch the watsonx Code Assistant for Red Hat Ansible Lightspeed ClusterRole:
    oc patch clusterrole wcaansibles.wca.cpd.ibm.com-v1beta1-$PROJECT_CPD_INST_OPERATORS-edit \
      --type=json -p '[
        {"op":"replace","path":"/metadata/labels/component-id","value":"wca-ansible-cluster-scoped"},
        {"op":"replace","path":"/metadata/labels/icpdsupport~1addOnId","value":"code-assistant-ansible"},
        {"op":"replace","path":"/rules/0/resources","value":["wcaansibles"]}
      ]'
  2. Create the watsonx Code Assistant ClusterRole by using the following commands:
    oc create clusterrole wcas.wca.cpd.ibm.com-v1beta1-$PROJECT_CPD_INST_OPERATORS-edit \
      --resource wcas.wca.cpd.ibm.com \
      --verb create,get,update,patch,delete,list
    oc label clusterrole wcas.wca.cpd.ibm.com-v1beta1-$PROJECT_CPD_INST_OPERATORS-edit \
      component-id=wca-cluster-scoped \
      icpdsupport/addOnId=code-assistant \
      rbac.authorization.k8s.io/aggregate-to-edit="true"
  3. Use the following command to verify that the ClusterRole for each service exists with the correct labels:
    oc get clusterrole -l 'component-id in (wca-cluster-scoped,wca-ansible-cluster-scoped)' -L component-id
    For example, the following is the ClusterRole for each service.
    • For watsonx Code Assistant
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        labels:
          component-id: wca-cluster-scoped
          icpdsupport/addOnId: code-assistant
          rbac.authorization.k8s.io/aggregate-to-edit: "true"
        name: wcas.wca.cpd.ibm.com-v1beta1-cpd-operators-edit
      rules:
      - apiGroups:
        - wca.cpd.ibm.com
        resources:
        - wcas
        verbs:
        - create
        - get
        - update
        - patch
        - delete
        - list
    • For watsonx Code Assistant for Red Hat Ansible Lightspeed
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        labels:
          component-id: wca-ansible-cluster-scoped
          icpdsupport/addOnId: code-assistant-ansible
          rbac.authorization.k8s.io/aggregate-to-edit: "true"
        name: wcaansibles.wca.cpd.ibm.com-v1beta1-cpd-operators-edit
      rules:
      - apiGroups:
        - wca.cpd.ibm.com
        resources:
        - wcaansibles
        verbs:
        - create
        - get
        - update
        - patch
        - delete
        - list
  4. Retry the backup precheck:
    cpd-cli oadp backup precheck \
      --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
      --hook-kind br \
      -n ${OADP_PROJECT} \
      --spec-version=2.0.0

Restore of IBM Master Data Management fails if RabbitMQ Helm secrets are not excluded before creating the backup

Applies to: 5.3.0

Fixed in:5.3.1

Applies to: Offline backup and restore methods

Diagnosing the problem
While performing an offline restore of the IBM Master Data Management service, the restore operation fails due to a timeout while waiting for the Rabbitmqcluster to enter a Completed state. The Rabbitmqcluster becomes stuck, which blocks the IBM Master Data Management CR (mdm-cr) from progressing.
Cause of the problem
While performing an offline restore, the Rabbitmqcluster operator pod does not get reconciled because the RabbitMQ Helm secrets were not excluded before creating the backup.
Resolving the problem
To resolve this issue, exclude the RabbitMQ Helm secrets before creating the backup.
Patch the RabbitMQ Helm secret with exclude labels by running the following commands:
oc label secret -l owner=helm,name=<rabbitmq_cr_name> velero.io/exclude-from-backup=true -n <CPD-OPERAND-NAMESPACE>
oc label role,rolebinding -l app.kubernetes.io/managed-by=Helm,release=<rabbit_cr_name> velero.io/exclude-from-backup=true n <CPD-OPERAND-NAMESPACE>

IBM watsonx Orchestrate fails to deploy agents after restoring data to a different cluster

Applies to: 5.3.1

Applies to: All backup and restore methods

Diagnosing the problem

After restoring data for IBM watsonx Orchestrate to a different cluster, you create an agent with Python tools or Domain agent tools. When you try to deploy the agent, the deployment process fails, and you see the following error message:

{
  "detail": "Unexpected error during tool deployment: 500: Tool deployment failed in TRM:
  {\"error\":\"storage init: failed to create bucket \\\"wo-server-storage-bucket-cpd-instance-1\\\":
  operation error S3: CreateBucket, https response error StatusCode: 403, RequestID: mig6ggytexrd1f-1cji, HostID: mig6ggyt-exrd1f-1cji, api error InvalidAccessKeyId: The AWS access key Id you
  provided does not exist in our records.\"} "
}
Cause of the problem

After you complete the restoration, the IBM watsonx Orchestrate Tools Runtime Manager (TRM) component still retains references to the AWS credentials from the source cluster. When deploying the agents, the TRM tries to create an S3 storage bucket for the agents by using the old AWS credentials. Since these credentials do not match the storage credentials for the target cluster, the S3 storage bucket fails to be created with an InvalidAccessKeyId error.

Resolving the problem
At this time, there is no confirmed workaround for this issue.

Neo4j cluster in Failed status after restore

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: All backup and restore methods

Diagnosing the problem

After the restore operation completes, the Neo4j cluster appears to be in Failed status, and it has an error that says the configdb database does not exist.

The restore log shows the following error:

2026-01-23 19:37:45.364+0000 WARN [c.n.b.r.RestoreDatabaseCommand] Restore command failed
java.lang.IllegalArgumentException: Database with name [configdb] has replicated cluster state locally. 
Please run DROP DATABASE configdb against the system database. 
If the database is already dropped, you need to unbind the local instance using neo4j-admin server unbind. 
Note that unbind requires stopping the instance and affects all databases.

The Neo4j cluster is actually in a failed state even though the restore process looks like it completed successfully.

Cause of the problem

The Neo4j restore process requires that the configdb database is dropped before restoring. However, the drop operation does not fully complete before the restore script begins restoring the database. This leaves stale cluster metadata that prevents the restore from completing successfully.

Additionally, the Neo4j restore script is not passing the exit code to the CPDBR restore tool, which is why the restore process appears successful despite actually failing.

Resolving the problem

Manually restore the Neo4j database using the following steps:

  1. Connect to the Neo4j pod:
    oc project <project-name>
    oc rsh <neo4j-pod>
  2. Connect to the Neo4j system database using cypher-shell:
    cypher-shell -a neo4j+ssc://localhost:7687 \
      -u neo4j \
      -p $(cat /config/neo4j-auth/NEO4J_AUTH | cut -d '/' -f2) \
      -d system
  3. Create the configdb database:
    CREATE DATABASE configdb;
  4. Force the Neo4j CR reconciliation by editing the Neo4j custom resource:
    oc edit neo4j <neo4j-cluster-name>
    Add the following under spec:
    spec:
      use_force: true
  5. Wait until CR reconciliation completes successfully and pods are ready.
  6. Verify that the required backup dumps are available under the /backups directory.
  7. Restore the database using the latest backup:
    python /neo4j/scripts/db-restore.py \
      --backup_dir=/backups \
      --method=latest > /backups/restore.log

During a restore, the IBM Master Data Management CR fails with an error stating that a conditional check failed

Applies to: 5.3.0 and later

Applies to: All backup and restore methods

Diagnosing the problem
A restore of the IBM Master Data Management service fails with an error similar to the following example:
The conditional check 'all_services_available and ( history_enabled | bool  or parity_enabled | bool )' failed. 
The error was: error while evaluating conditional (all_services_available and ( history_enabled | bool  or parity_enabled | bool )): 'history_enabled' is undefined

The error appears to be in '/opt/ansible/roles/4.10.23/mdm_cp4d/tasks/check_services.yml': line 202, column 5, 
but may be elsewhere in the file depending on the exact syntax problem.
Resolving the problem
No action is required. The error will be resolved automatically during subsequent operator reconciliation runs.

Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration

Applies to: 5.3.0 and later

Applies to: All backup and restore methods

Diagnosing the problem
After you restore, Watson OpenScale fails due to memory constraints. You might see Db2 (db2oltp) or Db2 Warehouse (db2wh) instances that return 404 errors and pod failures where scikit pods are unable to connect to Apache Kafka.
Cause of the problem
The root cause is typically insufficient memory or temporary table page size settings, which are critical for query execution and service stability.
Resolving the problem
Ensure that the Db2 (db2oltp) or Db2 Warehouse (db2wh) instance is configured with adequate memory resources, specifically:
  • Set the temporary table page size to at least 700 GB during instance setup or reconfiguration.
  • Monitor pod health and Apache Kafka connectivity to verify that dependent services recover after memory allocation is corrected.

FoundationDB cluster is stuck in stage_mirror: post-restore state after restore from Online backup

Applies to: 5.3.0 and later

Applies to: All backup and restore methods

Diagnosing the problem

After you complete a restore from an online backup, the FoundationDB cluster gets stuck in the post-restore stage with an InProgress status.

oc get fdbcluster wkc-foundationdb-cluster --namespace wkc  -o json | jq  .status
{
  "backup_status": {
    "backup_phase": "",
    "backup_stage": "post-restore",
    "cpdbr_backup_index": 0,
    "cpdbr_backup_name": "",
    "cpdbr_backup_start": 0,
    "cpdbr_init_time": 0,
    "fdb_last_up_time": 0,
    "fdb_up_start_time": 0,
    "last_completion_time": 0,
    "last_update_time": 0
  },
  "coordinators_status": {
    "last_query_time": 0,
    "last_update_time": 0
  },
  "fdbStatus": "InProgress",
  "restore_phase": "New Restore job created",
  "shutdownStatus": " ",
  "stage_mirror": "post-restore"
}

The restore operation fails, and you see the following error message in the restore logs:

Restoring backup to version: 125493581420
ERROR: Attempted to restore into a non-empty destination database
Fatal Error: Attempted to restore into a non-empty destination database
Cause of the problem

The issue is caused by a timing issue during the post-restore process. When the restore process completes, FoundationDB pods start up and lineage ingestion services (specifically wdp-kg-ingestion-service and wkc-data-lineage-service) also begin writing to the database before the FoundationDB restore job can lock it. This situation causes a race condition. It results in the database being non-empty when the FoundationDB restore job attempts to clear and restore data, which causes the restore operation to fail.

Resolving the problem
  1. Use the following commands to put the wkc operator in maintenance mode.
    cpd-cli manage update-cr --component=wkc --cpd_instance_ns=<OPERAND_NS> --patch='{"spec":{"ignoreForMaintenance":true}}'
  2. Use the following commands to stop any lineage ingestion job:
    oc scale deploy wdp-kg-ingestion-service --replicas=0 -n <namespace>
    oc scale deploy wkc-data-lineage-service --replicas=0 -n <namespace>
  3. Edit the FoundationDB custom resource (CR):
    oc edit fdbcluster wkc-foundationdb-cluster -n <namespace>
    1. Force a shut down of the FoundationDB cluster by setting spec.shutdown to "force".
    2. Check that the post-restore tag is present in the CR.
  4. Wait for all FoundationDB pods to stop.
  5. Start the FoundationDB cluster again by editing the FoundationDB CR and changing spec.shutdown from "force" to "false".
  6. Verify the restore job completes successfully by checking the logs. The job should not have any unclean database errors.
  7. Use the following command to verify FoundationDB status reaches Completed status:
    oc get fdbcluster wkc-foundationdb-cluster -n <namespace> -o json | jq .status.fdbStatus
  8. Use the following commands to scale up the lineage ingestion services:
    oc scale deploy wdp-kg-ingestion-service --replicas=1 -n <namespace>
    oc scale deploy wkc-data-lineage-service --replicas=1 -n <namespace>
  9. Verify that the wkc CR reaches Completed status.

After a service upgrade, restoring IBM Master Data Management fails with an error stating that mdm-cr cannot be found

Applies to: 5.3.1, when upgraded from a version earlier than 5.3.0

Fixed in: 5.3.1 Patch 1

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

When performing a restore of the IBM Master Data Management service by using IBM Fusion, the process fails while attempting to run post-restore processing, and you are unable to complete the restore process. This issue only occurs when restoring a service instance that was upgraded from a version earlier than 5.3.0.

The failed process indicates the following error:

Error: error running post-restore hooks: Error running post-processing rules.  Check the cpdbr-oadp.log for errors.
1 error occurred:
	* error performing op postRestoreViaConfigHookRule for resource mdm (configmap=mdm-cpd-aux-checkpoint-cm): : masterdatamanagements.mdm.cpd.ibm.com "mdm-cr" not found
Resolving the problem

To avoid this issue, complete the following steps before restoring a IBM Master Data Management service instance that was upgraded:

  1. Update the IBM Master Data Management ConfigMap.
    oc patch cm mdm-cpd-aux-checkpoint-cm -n ${PROJECT_CPD_INST_OPERATORS} --type=json -p='[{"op":"remove","path":"/data/restore-meta"}]'
  2. Restore the IBM Master Data Management service.

Backup of IBM Master Data Management fails intermittently with timeout error

Applies to: 5.3.1 Patch 2 and later

Fixed in: 5.3.1 Patch 4

Applies to: Online backup with IBM Fusion

Diagnosing the problem

When you attempt to perform an online backup on a cluster with IBM Master Data Management installed, the backup fails intermittently during the pre-backup phase with a timeout error.

There was an error when processing the job in the Transaction Manager service. The underlying error was: 
'Recipe executed. Fail Count=1, rollback=True, last failed command: "ExecHook/br-service-hooks/pre-backup" Extracted error message: Success. 
Extracted error message: Success.Hook run in cpd-ops:cpdbr-tenant-service-5564868696-f2sj6 ended with internal rc 5 indicating hook reached timeout prior to completion. 
Extracted error message: Timeout reached before command completed..'.

The backup process times out while waiting for the IBM Master Data Management operator to acknowledge the backup trigger annotation and update the custom resource status.

Cause of the problem

The backup process adds a pre-backup annotation to the IBM Master Data Management custom resource, the operator must process this annotation and update the BRStatus field to "backup running".

However, the IBM Master Data Management operator is not checking for this annotation soon enough. It only checks for this annotation during its reconciliation cycle. This delay can cause the backup pre-hooks to time out before the operator completes the reconciliation and updates the status.

The issue occurs intermittently because it depends on when the backup trigger annotation is added relative to when the reconciliation of the IBM Master Data Management operator. If the operator is idle and takes time to start reconciliation, the backup can time out.

Resolving the problem

Manually trigger the IBM Master Data Management operator reconciliation during the backup process to ensure the operator immediately detects and processes the backup trigger annotation. You can run this command as soon as you start the backup, or you can wait and run it only if the backup appears to be stalled.

oc patch mdm mdm-cr -n <instance-namespace> --type=merge -p '{"spec":{"ignoreForReconcile":"'$(date +%s)'"}}'

This command forces the operator to start a new reconciliation cycle immediately. The operator then detects the backup trigger annotation and updates the status.

Watson OpenScale fails to reconcile due to Kafka pod

Applies to: 5.3.1 Patch 3 and later

Applies to: Online restore with IBM Fusion

Diagnosing the problem

After you complete a restore by using IBM Fusion, the Watson OpenScale custom resource fails to reconcile and remains stuck in InProgress status.

When you check the Watson OpenScale custom resource status, you see output similar to the following:

# oc get WOService aiopenscale -n <namespace> -o yaml
status:
  phase: InProgress
  progress: 20%
  progressMessage: Installing OpenScale storage services
  reconcileHistory:
  - message: |
      Task - Wait until the Kafka statefulset is finished
      Error Details: 'Task failed: Action failed: Unknown error.'
    timestamp: "2026-04-16T15:07:09.63967Z"
  - message: |
      Task - Wait until the Kafka statefulset is finished
      Error Details: 'Task failed: Action failed: Unknown error.'
    timestamp: "2026-04-16T14:49:14.24406Z"
  - message: |
      Task - Wait until the Kafka statefulset is finished
      Error Details: 'Task failed: Action failed: Unknown error.'
    timestamp: "2026-04-16T14:30:42.43095Z"
  versions:
    reconciled: ""
  wosBuildNumber: "30"
  wosStatus: InProgress

The restore job reports that it was successful, but the Watson OpenScale custom resource has failed to reconcile.

When you check the clusters, the Kafka pod entered CrashLoopBackOff state:

# oc get pod | grep aios
aiopenscale-ibm-aios-kafka-0        0/1     CrashLoopBackOff   55 (19s ago)    4h59m
Cause of the problem

When the Kafka container restarts due to slow network, storage issues, or resource exhaustion, a bug can cause the pod to enter a CrashLoopBackOff state. After the pod initially fails to start, it continues to fail to restart properly, which prevents the Watson OpenScale resource from completing reconciliation. This issue can occur during restore operations or in other scenarios where Kafka pods restart.

Resolving the problem

You must manually delete the pod so that a new pod gets created.

  1. Identify the Kafka pods that are in CrashLoopBackOff state by listing all Kafka pods for Watson OpenScale.
    oc get pod -l component=aios-kafka
  2. Delete the affected Kafka pods in CrashLoopBackOff state.

    For example if aiopenscale-ibm-aios-kafka-0 is the only pod in CrashLoopBackOff state, then delete that pod.

    oc delete pod aiopenscale-ibm-aios-kafka-0

Watson OpenScale fails to reconcile after restore

Applies to: 5.3.1 Patch 2 and later

Applies to: Online backup and restore with IBM Fusion

Diagnosing the problem

After completing a restore, Watson OpenScale fails to reconcile.

When you check the Watson OpenScale service status, the aiopenscale service is also stuck in InProgress status at 40% progress.

#oc get WOService
NAME                        TYPE              SCALECONFIG   PHASE        PROGRESS   RECONCILED   STATUS
aiopenscale                 service           small         InProgress   40%                     InProgress
openscale-defaultinstance   serviceInstance   small
#oc get WOService aiopenscale -oyaml
  progress: 40%
  progressMessage: Creating Openscale micro service Deployments
  wosStatus: InProgress

If you check for pods that are not Running or Completed status, you see the aiopenscale-ibm-aios-configuration pod is in CrashLoopBackOff status with repeated liveness and readiness probe failures.

# oc get pods -A | egrep -v "Running | Completed"
NAMESPACE                                          NAME                                                                                    READY   STATUS             RESTARTS         AGE
bvt                                                aiopenscale-ibm-aios-configuration-75df6ddd95-q8m64                                     0/1     CrashLoopBackOff   47 (3m42s ago)   4h55m
# oc describe po aiopenscale-ibm-aios-configuration-75df6ddd95-q8m64 | tail -f
                              topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector component in (aios-configuration)
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Normal   Created    50m (x41 over 4h55m)     kubelet  Created container: aios-configuration-service
  Warning  BackOff    14m (x477 over 4h23m)    kubelet  Back-off restarting failed container aios-configuration-service in pod aiopenscale-ibm-aios-configuration-75dfd95-q8m64_bvt(a815344c--80ef-7cae9cd)
  Normal   Pulled     11m (x47 over 4h55m)     kubelet  Container image "cp.stg.icr.io/cp/cpd/aios-configuration-service@sha256:87e0fe0bdd704" already present on machine
  Warning  Unhealthy  9m18s (x341 over 4h54m)  kubelet  Readiness probe failed: Get "https://10.3/v1/aios_configuration_service/heartbeat": dial tcp 10.3: connect: connection refused
  Warning  Unhealthy  4m26s (x287 over 4h54m)  kubelet  Liveness probe failed: Get "https://10.3/v1/aios_configuration_service/heartbeat": dial tcp 10.3: connect: connection refused
  Normal   Killing    3m56s (x48 over 4h51m)   kubelet  Container aios-configuration-service failed liveness probe, will be restarted
Cause of the problem

The issue is caused by an unhealthy etcd instance, which is likely a problem with the node where etcd is running. When the etcd service is unable to respond properly, it prevents the Watson OpenScale configuration service from starting successfully, which causes the reconciliation process to fail.

Resolving the problem

Move the etcd instance to a different node.

  1. Run the following command to check the status for etcd and the unhealthy pods:
    namespace=INSTANCE_NS
    oc exec -it aiopenscale-ibm-aios-etcd-0 -- bash
    etcdctl --insecure-skip-tls-verify \
      --cacert=/etc/internal-tls/ca.crt \
      --user root:$(cat /etc/.secrets/etcd/etcd-root-password) \
      --endpoints=aiopenscale-ibm-aios-etcd-0.aiopenscale-ibm-aios-etcd.<namespace>.svc.cluster.local:2379,aiopenscale-ibm-aios-etcd-1.aiopenscale-ibm-aios-etcd.bvt.svc.cluster.local:2379,aiopenscale-ibm-aios-etcd-2.aiopenscale-ibm-aios-etcd.bvt.svc.cluster.local:2379 endpoint health
    

    If you see an error message like the following example, note down the pod name. For example, in the following error message it is aiopenscale-ibm-aios-etcd-2:

    {"level":"warn","ts":"2026-04-08T13:30:24.995960Z","logger":"client","caller":"v3@v3.5.28/retry_interceptor.go:66","msg":"retrying of unary invoker failed",
    "target":"etcd-endpoints://0xc00037b680/aiopenscale-ibm-aios-etcd-2.aiopenscale-ibm-aios-etcd.bvt.svc.cluster.local:2379","peer":"Peer{Addr: <nil>, LocalAddr: <nil>, AuthInfo: <nil>}",
    "attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded while waiting for connections to become ready"}
  2. Find the node where the unhealthy pod is currently running.
    export INSTANCE_NS=<namespace>                        
    export UNHEALTHY_ETCD_POD=<pod-name> 
    oc get pod ${UNHEALTHY_ETCD_POD} -n ${INSTANCE_NS} -o wide
  3. Prevent new pods from being scheduled on that node so that the replacement pod goes to another node.
    OLD_NODE=$(oc get pod ${UNHEALTHY_ETCD_POD} -n ${INSTANCE_NS} -o jsonpath='{.spec.nodeName}')
    oc cordon ${OLD_NODE}
  4. Delete the unhealthy pod.
    oc delete pod ${UNHEALTHY_ETCD_POD} -n ${INSTANCE_NS}

    The StatefulSet re-creates it immediately.

  5. When the unhealthy pod restarts, uncordon the node.
    oc uncordon ${OLD_NODE}

After moving etcd to a healthy node, the Watson OpenScale service reconciles successfully.

# oc get woservice
NAME                        TYPE              SCALECONFIG   PHASE   PROGRESS   RECONCILED   STATUS
aiopenscale                 service           small         Ready   100%       5.3.3        Completed
openscale-defaultinstance   serviceInstance   small         Ready   100%       5.3.3        Completed 

Validation for cpd-ikc-ikc-aux-ckpt-cm ConfigMap fails

Applies to: 5.3.1 Patch 1 and later

Fixed in: 5.3.1 Patch 3

Applies to: Online backup with IBM Fusion

Diagnosing the problem

This issue occurs only when IBM Software Hub is running on Power hardware (ppc64le clusters). When you take an online backup by using IBM Fusion, the backup fails during the validation stage with an error related to the cpd-ikc-ikc-aux-ckpt-cm ConfigMap.

The backup process reports the following error:

ERROR: Recipe failed
There was an error when processing the job in the Transaction Manager service. The underlying error was: 'Recipe executed. Fail Count=1, rollback=True, last failed command: "ExecHook/br-service-hooks/resource-validation" 
Hook run in cpd-ops:cpdbr-tenant-service-6f5f788577-5xlzh ended with internal rc 1 indicating hook execution ended in failure. 
Extracted error message: Found error in exec hook output file rc=1
Error: 1 error occurred:
	* backup validation failed for configmap: cpd-ikc-ikc-aux-ckpt-cm
Cause of the problem

The backup validation service fails to validate the specific secrets referenced in the cpd-ikc-ikc-aux-ckpt-cm and cpd-ikc-ikc-aux-br-cm ConfigMaps during the resource validation phase. These ConfigMaps are associated with IBM Knowledge Catalog.

Resolving the problem

Before taking a backup, edit the ConfigMaps to remove references to the secrets.

  1. Edit the following ConfigMaps:
    • cpd-ikc-ikc-aux-br-cm
    • cpd-ikc-ikc-aux-ckpt-cm
  2. Remove the following lines in each ConfigMap:
    - resource-kind: secret
            validation-rules:
              - type: match_names
                names:
                  - manta-credentials
                  - manta-keys
  3. Retry the backup

Backup fails at pre-check for watsonx Orchestrate due to missing ClusterRole permissions

Applies to: 5.3.1 and later

Fixed in: 5.3.1 Patch 2

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

When you take a backup by using IBM Fusion, the backup fails during the pre-check phase and the pre-check hooks fail for watsonx Orchestrate.

You see errors in the IBM Fusion job similar to the following error message:

2026-02-25 15:02:27 [INFO]: Recipe executed. Fail Count=1, rollback=True, last failed command: "ExecHook/br-service-hooks/precheck-backup"  Extracted error message: Success.  Extracted error message: Success.Hook run in cpd-operators:cpdbr-tenant-service-5b6b659579-ztcc6 ended with internal rc 1 indicating hook execution ended in failure.  Extracted error message: Found error in exec hook output file rc=1
: time=2026-02-25T15:02:27.419540Z level=error msg=precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=error, configmap=wo-watson-orchestrate-backup-aux-checkpoint
 func=cpdbr-oadp/cmd.Execute file=/go/src/cpdbr-oadp/cmd/root.go:94

Or if you ran the command manually, you might see the following errors:

--------------------------------------------------------------------------------
Hook execution breakdown by status=error/timedout:
The following hooks either have errors or timed out
pre-check (count=1, duration=18ms):
    	ADDONID    	COMPONENT             	CONFIGMAP                                  	PRIORITY_ORDER	METHOD	STATUS	DURATION	START_TIME              	END_TIME                 
    	orchestrate	ibm-watson-orchestrate	wo-watson-orchestrate-backup-aux-checkpoint	30            	rule  	error 	18ms    	2026-02-25 3:18:11.97 PM	2026-02-25 3:18:11.989 PM
--------------------------------------------------------------------------------
** INFO [BACKUP PRE-CHECK HOOKS/SUMMARY/END] **********************************
time=2026-02-25T15:18:12.052544Z level=error msg=precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=error, configmap=wo-watson-orchestrate-backup-aux-checkpoint

 func=cpdbr-oadp/cmd/cli/backup.(*PreCheckCommandContext).ProcessPreCheck file=/go/src/cpdbr-oadp/cmd/cli/backup/precheck.go:462
time=2026-02-25T15:18:12.052604Z level=error msg=error running processPreCheck(): precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=error, configmap=wo-watson-orchestrate-backup-aux-checkpoint

 func=cpdbr-oadp/cmd/cli/backup.(*PreCheckCommandContext).execute file=/go/src/cpdbr-oadp/cmd/cli/backup/precheck.go:330
Error: precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=error, configmap=wo-watson-orchestrate-backup-aux-checkpoint

time=2026-02-25T15:18:12.052652Z level=error msg=precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=error, configmap=wo-watson-orchestrate-backup-aux-checkpoint

 func=cpdbr-oadp/cmd.Execute file=/go/src/cpdbr-oadp/cmd/root.go:94
Cause of the problem

The ClusterRole configuration for watsonx Orchestrate is missing the get and list permissions. To read the watsonxorchestrates custom resource, the backup pre-check process requires that get and list are defined as verbs in the ClusterRole definition for the watsonxorchestrates custom resource. The backup service cannot access resources properly when these verbs are missing, and it causes the backup precheck to fail when validating resources.

Resolving the problem

To fix this issue, update the ClusterRole configuration to add list and get to verbs.

  1. Use the following command to patch the watsonx Orchestrate ClusterRole:

    oc patch ClusterRole watsonxorchestrates.wo.watsonx.ibm.com-v1-cpd-operators-edit \
    --type='json' \
    -p='[{"op": "add", "path": "/rules/0/verbs/-", "value": "list"}, {"op": "add", "path": "/rules/0/verbs/-", "value": "get"}]'

    The updated ClusterRole should look like this:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      labels:
        component-id: watsonx-orchestrate-cluster-scoped
        icpdsupport/addOnId: orchestrate
        rbac.authorization.k8s.io/aggregate-to-edit: "true"
      name: watsonxorchestrates.wo.watsonx.ibm.com-v1-cpd-operators-edit
    rules:
    - apiGroups:
      - wo.watsonx.ibm.com
      resources:
      - watsonxorchestrates
      verbs:
      - create
      - update
      - patch
      - delete
      - get
      - list

Notification dropdown does not open after watsonx.ai restore with IBM Fusion

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Restore with IBM Fusion

Diagnosing the problem

After restoring a watsonx.ai instance by using IBM Fusion, the Notification bell icon in the navigation does not respond if you click it. The Notification drop-down list also fails to open, which prevents you from accessing your notifications and notification settings through the user interface.

This issue affects all perspectives in the IBM Software Hub user interface, not just watsonx.ai.

The browser console also displays the following JavaScript error:

Uncaught TypeError: can't access property "lastIndexOf", e is undefined
Cause of the problem

This scenario is a rare issue that occurs only when a notification with a specific payload structure is sent to the navigation header. It triggers a JavaScript error that prevents the Notification drop-down list from opening.

Resolving the problem

You can access the Notifications page by manually adding /notifications2 to the URL of your cluster in your browser, for example: <cluster>/notifications2.

The URL provides direct access to the Notifications page, bypassing the drop-down list in the navigation. You can view all notifications and access notification settings through the page.

Resource validation at the end of a backup fails with OOMKilled status

Applies to: 5.3.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

When performing a backup with IBM Fusion, the backup fails during the resource validation phase at the end of the backup process. The IBM Fusion transaction manager logs indicate the following errors

Recipe failed
BMYBR0009
There was an error when processing the job in the Transaction Manager service. 
The underlying error was: 'Recipe executed. Fail Count=1, rollback=True, last failed command: "ExecHook/br-service-hooks/resource-validation"   Extracted error message: Success.Hook run in cpd-operators:cpdbr-tenant-service-5cb9-tsj ended with internal rc 1 indicating hook execution ended in failure. Extracted error message: Running command '/cpdbr-scripts/cpdbr-oadp backup validate --tenant-operator-namespace=cpd-operators --namespace=ibm-backup-restore --spec-version=2.0.0' failed with exception: (0)\nReason: Handshake status 500 Internal Server Error -+-+- {'content-length': '28', 'content-type': 'text/plain; charset=utf-8', 'date': 'Fri, 13 Feb 2026 09:20:36 GMT'} -+-+- b'container not found ("main")'\n.'.

When you investigate, you see that the resource validation was causing the cpdbr-tenant-service pod to restart with an OOMKilled (exit code 137) status.

Cause of the problem

The resource validation process requires more memory than the default memory limit that is allocated in the cpdbr-tenant-service deployment. When the memory limit is insufficient, it causes the pod to be killed by the Kubernetes OOM (Out of Memory) killer.

This issue occurs more frequently in environments with large numbers of resources being backed up, where the validation process needs to process and validate all backed-up resources.

Resolving the problem

Increase the memory limit for the cpdbr-tenant-service deployment before performing a backup with IBM Fusion.

  1. Check the memory limit of the cpdbr-tenant-service deployment.
    oc get deployment \
    -n ${PROJECT_CPD_INST_OPERATORS} cpdbr-tenant-service \
    -o jsonpath='{.spec.template.spec.containers[0].resources.limits.memory}'
  2. Run the following command to increase the memory limit for cpdbr-tenant-service from 1Gi to 4Gi:
    
    oc patch deployment cpdbr-tenant-service \
    -n ${PROJECT_CPD_INST_OPERATORS} \
    --type='json' \
    -p='[{ "op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "4Gi" }]'

    The deployment will automatically restart the pod with the new memory limit.

  3. Wait for the pod to be ready, and then try to take another backup.

    You can check if the pod is ready by using the following command:

    oc get pods -n cpd-operators | grep cpdbr-tenant-service

Informix backup fails at pre-check due to incorrect custom resource name

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

When performing a backup of Informix using IBM Fusion 2.12, the backup fails during the pre-check phase, and you see the following error messsage:

The following hooks either have errors or timed out
pre-check (1):
    COMPONENT                CONFIGMAP                          METHOD  STATUS  DURATION    ADDONID 
    informix-17677369712650 informix-17677369712650-aux-ckpt-cm rule    error   188.325222ms informix

The backup pre-check hooks fail with an Role-Based Access Control (RBAC) issue when attempting to access the Informix service custom resource.

Cause of the problem

The ClusterRole configuration for Informix uses the singular form informixservice for the custom resource name instead of the plural form informixservices, for example:

rules:
- apiGroups:
  - ifx.ibm.com
  resources:
  - informixservice  # Incorrect: should be "informixservices"

The plural form is required by the Kubernetes endpoint. This causes the backup pre-check to fail when IBM Fusion attempts to verify the Informix service status by using the singular form.

Resolving the problem

You must manually update the ClusterRole configuration with the correct plural resource name.

  1. Log in to the cluster as cluster admin.
  2. Find the Informix ClusterRole configuration:
    oc get clusterrole | grep informix
  3. Use the result from the last command to edit the ClusterRole configuration:
    oc edit clusterrole informixservice-v1-<CPD_OPERATOR_NS>-edit
  4. Update the resources section:
    rules:
    - apiGroups:
      - ifx.ibm.com
      resources:
      - informixservices
  5. Save the change.
  6. Confirm the change was applied:
    oc describe clusterrole informixservice-v1-<CPD_OPERATOR_NS>-edit
  7. Retry the backup operation.

Db2 backup fails at the Hook: br-service hooks/pre-backup step

Applies to: 5.3.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
In the cpdbr-oadp.log file, you see messages like in the following example:
time=<timestamp> level=info msg=podName: c-db2oltp-5179995-db2u-0, podIdx: 0, container: db2u, actionIdx: 0, commandString: ksh -lc 'manage_snapshots --action suspend --retry 3', command: [sh -c ksh -lc 'manage_snapshots --action suspend --retry 3'], onError: Fail, singlePodOnly: false, timeout: 20m0s func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:767
time=<timestamp> level=info msg=cmd stdout:  func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:823
time=<timestamp> level=info msg=cmd stderr: [<timestamp>] - INFO: Setting wolverine to disable
Traceback (most recent call last):
  File "/usr/local/bin/snapshots", line 33, in <module>
    sys.exit(load_entry_point('db2u-containers==1.0.0.dev1', 'console_scripts', 'snapshots')())
  File "/usr/local/lib/python3.9/site-packages/cli/snapshots.py", line 35, in main
    snap.suspend_writes(parsed_args.retry)
  File "/usr/local/lib/python3.9/site-packages/snapshots/snapshots.py", line 86, in suspend_writes
    self._wolverine.toggle_state(enable=False, message="Suspend writes")
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 73, in toggle_state
    self._toggle_state(state, message)
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 77, in _toggle_state
    self._cmdr.execute(f'wvcli system {state} -m "{message}"')
  File "/usr/local/lib/python3.9/site-packages/utils/command_runner/command.py", line 122, in execute
    raise CommandException(err)
utils.command_runner.command.CommandException: Command failed to run:ERROR:root:HTTPSConnectionPool(host='localhost', port=9443): Read timed out. (read timeout=15)
Cause of the problem
The Wolverine high availability monitoring process was in a RECOVERING state before the backup was taken.

Check the Wolverine status by running the following command:

wvcli system status
Example output:
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
HA Management is RECOVERING at <timestamp>.
The Wolverine log file /mnt/blumeta0/wolverine/logs/ha.log shows errors like in the following example:
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:414)] -  check_and_recover: unhealthy_dm_set = {('c-db2oltp-5179995-db2u-0', 'node')}
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:416)] - (c-db2oltp-5179995-db2u-0, node) : not OK
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:421)] -  check_and_recover: unhealthy_dm_names = {'node'}
Resolving the problem
Do the following steps:
  1. Re-initialize Wolverine:
    wvcli system init --force
  2. Wait until the Wolverine status is RUNNING. Check the status by running the following command:
    wvcli system status
  3. Retry the backup.

Backup service location is unavailable during backup

Applies to: 5.3.0 and later

Applies to: Backup with IBM Fusion

If your cluster is running Red Hat OpenShift Container Platform 4.19 with IBM Fusion 2.11 and OADP 1.5.1, then the backup service location might enter Unavailable state. To resolve this issue, upgrade OADP to version 1.5.2.

Watson Discovery fails after restore with opensearch and post_restore unverified components

Applies to: 5.3.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
After you restore, Watson Discovery becomes stuck with the following components listed as unverifiedComponents:
unverifiedComponents:
- opensearch
- post_restore
Additionally, the OpenSearch client pod might show an unknown container status, similar to the following example:
NAME                                 READY   STATUS                    RESTARTS   AGE
wd-discovery-opensearch-client-000   0/1     ContainerStatusUnknown    0          11h
Cause of the problem
The post_restore component depends on the opensearch component being verified. However, the OpenSearch client pod is not running, which prevents verification and causes the restore process to stall.
Resolving the problem
Manually delete the OpenSearch client pod to allow it to restart:
$ oc delete -n ${PROJECT_CPD_INST_OPERANDS} pod wd-discovery-opensearch-client-000

After the pod is restarted and verified, the post_restore component should complete the verification process.

Restore process stuck at db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState

Applies to: 5.3.0 and later

Applies to: Restore with IBM Fusion

Diagnosing the problem

When you use IBM Fusion to restore to the same cluster, the restore process fails when it times out while trying to verify that the db2ucluster or db2uinstance resources are in Ready status. You might receive something similar to the following error message:

There was an error when processing the job in the Transaction Manager service. 
The underlying error was: 'Execution of workflow restore of recipe ibmcpd-tenant completed. 
Number of failed commands: 1, last failed command: 
"CheckHook/db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState"'.
Cause of the problem

IBM Fusion fails to restore because of a combination of factors:

  • The db2 velero plugin is not present in IBM Fusion DPA.
  • The IBM Fusion check hook requires both db2uinstance and db2ucluster resources to be present, and it cannot skip checks when resources don't exist in the cluster
Workaround
You have two options to resolve the issue:
  • Update IBM Fusion from version 2.10 to either version 2.10 with hotfixes or version 2.10.1, and then restart the backup and restore

  • Reconfigure DataProtectionApplication (DPA) and start a fresh backup and restore.
    1. Configure DPA from the source cluster by following the process for creating a DPA custom resource in Installing and configuring software on the source cluster for backup and restore with IBM Fusion.
    2. Use IBM Fusion to create another backup.
    3. Start another restore from the new backup.

Restore process stuck at zenextensions-patch-ckpt-cm step

Applies to: 5.3.0 and later

Applies to: Backup and restore with IBM Fusion 2.10 (without hotfixes)

Diagnosing the problem
IBM Fusion fails to restore because the process gets stuck at the zenextensions-patch-ckpt-cm step. The restore fails at the following stage:
ExecHook/zenextensions-patch-ckpt-cm-zen-1-child.zenextensions-hooks/force-reconcile-zenextensions
Cause of the problem

The issue is when the IBM Fusion exec hook cannot handle trailing new spaces, and the restore process gets stuck waiting for zenextensions hooks to complete.

Workarounds
You have two options to resolve the issue:
  • Update IBM Fusion from version 2.10 to version either 2.10 with hotfixes or 2.10.1 version, and then restart the backup and restore

  • Use the fusion-resume-restore script.

You can use the fusion-resume-restore.sh script to continue the restore process from the point where it failed.

  1. Check that you have the prerequisites to run the script:
  2. Download the fusion-resume-restore script from https://github.com/IBM/cpd-cli/blob/master/cpdops/files/fusion-resume-restore.sh.
  3. Ensure you have write access to /tmp/fusion-resume-restore.
  4. Run the script from hub:
    ./fusion-resume-restore.sh <fusion-restore-name> ${PROJECT_CPD_INST_OPERATORS}

    The script displays the steps required to restore and their sequence numbers.

  5. When the script asks you to Enter the key of the workflow hook/group to resume from, select the index number that excludes the zenextensions-patch-ckpt-cm step.

    For example:

    Enter the key of the workflow hook/group to resume from (0-101): 97
    Selected workflow (index=97): "hook: zenextensions-patch-ckpt-cm-zen-1-child.zenextensions-hooks/force-reconcile-zenextensions"
  6. When the script asks Resume the Fusion restore now?, choose a response based on whether you are restoring to the same or a different cluster.
    • If restoring to the same cluster, reply y.
    • If restoring to a different cluster, reply n.

      The script provides instructions to manually apply the CRs to resume the restore.

OADP 1.5.x Certificate Issue with Fusion BSL

Diagnosing the problem
on a cluster with OCP 4.19 with Fusion 2.11, if OADP 1.5.1 is installed/upgraded, BSL will go into Unavailable Phase. If the cluster has such setup, please upgrade OADP to Version 1.5.2.

If your cluster is running OpenShift Container Platform 4.19 with IBM Fusion 2.11 and OADP 1.5.1, then the backup service location might enter an unavailable phase. To resolve this issue, upgrade OADP to version 1.5.2.

Restore for watsonx.data Premium completes but dependent CRs remain in a bad state

Applies to: 5.3.1 and later

Fixed in : 5.3.1 Patch 3

Applies to: Online restore to different cluster with NetApp Trident protect

Diagnosing the problem

You perform an online restore by using NetApp Trident protect on a cluster with watsonx.data Premium installed. The restore operation seems to be successful, but the watsonx.data Premium custom resource (CR) and its dependent CRs remain in an InProgress state and have low completion percentages, for example

watsonx_data              WxdAddon                 wxdaddon                     InProgress  15%
watsonx_data_premium      WxdAddonPremium          wxdaddon-premium             InProgress  56%
wml                       WmlBase                  wml-cr                       InProgress  0%
watsonx_ai                Watsonxai                watsonxai-cr                 InProgress  27%

And multiple pods fail to start with errors: Error: ErrImagePull and Error: ImagePullBackOff

Cause of the problem

watsonx.data Premium creates dependent custom resources during its reconciliation process. When a restore operation is performed, these dependent CRs are restored from the backup with their original metadata, including fields like uid, resourceVersion, managedFields, generation, and creationTimestamp.

However, the watsonx.data Premium operator already created new instances of these dependent CRs on the target cluster during the restore process. This creates conflicts between the old CRs and the newly created CRs, when the metadata and specifications differ.

Resolving the problem

You can manually patch the dependent CRs after the restore completes. You need to remove conflicting metadata fields from the backed-up CR files and reapplying them to the cluster.

  1. After the restore operation completes, identify the dependent CRs that are in a bad state:
    cpd-cli manage get-cr-status --cpd_instance_ns=<namespace>
  2. Locate the backed-up CR file in the restore data directory for one of the CRs, and copy the CR file to a new location for editing.
    BACKUP_NAME=<define your tenant-backup name here>
    cpd-cli oadp tenant-backup download ${BACKUP_NAME}
    unzip ${BACKUP_NAME}-data.zip
    
    for backupTar in $(ls *.tar.gz); do
      echo "extracting $backupTar"
      backupDir=$(basename $backupTar .tar.gz)
      mkdir $backupDir
      tar -xf $backupTar -C $backupDir
    done
    
    crLoc=$(find . -type f | grep <resource-name or pattern>| grep -v preferred | head -1)
  3. Edit the copied file and remove the following fields from the metadata section:
    • uid
    • resourceVersion
    • managedFields
    • generation
    • creationTimestamp

    These fields will be regenerated when you apply the custom resource.

  4. Remove the entire status section from the CR file.
  5. Delete the newly created CR:
    oc delete <custom-resource-type> <cr-name> -n <namespace>
  6. Apply the modified CR file:
    oc apply --server-side --force-conflicts -f $crLoc
  7. Verify that the CR begins reconciling, then repeat the steps for any other dependent CRs that need to be patched.
  8. Monitor the CR status until all CRs reach Completed state:
    cpd-cli manage get-cr-status --cpd_instance_ns=<namespace>

OADP backup with the same name as a NetApp Trident protect backup can be deleted by cpd-trident-protect.py backup delete

Applies to: 5.3.1

Applies to: Backup with NetApp Trident protect

Diagnosing the problem

A NetApp Trident protect backup can fail if it has the same name as an existing OADP backup.

A NetApp Trident protect backup uses the same name for the NetApp Trident protect CR and the OADP sub-backup. If you take OADP backups independently of taking backups with NetApp Trident protect, the backup process can fail because an OADP backup with the same name already exists.

Additionally, attempting to delete the failed NetApp Trident protect backup will delete both the NetApp Trident protect CR and the existing OADP backup.

Cause of the problem

This issue can occur if you use one of the cpd-cli oadp commands to take OADP backups independently of taking backups with NetApp Trident protect by using cpd-trident-protect.py.

NetApp Trident protect backups do not currently verify ownership of their related OADP sub-backup.

Resolving the problem

Avoid creating NetApp Trident protect backups that have the same name as existing OADP backups.

NetApp Trident protect backups need to be unique, and they cannot overlap with existing OADP backups. When creating an on-demand NetApp Trident protect backup, append a unique identifier, such as the current timestamp:

BACKUP_NAME="backup-$(date +%s)"
cpd-trident-protect.py backup create \
--backup_name=${BACKUP_NAME} ...

Neo4j config maps missing during backup

Applies to: 5.3.0 and later

Applies to: Backup with NetApp Trident protect

Diagnosing the problem

During the backup, the backup operation fails, and you see a pre-snapshot hook error similar to:

Error checking for hook failures in backup '<backup-name>':
command terminated with exit code 1

When you inspect the logs, you see failures related to missing Neo4j configmaps:

failed to validate dynamic online configmap '<>-aux-v2-ckpt-cm': not found
failed to validate dynamic offline configmap '<>-aux-v2-br-cm': not found

When you investigate the inventory ConfigMap, ibm-neo4j-inv-list-cm shows that it still contains entries referencing a Neo4j instance that has already been uninstalled or is no longer running.

Pre-snapshot hooks attempt to execute backup commands for all entries listed in the inventory ConfigMap, so the backup fails when it encounters stale/non-existent entries.

Cause of the problem

The backup fails because the Neo4j inventory ConfigMap contains stale configmap references for Neo4j clusters, which were previously installed but later removed.

Backup hooks rely on this inventory to determine which Neo4j instances require online/offline checkpoint operations. When the inventory still includes configmaps that no longer exist in the cluster (such as <>-aux-v2-ckpt-cm and <>-aux-v2-br-cm), the hook execution fails, which causes the entire backup to fail.

This makes the backup system behave as if the missing configmaps indicate a failure, even though the corresponding Neo4j instance was intentionally removed.

Resolving the problem

To resolve the problem, remove stale Neo4j entries from the inventory ConfigMap so that the backup does not attempt to run hooks against non-existent Neo4j instances.

  1. View the inventory ConfigMap, and remove entries that look like the following examples:
    • - name: mdm-neo-<id>-aux-v2-ckpt-cm
    • - name: mdm-neo-<id>-aux-v2-br-cm
  2. Edit the inventory ConfigMap to remove stale entries:
    oc edit cm ibm-neo4j-inv-list-cm -n zen
    

    Remove the MDM-related blocks from both the online and offline lists, which ensures that the backup executes hooks only for active Neo4j instances.

    For example, the following structure shows the stale entries

    online:
      - name: data-lineage-neo4j-aux-v2-ckpt-cm
        namespace: zen
        priority-order: '200'
      - name: mdm-neo-1763773800090371-aux-v2-ckpt-cm
        namespace: zen
        priority-order: '200'
    offline:
      - name: data-lineage-neo4j-aux-v2-br-cm
        namespace: zen
        priority-order: '200'
      - name: mdm-neo-1763773800090371-aux-v2-br-cm
        namespace: zen
        priority-order: '200'

    And the following example shows the revised structure:

    online:
      - name: data-lineage-neo4j-aux-v2-ckpt-cm
        namespace: zen
        priority-order: '200'
    
    offline:
      - name: data-lineage-neo4j-aux-v2-br-cm
        namespace: zen
        priority-order: '200'
    
  3. Rerun the backup process.

    The backup should now complete without pre-snapshot hook failures.

Restore fails for watsonx Orchestrate and watsonx Assistant custom resources

Applies to: 5.3.1 Patch 2 and later

Fixed in: 5.3.1 Patch 5

Applies to: Online restore with NetApp Trident protect

Diagnosing the problem

After performing an online restore of watsonx Orchestrate by using NetApp Trident protect, the watsonx Assistant custom resource (CR) becomes stuck and fails to complete reconciliation. The watsonx Orchestrate CR also shows InProgress status with unverified components.

When you check the watsonx Assistant CR status, you see something similar to the following status:

#oc get wa -n ubr
NAME    VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE   AGE
wo-wa   5.3.2     False   Initializing   True       VerifyWait       14/20      12/20                16h
# oc get wa -n ubr -oyaml | grep -E "progress|Status"
    componentStatus:
      message: Post task of online restore is in progress. Please ensure that MCG
    progress: 60%
    progressMessage: 'Installation in Progress; Unverified components: data-governor,
    watsonAssistantStatus: InProgress

The watsonx Orchestrate CR shows a similar incomplete status:

#oc get wo -n ubr
NAME   VERSION   DEPLOYED   VERIFIED   TOTAL   INSTALLMODE         QUIESCE        RECONCILE_PROGRESS   AGE
wo     5.3.1     29         28         33      agentic_assistant   NOT_QUIESCED   84%                  4d5h

And the watsonx Orchestrate CR status message indicates unverified components:

progressMessage: 'The current operation is in progress; unverified components:
  components-services, install-metric-cronjob, skill-creation-job, skill-initial-job,
  zen-service'

Additionally, several pods are in a CrashLoopBackOff or Error state:

wo-wa-analytics-5fb7c-dntc2                    0/1     CrashLoopBackOff   136 (2m15s ago)   10h
wo-wa-analytics-5ffcb-2qgrl                    0/1     CrashLoopBackOff   207 (3m3s ago)    16h
wo-wa-cleanup-job-55                           0/1     Error              0                 16h
wo-wa-cleanup-job-wtp                          0/1     Error              0                 8h
wo-wa-dialog-6bc78-t2tvp                       0/1     CrashLoopBackOff   221 (2m40s ago)   16h
wo-wa-dialog-6fbb9-9bwr9                       0/1     CrashLoopBackOff   145 (4m29s ago)   10h
Cause of the problem

During the backup and restore operation, the OpenSearch credentials for Watson Data Governor might be rotated. When the authentication credentials for OpenSearch are rotated, the new credentials are not automatically applied on the OpenSearch server. The stale credentials prevent watsonx Assistant from connecting to Watson Data Governor services, which causes the reconciliation to fail.

Resolving the problem

You can manually rotate the Watson Data Governor credentials after the restore operation completes.

  1. Confirm the OpenSearch secret is out of sync. Get the password from the secret and put it in the curl command.
    Note: Note that password inside or outside the container is different. You must use it in the same context where you retrieved it. So if you run the cat command outside of the container, you need to use the curl command outside the container using the same password.
    $ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 cat internal_users/elastic
    $ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 curl -k https://elastic:<some_password>@localhost:9200/_cat/indices
    

    You see Unauthorized if it's out of sync.

  2. Set the up the following environment variables
    export PROJECT_CPD_OPERATORS=<operator namespace>
    export PROJECT_CPD_OPERAND=<operand namespace>
  3. Scale the Watson Data Governor operator down so it doesn't revert the OpenSearch changes.
    oc scale deployment -n ${PROJECT_CPD_OPERATORS} ibm-data-governor-operator --replicas=0
  4. Edit the OpenSearch cluster with Watson Data Governor and set spec.replicas to 0.
    oc edit Cluster wo-wa-data-governor-ibm-data-governor-all

    Wait for the OpenSearch pods to disappear.

  5. Edit the cluster again and set all these three changes at once
    replicas: 3 # change to 0
    
        opensearchyml: |
          plugins.security.ssl.http.enabled: true                      #<-- delete this whole line
          plugins.security.allow_default_init_securityindex: true      #<-- delete this whole line
          http.compression: false
    
      plugins:
        security:
          enabled: true # change to false

    Wait for the OpenSearch pods until they almost reach a RUNNING state (0/1), which takes about three minutes. They might not fully reach RUNNING until the next command

  6. Run this command
    oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 curl -X DELETE "http://localhost:9200/.opendistro_security"
    

    You should see {"acknowledged":true}

  7. Run oc get pods and you should see the OpenSearch pods are healthy.
  8. Edit the OpenSearch cluster with Watson Data Governor and set spec.replicas to 0.
    oc edit Cluster wo-wa-data-governor-ibm-data-governor-all

    Wait for the OpenSearch pods to completely disappear.

  9. Edit the cluster again and revert the changes from step 4.
    replicas: 0 # change to 3
    
        opensearchyml: |
          plugins.security.ssl.http.enabled: true                      #<-- add this back
          plugins.security.allow_default_init_securityindex: true      #<-- add this back
          http.compression: false
    
      plugins:
        security:
          enabled: false # change to true
  10. Use the following commends to verify that everything is healthy and has the right credentials:
    $ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 cat internal_users/elastic
    some_password
    $ oc rsh wo-wa-data-governor-ibm-data-governor-all-search-000 curl -k https://elastic:<some_password>@localhost:9200/_cat/indices
  11. Scale up the Watson Data Governor operator.
    oc scale deployment -n ${PROJECT_CPD_OPERATORS} ibm-data-governor-operator --replicas=1

Backups and restores fail because of missing SCCs

Applies to: 5.3.0 and later

Applies to: Backup and restore with NetApp Trident protect

Diagnosing the problem

Backups and restores with NetApp Trident protect can fail because of issues with the Security Context Constraint (SCC). These SCC issues can happen when Red Hat OpenShift AI is installed. For a list of services that use Red Hat OpenShift AI, see Installing Red Hat OpenShift AI.

To diagnose this issue, use the following steps.

For backups

If backups fail, and it appears that SCCs did not get created, use these steps to diagnose the issue.

  1. Use the following command to check the state of the NetApp Trident protect custom resource:
    oc get backup.protect.trident.netapp.io -n ${PROJECT_CPD_INST_OPERATORS}

    The custom resource shows an error similar to the following error message:

    
    NAME                       STATE    ERROR                                                                                                                                                                                      AGE
    netapp-cx-gmc-120325-4-3   Failed   VolumeBackupHandler failed with permanent error kopiaVolumeBackup failed for volume trident-protect/data-aiopenscale-ibm-aios-etcd-1-6f44559e00b0b23bbb077d0e57c45de2: permanent error   124m
    
  2. Use the following commands to check the kopiavolumebackups jobs:
    oc get kopiavolumebackups -n ${PROJECT_CPD_INST_OPERATORS}
    oc get jobs -n trident-protect | grep <name> | grep "openshift.io/scc"

    If the pod's Security Context Constraint (SCC) is anything other than trident-protect-job, you must apply the workaround.

    For example, if you see openshift.io/scc: openshift-ai-llminferenceservice-multi-node-scc, you must apply the workaround.

For restores

If restores fail, and it appears that SCCs did not get created, use these steps to diagnose the issue.

  1. Use the following command to check the state of the NetApp Trident protect custom resource:
    oc get backuprestore.protect.trident.netapp.io -n ${PROJECT_CPD_INST_OPERATORS}

    The custom resource shows an error similar to the following error message:

    
    NAME                       STATE    ERROR                                                                                                                                                                                      AGE
    netapp-cx-gmc-120325-4-3   Failed   VolumeRestoreHandler failed with permanent error kopiaVolumeRestore failed for volume trident-protect/data-aiopenscale-ibm-aios-etcd-1-6f44559e00b0b23bbb077d0e57c45de2: permanent error   124m
  2. Use the following commands to check the KopiaVolumeRestore jobs:
    oc get kopiavolumerestores -n ${PROJECT_CPD_INST_OPERATORS}
    oc get jobs -n trident-protect | grep <name> | grep "openshift.io/scc"

    If the pod's Security Context Constraint (SCC) is anything other than trident-protect-job, you must apply the workaround.

    For example, if you see openshift.io/scc: openshift-ai-llminferenceservice-multi-node-scc, you must apply the workaround.

Cause of the problem

The restore fails because the KopiaVolumeRestore and kopiavolumebackups jobs are using an incorrect Security Context Constraint (SCC). Instead of using the expected trident-protect-job SCC for NetApp Trident protect, the jobs are picking up a different SCC with higher priority. In this case, openshift-ai-llminferenceservice-multi-node-scc gets used, which cause permission issues during the restore.

Resolving the problem
You must be a cluster administrator to use the workaround as SCCs require cluster-admin level permissions.

Use the following commands to set the priority of trident-protect-job higher than the priority of openshift-ai-llminferenceservice-multi-node-scc, which is usually 11.

oc adm policy add-scc-to-group trident-protect-job system:serviceaccounts:trident-protect
oc patch scc trident-protect-job --type=merge -p '{"priority": 20}' 

If other SCCs are preventing the job pods from using trident-protect-job, you might need to increase the priority of it further.

IBM Software Hub resources are not migrated

Applies to: 5.3.0 and later

Applies to: Portworx asynchronous disaster recovery

Diagnosing the problem
When you use Portworx asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the expected number of resources are migrated. Run the following command:
storkctl get migrations -n ${PX_ADMIN_NS}
Tip: ${PX_ADMIN_NS} is usually kube-system.
Example output:
NAME                                                CLUSTERPAIR       STAGE   STATUS       VOLUMES   RESOURCES   CREATED               ELAPSED                       TOTAL BYTES TRANSFERRED
cpd-tenant-migrationschedule-interval-<timestamp>   mig-clusterpair   Final   Successful   0/0       0/0         <timestamp>   Volumes (0s) Resources (3s)   0
Cause of the problem
This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected IBM Software Hub resources are not migrated.
Resolving the problem
To resolve the problem, downgrade stork to a version prior to 23.11.0. For more information about stork releases, see the stork Releases page.
  1. Scale down the Portworx operator so that it doesn't reset manual changes to the stork deployment:
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
  2. Edit the stork deployment image version to a version prior to 23.11.0:
    oc edit deploy -n ${PX_ADMIN_NS} stork
  3. If you need to scale up the Portworx operator, run the following command.
    Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1

Prompt tuning fails after restoring watsonx.ai

Applies to: 5.3.0 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
When you try to create a prompt tuning experiment, you see the following error message:
An error occurred while processing prompt tune training.
Resolving the problem
Do the following steps:
  1. Restart the caikit operator:
    oc rollout restart deployment caikit-runtime-stack-operator -n ${PROJECT_CPD_INST_OPERATORS}

    Wait at least 2 minutes for the cais fmaas custom resource to become healthy.

  2. Check the status of the cais fmaas custom resource by running the following command:
    oc get cais fmaas -n ${PROJECT_CPD_INST_OPERANDS}
  3. Retry the prompt tuning experiment.

Restic backup that contains dynamically provisioned volumes in Amazon Elastic File System fails during restore

Applies to: 5.3.0 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem

When trying to restore from an offline backup in Amazon Elastic File System, the restore process fails in the volume restore phase, and you might see something similar to the following error:

Status: Failed
Errors: 8
Warnings: 640

Action Errors:
DPP_NAME: cpd-offline-tenant/restore-service-orchestrated-parent-workflow
INDEX: 6
ACTION: restore-cpd-volumes
ERROR: error: expected restore phase to be Completed, received PartiallyFailed

In the Velero logs, you might see something similar to the following error:

time="2025-12-04T18:14:44Z" level=error msg="Restic command fail with ExitCode: 1. 
Process ID is 2077, Exit error is: exit status 1" 
PodVolumeRestore=openshift-adp/cpd-tenant-vol-r-00xda-x44x-4xx0-a9b1 
controller=PodVolumeRestore
stderr=ignoring error for /index-16: lchown /host_pods/.../mount/index-16: 
operation not permitted
Cause of the problem
Restic backups with Amazon Elastic File System are not supported by Velero. Restic cannot properly restore file ownership and permissions on Amazon Elastic File System volumes because it uses a different protocol for ownership operations.
Resolving the issue

If you want to use Amazon Elastic File System for dynamically provisioned volumes, you must use Kopia backups instead of Restic. The OADP DataProtectionApplication (DPA) must use uploaderType: kopia.

Custom resource for watsonx.ai IFM is created instead of restored when restoring SemanticAutomation

Applies to: 5.3.1 Patch 5

Applies to: Online restore with the OADP utility

Diagnosing the problem

After restoring SemanticAutomation from an online backup by using OADP, you see that the watsonx.ai Inferencing Foundation Models (IFM) custom resource (watsonx_ai_ifm) was not restored from the backup. Instead, the IBM Knowledge Catalog operator created a new watsonx.ai IFM custom resource during the post-restore reconciliation process.

If you check the watsonxaiifm custom resource (CR) after the restore completes, you see that it has a recent creation timestamp that does not match the original backup timestamp. The creationTimestamp shows a time after the restore operation started. When you get details about the watsonxaiifm CR, you see that the AGE column shows a recent time (for example, 21 hours) instead of the age from the original backup.

#oc get Watsonxaiifm -n <namespace> -o yaml
NAME              VERSION   RECONCILED   STATUS      PERCENT   AGE
watsonxaiifm-cr   12.1.0    12.1.0       Completed   100%      21h

When you check the backup and restore logs, you see that the watsonxaiifm CR has the label "br_label": "NotPresent" instead of "br_label": "BackupRestore". That label indicates that it was not properly included in the backup and restore process.

The new watsonxaiifm CR does not contain the configuration and state information from the original custom resource that was backed up. However, the watsonxaiifm CR functions correctly, and the watsonx.ai IFM service and Semantic Automation are operational after the restore.

Cause of the problem

SemanticAutomation has a dependency on watsonx.ai IFM. It is also managed by the IBM Knowledge Catalog operator.

During the restore process, SemanticAutomation is restored before watsonx.ai IFM because it has higher priority than watsonx.ai IFM in the restore process. When the IBM Knowledge Catalog operator's reconciliation logic detects that SemanticAutomation has been restored but the required watsonx.ai IFM custom resource is missing, it automatically creates a new watsonx.ai IFM custom resource to satisfy the dependency requirements of SemanticAutomation. The new watsonx.ai IFM CR interferes with the restoration of the backed-up watsonx.ai IFM CR.

Resolving the problem
You can manually change the priority order for SemanticAutomation by updating its ConfigMaps.You can then restart the restore process.
  1. Run the following command to set the context to the namespace where the ConfigMaps are located:
    oc project ${CPD_OPERAND_NS}
  2. Run both of the following scripts to update the priority order to 20:
    • # update cpd-ikc-sal-aux-br-cm with PO from 100 to 20
      cm='cpd-ikc-sal-aux-br-cm'
      oc get cm $cm -o=jsonpath='{.data.aux-meta}' > /tmp/${cm}-aux-meta.yaml
      sed -i.bak 's,100,20,g' /tmp/${cm}-aux-meta.yaml
      oc set data cm/$cm aux-meta="$(cat /tmp/${cm}-aux-meta.yaml)"
    • # update cpd-ikc-sal-aux-ckpt-cm with PO from 100 to 20
      cm='cpd-ikc-sal-aux-ckpt-cm'
      oc get cm $cm -o=jsonpath='{.data.aux-meta}' > /tmp/${cm}-aux-meta.yaml
      sed -i.bak 's,100,20,g' /tmp/${cm}-aux-meta.yaml
      oc set data cm/$cm aux-meta="$(cat /tmp/${cm}-aux-meta.yaml)"

Custom resources for watsonx.data intelligence and Data Refinery are created instead of restored

Applies to: 5.3.1 Patch 2 and later

Fixed in: 5.3.1 Patch 4

Applies to: Online and offline restore with the OADP utility

Diagnosing the problem

When you try to restore on a cluster with watsonx.data Premium or watsonx BI Assistant installed, some custom resources (CRs) might get re-created instead of being restored from the backup. For example, the watsonx.data intelligence CR and the Data Refinery CR might be recreated instead of being restored.

If you check the watsonxdataintelligence CR after the restore completes, you see that it has a recent creation timestamp that does not match the original backup timestamp. The creationTimestamp shows a time after the restore operation started. When you get details about the watsonxdataintelligence CR, you see that the AGE column shows a recent time (for example, 3 hours) instead of the age from the original backup.

NAME                         VERSION   RECONCILED   STATUS      PERCENT   AGE
watsonxdataintelligence-cr   2.3.11    2.3.11       Completed   100%      3h

Similarly, if you check the Data Refinery CR, you see something similar

# oc get datarefinery -n bvt
NAME              VERSION   RECONCILED   STATUS      PERCENT   AGE
datarefinery-cr   12.1.4    12.1.4       Completed   100%      7h

You can also check the backup and restore logs, where you see that the watsonxdataintelligence CR has the label "br_label": "NotPresent" instead of "br_label": "BackupRestore". That label indicates that it was not properly included in the backup and restore process.

Cause of the problem

The root cause is the priority order for restoring services. CRs are restored based on their priority order. Both watsonx.data Premium and watsonx BI Assistant have a dependency on watsonx.data intelligence. However, watsonx.data intelligence and watsonx BI Assistant have the same restore priority order in the restore process, and watsonx.data Premium has a higher priority order.

Because of these priority orders, there is no guarantee that the watsonx.data intelligence CR gets restored first. When watsonx BI Assistant or watsonx.data Premium is restored before watsonx.data intelligence, their operators detect that the watsonx.data intelligence CR does not exist and automatically create a new one to satisfy their dependency. This newly created CR then prevents the actual watsonx.data intelligence CR from being restored from the backup.

Similarly, the Data Refinery CR (datarefinery-cr) might be re-created by the Watson Studio CR (ws-cr) instead of being restored from the backup. This also happens because Watson Studio has a higher restore priority order than Data Refinery. Watson Studio is a dependency of watsonx.data Premium.

Resolving the problem

If the CRs were created instead of restored, you can manually restore them from the backup after the restore operation completes.

  1. Identify the problematic custom resource by running the following command:
    cpd-cli manage get-cr-status --cpd_instance_ns=<namespace>

    For example, look for the watsonxdataintelligence and datarefinery CRs.

  2. Locate the backed-up CR file for watsonxdataintelligence or datarefinery in the restore data directory, and copy the CR file to a new location for editing.
    BACKUP_NAME=<define your tenant-backup name here>
    cpd-cli oadp tenant-backup download ${BACKUP_NAME}
    unzip ${BACKUP_NAME}-data.zip
    
    for backupTar in $(ls *.tar.gz); do
      echo "extracting $backupTar"
      backupDir=$(basename $backupTar .tar.gz)
      mkdir $backupDir
      tar -xf $backupTar -C $backupDir
    done
    
    crLoc=$(find . -type f | grep <resource-name or pattern>| grep -v preferred | head -1)
  3. Edit the copied file and remove the following fields from the metadata section:
    • uid
    • resourceVersion
    • managedFields
    • generation
    • creationTimestamp

    These fields will be regenerated when you apply the custom resource.

  4. Remove the entire status section from the CR file.
  5. Delete the newly created CR.

    For example, to delete the watsonx.data intelligence CR, use the following command:

    oc delete WatsonxDataIntelligence <cr-name> -n <namespace>
  6. Apply the modified CR file:
    oc apply --server-side --force-conflicts -f $crLoc
  7. Repeat these steps for any other CRs that are affected.
  8. Verify that the CR begins reconciling. Wait for the CRs to reconcile and reach Completed status

Custom resource for watsonx.ai IFM is created instead of restored

Applies to:5.3.0 and later

Fixed in: 5.3.1 Patch 5

Applies to: Online restore with the OADP utility

Diagnosing the problem

When you perform an online restore on a cluster with watsonx.data Premium and watsonx.ai Inferencing Foundation Models (IFM) installed, the watsonxaiifm custom resource (CR) is newly created instead of being restored from the backup.

You can check if the watsonxaiifm CR was created or restore by running the following command:

oc get watsonxaiifm -o yaml | grep velero

If you see labels like restore-name for the CR, then it was restored. If the labels are missing, CR was likely re-created instead.

You can also check the watsonxaiifm CR after the restore completes. You see might that it has a recent creation timestamp that does not match the original backup timestamp. The creationTimestamp shows a time after the restore operation started. When you get details about the watsonxaiifm CR, you see that the AGE column shows a recent time (for example, 21 hours) instead of the age from the original backup.

#oc get Watsonxaiifm -n <namespace>
NAME              VERSION   RECONCILED   STATUS      PERCENT   AGE
watsonxaiifm-cr   12.1.0    12.1.0       Completed   100%      21h

When you check the backup and restore logs, you see that the watsonxaiifm CR has the label "br_label": "NotPresent" instead of "br_label": "BackupRestore". That label indicates that it was not properly included in the backup and restore process.

However, the watsonxaiifm CR functions correctly and the watsonx.ai IFM service is operational after the restore.

Cause of the problem

The watsonx.data Premium service has a dependency on watsonx.ai IFM, but both services have the same restore priority order in the restore process. CRs are restored based on their priority order.

Because both services have the same priority order, there is no guarantee which service's CRs get restored first. When watsonx.data Premium is restored before watsonx.ai IFM, the watsonx.data Premium operator detects that the watsonx.ai IFM CR does not exist and automatically creates a new one to satisfy its dependency. This newly created CR then prevents the actual watsonx.ai IFM CR from being restored from the backup.

Resolving the problem
At this time, there is no confirmed workaround for this issue.

While the newly created watsonxaiifm CR functions correctly and the watsonx.ai IFM service is operational after the restore, the CR does not retain its original creation timestamp and metadata from the backup.

Custom resource for common core services fails to restore due to missing nginx routes

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Online restore with the OADP utility

Diagnosing the problem

During an online restore with the OADP utility, the restoration of the common core services custom resource (CR) fails to complete. When you check the status of the common core services, you see it is stuck at InProgress status:

NAME     VERSION   RECONCILED   STATUS       PERCENT   AGE
ccs-cr   12.0.0                 InProgress   56.2%     ...

You also see errors related to the dap-base-folder-global-type job timing out.

reconcileHistory:
- 'The failed task is : Create folder global asset type and the error message is: 
  "Job" "dap-base-folder-global-type": Timed out waiting on resource'

If you check the nginx logs, you might also see many errors about /v2/asset_types.

/usr/local/openresty/nginx/html/v2/asset_types" failed (2: No such file or directory)

And the configuration for the nginx pod doesn't seem to have the /v2/asset_types route.

Cause of the problem

The issue is caused by the backup and restore priority ordering. Some services and resources are getting restored too early, and they are causing zenextensions to be restored early as well. These services and resources are being restored before the common core services CR has completed its reconciliation.

The early restoration of these services and resources is causing the common core services CR to be recreated before the common core services CR can be properly restored. New nginx routes are created instead of restoring the routes that were properly configured.

When common core services tries to create global asset types through the dap-base-folder-global-type job, the nginx service has errors because the routes are missing. Eventually the job times out after multiple retry attempts.

This issue can be intermittent because it depends on the timing of resources being restored and the order in which operators process the restored resources.

Resolving the problem

Use the following script to force zenextensions to reconcile.

The script patches the zenextensions resource with a new timestamp. Patching the resources forces the zen operator to reconcile the resource and regenerate the nginx route configurations.

  1. Create a script file with the following content. Replace <namespace> with your IBM Software Hub instance namespace.
    #!/bin/bash
    
    RESOURCE_TYPE="zenextensions.zen.cpd.ibm.com"
    NAMESPACE="<namespace>"
    
    # Get all resource names in the namespace
    RESOURCE_NAMES=$(oc get $RESOURCE_TYPE -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')
    
    # Iterate and patch each
    for NAME in $RESOURCE_NAMES; do
      echo "Patching $RESOURCE_TYPE/$NAME with timestamp $TIMESTAMP..."
      oc patch $RESOURCE_TYPE "$NAME" -n $NAMESPACE --type=merge -p "{\"spec\": {\"lastRecon\": \"$(date -u +"%Y-%m-%dT%H:%M:%SZ")\"}}"
    done
  2. Make the script executable and run the script.

    Wait for the zenextensions resource to reconcile.

  3. Continue the restore operation from the point where it failed, or allow it to complete if it automatically resumes.

Custom resource for common core services stuck at InProgress status after rebooting cluster or restoring

Applies to: 5.3.1

Fixed in: 5.3.1 Patch 1

Applies to: Restore with the OADP utility or after rebooting the cluster

Diagnosing the problem

This issue occurs in one of the following scenarios:

  • After completing a restore operation with the OADP utility
  • After rebooting the OpenShift cluster

The custom resource for the common core services is then stuck at 97% progress with the status InProgress:

NAME     VERSION   RECONCILED   STATUS      PERCENT   AGE
ccs-cr   12.1.0    12.1.0       InProgress  97%       ...

The common core services CR also displays the message progressMessage: Starting install of ccs-post-install role and does not complete. When you check the pods, the spaces pod is in CrashLoopBackOff status or Error status.

The failure of the common core services CR to complete also blocks all services that depend on common core services, which prevents them from reconciling. This affects the restore process or cluster recovery after reboot.

Cause of the problem

This issue occurs because the spaces pod fails to start correctly. The startup script for the spaces pod generates a password during the pod startup process, and it uses this password in OpenSSL and keytool calls. An issue with the password causes one of these calls to give a keystore password was incorrect error. This error prevents the spaces pod from starting successfully, which prevents the common core services operator from completing and updating the CR status to Completed.

Resolving the problem

Manually restart the spaces pod by deleting it, which triggers the password generation process. The pod can then start successfully, and the common core services operator can complete the reconciliation.

  1. Delete the spaces pod to force a restart:
    oc delete pod spaces-<xxxxxxxxxx>-<xxxxx> -n zen

    Replace <xxxxxxxxxx>-<xxxxx> with the pod identifier. If the pod does not restart automatically, use the --force option.

  2. Wait for the new spaces pod to start and reach Running status.

    You can then check that the common core services CR updates to Completed status. Services that depend on common core services automatically reconcile once the common core services CR updates.

User interface for IBM Software Hub displays Internal Server Error after offline restore

Applies to: 5.3.1

Applies to: Offline restore with OADP

Diagnosing the problem

After completing an offline restore with OADP, the restore operation has Completed status, but the IBM Software Hub user interface is not accessible in a web browser. When you attempt to access the platform, the IBM Software Hub user interface displays a 500 (Internal Server Error) message. Yet, when you check the restore logs, the operation shows it completed successfully:

Finished with status: Completed
...
Scenario:       RESTORE CREATE (531-nfs-gmc-f-1-restore)
Start Time:     	2026-02-17 08:06:14 -0800 PST m=+0.211399020
Completion Time:	2026-02-17 09:30:33 -0800 PST m=+5059.260548678
Time Elapsed:   1h24m19.049149658s

When you check the usermgmt pod logs, you also see authentication errors:

InternalOAuthError: failed to obtain access token
    at /usr/src/server-src/node_modules/passport-idaas-openidconnect/lib/strategy.js:266:32

You might also see database constraint errors and errors related to token and user management when you check the platform-identity-provider pod logs:

Failed to get access token, Liberty error: {"error_description":"CWOAU0029E: The token with key: PQR type: authorization_grant subType: authorization_code was not found in the token cache.","error":"invalid_grant"}

SequelizeDatabaseError: there is no unique or exclusion constraint matching the ON CONFLICT specification
error: there is no unique or exclusion constraint matching the ON CONFLICT specification

This is an intermittent issue that might not occur after every restore. In some cases, performing a second restore on the same target cluster resolves the issue.

Cause of the problem

This issue occurs because the platformdb.users table is missing its PRIMARY KEY constraint on the user_id column. This constraint is missing due to a failure to restore the database schema during the Velero restore operation.

After Velero restores the platformdb.users table data, the table contains duplicate user_id values. This duplication can occur for several reasons. For example, multiple backup snapshots might have merged, or the cleanup of data before the restore might be incomplete. When the restore process attempts to recreate the PRIMARY KEY constraint on the user_id column, the operation fails because duplicate values exist in the restored data. Without the PRIMARY KEY constraint, foreign key constraints that reference users.user_id also fail to be created. The lack of constraints then cause application errors.

Resolving the problem

You need to clean up the target cluster namespaces and retry the offline restore operation from the beginning. This clears the corrupted database and allows the restore to complete with the proper constraints. However, this workaround requires rerunning the entire offline restore process, which can take several hours depending on the size of your backup and cluster resources.

  1. On the target cluster, delete the IBM Software Hub instance and operator namespaces. Follow the steps in the Cleaning up the cluster before a restore section in Offline backup and restore to a different cluster.
  2. Wait for the namespaces to be fully deleted.
  3. Retry the offline restore operation. Start the restore procedure by following the steps in the Cleaning up the target cluster after a previous restore section in Offline backup and restore to a different cluster.

Offline restore fails with OIDC client registration error

Applies to: 5.3.1

Fixed in: 5.3.1 Patch 1

Applies to: Offline restore with OADP

Diagnosing the problem

While running an offline restore with OADP, the restore fails after running for more than two hours. You see errors related to OIDC client registration, and the restore process cannot verify that the oidc-client-registration managed resource has reached Ready status:

Error: 3 errors occurred:
* error from dpp.Execute() [traceId=..., dpp=cpd-offline-tenant/restore-service-orchestrated-parent-workflow, 
  operationKind=restore-service-orchestrated-parent-workflow]:

error executing workflow actions: workflow action execution resulted in 4 error(s):
  - encountered an error during local-exec workflowAction.Do() - action=cpd-restore-operators, 
    action-index=23, retry-attempt=0/3, err=operators restore execution failed: 
    error evaluating condition (condition={$.status.service.managedResources[?(@.objectName == 
    "oidc-client-registration")].status} == {"Ready"}, namespace=cpd-instance, 
    gvr=operator.ibm.com/v1alpha1, Resource=authentications, name=example-authentication): 
    error jsonpath FindResults(): status is not found

When you check the pods in the operator namespace, the ibm-iam-operator pod is in CrashLoopBackOff status. The pod logs show cache synchronization failures and timeout errors:

{"level":"error","ts":"...","msg":"Could not wait for Cache to sync",
"controller":"controller_authentication","source":"kind source: *v1.Certificate",
"error":"failed to wait for controller_authentication caches to sync kind source: *v1.Certificate: timed out waiting for cache to be synced for Kind *v1.Certificate",
"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).startEventSourcesAndQueueLocked.func1.2.1\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:366"}

{"level":"error","ts":"2026-01-20T13:42:10Z","msg":"Could not wait for Cache to sync",
"controller":"controller_authentication","source":"kind source: *v1.Secret",
"error":"failed to wait for controller_authentication caches to sync kind source: *v1.Secret: timed out waiting for cache to be synced for Kind *v1.Secret",
"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).startEventSourcesAndQueueLocked.func1.2.1\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:366"}

{"level":"error","ts":"2026-01-20T13:42:11Z","msg":"Could not wait for Cache to sync",
"controller":"controller_oidc_client","controllerGroup":"oidc.security.ibm.com",
"controllerKind":"Client","source":"kind source: *v1.Client",
"error":"failed to wait for controller_oidc_client caches to sync kind source: *v1.Client: timed out waiting for cache to be synced for Kind *v1.Client",
"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).startEventSourcesAndQueueLocked.func1.2.1\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:366"}

{"level":"error","ts":"2026-01-20T13:42:12Z","logger":"controller-runtime.source.Kind",
"msg":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Client Informer to sync",
"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/source/kind.go:80\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.f....
Cause of the problem

This issue occurs because the Identity Management Service operator fails to synchronize its cache during the restore process. The operator attempts to watch and cache various Kubernetes resources, but it encounters timeout errors when trying to list and watch all these resources.

When the cache fails to sync, the operator unable to watch the oidc-client-registration resource status properly. The restore process then fails when it attempts to verify that this resource has reached Ready status. The operator enters a CrashLoopBackOff state, repeatedly attempting and failing to sync its cache.

Cache synchronization failures can occur because of several factors:

  • Large volumes of resources being restored simultaneously
  • Network connectivity issues between the operator and the Kubernetes API server
  • Resource constraints on the Identity Management Service operator pod
  • Timing and sequencing issues where multiple operators and resources are being restored and started concurrently
  • Service account token issues preventing proper resource access
Resolving the problem

You must kill off the ibm-iam-operator-xxx pod, and then start a new restore process.

  1. Remove any partially restored components or resources that may be in an inconsistent state.

    Follow the steps in the Cleaning up the target cluster after a previous restore section in Offline backup and restore to a different cluster

    Cleaning up the namespaces kills the pod and triggers the generation of a new ServiceAccount token.

  2. Restart the restore process from the beginning.

    For example, follow the steps in the Restoring IBM Software Hub to the same cluster section in Offline backup and restore to a different cluster.

Restoring Data Virtualization fails with metastore not running or failed to connect to database error

Applies to: 5.3.0 and later

Applies to: Online backup and restore with the OADP utility

Diagnosing the problem
View the status of the restore by running the following command:
cpd-cli oadp tenant-restore status ${TENANT_BACKUP_NAME}-restore --details
The output shows errors like in the following examples:
time=<timestamp>  level=INFO msg=Verifying if Metastore is listening
SERVICE              HOSTNAME                               NODE      PID  STATUS                                                                                                                                        	
Standalone Metastore c-db2u-dv-hurricane-dv                   -        -   Not running
time=<timestamp>  level=ERROR msg=Failed to connect to BigSQL database                                                                                                                                                                                                                        	     	                      	                                                                                                                                                                                                                                                                                                                                                         	                     	                                                                                                                                                                                                                                                                                                                                                         	
* error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred:                                                                                                                                                                                                                                     	
* error executing command su - db2inst1 -c '/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L' (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=<namespace-name> auxMetaName=dv-aux component=dv actionIdx=0): command terminated with exit code 1
Cause of the problem
A timing issue causes restore posthooks to fail at the step where the posthooks check for the results of the db2 connect to bigsql command. The db2 connect to bigsql command has failed because bigsql is restarting at around the same time.
Resolving the problem
Run the following command:
export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from cpd-post-restore-hooks

Restore fails for IBM Knowledge Catalog because PostgreSQL cluster cannot reach a healthy state

Applies to: 5.3.1 Patch 2 and later

Applies to: Online restore with the OADP utility

Diagnosing the problem

When you perform an online restore of IBM Knowledge Catalog, the restore fails during the operator restore phase with an error indicating that the PostgreSQL cluster ikc-dp-dps-bidata-mde-mdi-postgres cannot reach a healthy state.

error: DataProtectionPlan=cpd-tenant/restore-service-orchestrated-parent-workflow, 
       Action=cpd-restore-operators (index=24)
       operators restore execution failed: 1 error occurred:
       * condition not met (condition={$.status.phase} == {"Cluster in healthy state"}, 
         namespace=udp, gvr=postgresql.k8s.enterprisedb.io/v1, Resource=clusters, 
         name=ikc-dp-dps-bidata-mde-mdi-postgres)

When you check the PostgreSQL clusters in the namespace, you see that the ikc-dp-dps-bidata-mde-mdi-postgres cluster has no instances ready and shows the status Unable to create required cluster objects.

When you describe the cluster, you see an error message indicating that the cluster refuses to reconcile a service that it does not own:

#oc describe clusters.p -n udp ikc-dp-dps-bidata-mde-mdi-postgres
...
Status:
  Conditions:
    Last Transition Time:  2026-04-03T09:27:28Z
    Message:               Cluster Is Not Ready
    Reason:                ClusterIsNotReady
    Status:                False
    Type:                  Ready
  Image:                   cp.stg.icr.io/cp/edb/postgresql:16.13-5.31.1@sha256:c632b5ef78ab6686939e8f76543c08dac8cb77e7b30578b21eb0a0c8d3b0b020
  Latest Generated Node:   2
  Phase:                   Unable to create required cluster objects
  Phase Reason:            refusing to reconcile service: ikc-dp-dps-bidata-mde-mdi-postgres-r, not owned by the cluster
  Target Primary:          ikc-dp-dps-bidata-mde-mdi-postgres-2
Events:                    <none>

When you check the service, you see that it was restored by Velero, but it lacks the required ownerReferences field.

Cause of the problem

The cpd-ikc-enrichment-aux-ckpt-cm ConfigMap defines which resources to back up and restore. The ikc-enrichment-resources group in the ConfigMap includes a label selector that is too broad, and it unintentionally includes the PostgreSQL cluster service, ikc-dp-dps-bidata-mde-mdi-postgres-r, in the backup.

During the restore process, Velero restores this service but removes the ownerReferences field, which normally links the service to the PostgreSQL cluster custom resource. The PostgreSQL cluster operator expects to create and manage this service itself. When it detects a service with the expected name but without the correct ownerReferences, it refuses to reconcile the cluster, which causes the Unable to create required cluster objects error.

Resolving the problem
If this issue affects your deployment, contact IBM Support.

Backup fails for watsonx Assistant due to ConfigMap issues

Applies to: 5.3.1 and later

Applies to: Online and offline backups with the OADP utility

Diagnosing the problem

When you attempt to perform a backup of watsonx Assistant, the backup fails during the pre-backup hook phase with an error indicating an invalid job specification in the ConfigMap:

The backup log shows the following error:

error executing workflow actions: workflow action execution resulted in 1 error(s):
      - encountered an error during local-exec workflowAction.Do() - action=cpd-pre-backup-hooks,
      action-index=5, retry-attempt=0/0, err=online pre-backup hooks execution failed: 
      error running pre-backup hooks: Error running pre-processing rules.  
      Check the /root/br/backup/cpd-cli-workspace/logs/CPD-CLI-2026-04-14.log for errors.
  1 error occurred:
      * op error: id=12, name=preBackupViaConfigHookRule, configmap=cpd-wa-aux-ckpt-cm: 
      error performing op preBackupViaConfigHookRule for resource wa, msg: : Job.batch "" is invalid: 
      [metadata.name: Required value: name or generateName is required, spec.template.spec.containers: 
      Required value, spec.template.spec.restartPolicy: Required value: valid values: "OnFailure", "Never"]
  [ERROR] 2026-04-14T02:26:59.905426Z RunPluginCommand:Execution error:  exit status 1

When you examine the ConfigMap, you might see that the mcg-backup-job key has a null or empty value:

#oc get cm cpd-wa-aux-ckpt-cm -n <namespace> -o yaml
checkpoint-meta:
  exec-hooks:
    exec-rules:
      - actions:
          - job:
              job-key: mcg-backup-job
              timeout: 3600s
mcg-backup-job: null
Cause of the problem

The watsonx Assistant operator creates the ConfigMap, which contains the backup and restore job specifications. If the Cloud Object Storage (COS) connection is unavailable or not properly initialized during operator reconciliation (especially after a restore operation), the operator can create the ConfigMap without a valid job specification for the mcg-backup-job key. When the backup process tries to read the job specification from the ConfigMap, the backup process cannot create a job with a valid specification.

This issue occurs intermittently. It depends on the timing of the operator reconciliation and the availability of the COS connection when the ConfigMap is created or updated.

Resolving the problem

Retry the backup operation.

Because the issue occurs intermittently, the watsonx Assistant operator typically reconciles and correctly updates the ConfigMap before you retry the backup.

Offline backup fails for cpd-ikc-ikc-aux-br-cm ConfigMap

Applies to: 5.3.1 Patch 1 and later

Fixed in: 5.3.1 Patch 3

Applies to: Offline backup with the OADP utility

Diagnosing the problem

This issue occurs only when IBM Software Hub is running on Power hardware (ppc64le clusters). When you perform an offline backup, it might fail at the cpd-pre-backup-hooks step with timeout errors for the following ConfigMaps: cpd-ws-maint-aux-br-cm, wspipelines-aux-br-cm, or cpd-ccs-maint-aux-br-cm.

Failed with 1 error(s):
        DataProtectionPlan=cpd-offline-tenant/backup-service-orchestrated-parent-workflow, Action=cpd-pre-backup-hooks (index=4)
                error: offline pre-backup hooks execution failed: error running pre-backup hooks: Error running pre-processing rules.  Check the cpdbr-oadp.log for errors.
3 errors occurred:
        * op error: id=12, name=preBackupViaConfigHookRule, configmap=cpd-ws-maint-aux-br-cm: error performing op preBackupViaConfigHookRule for resource ws-maint-br, msg: 1 error occurred:
        * 1 error occurred:
   * timed out waiting for the condition

        * op error: id=173, name=preBackupViaConfigHookRule, configmap=wspipelines-aux-br-cm: error performing op preBackupViaConfigHookRule for resource wspipelines, msg: 1 error occurred:
        * 1 error occurred:
   * timed out waiting for the condition

        * op error: id=192, name=preBackupViaConfigHookRule, configmap=cpd-ccs-maint-aux-br-cm: error performing op preBackupViaConfigHookRule for resource ccs-maint-br, msg: 1 error occurred:
        * 1 error occurred:
   * timed out waiting for the condition

A backup might also fail at the cpd-backup-validation step for the cpd-ikc-ikc-aux-br-cm ConfigMap.

Failed with 1 error(s):
        DataProtectionPlan=cpd-offline-tenant/backup-service-orchestrated-parent-workflow, Action=cpd-backup-validation (index=8)
                error: backup validation failed: 1 error occurred:
        * backup validation failed for configmap: cpd-ikc-ikc-aux-br-cm
Cause of the problem

The backup validation service fails to validate the specific secrets referenced in the cpd-ikc-ikc-aux-ckpt-cm and cpd-ikc-ikc-aux-br-cm ConfigMaps during the cpd-pre-backup-hooks or cpd-backup-validation phase. These ConfigMaps are associated with IBM Knowledge Catalog. When backup hooks attempt to validate these services, they encounter timeout errors because the services are not in a stable state.

Resolving the problem

Before taking a backup, edit the ConfigMaps to remove references to the secrets.

  1. Edit the following ConfigMaps:
    • cpd-ikc-ikc-aux-br-cm
    • cpd-ikc-ikc-aux-ckpt-cm
  2. Remove the following lines in each ConfigMap:
    - resource-kind: secret
            validation-rules:
              - type: match_names
                names:
                  - manta-credentials
                  - manta-keys
  3. Retry the backup.

Backup pre-check fails for Db2 Data Management Console due to timeout for status check

Applies to: 5.3.1

Applies to: Offline backup with OADP

Diagnosing the problem

You have Db2 Data Management Console installed. When you performed an offline backup, the backup fails during the pre-check phase. You see an error related to the cpd-dmc-aux-br-cm ConfigMap, which contains the backup configuration for Db2 Data Management Console. It indicates that the backup pre-check hook failed:

** PHASE [MASTER PLAN/PARENT RECIPE (APPTYPE=CPD-OFFLINE-TENANT, WORKFLOW=BACKUP-SERVICE-ORCHESTRATED-PARENT-WORKFLOW)/CPD-PRECHECK-BACKUP/END]
master plan results:
   1 DataProtectionPlan execution result(s):
   	dpp 0: cpd-offline-tenant/backup-service-orchestrated-parent-workflow (Phase: FatalError)
   		action 0: cpd-precheck-backup (Phase=FatalError)
   			error: backup precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
   	* backup precheck hook finished with status=error, configmap=cpd-dmc-aux-br-cm
   Failed with 1 error(s):
   error: DataProtectionPlan=cpd-offline-tenant/backup-service-orchestrated-parent-workflow, Action=cpd-precheck-backup (index=0)
   		backup precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
   	* backup precheck hook finished with status=error, configmap=cpd-dmc-aux-br-cm

When examining the pre-check hook execution details, you observe that the hook times out:

The following hooks either have errors or timed out
   pre-check (count=1, duration=5.413s):
   
       	ADDONID	COMPONENT	CONFIGMAP        	PRIORITY_ORDER	METHOD	STATUS	DURATION	START_TIME               	END_TIME                 
       	dmc    	dmc      	cpd-dmc-aux-br-cm	20            	rule  	error 	5.413s  	2026-02-13 8:07:12.035 AM	2026-02-13 8:07:17.449 AM

This issue occurs intermittently, and some backups might succeed while others fail.

Cause of the problem

The cpd-dmc-aux-br-cm ConfigMap contains a pre-check hook that validates that the dmcAddon status is Completed before allowing the backup to proceed. The default timeout value is set to 5 seconds in the ConfigMap, which is not enough time:

precheck-meta:
  backup-hooks:
    exec-rules:
      - resource-kind: dmcaddons.dmc.databases.ibm.com
        on-error: Fail
        actions:
          - builtins:
              name: cpdbr.cpd.ibm.com/check-condition
              params:
                condition: "{$.status.dmcAddonStatus} == {\"Completed\"}"
              timeout: 5s

When time runs out before the check completes, the pre-check hook fails, which causes the entire backup operation to fail.

Resolving the problem

You must manually increase the timeout value in the cpd-dmc-aux-br-cm ConfigMap before taking a backup.

Note: You must apply this workaround before each backup that you take because the ConfigMap may be recreated or reset during Db2 Data Management Console operator reconciliation.
  1. Patch the cpd-dmc-aux-br-cm ConfigMap to increase the timeout from 5 seconds to 1800 seconds (30 minutes):
    oc patch configmap cpd-dmc-aux-br-cm -n <namespace> --type='merge' -p='
    {
      "data": {
        "precheck-meta": "backup-hooks:\n  exec-rules:\n    - resource-kind: dmcaddons.dmc.databases.ibm.com\n      on-error: Fail\n      actions:\n        - builtins:\n            name: cpdbr.cpd.ibm.com/check-condition\n            params:\n              condition: \"{$.status.dmcAddonStatus} == {\\\"Completed\\\"}\"\n            timeout: 1800s\n"
      }
    }'
  2. Verify that the timeout value in the ConfigMap has been updated to 1800s:
    oc get configmap cpd-dmc-aux-br-cm -n <namespace> -o yaml | grep -A 10 "precheck-meta"

Unstructured Data Integration backup fails at pre-backup hooks phase

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Offline backup with OADP

Diagnosing the problem

When attempting to take an offline backup of Unstructured Data Integration by using OADP with NFS storage, the backup fails during the pre-backup hooks phase with the following error:

op error: id=63, name=quiesceViaScaling, configmap=: error waiting for op quiesceViaScaling on workload deployment/datasift-api: scale down failed

The issue is intermittent. Some backup attempts succeed while others fail.

Cause of the problem

The issue is caused by a timing issue in the backup workflow. Unstructured Data Integration resources, such as datasift-api and datasift-ui, are scaled down during the quiesce phase, but the operator reconciliation process scales them back up before the Unstructured Data Integration custom resource (CR) is put into maintenance mode. This causes the quiesce operation to fail because the resources cannot be scaled down.

The default priority order in the cpd-udp-aux-br-cm ConfigMap causes the scaling operations to occur before the CR maintenance mode is set.

Additionally, the quiesce timeout may not be sufficient for UDP resources to complete the scale-down operation, especially in offline backup scenarios where resources take longer to quiesce.

Resolving the problem

To resolve the issue, update the priority order in the cpd-udp-aux-br-cm ConfigMap to ensure CRs are put in maintenance mode:

  1. Edit the cpd-udp-aux-br-cm ConfigMap:
    oc edit cm cpd-udp-aux-br-cm -n <namespace>
  2. Change the priority-order value to 25.
  3. Add the --skip-registry-check flag when you run the cpd-cli oadp backup create command.

Backup fails on upgraded cluster due to EDB Postgres Enterprise ConfigMap timeout error

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Offline backup and restore methods

Diagnosing the problem

After upgrading watsonx.data from version 5.1.x to 5.3.0, you performed an offline backup. However, the backup fails, and the backup pre-check hooks have timeout errors on Config Maps for EDB Postgres Enterprise:

pre-backup (2):
  COMPONENT            CONFIGMAP            METHOD  STATUS  DURATION
  cpdbr-edb-patch-br   edb-patch-aux-br-cm  rule    error   6m0.025738359s
  lh-edb-br-component  lh-edb-aux-br-cm     rule    error   1m20.851377472s

Additionally, one of the EDB Postgres Enterprise pods (edb-2) might also get stuck with the message Instance could be down, will not proceed with the reconciliation loop.

The specific error indicates that the backup instance annotation cannot be removed from the EDB cluster:

Waiting for Backup Instance annotation for EDB cluster ibm-lh-postgres-edb to be refreshed...
error performing op preBackupViaConfigHookRule for resource lh-edb-br-component: command terminated with exit code 1
Cause of the problem

During the upgrade, the specification for the postgresql PVC storage size changes from 10G to 10Gi, and the resizeInUseVolume flag gets set to true. However, the underlying storage class has AllowVolumeExpansion set to false, so the PostgreSQL operator enters a continuous reconciliation loop, where the operator repeatedly tries and fails to resize the PVC. You might see the following error:

persistentvolumeclaims "ibm-lh-postgres-edb-1" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize

Although the PostgreSQL CR shows a Completed status, the operator is actually stuck in the resize loop and does not properly update the CR status. This prevents the backup pre-check hooks from completing, as they cannot refresh the backup instance annotation while the operator is in this unhealthy state.

Resolving the problem
The workaround to use depends on whether the storage class supports volume expansion.
Can enable volume expansion for storage class
  1. Enable volume expansion by setting AllowVolumeExpansion to true in the storage class.
  2. Allow the PostgreSQL operator to complete the PVC resize operation.
  3. Verify the PostgreSQL cluster reaches Completed status.
  4. Retry the backup operation.
Cannot enable volume expansion for storage class

You must manually recreate the PostgreSQL PVCs one at a time:

  1. Delete one PostgreSQL pod and its associated PVC.
    1. Wait for the new pod and PVC to be created.
    2. Wait for PostgreSQL to sync up and stabilize.
  2. Repeat the previous step for each of the remaining PostgreSQL pods. Delete only one pod at a time.
  3. Verify all PVCs have been refreshed with the correct size.
  4. Verify the PostgreSQL cluster is healthy.
  5. Retry the backup operation.

Backup validation fails for Data Virtualization due to missing label in dvendpoint PVC

Applies to: 5.3.0 and later

Applies to: Online and offline backup with the OADP utility

Diagnosing the problem

When you take a backup with OADP on a cluster that has a Data Virtualization instance, the backup validation fails and you see the following error:

error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=cpd-backup-validation, action-index=8, retry-attempt=0/0, err=backup validation failed: 1 error occurred:
        * backup validation failed for configmap: dv-aux-br-cm

 (description: error executing workflow for DataProtectionPlan 'cpd-offline-tenant/backup-service-orchestrated-parent-workflow')
        * error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=cpd-backup-validation, action-index=8, retry-attempt=0/0, err=backup validation failed: 1 error occurred:
        * backup validation failed for configmap: dv-aux-br-cm

 (description: finished executing 1 DataProtectionPlan(s):)
        * DataProtectionPlan=cpd-offline-tenant/backup-service-orchestrated-parent-workflow, Action=cpd-backup-validation (index=8) error: backup validation failed: 1 error occurred:
        * backup validation failed for configmap: dv-aux-br-cm

error: backup validation unsuccessful

failed rules report:

CM NAME         RESOURCE-KIND           ADDONID PATH                                                            ERR
dv-aux-br-cm    persistentvolumeclaim   dv      backup-validation-meta/backup-validations/4/validation-rules/1  validation unsuccessful, resource  '>=' '1' rule is not fulfilled, actual count: 0
Cause of the problem

The backup validation process fails because it cannot find the correct number of PersistentVolumeClaims (PVCs) with the required labels.

The backup validation process checks for the dvendpoint PVC by using specific labels defined in the dv-aux-br-cm backup ConfigMap. The validation rule expects at least one PVC with the following labels:

 - resource-kind: persistentvolumeclaim
    validation-rules:
      - type: count
        op: ">="
        val: 1
        labels: "icpdsupport/app=endpoint, icpdsupport/addOnId=dv"  

However, after upgrading from Data Virtualization 3.2.x to Data Virtualization 3.3.x, the required labels icpdsupport/app=endpoint and icpdsupport/addOnId=dv are missing from the dvendpoint PVC. The missing labels cause the validation process to fail because the it cannot find any PVCs with the labels: actual count: 0.

Resolving the problem

Add the missing labels to the PVCs before taking a backup. Run the following script to add the labels to the dvendpoint PVCs.

  1. Create the script file:
    vi fix_labels.sh
  2. Copy and paste the following content into the file. Change the NAMESPACE value as needed.
    # Set NAMESPACE to match your IBM Software
    Hub namespace
    NAMESPACE="cpd-instance"
    
    # List PVCs before patching
    echo "PVCs before patching:"
    oc get pvc -n $NAMESPACE -l component=endpoint,formation_id=db2u-dv --show-labels
    
    # Patch all matching PVCs
    echo "Patching PVCs..."
    for pvc in $(oc get pvc -n $NAMESPACE -l component=dvendpoint,formation_id=db2u-dv -o name); do
      echo "Patching $pvc"
      oc patch $pvc -n $NAMESPACE --type=merge -p '{"metadata":{"labels":{"icpdsupport/app":"endpoint"}}}'
    done
    
    # Verify the patch
    echo "PVCs after patching:"
    oc get pvc -n $NAMESPACE -l component=dvendpoint,formation_id=db2u-dv --show-labels
  3. Make the script executable:
    chmod +x fix_labels.sh
  4. Execute the script:
    sh fix_labels.sh

    The script displays PVC labels before and after patching to confirm the changes. The script adds the missing label only if the label is not already present. You need to run the script before every backup that you take.

  5. Verify that the labels have been correctly applied:
    oc get pvc -n <namespace> -l component=dvendpoint,formation_id=db2u-dv --show-labels

Offline backups fail for Db2 Data Management Console due to incorrect CR names in the ConfigMap

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Offline backup with the OADP utility

Diagnosing the problem

When performing an offline backup on a cluster that was upgraded from IBM Software Hub Version 5.1.1 to Version 5.3.0, the backup fails during the cpd-backup-validation phase with the following error:

backup validation failed: 1 error occurred:
* backup validation failed for configmap: cpd-dmc-aux-br-cm

The detailed logs show validation errors for Db2 Data Management Console custom resources with hardcoded sample names that do not exist in the backup:

object with name 'dmcaddon-sample' does not exist in the backup

When you examine the cpd-dmc-aux-br-cm ConfigMap, you see hardcoded sample custom resource (CR) names:

oc get cm -n zen cpd-dmc-aux-br-cm -o yaml | yq
...
  backup-validation-meta: |-
    backup-validations:
      - resource-kind: dmcaddons.dmc.databases.ibm.com
        validation-rules:
          - type: match_names
            names:
              - dmcaddon-sample # cluster has dmc-addon, not dmcaddon-sample
      - resource-kind: dmcs.dmc.databases.ibm.com
        validation-rules:
          - type: match_names
            names:
              - dmc-sample # cluster has data-management-console, not dmc-sample
      - resource-kind: configmap
        validation-rules:
          - type: match_names
            names:
              - ibm-dmc-addon-api-cm

However, the actual Db2 Data Management Console custom resources on the cluster have different names:

NAMESPACE   NAME                      VERSION   RECONCILED   STATUS      AGE
zen         dmc-addon                 5.3.0     5.3.0        Completed   2d23h
zen         data-management-console   5.3.0     5.3.0        Completed   35h

This mismatch causes the backup validation to fail because it cannot find the hardcoded sample CR names in the backup.

Cause of the problem

The cpd-dmc-aux-br-cm ConfigMap contains hardcoded sample CR names (dmcaddon-sample and dmc-sample) in the backup validation rules. These sample names were changed to dmc-addon and data-management-console in IBM Software Hub 5.1.2.

When a cluster is upgraded, the configmap is not automatically updated with the correct CR names. The backup validation process then fails because the validation process cannot find any current CRs with the old sample names.

Resolving the problem

Manually update the cpd-dmc-aux-br-cm ConfigMap to use the correct CR names that match your cluster.

  1. Edit the Db2 Data Management Console backup ConfigMap:
    oc get cm -n zen cpd-dmc-aux-br-cm -o yaml | yq
  2. Locate the backup-validation-meta section, and update the CR names to match the CR names on your cluster:
    backup-validation-meta: |-
      backup-validations:
        - resource-kind: dmcaddons.dmc.databases.ibm.com
          validation-rules:
            - type: match_names
              names:
                - dmc-addon
        - resource-kind: dmcs.dmc.databases.ibm.com
          validation-rules:
            - type: match_names
              names:
                - data-management-console
        - resource-kind: configmap
          validation-rules:
            - type: match_names
              names:
                - ibm-dmc-addon-api-cm
  3. Wait for the Db2 Data Management Console custom resources to reach Completed status before taking another backup.

Model Gateway online backup fails at checkpoint due to missing shell in container image

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Online backup with the OADP utility

Diagnosing the problem

When attempting to perform an online backup of Model Gateway by using OADP, the backup fails during the cpd-checkpoint phase. The logs show something like the following error:

Hook execution breakdown by status=error/timedout:  
  The following hooks either have errors or timed out
    checkpoint (1):
  
      	COMPONENT               	CONFIGMAP                          	METHOD	STATUS	DURATION	ADDONID      
      	model-gateway-maint-ckpt	cpd-model-gateway-maint-aux-ckpt-cm	rule  	error 	0s      	model_gateway

error executing workflow actions: workflow action execution resulted in 1 error(s):
       - encountered an error during local-exec workflowAction.Do() - action=cpd-checkpoint, action-index=4, retry-attempt=0/0,
         checkpoint hooks execution failed: error running checkpoint exec hooks: Error running checkpoint rules.
         Check the /root/br/backup/cpd-cli-workspace/logs/CPD-CLI-2025-12-14.log for errors
1 error occurred:
   * op error: id=32, name=checkpointViaConfigHookRule, configmap=cpd-model-gateway-maint-aux-ckpt-cm: error performing op checkpointViaConfigHookRule for resource model-gateway-maint-ckpt, 
     msg: 1 error occurred:
   * error executing command (container=model-gateway podIdx=0 podName=model-gateway-55f8994bd6-x9rkj namespace=zen auxMetaName=model-gateway-maint-aux-ckpt component=model-gateway-maint-ckpt actionIdx=0) 
   - related pod stdout/stderr are available in the cpd-cli logs under logId=114c7e06-49ab-4e5b-a921-10ed8f3a8cbb: command terminated with exit code 1

If you attempt to access the Model Gateway pod by using oc rsh, you get the following error:

executable file `/bin/sh` not found: No such file or directory
command terminated with exit code 1
Cause of the problem

The Model Gateway container image does not include /bin/sh for security reasons. The checkpoint hook in the cpd-model-gateway-maint-aux-ckpt-cm tries to execute commands inside the Model Gateway pod during online backup, but it fails because no shell is available in the container. This failure then causes the entire online backup to fail.

Resolving the problem

The checkpoint hook is designed to log information about the Model Gateway state during backup. While this information is useful for diagnostics, it is not critical for the backup process itself.

You can remove the checkpoint hook from the ConfigMap, which allows the backup to proceed without attempting to execute commands in the Model Gateway pod.

  1. Before taking an online backup, patch the Model Gateway CR to set it in maintenance mode:
    oc patch modelgateway modelgateway-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}'
    Note: The Model Gateway service remains accessible during the backup process. Setting the ignoreForMaintenance flag only prevents the operator from reverting the ConfigMap changes during backup. It does not quiesce the service nor make it unavailable to users.
  2. Patch the checkpoint ConfigMap to remove the checkpoint hook:
    oc patch cm cpd-model-gateway-maint-aux-ckpt-cm --type=merge -p '{"data":{"checkpoint-meta":"exec-hooks:\n  exec-rules: []\n"}}'
  3. Proceed with the online backup operation.
  4. After the backup completes successfully, remove the Model Gateway CR from maintenance mode:
    oc patch modelgateway modelgateway-cr --type=merge -p '{"spec":{"ignoreForMaintenance":false}}'

Backup pre-check fails for Db2 Data Management Console in REST mode

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Offline backup with the OADP utility

Diagnosing the problem

When you try to perform an offline backup of Db2 Data Management Console by using the REST mode for OADP, the backup fails during cpd-precheck-backup phase with the following error:

backup precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=error, configmap=cpd-dmc-aux-br-cm

The detailed error message in the logs indicates a permission issue:

error performing op preCheckViaConfigHookRule for resource dmc (configmap=cpd-dmc-aux-br-cm): : dmcaddons.dmc.databases.ibm.com is forbidden: User "cpdbr-api" cannot list resource "dmcaddons" in API group "dmc.com" in the namespace "zen"

The issue does not occur when you use Kubernetes mode.

Cause of the problem

This issue is an RBAC (Role-Based Access Control) permissions issue. The admin ClusterRole includes permissions for the dmcs custom resource, but it does not include permissions for the dmcaddons custom resources. The ClusterRole for dmcaddons is missing a label, so its permissions aren't automatically aggregated into the admin ClusterRole. The admin ClusterRole is missing the list verb for dmcaddons. As a result, the precheck hook in cpd-dmc-aux-br-cm fails because it can't list dmcaddons in the dmc.databases.ibm.com API group.

this issue doesn't occur with Kubernetes mode because the backup process uses different credentials for that mode.

Resolving the problem

To work around the issue, you can switch to the Kubernetes mode.

OpenPages instance fails to start after restore due to missing labels on secrets

Applies to: 5.3.0

Fixed in: 5.3.1

Applies to: Online and offline restore using OADP utility

Diagnosing the problem
After upgrading to version 5.3.0, you perform a backup and restore (either online or offline). The restore seems to complete successfully, but the OpenPages instance fails to start. You also see the following error:
The creation of openpages-openpagesinstance-cr-sts-0 and op-*-user-sync-cj-*-* are blocked by a missing secret:
MountVolume.SetUp failed for volume "op-platform-secret" : secret "openpages-<openpages-instance-name>-platform-secret" not found

When you check the OpenPages secrets, you notice that some secrets are missing the icpdsupport/addOnId label, for example:

labels:
  app: openpages
  app.kubernetes.io/component: opapp
  app.kubernetes.io/instance: openpages
  app.kubernetes.io/managed-by: openpages
  app.kubernetes.io/name: openpages
  component: opapp
  icpd-addon/status: "1718062918"
  icpdsupport/app: openpages
  icpdsupport/module: openpages-app
  icpdsupport/serviceInstanceId: "1718062918"
  release: openpages
Cause of the problem

During the upgrade to version 5.3.0, some OpenPages secrets lose the icpdsupport/addOnId label while retaining the icpdsupport/serviceInstanceId label. These secrets are backed up correctly. However, the OADP utility tool uses the icpdsupport/addOnId label to restore secrets during the restore operation. The missing labels prevent the restoration, causing the OpenPages instance to fail during startup.

The following secrets are affected:
  • openpages-<openpages-instance-name>-db-secret
  • openpages-<openpages-instance-name>-platform-secret
  • openpages-<openpages-instance-name>-initialpw-secret
Resolving the problem

You can manually add the missing label to the affected secrets.

  1. Use the following commands to set environment variables in the source and target cluster.
    export PROJECT_CPD_INST_OPERANDS="openpages-namespace"
    
    export OPENPAGES_INSTANCE_NAME=$(oc get openpagesinstance \
    -n ${PROJECT_CPD_INST_OPERANDS} \
    -o jsonpath='{.items[*].metadata.name}')
    
    
    export OP_INSTANCE_ID=$(oc get openpagesinstance ${OPENPAGES_INSTANCE_NAME} \
    -n ${PROJECT_CPD_INST_OPERANDS} \
    -ojsonpath='{.spec.zenServiceInstanceId}')
  2. Identify OpenPages secrets from the source cluster:
    oc get secrets \
    -l "icpdsupport/serviceInstanceId=${OP_INSTANCE_ID},app.kubernetes.io/instance=openpages,\!icpdsupport/addOnId" \
    -n ${PROJECT_CPD_INST_OPERANDS}
  3. Extract the OpenPages secrets from the backup of the source cluster.
    1. Extract the secrets YAML files from the backup data.
      ./cpd-cli oadp tenant-backup download <backup-name>
    2. Untar the downloaded file and navigate to resources/secrets/namespaces/<namespace>/
    3. List the secrets and the secrets YAML files that you collected:

      Use jq JSON to format it, for example:

      cat <openpages-openpagesinstance-with-25-platform-secret.json> | jq
  4. To avoid conflicts, remove all cluster-specific metadata fields before applying the secrets to the target cluster.
    1. Remove the following data from the backup files:
      • creationTimestamp ( .metadata.creationTimestamp)
      • managedFields (the entire section)
      • resourceVersion ( .metadata.resourceVersion)
      • uid ( .metadata.uid)
    2. Update ownerReferences so that the uid matches your current OpenPagesInstance UID (.metadata.uid).
  5. Apply the YAML file to the target cluster.
    oc apply -f <secrets>.yaml -n ${PROJECT_CPD_INST_OPERANDS}
    Note: Verify that the secrets are present in the target cluster.
  6. Apply the missing labels to all secrets.
    oc label secret \
    -l "icpdsupport/serviceInstanceId=${OP_INSTANCE_ID},!icpdsupport/addOnId,app.kubernetes.io/instance=openpages" 'icpdsupport/addOnId=openpages' \
    --overwrite \
    -n ${PROJECT_CPD_INST_OPERANDS}
    Note: You should also run this command on all the instances before taking any backups.
  7. Use the following command to restart the OpenPages STS pod.
    oc rollout restart statefulset openpages-${OPENPAGES_INSTANCE_NAME}-sts \
    -n ${PROJECT_CPD_INST_OPERANDS}

Offline backup fails with PartiallyFailed error

Applies to: 5.3.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
In the Velero logs, you see errors like in the following example:
time="<timestamp>" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:180"
time="<timestamp>" level=error msg="error encountered while scanning stdout" backupLocation=oadp-operator/dpa-sample-1 cmd=/plugins/velero-plugin-for-aws controller=backup-sync error="read |0: file already closed" logSource="/remote-source
/velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:90"
time="<timestamp>" level=error msg="Restic command fail with ExitCode: 1. Process ID is 906, Exit error is: exit status 1" logSource="/remote-source/velero/app/pkg/util/exec/exec.go:66"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.examplehost.example.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=76b76bc4-27d4-4369-886c-1272dfdf9ce9 --tag=volume=cc-home-p
vc-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --tag=pod=cpdbr-vol-mnt --host=velero --json with error: exit status 3 stderr: {\"message_type\":\"e
rror\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/.scripts/system\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival
\",\"item\":\".scripts/system\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"_global_/security/artifacts/metakey\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kuberne
tes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/_global_/security/artifacts/metakey\"}\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/ve
lero/app/pkg/podvolume/backupper.go:328"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.cp.fyre.ibm.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod=cpdbr-vol-mnt --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=93e9e23c-d80a-49cc-80bb-31a36524e0d
c --tag=volume=data-rabbitmq-ha-0-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --host=velero --json with error: exit status 3 stderr: {\"message_typ
e\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\".erlang.cookie\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-93e9e23c-d80a-49cc-80bb-31a36524e0dc/.erlang.cookie\"}
\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/velero/app/pkg/podvolume/backupper.go:328"
Cause of the problem
The restic folder was deleted after backups were cleaned up (deleted). This problem is a Velero known issue. For more information, see velero does not recreate restic|kopia repository from manifest if its directories are deleted on s3.
Resolving the problem
Do the following steps:
  1. Get the list of backup repositories:
    oc get backuprepositories -n ${OADP-OPERATOR-NAMESPACE} -o yaml
  2. Check for old or invalid object storage URLs.
  3. Check that the object storage path is in the backuprepositories custom resource.
  4. Check that the <objstorage>/<bucket>/<prefix>/restic/<namespace>/config file exists.

    If the file does not exist, make sure that you do not share the same <objstorage>/<bucket>/<prefix> with another cluster, and specify a different <prefix>.

  5. Delete backup repositories that are invalid for the following reasons:
    • The path does not exist anymore in the object storage.
    • The restic/<namespace>/config file does not exist.
    oc delete backuprepositories -n ${OADP_OPERATOR_NAMESPACE} <backup-repository-name>

Db2 Big SQL backup pre-hook and post-hook fail during offline backup

Applies to: 5.3.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
In the db2diag logs of the Db2 Big SQL head pod, you see error messages such as in the following example when backup pre-hooks are running:
<timestamp>          LEVEL: Event
PID     : 3415135              TID : 22544119580160  PROC : db2star2
INSTANCE: db2inst1             NODE : 000
HOSTNAME: c-bigsql-<xxxxxxxxxxxxxxx>-db2u-0
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5692
MESSAGE : ZRC=0xFFFFFBD0=-1072
          SQL1072C  The request failed because the database manager resources
          are in an inconsistent state. The database manager might have been
          incorrectly terminated, or another application might be using system
          resources in a way that conflicts with the use of system resources by
          the database manager.
Cause of the problem
The Db2 database was unable to start because of the error code SQL1072C. As a result, the bigsql start command that runs as part of the post-backup hook hangs, which produces the timeout of the post-hook. The post-hook cannot succeed unless Db2 is brought back to a stable state and the bigsql start command runs successfully. The Db2 Big SQL instance is left in an unstable state.
Resolving the problem
Do one or both of the following troubleshooting and cleanup procedures.
Tip: For more information about the SQL1072C error code and how to resolve it, see SQL1000-1999 in the Db2 documentation.
Remove all the database manager processes running under the Db2 instance ID
Do the following steps:
  1. Log in to the Db2 Big SQL head pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash
  2. Switch to the db2inst1 user:
    su - db2inst1
  3. List all the database manager processes that are running under db2inst1:
    db2_ps
  4. Remove these processes:
    kill -9 <process-ID>
Ensure that no other application is running under the Db2 instance ID, and then remove all resources owned by the Db2 instance ID
Do the following steps:
  1. Log in to the Db2 Big SQL head pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash
  2. Switch to the db2inst1 user:
    su - db2inst1
  3. List all IPC resources owned by db2inst1:
    ipcs | grep db2inst1
  4. Remove these resources:
    ipcrm -[q|m|s] db2inst1

OpenSearch operator fails during backup

Applies to: 5.3.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
After restoring, verify the list of restored indices in OpenSearch. If only indices that start with a dot, for example, .ltstore, are present, and expected non-dot-prefixed indices are missing, this indicates the issue.
Cause of the problem
During backup and restore, the OpenSearch operator restores indices that start with either "." or indices that do not start with ".", but not both. This behavior affects Watson Discovery deployments where both types of indices are expected to be restored.
Resolving the problem
Complete the following steps to resolve the issue:
  1. Access the client pod:
    oc rsh -n ${PROJECT_CPD_INST_OPERANDS} wd-discovery-opensearch-client-000
  2. Set the credentials and repository:
    user=$(find /workdir/internal_users/ -mindepth 1 -maxdepth 1 | head -n 1 | xargs basename)
    password=$(cat "/workdir/internal_users/${user}")
    repo="cloudpak"
  3. Get the latest snapshot:
    last_snapshot=$(curl --retry 5 --retry-delay 5 --retry-all-errors -k -X GET "https://${user}:${password}@${OS_HOST}/_cat/snapshots/${repo}?h=id&s=end_epoch" | tail -n1)
  4. Check that the latest snapshot was saved:
    echo $last_snapshot
  5. Restore the snapshot:
    curl -k -X POST "https://${user}:${password}@${OS_HOST}/_snapshot/${repo}/${last_snapshot}/_restore?wait_for_completion=true" \
      -d '{"indices": "-.*","include_global_state": false}' \
      -H 'Content-Type: application/json'
    This command can take a while to run before output is shown. After it completes, you'll see output similar to the following example:
    {
        "snapshot": {
            "snapshot": "cloudpak_snapshot_2025-06-10-13-30-45",
            "indices": [
                "966d1979-52e8-6558-0000-019759db7bdc_notice",
                "b49b2470-70c3-4ba1-9bd2-a16d72ffe49f_curations",
    			...
            ],
            "shards": {
                "total": 142,
                "failed": 0,
                "successful": 142
            }
        }
    }

This restores all indices, including both dot-prefixed and non-dot-prefixed ones.

Security issues

Security scans return an Inadequate Account Lockout Mechanism message

Applies to: 5.3.0 and later

Diagnosing the problem
If you run a security scan against IBM Software Hub, the scan returns the following message.
Inadequate Account Lockout Mechanism
Resolving the problem
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 5.3.0 and later

Diagnosing the problem
If you run an Aqua Security scan against your cluster, the scan returns the following issue:
Resolving the problem
This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.