Known issues and limitations for IBM Software Hub

The following issues apply to the IBM® Software Hub platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

The Watson Knowledge Catalog monitor fails to display events on the Monitoring page

Applies to: 5.4.0

When OOTB monitoring is enabled on the IBM Software Hub instance, the wkc-enrichment-monitor command fails to post events to the history or the events APIs and, in turn, does not display events in the UI.

As a workaround, you can check the Watson Knowledge Catalog Enrichment service condition by manually running the command.

The health service-functionality check for watsonx.ai fails on the Inferencing on Text Chat Models step

Applies to: 5.4.0

The cpd-cli health service-functionality check for watsonx.ai™ fails on the Inferencing on Text Chat Models step while interfering the text on the ibm/granite-20b-code-8k-ansible and ibm/granite-20b-code-javaenterprise-v2 models. The failure occurs only on these models if installed, and the health check continues to run successfully on the other installed models.

This failure is not indicative of a problem with watsonx.ai. There is no impact on the functionality of the service if this check fails.

The health service-functionality check fails on clusters with HAProxy configuration

Applies to: 5.4.0

On clusters that have HAProxy configured, the cpd-cli health service-functionality check fails with an End-of-File (EOF) error on long running requests. The error occurs due to HAProxy reloads disrupting active connections every 10 seconds.

As a work around, reconfigure the cluster router to increase the reload time from 10 seconds to 1 minute before running the cpd-cli health service-functionality command.

  1. Run the patch_router() command to reconfigure the cluster router:

    patch_router() {
      local timeout="5m"
      oc patch ingresscontroller/default -n openshift-ingress-operator \
      --type=merge \
      -p '{"spec":{"tuningOptions":{"reloadInterval":"1m"},"idleConnectionTerminationPolicy":"Deferred"}}'
    
      echo "Waiting for router rollout to start..."
      sleep 15
      
      echo "Waiting for router rollout to complete..."
      if ! oc -n openshift-ingress rollout status deployment/router-default --timeout="$timeout"; then
        echo "Warning: Rollout failed or timed out. Continuing to prevent blocking tests"
        return 1
      fi
      
      echo "Rollout completed. Waiting 60s for HAProxy config propagation and connection stabilization..."
      sleep 60
      
      echo "Router patch fully applied and stabilized."
    }
    
    # Call the function
    patch_router

    The command also restarts the deployments associated with the HAProxy server.

  2. Run the cpd-cli health service-functionality command.

This patch persists across cluster restarts and needs to be applied only once per cluster.

Installation and upgrade issues

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Applies to: 5.4.0

If the Red Hat® OpenShift® Data Foundation or IBM Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.

One symptom is that pods will not start because of a FailedMount error. For example:

Warning  FailedMount  36s (x1456 over 2d1h)   kubelet  MountVolume.MountDevice failed for volume 
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given 
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
Diagnosing the problem
To confirm whether the Data Foundation Rook Ceph cluster is unstable:
  1. Ensure that the rook-ceph-tools pod is running.
    oc get pods -n openshift-storage | grep rook-ceph-tools
    Note: On IBM Fusion HCI System or on environments that use hosted control planes, the pods are running in the openshift-storage-client project.
  2. Set the TOOLS_POD environment variable to the name of the rook-ceph-tools pod:
    export TOOLS_POD=<pod-name>
  3. Execute into the rook-ceph-tools pod:
    oc rsh -n openshift-storage ${TOOLS_POD}
  4. Run the following command to get the status of the Rook Ceph cluster:
    ceph status
    Confirm that the output includes the following line:
    health: HEALTH_WARN
  5. Exit the pod:
    exit
Resolving the problem
To resolve the problem:
  1. Get the name of the rook-ceph-mrg pods:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  2. Set the MGR_POD_A environment variable to the name of the rook-ceph-mgr-a pod:
    export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
  3. Set the MGR_POD_B environment variable to the name of the rook-ceph-mgr-b pod:
    export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
  4. Delete the rook-ceph-mgr-a pod:
    oc delete pods ${MGR_POD_A} -n openshift-storage
  5. Ensure that the rook-ceph-mgr-a pod is running before you move to the next step:
    oc get pods -n openshift-storage | grep rook-ceph-mgr
  6. Delete the rook-ceph-mgr-b pod:
    oc delete pods ${MGR_POD_B} -n openshift-storage
  7. Ensure that the rook-ceph-mgr-b pod is running:
    oc get pods -n openshift-storage | grep rook-ceph-mgr

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 5.4.0

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Master Data Management cannot function correctly.

This issue affects deployments of the following services.
  • IBM Knowledge Catalog
  • IBM Master Data Management
Diagnosing the problem
To identify the cause of this issue, check the FoundationDB status and details.
  1. Check the FoundationDB status.
    oc get fdbcluster -o yaml | grep fdbStatus

    If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.

  2. If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
    oc rsh sample-cluster-log-1 /bin/fdbcli

    To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.

    status details
    • If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
      oc get pod -o wide | grep storage
      > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 

      If this step does not resolve the problem, proceed to the workaround steps.

    • If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
Resolving the problem
To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

  1. Restart the FoundationDB cluster pods.
    oc get fdbcluster 
    oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po

    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

  2. Restart the FoundationDB operator pods.
    oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po
  3. After the pods finish restarting, check to ensure that FoundationDB is available.
    1. Check the FoundationDB status.
      oc get fdbcluster -o yaml | grep fdbStatus

      The returned status must be Complete.

    2. Check to ensure that the database is available.
      oc rsh sample-cluster-log-1 /bin/fdbcli

      If the database is still not available, complete the following steps.

      1. Log in to the ibm-fdb-controller pod.
      2. Run the fix-coordinator script.
        kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}

        Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

Node pinning is not applied to postgresql pods

Applies to: 5.4.0

If you use node pinning to schedule pods on specific nodes, and your environment includes postgresql pods, the node affinity settings are not applied to the postgresql pods that are associated with your IBM Software Hub deployment.

The resource specification injection (RSI) webhook cannot patch postgresql pods because the EDB Postgres operator uses a PodDisruptionBudget resource to limit the number of concurrent disruptions to postgresql pods. The PodDisruptionBudget resource prevents postgresql pods from being evicted.

The ibm-nginx deployment does not scale fast enough when automatic scaling is configured

Applies to: 5.4.0

If you configure automatic scaling for IBM Software Hub, the ibm-nginx deployment might not scale fast enough. Some symptoms include:

  • Slow response times
  • High CPU requests are throttled
  • The deployment scales up and down even when the workload is steady

This problem typically occurs when you install watsonx Assistant or watsonx™ Orchestrate.

Resolving the problem
If you encounter the preceding symptoms, you must manually scale the ibm-nginx deployment:
oc patch zenservice lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {
    "Nginx": {
        "name": "ibm-nginx",
        "kind": "Deployment",
        "container": "ibm-nginx-container",
        "replicas": 5,
        "minReplicas": 2,
        "maxReplicas": 11,
        "guaranteedReplicas": 2,
        "metrics": [
            {
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": 529
                    }
                }
            }
        ],
        "resources": {
            "limits": {
                "cpu": "1700m",
                "memory": "2048Mi",
                "ephemeral-storage": "500Mi"
            },
            "requests": {
                "cpu": "225m",
                "memory": "920Mi",
                "ephemeral-storage": "100Mi"
            }
        },
        "containerPolicies": [
            {
                "containerName": "*",
                "minAllowed": {
                    "cpu": "200m",
                    "memory": "256Mi"
                },
                "maxAllowed": {
                    "cpu": "2000m",
                    "memory": "2048Mi"
                },
                "controlledResources": [
                    "cpu",
                    "memory"
                ],
                "controlledValues": "RequestsAndLimits"
            }
        ]
    }
}}'

Uninstalling IBM watsonx services does not remove the IBM watsonx experience

Applies to: 5.4.0

After you uninstall watsonx.ai or watsonx.governance™, the IBM watsonx experience is still available in the web client even though there are no services that are specific to the IBM watsonx experience.

Resolving the problem
To remove the IBM watsonx experience from the web client, an instance administrator must run the following command:
oc delete zenextension wx-perspective-configuration \
--namespace=${PROJECT_CPD_INST_OPERANDS}

Backup and restore issues

Issues that apply to several backup and restore methods

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Backups with IBM Fusion and NetApp Trident Protect fail when IBM StreamSets is installed
  2. Identity resources have file names that are too long
  3. Backup fails due to lingering pvc-sysbench-rwo created by storage-performance health check in Data Virtualization
Restore issues
Review the following issues before you restore a backup. Do the workarounds that apply to your environment.
  1. Python tools are missing in watsonx Orchestrate after restoring to different cluster
  2. Domain agent connections are not displayed in watsonx Orchestrate after restoring to different cluster
  3. Cannot create connections in watsonx Orchestrate after restoring to a different cluster
  4. IBM watsonx Orchestrate fails to deploy agents after restoring data to a different cluster
  5. During a restore, the IBM Master Data Management CR fails with an error stating that a conditional check failed
  6. After a restore, OperandRequest timeout error in the ZenService custom resource
  7. Restore fails and displays postRestoreViaConfigHookRule error in Data Virtualization
  8. Error 404 displays after backup and restore in Data Virtualization
  9. The restore process times out while waiting for the ibmcpd status check to complete
  10. Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration

Backup and restore issues with the OADP utility

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Backup pre-check fails for Db2 Data Management Console due to timeout for status check
  2. Backup validation fails for Data Virtualization due to missing label in dvendpoint PVC
  3. Offline backup fails with PartiallyFailed error
  4. ObjectBucketClaim is not supported by the OADP utility
  5. Db2 Big SQL backup pre-hook and post-hook fail during offline backup
Restore issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Restoring Data Virtualization fails with metastore not running or failed to connect to database error
  2. Prompt tuning fails after restoring watsonx.ai
  3. Restic backup that contains dynamically provisioned volumes in Amazon Elastic File System fails during restore

Backup and restore issues with IBM Fusion

Backup issues
Review the following issues before you create a backup. Do the workarounds that apply to your environment.
  1. Backup fails with post-backup hook timeout for IBM Master Data Management
  2. Backup pre-check fails for db2oltp, db2wh, or db2aaservice on upgraded cluster with proxy enabled
  3. Resource validation at the end of a backup fails with OOMKilled status
  4. Db2 backup fails at the Hook: br-service hooks/pre-backup step
  5. Backup service location is unavailable during backup
Restore issues
Do the workarounds that apply to your environment after you restore a backup.
  1. Watson Discovery fails after restore with opensearch and post_restore unverified components
  2. Restore process stuck at db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState
  3. Restore process stuck at zenextensions-patch-ckpt-cm step

Backup and restore issues with NetApp Trident Protect

Backup issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Restore fails for Cognos Analytics while waiting for its CR to change status
  2. Restore for watsonx.data Premium completes but dependent CRs remain in a bad state
  3. OADP backup with the same name as a NetApp Trident Protect backup can be deleted by cpd-trident-protect.py backup delete
  4. Neo4j config maps missing during backup
  5. Backups and restores fail because of missing SCCs
Restore issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. Backups and restores fail because of missing SCCs

Backup and restore issues with Portworx

Backup issues
Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
  1. IBM Software Hub resources are not migrated

ObjectBucketClaim is not supported by the OADP utility

Applies to: 5.4.0

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
If an ObjectBucketClaim is created in an IBM Software Hub instance, it is not included when you create a backup.
Cause of the problem
OADP does not support backup and restore of ObjectBucketClaim.
Resolving the problem
Services that provide the option to use ObjectBuckets must ensure that the ObjectBucketClaim is in a separate namespace and backed up separately.

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 5.4.0

Applies to: All backup and restore methods

Diagnosing the problem
Get the status of the ZenService YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...
Check for failing operandrequests:
oc get operandrequests -A
For failing operandrequests, check their conditions for constraints not satisfiable messages:
oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>
Cause of the problem
Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Resolving the problem
Do the following steps:
  1. Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.

    For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.

  2. Delete IBM Software Hub instance projects (namespaces).

    For details, see Cleaning up the cluster before a restore.

  3. Retry the restore.

Python tools are missing in watsonx Orchestrate after restoring to different cluster

Applies to: 5.4.0

Applies to: Online and offline restore to a different cluster

Diagnosing the problem

After restoring data from a source cluster to a target cluster, Python tools that were imported in the source cluster are not displayed in the restored watsonx Orchestrate environment.

When you click Manage > Agents from the watsonx Orchestrate navigation menu, the Python tools that you imported in the source cluster do not appear in the agents list. The Manage agents screen might also display an error when you try to access it.

Cause of the problem

This issue occurs because the certificates from the source cluster are included in the backup, and they are restored in the target cluster. These certificates are specific to the source cluster and are not valid in the target cluster environment. When watsonx Orchestrate components attempt to use these certificates, the Python tools fail to load and display properly.

Resolving the problem

The certificates need to be regenerated in the target cluster to match the target cluster's environment.

After restoring data to the target cluster, manually regenerate the certificates in the IBM Software Hub instance namespace.

  1. Delete the existing certificates:
    oc delete certificate `oc get certificates | grep icert | awk '{print $1}'`
  2. Restart the watsonx Orchestrate deployments to trigger the regeneration of the certificates:
    oc rollout restart deployments `oc get deployments -l icpdsupport/module=components-services-orchestrate | awk '{print $1}'`
  3. Wait for the deployments to restart and the new certificates to be generated.

Domain agent connections are not displayed in watsonx Orchestrate after restoring to different cluster

Applies to: 5.4.0

Applies to: Online and offline restore to a different cluster

Diagnosing the problem

After restoring data from a source cluster to a target cluster, domain agent connections that existed in the source cluster are not displayed in the watsonx Orchestrate environment in the target cluster.

When you click Connections from the watsonx Orchestrate navigation menu, the domain agent connections from the source cluster do not appear in the connections list. The Connections window might also display an error when you try to access it.

Cause of the problem

This issue occurs because the certificates from the source cluster are included in the backup, and they are restored in the target cluster. These certificates are specific to the source cluster and are not valid in the target cluster environment. When watsonx Orchestrate components attempt to use these certificates, the domain agent connections fail to load and display properly.

Resolving the problem

The certificates need to be regenerated in the target cluster to match the target cluster's environment.

After restoring data to the target cluster, manually regenerate the certificates in the IBM Software Hub instance namespace.

  1. Delete the existing certificates:
    oc delete certificate `oc get certificates | grep icert | awk '{print $1}'`
  2. Restart the watsonx Orchestrate deployments to trigger the regeneration of the certificates:
    oc rollout restart deployments `oc get deployments -l icpdsupport/module=components-services-orchestrate | awk '{print $1}'`
  3. Wait for the deployments to restart and the new certificates to be generated.

Cannot create connections in watsonx Orchestrate after restoring to a different cluster

Applies to: 5.4.0

Applies to: Online and offline restore to a different cluster

Diagnosing the problem

After restoring data from a source cluster to a target cluster, you cannot create new connections in the restored watsonx Orchestrate environment.

When you click Save and continue to create a connection in the "Add new connection" dialog, you see a Connection failed error. The Connections window might also display an error when you try to access it, and API calls to the connections service may fail with HTTP 500 errors.

Cause of the problem

This issue occurs because the certificates from the source cluster are included in the backup, and they are restored in the target cluster. These certificates are specific to the source cluster and are not valid in the target cluster environment. When watsonx Orchestrate components attempt to use these certificates, the connections service fails to process requests properly.

The issue can also cause errors and failures for JWT token validation and certificate verification during API calls between watsonx Orchestrate components.

Resolving the problem

The certificates need to be regenerated in the target cluster to match the target cluster's environment.

After restoring data to the target cluster, manually regenerate the certificates in the IBM Software Hub instance namespace.

  1. Delete the existing certificates:
    oc delete certificate `oc get certificates | grep icert | awk '{print $1}'`
  2. Restart the watsonx Orchestrate deployments to trigger the regeneration of the certificates:
    oc rollout restart deployments `oc get deployments -l icpdsupport/module=components-services-orchestrate | awk '{print $1}'`
  3. Wait for the deployments to restart and the new certificates to be generated.

Backups with IBM Fusion and NetApp Trident Protect fail when IBM StreamSets is installed

Applies to: 5.4.0

Applies to: Backup and restore with IBM Fusion and NetApp Trident Protect

Diagnosing the problem
Backups might fail for deployments that have IBM StreamSets installed. The backups often fail in the cpd-registry-check phase.
Cause of the problem

IBM StreamSets is not getting excluded from backups with IBM Fusion and NetApp Trident Protect. IBM StreamSets needs to be excluded from these backups as it can't currently be backed up with IBM Fusion and NetApp Trident Protect.

Resolving the problem
At this time, there is no confirmed workaround for this issue.

Identity resources have file names that are too long

Applies to: 5.4.0

Applies to: All backup and restore methods

Diagnosing the problem

During the backup validation phase, the backup fails, and you see an error that says the file name is too long, for example:

time=2025-10-07T14:53:35.524157Z level=debug msg=exit backupValidationService.validateBackup
Error: failed to extract downloaded backup file '/tmp/backup-resources-f762cb7f-fea7-4d3a-9e13-c793147f7012-20251007145334-data.tar.gz': 
open /tmp/backup-resources-f762cb7f-fea7-4d3a-9e13-c793147f7012-20251007145334-data-20251007145334/resources/identities.user.openshift.io/cluster/xxx:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy.json: 
file name too long
Cause of the problem

The error occurs when extracting the backup archive. One or more OpenShift identity resources have very long names, for example identities.user.openshift.io in the error above. These names exceed the filesystem's 255 character filename limit. This typically occurs with LDAP-generated identities that include the full distinguished name (DN) in the resource name. When the backup validation process attempts to extract the backup archive and create files for these resources, the operation fails because the filename is too long.

Resolving the problem

Manually mark the problematic identity resources so that they are excluded from the backup:

  1. Use the following command to exclude all identity resources before you take a backup:
    oc get identities.user.openshift.io --no-headers | awk '{print $1}' | xargs -I {} sh -c "oc label identities.user.openshift.io {} velero.io/exclude-from-backup=true"
    Note: You can use the same command even if you already tried to take a backup, and the backup failed.
  2. Retry the backup.

IBM watsonx Orchestrate fails to deploy agents after restoring data to a different cluster

Applies to: 5.4.0

Applies to: All backup and restore methods

Diagnosing the problem

After restoring data for IBM watsonx Orchestrate to a different cluster, you create an agent with Python tools or Domain agent tools. When you try to deploy the agent, the deployment process fails, and you see the following error message:

{
  "detail": "Unexpected error during tool deployment: 500: Tool deployment failed in TRM:
  {\"error\":\"storage init: failed to create bucket \\\"wo-server-storage-bucket-cpd-instance-1\\\":
  operation error S3: CreateBucket, https response error StatusCode: 403, RequestID: mig6ggytexrd1f-1cji, HostID: mig6ggyt-exrd1f-1cji, api error InvalidAccessKeyId: The AWS access key Id you
  provided does not exist in our records.\"} "
}
Cause of the problem

After you complete the restoration, the IBM watsonx Orchestrate Tools Runtime Manager (TRM) component still retains references to the AWS credentials from the source cluster. When deploying the agents, the TRM tries to create an S3 storage bucket for the agents by using the old AWS credentials. Since these credentials do not match the storage credentials for the target cluster, the S3 storage bucket fails to be created with an InvalidAccessKeyId error.

Resolving the problem
At this time, there is no confirmed workaround for this issue.

During a restore, the IBM Master Data Management CR fails with an error stating that a conditional check failed

Applies to: 5.4.0

Applies to: All backup and restore methods

Diagnosing the problem
A restore of the IBM Master Data Management service fails with an error similar to the following example:
The conditional check 'all_services_available and ( history_enabled | bool  or parity_enabled | bool )' failed. 
The error was: error while evaluating conditional (all_services_available and ( history_enabled | bool  or parity_enabled | bool )): 'history_enabled' is undefined

The error appears to be in '/opt/ansible/roles/4.10.23/mdm_cp4d/tasks/check_services.yml': line 202, column 5, 
but may be elsewhere in the file depending on the exact syntax problem.
Resolving the problem
No action is required. The error will be resolved automatically during subsequent operator reconciliation runs.

Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration

Applies to: 5.4.0

Applies to: All backup and restore methods

Diagnosing the problem
After you restore, Watson OpenScale fails due to memory constraints. You might see Db2 (db2oltp) or Db2 Warehouse (db2wh) instances that return 404 errors and pod failures where scikit pods are unable to connect to Apache Kafka.
Cause of the problem
The root cause is typically insufficient memory or temporary table page size settings, which are critical for query execution and service stability.
Resolving the problem
Ensure that the Db2 (db2oltp) or Db2 Warehouse (db2wh) instance is configured with adequate memory resources, specifically:
  • Set the temporary table page size to at least 700 GB during instance setup or reconfiguration.
  • Monitor pod health and Apache Kafka connectivity to verify that dependent services recover after memory allocation is corrected.

Backup fails with post-backup hook timeout for IBM Master Data Management

Applies to: 5.4.0

Applies to: Online backup with IBM Fusion

Diagnosing the problem

This issue occurs when IBM Master Data Management is installed on your deployment. After upgrading IBM Software Hub, the first backup that you take fails during the post-backup phase with the following error:

[ERROR]: Unexpected failure in hook: 'Recipe executed. Fail Count=1, rollback=True, last failed command: "ExecHook/br-service-hooks/post-backup"  
Hook run in ubr-operator:cpdbr-tenant-service-78c-c28xk ended with internal rc 5 indicating hook reached timeout prior to completion.  
Extracted error message: Timeout reached before command completed.'
Cause of the problem

The backup fails because the IBM Master Data Management custom resource takes longer to reconcile than allowed in the IBM Fusion recipe for post-backup hooks. The default amount of time allied in the IBM Fusion recipe is 1800 seconds (30 minutes). The backup times out and fails before IBM Master Data Management can complete its reconciliation

Resolving the problem

You can manually increase the timeout for the IBM Fusion recipe that is used for post-backup hooks.

  1. Run the following command to increase the timeout to 7200 seconds (2 hours):
    oc get frcpe ibmcpd-tenant -n "${PROJECT_CPD_INST_OPERATORS}" -o json | jq --arg timeout "7200" '(.spec.hooks[] | select(.name=="br-service-hooks") | .ops[] | select(.name=="checkpoint-inverseop" or .name=="post-backup") | .timeout) |= ($timeout | tonumber)' | oc apply -f -
  2. Take another backup.
Important: If the IBM Fusion recipe is regenerated for any reason, the timeout value can revert back to the default. You might need to change the timeout value again before taking another backup.

Backup pre-check fails for db2oltp, db2wh, or db2aaservice on upgraded cluster with proxy enabled

Applies to: 5.4.0

Applies to: Online and offline backups

Diagnosing the problem

After upgrading a cluster from IBM Software Hub 5.2.1 to 5.4.1, backups fail during the pre-check hook phase. You also see errors for some or all of the following Db2 ConfigMaps:

  • db2oltp-aux-ckpt-cm
  • db2oltp-aux-br-cm
  • db2wh-aux-ckpt-cm
  • db2wh-aux-br-cm
  • db2aaservice-aux-ckpt-cm
  • db2aaservice-aux-br-cm

When you check the status of the affected services (db2oltpservices, db2whservices, and db2aaserviceservices), you see that the CR status frequently flips between Completed and InProgress statuses:

oc get db2oltpservices -n <namespace> -oyaml | grep Status
    db2oltpStatus: InProgress
# A few seconds later...
    db2oltpStatus: Completed
# A few seconds later...
    db2oltpStatus: InProgress
    progress: 10%
    progressMessage: Ibmcpd dependency satisfied
Cause of the problem

When proxy configuration resources are enabled on clusters by using the cpd-cli manage create-proxy-config command, it creates a resource specification injection (RSI) patch. The RSI patch injects proxy environment variables into pods. A cronjob zen-rsi-evictor-cron-job then runs every 30 minutes to check all eligible pods, and it evicts unpatched pods that need to have the RSI patch applies. It skips pods that have already been patched.

During the checking process, the cronjob updates the pod owner's annotation to indicate the patching status. The Db2 operator watches these deployment changes and triggers a reconciliation of the Db2 custom resource whenever the annotations are updated. This causes the CR status to temporarily change from Completed to InProgress during the reconciliation. When a backup pre-check runs during this reconciliation window, it causes the precheck to fail for the db2oltp-aux-ckpt-cm, db2wh-aux-ckpt-cm, or db2aaservice-aux-ckpt-cm ConfigMap because the pod state is not in the expected Completed status while backup is trying to capture a consistent snapshot.

Resolving the problem

If the backup fails, you can try taking a backup after the reconciliation is finished.

If you don't want zen-rsi-evictor-cron-job running while you take a backup, you can suspend this cronjob during backup and resume it after the backup finishes. For scheduled backups, consider adjusting the backup schedule to avoid the RSI patch cycle, or implement automation to suspend and resume the cron job around backup windows.
  1. Use the following command to suspend zen-rsi-evictor-cron-job before you take a backup:
    oc patch cronjob zen-rsi-evictor-cron-job -n <cpd-instance-namespace> -p '{"spec":{"suspend":true}}'
  2. Take a backup.
  3. Use the following command to resume zen-rsi-evictor-cron-job:
    oc patch cronjob zen-rsi-evictor-cron-job -n <cpd-instance-namespace> -p '{"spec":{"suspend":false}}'

Resource validation at the end of a backup fails with OOMKilled status

Applies to: 5.4.0

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

When performing a backup with IBM Fusion, the backup fails during the resource validation phase at the end of the backup process. The IBM Fusion transaction manager logs indicate the following errors

Recipe failed
BMYBR0009
There was an error when processing the job in the Transaction Manager service. 
The underlying error was: 'Recipe executed. Fail Count=1, rollback=True, last failed command: "ExecHook/br-service-hooks/resource-validation"   Extracted error message: Success.Hook run in cpd-operators:cpdbr-tenant-service-5cb9-tsj ended with internal rc 1 indicating hook execution ended in failure. Extracted error message: Running command '/cpdbr-scripts/cpdbr-oadp backup validate --tenant-operator-namespace=cpd-operators --namespace=ibm-backup-restore --spec-version=2.0.0' failed with exception: (0)\nReason: Handshake status 500 Internal Server Error -+-+- {'content-length': '28', 'content-type': 'text/plain; charset=utf-8', 'date': 'Fri, 13 Feb 2026 09:20:36 GMT'} -+-+- b'container not found ("main")'\n.'.

When you investigate, you see that the resource validation was causing the cpdbr-tenant-service pod to restart with an OOMKilled (exit code 137) status.

Cause of the problem

The resource validation process requires more memory than the default memory limit that is allocated in the cpdbr-tenant-service deployment. When the memory limit is insufficient, it causes the pod to be killed by the Kubernetes OOM (Out of Memory) killer.

This issue occurs more frequently in environments with large numbers of resources being backed up, where the validation process needs to process and validate all backed-up resources.

Resolving the problem

Increase the memory limit for the cpdbr-tenant-service deployment before performing a backup with IBM Fusion.

  1. Check the memory limit of the cpdbr-tenant-service deployment.
    oc get deployment \
    -n ${PROJECT_CPD_INST_OPERATORS} cpdbr-tenant-service \
    -o jsonpath='{.spec.template.spec.containers[0].resources.limits.memory}'
  2. Run the following command to increase the memory limit for cpdbr-tenant-service from 1Gi to 4Gi:
    
    oc patch deployment cpdbr-tenant-service \
    -n ${PROJECT_CPD_INST_OPERATORS} \
    --type='json' \
    -p='[{ "op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "4Gi" }]'

    The deployment will automatically restart the pod with the new memory limit.

  3. Wait for the pod to be ready, and then try to take another backup.

    You can check if the pod is ready by using the following command:

    oc get pods -n cpd-operators | grep cpdbr-tenant-service

Db2 backup fails at the Hook: br-service hooks/pre-backup step

Applies to: 5.4.0

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
In the cpdbr-oadp.log file, you see messages like in the following example:
time=<timestamp> level=info msg=podName: c-db2oltp-5179995-db2u-0, podIdx: 0, container: db2u, actionIdx: 0, commandString: ksh -lc 'manage_snapshots --action suspend --retry 3', command: [sh -c ksh -lc 'manage_snapshots --action suspend --retry 3'], onError: Fail, singlePodOnly: false, timeout: 20m0s func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:767
time=<timestamp> level=info msg=cmd stdout:  func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:823
time=<timestamp> level=info msg=cmd stderr: [<timestamp>] - INFO: Setting wolverine to disable
Traceback (most recent call last):
  File "/usr/local/bin/snapshots", line 33, in <module>
    sys.exit(load_entry_point('db2u-containers==1.0.0.dev1', 'console_scripts', 'snapshots')())
  File "/usr/local/lib/python3.9/site-packages/cli/snapshots.py", line 35, in main
    snap.suspend_writes(parsed_args.retry)
  File "/usr/local/lib/python3.9/site-packages/snapshots/snapshots.py", line 86, in suspend_writes
    self._wolverine.toggle_state(enable=False, message="Suspend writes")
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 73, in toggle_state
    self._toggle_state(state, message)
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 77, in _toggle_state
    self._cmdr.execute(f'wvcli system {state} -m "{message}"')
  File "/usr/local/lib/python3.9/site-packages/utils/command_runner/command.py", line 122, in execute
    raise CommandException(err)
utils.command_runner.command.CommandException: Command failed to run:ERROR:root:HTTPSConnectionPool(host='localhost', port=9443): Read timed out. (read timeout=15)
Cause of the problem
The Wolverine high availability monitoring process was in a RECOVERING state before the backup was taken.

Check the Wolverine status by running the following command:

wvcli system status
Example output:
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
HA Management is RECOVERING at <timestamp>.
The Wolverine log file /mnt/blumeta0/wolverine/logs/ha.log shows errors like in the following example:
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:414)] -  check_and_recover: unhealthy_dm_set = {('c-db2oltp-5179995-db2u-0', 'node')}
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:416)] - (c-db2oltp-5179995-db2u-0, node) : not OK
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:421)] -  check_and_recover: unhealthy_dm_names = {'node'}
Resolving the problem
Do the following steps:
  1. Re-initialize Wolverine:
    wvcli system init --force
  2. Wait until the Wolverine status is RUNNING. Check the status by running the following command:
    wvcli system status
  3. Retry the backup.

Backup service location is unavailable during backup

Applies to: 5.4.0

Applies to: Backup with IBM Fusion

If your cluster is running Red Hat OpenShift Container Platform 4.19 with IBM Fusion 2.11 and OADP 1.5.1, then the backup service location might enter Unavailable state. To resolve this issue, upgrade OADP to version 1.5.2.

Watson Discovery fails after restore with opensearch and post_restore unverified components

Applies to: 5.4.0

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem
After you restore, Watson Discovery becomes stuck with the following components listed as unverifiedComponents:
unverifiedComponents:
- opensearch
- post_restore
Additionally, the OpenSearch client pod might show an unknown container status, similar to the following example:
NAME                                 READY   STATUS                    RESTARTS   AGE
wd-discovery-opensearch-client-000   0/1     ContainerStatusUnknown    0          11h
Cause of the problem
The post_restore component depends on the opensearch component being verified. However, the OpenSearch client pod is not running, which prevents verification and causes the restore process to stall.
Resolving the problem
Manually delete the OpenSearch client pod to allow it to restart:
$ oc delete -n ${PROJECT_CPD_INST_OPERANDS} pod wd-discovery-opensearch-client-000

After the pod is restarted and verified, the post_restore component should complete the verification process.

Restore process stuck at db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState

Applies to: 5.4.0

Applies to: Restore with IBM Fusion

Diagnosing the problem

When you use IBM Fusion to restore to the same cluster, the restore process fails when it times out while trying to verify that the db2ucluster or db2uinstance resources are in Ready status. You might receive something similar to the following error message:

There was an error when processing the job in the Transaction Manager service. 
The underlying error was: 'Execution of workflow restore of recipe ibmcpd-tenant completed. 
Number of failed commands: 1, last failed command: 
"CheckHook/db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState"'.
Cause of the problem

IBM Fusion fails to restore because of a combination of factors:

  • The db2 velero plugin is not present in IBM Fusion DPA.
  • The IBM Fusion check hook requires both db2uinstance and db2ucluster resources to be present, and it cannot skip checks when resources don't exist in the cluster
Workaround
You have two options to resolve the issue:
  • Update IBM Fusion from version 2.10 to either version 2.10 with hotfixes or version 2.10.1, and then restart the backup and restore

  • Reconfigure DataProtectionApplication (DPA) and start a fresh backup and restore.
    1. Configure DPA from the source cluster by following the process for creating a DPA custom resource in Installing and configuring software on the source cluster for backup and restore with IBM Fusion.
    2. Use IBM Fusion to create another backup.
    3. Start another restore from the new backup.

Restore process stuck at zenextensions-patch-ckpt-cm step

Applies to: 5.4.0

Applies to: Backup and restore with IBM Fusion 2.10 (without hotfixes)

Diagnosing the problem
IBM Fusion fails to restore because the process gets stuck at the zenextensions-patch-ckpt-cm step. The restore fails at the following stage:
ExecHook/zenextensions-patch-ckpt-cm-zen-1-child.zenextensions-hooks/force-reconcile-zenextensions
Cause of the problem

The issue is when the IBM Fusion exec hook cannot handle trailing new spaces, and the restore process gets stuck waiting for zenextensions hooks to complete.

Workarounds
You have two options to resolve the issue:
  • Update IBM Fusion from version 2.10 to version either 2.10 with hotfixes or 2.10.1 version, and then restart the backup and restore

  • Use the fusion-resume-restore script.

You can use the fusion-resume-restore.sh script to continue the restore process from the point where it failed.

  1. Check that you have the prerequisites to run the script:
  2. Download the fusion-resume-restore script from https://github.com/IBM/cpd-cli/blob/master/cpdops/files/fusion-resume-restore.sh.
  3. Ensure you have write access to /tmp/fusion-resume-restore.
  4. Run the script from hub:
    ./fusion-resume-restore.sh <fusion-restore-name> ${PROJECT_CPD_INST_OPERATORS}

    The script displays the steps required to restore and their sequence numbers.

  5. When the script asks you to Enter the key of the workflow hook/group to resume from, select the index number that excludes the zenextensions-patch-ckpt-cm step.

    For example:

    Enter the key of the workflow hook/group to resume from (0-101): 97
    Selected workflow (index=97): "hook: zenextensions-patch-ckpt-cm-zen-1-child.zenextensions-hooks/force-reconcile-zenextensions"
  6. When the script asks Resume the Fusion restore now?, choose a response based on whether you are restoring to the same or a different cluster.
    • If restoring to the same cluster, reply y.
    • If restoring to a different cluster, reply n.

      The script provides instructions to manually apply the CRs to resume the restore.

OADP 1.5.x Certificate Issue with Fusion BSL

Diagnosing the problem
on a cluster with OCP 4.19 with Fusion 2.11, if OADP 1.5.1 is installed/upgraded, BSL will go into Unavailable Phase. If the cluster has such setup, please upgrade OADP to Version 1.5.2.

If your cluster is running OpenShift Container Platform 4.19 with IBM Fusion 2.11 and OADP 1.5.1, then the backup service location might enter an unavailable phase. To resolve this issue, upgrade OADP to version 1.5.2.

Restore fails for Cognos Analytics while waiting for its CR to change status

Applies to: 5.4.0

Applies to: Online restore with NetApp Trident Protect

Diagnosing the problem

When you attempt to perform an online restore using NetApp Trident Protect on a cluster with Cognos Analytics installed, the restore fails during the post-restore phase with a timeout error. The restore process times out after several hours while waiting for the Cognos Analytics custom resource (CR) to reach Completed status.

The restore log shows errors similar to the following:

time=2026-04-08T21:31:08.395262Z level=error msg=error: error jsonpath FindResults(): caStatus is not found
time=2026-04-08T21:31:08.395306Z level=debug msg=error evaluating condition (condition={$.status.caStatus} == {"Completed"}, namespace=ubr, gvr=ca.cpd.ibm.com/v1, 
Resource=caservices, name=ca-addon-cr): error jsonpath FindResults(): caStatus is not found
...
time=2026-04-08T21:31:13.437750Z level=debug msg=result of expression: `"InProgress" == "Completed"` is false

When you check the Cognos Analytics CR status, it shows InProgress and does not progress to Completed:

#oc get caserviceinstance -n <namespace>
NAME                    PROGRESS   PROGRESSMESSAGE                                                        STATUS       AGE
ca1760007983547817-cr   35%        Completed with running artifacts cpd role. Beginning to run cm role.   InProgress   2d12h
Cause of the problem

The Cognos Analytics custom resource takes longer than expected to reconcile and reach Completed status after a restore operation. The post-restore validation times out before the Cognos Analytics CR finishes reconciling and updates its status. The restore fails even though the underlying restore operation completed successfully.

Resolving the problem

After the Cognos Analytics service reaches Completed status, you can manually resume the restore process to complete the remaining post-restore steps.

  1. Wait for the Cognos Analytics custom resource to reach Completed status and for PROGRESS to show 100%. You can use the following command to check the status:
    oc get caserviceinstance -n <namespace>
  2. Resume the restore process from the restore-post-operands phase. Use the following command, changing cpd-operators in the command as needed:
    PROJECT_CPD_INST_OPERATORS=cpd-operators
    TENANT_BACKUP_NAME=$(oc get cm -n ${PROJECT_CPD_INST_OPERATORS} cpd-operators -o json | jq -r ".data.vendorBackup")
    export CPDBR_ENABLE_FEATURES=experimental
    cpd-cli oadp tenant-restore create --from-tenant-backup=${TENANT_BACKUP_NAME} --log-level=debug --verbose --start-from=restore-post-operands

Restore for watsonx.data Premium completes but dependent CRs remain in a bad state

Applies to: 5.4.0

Applies to: Online restore to different cluster with NetApp Trident Protect

Diagnosing the problem

You perform an online restore by using NetApp Trident Protect on a cluster with watsonx.data™ Premium installed. The restore operation seems to be successful, but the watsonx.data Premium custom resource (CR) and its dependent CRs remain in an InProgress state and have low completion percentages, for example

watsonx_data              WxdAddon                 wxdaddon                     InProgress  15%
watsonx_data_premium      WxdAddonPremium          wxdaddon-premium             InProgress  56%
wml                       WmlBase                  wml-cr                       InProgress  0%
watsonx_ai                Watsonxai                watsonxai-cr                 InProgress  27%

And multiple pods fail to start with errors: Error: ErrImagePull and Error: ImagePullBackOff

Cause of the problem

watsonx.data Premium creates dependent custom resources during its reconciliation process. When a restore operation is performed, these dependent CRs are restored from the backup with their original metadata, including fields like uid, resourceVersion, managedFields, generation, and creationTimestamp.

However, the watsonx.data Premium operator already created new instances of these dependent CRs on the target cluster during the restore process. This creates conflicts between the old CRs and the newly created CRs, when the metadata and specifications differ.

Resolving the problem

You can manually patch the dependent CRs after the restore completes. You need to remove conflicting metadata fields from the backed-up CR files and reapplying them to the cluster.

  1. After the restore operation completes, identify the dependent CRs that are in a bad state:
    cpd-cli manage get-cr-status --cpd_instance_ns=<namespace>
  2. For each CR, locate the backed-up CR file in the restore data directory, and copy the CR file to a new location for editing.
  3. Edit the copied file and remove the following fields from the metadata section:
    • uid
    • resourceVersion
    • managedFields
    • generation
    • creationTimestamp
  4. Remove the entire status section from the CR file.
  5. Apply the modified CR file:
    oc apply --server-side --force-conflicts -f $crLoc
  6. Verify that the CR begins reconciling, then repeat the steps for any other dependent CRs that need to be patched.
  7. Repeat steps 2-7 (such as watsonxai-cr, analyticsengine-sample, etc.).
  8. Monitor the CR status until all CRs reach Completed state:
    cpd-cli manage get-cr-status --cpd_instance_ns=<namespace>

OADP backup with the same name as a NetApp Trident Protect backup can be deleted by cpd-trident-protect.py backup delete

Applies to: 5.4.0

Applies to: Backup with NetApp Trident Protect

Diagnosing the problem

A NetApp Trident Protect backup can fail if it has the same name as an existing OADP backup.

A NetApp Trident Protect backup uses the same name for the NetApp Trident Protect CR and the OADP sub-backup. If you take OADP backups independently of taking backups with NetApp Trident Protect, the backup process can fail because an OADP backup with the same name already exists.

Additionally, attempting to delete the failed NetApp Trident Protect backup will delete both the NetApp Trident Protect CR and the existing OADP backup.

Cause of the problem

This issue can occur if you use one of the cpd-cli oadp commands to take OADP backups independently of taking backups with NetApp Trident Protect by using cpd-trident-protect.py.

NetApp Trident Protect backups do not currently verify ownership of their related OADP sub-backup.

Resolving the problem

Avoid creating NetApp Trident Protect backups that have the same name as existing OADP backups.

NetApp Trident Protect backups need to be unique, and they cannot overlap with existing OADP backups. When creating an on-demand NetApp Trident Protect backup, append a unique identifier, such as the current timestamp:

BACKUP_NAME="backup-$(date +%s)"
cpd-trident-protect.py backup create \
--backup_name=${BACKUP_NAME} ...

Neo4j config maps missing during backup

Applies to: 5.4.0

Applies to: Backup with NetApp Trident Protect

Diagnosing the problem

During the backup, the backup operation fails, and you see a pre-snapshot hook error similar to:

Error checking for hook failures in backup '<backup-name>':
command terminated with exit code 1

When you inspect the logs, you see failures related to missing Neo4j configmaps:

failed to validate dynamic online configmap '<>-aux-v2-ckpt-cm': not found
failed to validate dynamic offline configmap '<>-aux-v2-br-cm': not found

When you investigate the inventory ConfigMap, ibm-neo4j-inv-list-cm shows that it still contains entries referencing a Neo4j instance that has already been uninstalled or is no longer running.

Pre-snapshot hooks attempt to execute backup commands for all entries listed in the inventory ConfigMap, so the backup fails when it encounters stale/non-existent entries.

Cause of the problem

The backup fails because the Neo4j inventory ConfigMap contains stale configmap references for Neo4j clusters, which were previously installed but later removed.

Backup hooks rely on this inventory to determine which Neo4j instances require online/offline checkpoint operations. When the inventory still includes configmaps that no longer exist in the cluster (such as <>-aux-v2-ckpt-cm and <>-aux-v2-br-cm), the hook execution fails, which causes the entire backup to fail.

This makes the backup system behave as if the missing configmaps indicate a failure, even though the corresponding Neo4j instance was intentionally removed.

Resolving the problem

To resolve the problem, remove stale Neo4j entries from the inventory ConfigMap so that the backup does not attempt to run hooks against non-existent Neo4j instances.

  1. View the inventory ConfigMap, and remove entries that look like the following examples:
    • - name: mdm-neo-<id>-aux-v2-ckpt-cm
    • - name: mdm-neo-<id>-aux-v2-br-cm
  2. Edit the inventory ConfigMap to remove stale entries:
    oc edit cm ibm-neo4j-inv-list-cm -n zen
    

    Remove the MDM-related blocks from both the online and offline lists, which ensures that the backup executes hooks only for active Neo4j instances.

    For example, the following structure shows the stale entries

    online:
      - name: data-lineage-neo4j-aux-v2-ckpt-cm
        namespace: zen
        priority-order: '200'
      - name: mdm-neo-1763773800090371-aux-v2-ckpt-cm
        namespace: zen
        priority-order: '200'
    offline:
      - name: data-lineage-neo4j-aux-v2-br-cm
        namespace: zen
        priority-order: '200'
      - name: mdm-neo-1763773800090371-aux-v2-br-cm
        namespace: zen
        priority-order: '200'

    And the following example shows the revised structure:

    online:
      - name: data-lineage-neo4j-aux-v2-ckpt-cm
        namespace: zen
        priority-order: '200'
    
    offline:
      - name: data-lineage-neo4j-aux-v2-br-cm
        namespace: zen
        priority-order: '200'
    
  3. Rerun the backup process.

    The backup should now complete without pre-snapshot hook failures.

Backups and restores fail because of missing SCCs

Applies to: 5.4.0

Applies to: Backup and restore with NetApp Trident Protect

Diagnosing the problem

Backups and restores with NetApp Trident Protect can fail because of issues with the Security Context Constraint (SCC). These SCC issues can happen when Red Hat OpenShift AI is installed. For a list of services that use Red Hat OpenShift AI, see Installing Red Hat OpenShift AI.

To diagnose this issue, use the following steps.

For backups

If backups fail, and it appears that SCCs did not get created, use these steps to diagnose the issue.

  1. Use the following command to check the state of the NetApp Trident Protect custom resource:
    oc get backup.protect.trident.netapp.io -n ${PROJECT_CPD_INST_OPERATORS}

    The custom resource shows an error similar to the following error message:

    
    NAME                       STATE    ERROR                                                                                                                                                                                      AGE
    netapp-cx-gmc-120325-4-3   Failed   VolumeBackupHandler failed with permanent error kopiaVolumeBackup failed for volume trident-protect/data-aiopenscale-ibm-aios-etcd-1-6f44559e00b0b23bbb077d0e57c45de2: permanent error   124m
    
  2. Use the following commands to check the kopiavolumebackups jobs:
    oc get kopiavolumebackups -n ${PROJECT_CPD_INST_OPERATORS}
    oc get jobs -n trident-protect | grep <name> | grep "openshift.io/scc"

    If the pod's Security Context Constraint (SCC) is anything other than trident-protect-job, you must apply the workaround.

    For example, if you see openshift.io/scc: openshift-ai-llminferenceservice-multi-node-scc, you must apply the workaround.

For restores

If restores fail, and it appears that SCCs did not get created, use these steps to diagnose the issue.

  1. Use the following command to check the state of the NetApp Trident Protect custom resource:
    oc get backuprestore.protect.trident.netapp.io -n ${PROJECT_CPD_INST_OPERATORS}

    The custom resource shows an error similar to the following error message:

    
    NAME                       STATE    ERROR                                                                                                                                                                                      AGE
    netapp-cx-gmc-120325-4-3   Failed   VolumeRestoreHandler failed with permanent error kopiaVolumeRestore failed for volume trident-protect/data-aiopenscale-ibm-aios-etcd-1-6f44559e00b0b23bbb077d0e57c45de2: permanent error   124m
  2. Use the following commands to check the KopiaVolumeRestore jobs:
    oc get kopiavolumerestores -n ${PROJECT_CPD_INST_OPERATORS}
    oc get jobs -n trident-protect | grep <name> | grep "openshift.io/scc"

    If the pod's Security Context Constraint (SCC) is anything other than trident-protect-job, you must apply the workaround.

    For example, if you see openshift.io/scc: openshift-ai-llminferenceservice-multi-node-scc, you must apply the workaround.

Cause of the problem

The restore fails because the KopiaVolumeRestore and kopiavolumebackups jobs are using an incorrect Security Context Constraint (SCC). Instead of using the expected trident-protect-job SCC for NetApp Trident Protect, the jobs are picking up a different SCC with higher priority. In this case, openshift-ai-llminferenceservice-multi-node-scc gets used, which cause permission issues during the restore.

Resolving the problem
You must be a cluster administrator to use the workaround as SCCs require cluster-admin level permissions.

Use the following commands to set the priority of trident-protect-job higher than the priority of openshift-ai-llminferenceservice-multi-node-scc, which is usually 11.

oc adm policy add-scc-to-group trident-protect-job system:serviceaccounts:trident-protect
oc patch scc trident-protect-job --type=merge -p '{"priority": 20}' 

If other SCCs are preventing the job pods from using trident-protect-job, you might need to increase the priority of it further.

IBM Software Hub resources are not migrated

Applies to: 5.4.0

Applies to: Portworx asynchronous disaster recovery

Diagnosing the problem
When you use Portworx asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the expected number of resources are migrated. Run the following command:
storkctl get migrations -n ${PX_ADMIN_NS}
Tip: ${PX_ADMIN_NS} is usually kube-system.
Example output:
NAME                                                CLUSTERPAIR       STAGE   STATUS       VOLUMES   RESOURCES   CREATED               ELAPSED                       TOTAL BYTES TRANSFERRED
cpd-tenant-migrationschedule-interval-<timestamp>   mig-clusterpair   Final   Successful   0/0       0/0         <timestamp>   Volumes (0s) Resources (3s)   0
Cause of the problem
This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected IBM Software Hub resources are not migrated.
Resolving the problem
To resolve the problem, downgrade stork to a version prior to 23.11.0. For more information about stork releases, see the stork Releases page.
  1. Scale down the Portworx operator so that it doesn't reset manual changes to the stork deployment:
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
  2. Edit the stork deployment image version to a version prior to 23.11.0:
    oc edit deploy -n ${PX_ADMIN_NS} stork
  3. If you need to scale up the Portworx operator, run the following command.
    Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
    oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1

Prompt tuning fails after restoring watsonx.ai

Applies to: 5.4.0

Applies to: Backup and restore with the OADP utility

Diagnosing the problem
When you try to create a prompt tuning experiment, you see the following error message:
An error occurred while processing prompt tune training.
Resolving the problem
Do the following steps:
  1. Restart the caikit operator:
    oc rollout restart deployment caikit-runtime-stack-operator -n ${PROJECT_CPD_INST_OPERATORS}

    Wait at least 2 minutes for the cais fmaas custom resource to become healthy.

  2. Check the status of the cais fmaas custom resource by running the following command:
    oc get cais fmaas -n ${PROJECT_CPD_INST_OPERANDS}
  3. Retry the prompt tuning experiment.

Restic backup that contains dynamically provisioned volumes in Amazon Elastic File System fails during restore

Applies to: 5.4.0

Applies to: Backup and restore with the OADP utility

Diagnosing the problem

When trying to restore from an offline backup in Amazon Elastic File System, the restore process fails in the volume restore phase, and you might see something similar to the following error:

Status: Failed
Errors: 8
Warnings: 640

Action Errors:
DPP_NAME: cpd-offline-tenant/restore-service-orchestrated-parent-workflow
INDEX: 6
ACTION: restore-cpd-volumes
ERROR: error: expected restore phase to be Completed, received PartiallyFailed

In the Velero logs, you might see something similar to the following error:

time="2025-12-04T18:14:44Z" level=error msg="Restic command fail with ExitCode: 1. 
Process ID is 2077, Exit error is: exit status 1" 
PodVolumeRestore=openshift-adp/cpd-tenant-vol-r-00xda-x44x-4xx0-a9b1 
controller=PodVolumeRestore
stderr=ignoring error for /index-16: lchown /host_pods/.../mount/index-16: 
operation not permitted
Cause of the problem
Restic backups with Amazon Elastic File System are not supported by Velero. Restic cannot properly restore file ownership and permissions on Amazon Elastic File System volumes because it uses a different protocol for ownership operations.
Resolving the issue

If you want to use Amazon Elastic File System for dynamically provisioned volumes, you must use Kopia backups instead of Restic. The OADP DataProtectionApplication (DPA) must use uploaderType: kopia.

Restoring Data Virtualization fails with metastore not running or failed to connect to database error

Applies to: 5.4.0

Applies to: Online backup and restore with the OADP utility

Diagnosing the problem
View the status of the restore by running the following command:
cpd-cli oadp tenant-restore status ${TENANT_BACKUP_NAME}-restore --details
The output shows errors like in the following examples:
time=<timestamp>  level=INFO msg=Verifying if Metastore is listening
SERVICE              HOSTNAME                               NODE      PID  STATUS                                                                                                                                        	
Standalone Metastore c-db2u-dv-hurricane-dv                   -        -   Not running
time=<timestamp>  level=ERROR msg=Failed to connect to BigSQL database                                                                                                                                                                                                                        	     	                      	                                                                                                                                                                                                                                                                                                                                                         	                     	                                                                                                                                                                                                                                                                                                                                                         	
* error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred:                                                                                                                                                                                                                                     	
* error executing command su - db2inst1 -c '/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L' (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=<namespace-name> auxMetaName=dv-aux component=dv actionIdx=0): command terminated with exit code 1
Cause of the problem
A timing issue causes restore posthooks to fail at the step where the posthooks check for the results of the db2 connect to bigsql command. The db2 connect to bigsql command has failed because bigsql is restarting at around the same time.
Resolving the problem
Run the following command:
export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from cpd-post-restore-hooks

Backup pre-check fails for Db2 Data Management Console due to timeout for status check

Applies to: 5.4.0

Applies to: Offline backup with OADP

Diagnosing the problem

You have Db2 Data Management Console installed. When you performed an offline backup, the backup fails during the pre-check phase. You see an error related to the cpd-dmc-aux-br-cm ConfigMap, which contains the backup configuration for Db2 Data Management Console. It indicates that the backup pre-check hook failed:

** PHASE [MASTER PLAN/PARENT RECIPE (APPTYPE=CPD-OFFLINE-TENANT, WORKFLOW=BACKUP-SERVICE-ORCHESTRATED-PARENT-WORKFLOW)/CPD-PRECHECK-BACKUP/END]
master plan results:
   1 DataProtectionPlan execution result(s):
   	dpp 0: cpd-offline-tenant/backup-service-orchestrated-parent-workflow (Phase: FatalError)
   		action 0: cpd-precheck-backup (Phase=FatalError)
   			error: backup precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
   	* backup precheck hook finished with status=error, configmap=cpd-dmc-aux-br-cm
   Failed with 1 error(s):
   error: DataProtectionPlan=cpd-offline-tenant/backup-service-orchestrated-parent-workflow, Action=cpd-precheck-backup (index=0)
   		backup precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
   	* backup precheck hook finished with status=error, configmap=cpd-dmc-aux-br-cm

When examining the pre-check hook execution details, you observe that the hook times out:

The following hooks either have errors or timed out
   pre-check (count=1, duration=5.413s):
   
       	ADDONID	COMPONENT	CONFIGMAP        	PRIORITY_ORDER	METHOD	STATUS	DURATION	START_TIME               	END_TIME                 
       	dmc    	dmc      	cpd-dmc-aux-br-cm	20            	rule  	error 	5.413s  	2026-02-13 8:07:12.035 AM	2026-02-13 8:07:17.449 AM

This issue occurs intermittently, and some backups might succeed while others fail.

Cause of the problem

The cpd-dmc-aux-br-cm ConfigMap contains a pre-check hook that validates that the dmcAddon status is Completed before allowing the backup to proceed. The default timeout value is set to 5 seconds in the ConfigMap, which is not enough time:

precheck-meta:
  backup-hooks:
    exec-rules:
      - resource-kind: dmcaddons.dmc.databases.ibm.com
        on-error: Fail
        actions:
          - builtins:
              name: cpdbr.cpd.ibm.com/check-condition
              params:
                condition: "{$.status.dmcAddonStatus} == {\"Completed\"}"
              timeout: 5s

When time runs out before the check completes, the pre-check hook fails, which causes the entire backup operation to fail.

Resolving the problem

You must manually increase the timeout value in the cpd-dmc-aux-br-cm ConfigMap before taking a backup.

Note: You must apply this workaround before each backup that you take because the ConfigMap may be recreated or reset during Db2 Data Management Console operator reconciliation.
  1. Patch the cpd-dmc-aux-br-cm ConfigMap to increase the timeout from 5 seconds to 1800 seconds (30 minutes):
    oc patch configmap cpd-dmc-aux-br-cm -n <namespace> --type='merge' -p='
    {
      "data": {
        "precheck-meta": "backup-hooks:\n  exec-rules:\n    - resource-kind: dmcaddons.dmc.databases.ibm.com\n      on-error: Fail\n      actions:\n        - builtins:\n            name: cpdbr.cpd.ibm.com/check-condition\n            params:\n              condition: \"{$.status.dmcAddonStatus} == {\\\"Completed\\\"}\"\n            timeout: 1800s\n"
      }
    }'
  2. Verify that the timeout value in the ConfigMap has been updated to 1800s:
    oc get configmap cpd-dmc-aux-br-cm -n <namespace> -o yaml | grep -A 10 "precheck-meta"

Backup validation fails for Data Virtualization due to missing label in dvendpoint PVC

Applies to: 5.4.0

Applies to: Online and offline backup with the OADP utility

Diagnosing the problem

When you take a backup with OADP on a cluster that has a Data Virtualization instance, the backup validation fails and you see the following error:

error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=cpd-backup-validation, action-index=8, retry-attempt=0/0, err=backup validation failed: 1 error occurred:
        * backup validation failed for configmap: dv-aux-br-cm

 (description: error executing workflow for DataProtectionPlan 'cpd-offline-tenant/backup-service-orchestrated-parent-workflow')
        * error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=cpd-backup-validation, action-index=8, retry-attempt=0/0, err=backup validation failed: 1 error occurred:
        * backup validation failed for configmap: dv-aux-br-cm

 (description: finished executing 1 DataProtectionPlan(s):)
        * DataProtectionPlan=cpd-offline-tenant/backup-service-orchestrated-parent-workflow, Action=cpd-backup-validation (index=8) error: backup validation failed: 1 error occurred:
        * backup validation failed for configmap: dv-aux-br-cm

error: backup validation unsuccessful

failed rules report:

CM NAME         RESOURCE-KIND           ADDONID PATH                                                            ERR
dv-aux-br-cm    persistentvolumeclaim   dv      backup-validation-meta/backup-validations/4/validation-rules/1  validation unsuccessful, resource  '>=' '1' rule is not fulfilled, actual count: 0
Cause of the problem

The backup validation process fails because it cannot find the correct number of PersistentVolumeClaims (PVCs) with the required labels.

The backup validation process checks for the dvendpoint PVC by using specific labels defined in the dv-aux-br-cm backup ConfigMap. The validation rule expects at least one PVC with the following labels:

 - resource-kind: persistentvolumeclaim
    validation-rules:
      - type: count
        op: ">="
        val: 1
        labels: "icpdsupport/app=endpoint, icpdsupport/addOnId=dv"  

However, after upgrading from Data Virtualization 3.2.x to Data Virtualization 3.3.x, the required labels icpdsupport/app=endpoint and icpdsupport/addOnId=dv are missing from the dvendpoint PVC. The missing labels cause the validation process to fail because the it cannot find any PVCs with the labels: actual count: 0.

Resolving the problem

Add the missing labels to the PVCs before taking a backup. Run the following script to add the labels to the dvendpoint PVCs.

  1. Create the script file:
    vi fix_labels.sh
  2. Copy and paste the following content into the file. Change the NAMESPACE value as needed.
    # Set NAMESPACE to match your IBM Software
    Hub namespace
    NAMESPACE="cpd-instance"
    
    # List PVCs before patching
    echo "PVCs before patching:"
    oc get pvc -n $NAMESPACE -l component=endpoint,formation_id=db2u-dv --show-labels
    
    # Patch all matching PVCs
    echo "Patching PVCs..."
    for pvc in $(oc get pvc -n $NAMESPACE -l component=dvendpoint,formation_id=db2u-dv -o name); do
      echo "Patching $pvc"
      oc patch $pvc -n $NAMESPACE --type=merge -p '{"metadata":{"labels":{"icpdsupport/app":"endpoint"}}}'
    done
    
    # Verify the patch
    echo "PVCs after patching:"
    oc get pvc -n $NAMESPACE -l component=dvendpoint,formation_id=db2u-dv --show-labels
  3. Make the script executable:
    chmod +x fix_labels.sh
  4. Execute the script:
    sh fix_labels.sh

    The script displays PVC labels before and after patching to confirm the changes. The script adds the missing label only if the label is not already present. You need to run the script before every backup that you take.

  5. Verify that the labels have been correctly applied:
    oc get pvc -n <namespace> -l component=dvendpoint,formation_id=db2u-dv --show-labels

Offline backup fails with PartiallyFailed error

Applies to: 5.4.0

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
In the Velero logs, you see errors like in the following example:
time="<timestamp>" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:180"
time="<timestamp>" level=error msg="error encountered while scanning stdout" backupLocation=oadp-operator/dpa-sample-1 cmd=/plugins/velero-plugin-for-aws controller=backup-sync error="read |0: file already closed" logSource="/remote-source
/velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:90"
time="<timestamp>" level=error msg="Restic command fail with ExitCode: 1. Process ID is 906, Exit error is: exit status 1" logSource="/remote-source/velero/app/pkg/util/exec/exec.go:66"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.examplehost.example.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=76b76bc4-27d4-4369-886c-1272dfdf9ce9 --tag=volume=cc-home-p
vc-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --tag=pod=cpdbr-vol-mnt --host=velero --json with error: exit status 3 stderr: {\"message_type\":\"e
rror\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/.scripts/system\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival
\",\"item\":\".scripts/system\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"_global_/security/artifacts/metakey\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kuberne
tes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/_global_/security/artifacts/metakey\"}\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/ve
lero/app/pkg/podvolume/backupper.go:328"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.cp.fyre.ibm.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod=cpdbr-vol-mnt --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=93e9e23c-d80a-49cc-80bb-31a36524e0d
c --tag=volume=data-rabbitmq-ha-0-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --host=velero --json with error: exit status 3 stderr: {\"message_typ
e\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\".erlang.cookie\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-93e9e23c-d80a-49cc-80bb-31a36524e0dc/.erlang.cookie\"}
\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/velero/app/pkg/podvolume/backupper.go:328"
Cause of the problem
The restic folder was deleted after backups were cleaned up (deleted). This problem is a Velero known issue. For more information, see velero does not recreate restic|kopia repository from manifest if its directories are deleted on s3.
Resolving the problem
Do the following steps:
  1. Get the list of backup repositories:
    oc get backuprepositories -n ${OADP-OPERATOR-NAMESPACE} -o yaml
  2. Check for old or invalid object storage URLs.
  3. Check that the object storage path is in the backuprepositories custom resource.
  4. Check that the <objstorage>/<bucket>/<prefix>/restic/<namespace>/config file exists.

    If the file does not exist, make sure that you do not share the same <objstorage>/<bucket>/<prefix> with another cluster, and specify a different <prefix>.

  5. Delete backup repositories that are invalid for the following reasons:
    • The path does not exist anymore in the object storage.
    • The restic/<namespace>/config file does not exist.
    oc delete backuprepositories -n ${OADP_OPERATOR_NAMESPACE} <backup-repository-name>

Db2 Big SQL backup pre-hook and post-hook fail during offline backup

Applies to: 5.4.0

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem
In the db2diag logs of the Db2 Big SQL head pod, you see error messages such as in the following example when backup pre-hooks are running:
<timestamp>          LEVEL: Event
PID     : 3415135              TID : 22544119580160  PROC : db2star2
INSTANCE: db2inst1             NODE : 000
HOSTNAME: c-bigsql-<xxxxxxxxxxxxxxx>-db2u-0
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5692
MESSAGE : ZRC=0xFFFFFBD0=-1072
          SQL1072C  The request failed because the database manager resources
          are in an inconsistent state. The database manager might have been
          incorrectly terminated, or another application might be using system
          resources in a way that conflicts with the use of system resources by
          the database manager.
Cause of the problem
The Db2 database was unable to start because of the error code SQL1072C. As a result, the bigsql start command that runs as part of the post-backup hook hangs, which produces the timeout of the post-hook. The post-hook cannot succeed unless Db2 is brought back to a stable state and the bigsql start command runs successfully. The Db2 Big SQL instance is left in an unstable state.
Resolving the problem
Do one or both of the following troubleshooting and cleanup procedures.
Tip: For more information about the SQL1072C error code and how to resolve it, see SQL1000-1999 in the Db2 documentation.
Remove all the database manager processes running under the Db2 instance ID
Do the following steps:
  1. Log in to the Db2 Big SQL head pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash
  2. Switch to the db2inst1 user:
    su - db2inst1
  3. List all the database manager processes that are running under db2inst1:
    db2_ps
  4. Remove these processes:
    kill -9 <process-ID>
Ensure that no other application is running under the Db2 instance ID, and then remove all resources owned by the Db2 instance ID
Do the following steps:
  1. Log in to the Db2 Big SQL head pod:
    oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash
  2. Switch to the db2inst1 user:
    su - db2inst1
  3. List all IPC resources owned by db2inst1:
    ipcs | grep db2inst1
  4. Remove these resources:
    ipcrm -[q|m|s] db2inst1

Security issues

Security scans return an Inadequate Account Lockout Mechanism message

Applies to: 5.10.0 and later

Diagnosing the problem
If you run a security scan against IBM Software Hub, the scan returns the following message.
Inadequate Account Lockout Mechanism
Resolving the problem
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 5.10.0 and later

Diagnosing the problem
If you run an Aqua Security scan against your cluster, the scan returns the following issue:
Resolving the problem
This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.