Known issues and limitations for IBM Software Hub
The following issues apply to the IBM® Software Hub platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.
- Customer-reported issues
- General issues
- Installation and upgrade issues
- Backup and restore issues
- Security issues
The following issues apply to IBM Software Hub services.
Customer-reported issues
Issues that are found after the release are posted on the IBM Support site.
General issues
- After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional
- Export jobs are not deleted if the export specification file includes incorrectly formatted JSON
- The health service-functionality check fails on the v2/ingestion_jobs API endpoint for watsonx.data
After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional
Applies to: 5.3.0
- Diagnosing the problem
- After rebooting the cluster, some IBM Software Hub custom resources remain in
the
InProgressstate.For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.
- Workaround
- To enable the pods to come up after a reboot:
- Find the nodes that have pods that are in an
Errorstate:oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A | grep -v -P "Completed|(\d+)\/\1" - Mark each node as
unschedulable.
oc adm cordon <node_name> - Delete the affected
pods:
oc get pod | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0 - Mark each node as
scheduled:
oc adm uncordon <node_name>
- Find the nodes that have pods that are in an
Export jobs are not deleted if the export specification file includes incorrectly formatted JSON
Applies to: 5.3.0
When you export data from IBM Software Hub, you must create an export specification file. The export specification file includes a JSON string that defines the data that you want to export.
If the JSON string is incorrectly formatted, the export will fail. In addition, if you try
to run the cpd-cli
export-import
export delete command, the command completes
without returning any errors, but the export job is not deleted.
- Diagnosing the problem
-
- To confirm that the export job was not deleted,
run:
cpd-cli export-import export status \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --profile=${CPD_PROFILE_NAME}
- To confirm that the export job was not deleted,
run:
- Resolving the problem
-
- Use the OpenShift
Container Platform
command-line interface to delete the failed export job:
- Set the EXPORT_JOB environment variable to the name of the job that you want
to
delete:
export EXPORT_JOB=<export-job-name> - To remove the export job,
run:
oc delete job ${EXPORT_JOB} \ --namespace=${PROJECT_CPD_INST_OPERANDS}
- Set the EXPORT_JOB environment variable to the name of the job that you want
to
delete:
- If you want to use the export specification file to export data, fix the JSON formatting issues.
- Use the OpenShift
Container Platform
command-line interface to delete the failed export job:
The health
service-functionality check fails on the
v2/ingestion_jobs API endpoint for watsonx.data
Applies to: 5.3.0
API endpoint:
/lakehouse/api/v2/ingestion_jobs
This failure is not indicative of a problem with watsonx.data™. There is no impact on the functionality of the service if this check fails.
Installation and upgrade issues
- The Switch locations icon is not available if the apply-cr command times out
- Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
- After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
- Persistent volume claims with the WaitForFirstConsumer volume binding mode are flagged by the installation health checks
- Node pinning is not applied to postgresql pods
- The ibm-nginx deployment does not scale fast enough when automatic scaling is configured
- Uninstalling IBM watsonx services does not remove the IBM watsonx experience
The Switch locations icon is not available if the
apply-cr command times out
Applies to: 5.3.0
If you install solutions that are available in different experiences, the Switch
locations icon () is not available in the web
client if the
cpd-cli
manage
apply-cr command times out.
- Resolving the problem
-
Re-run the
cpd-cli manage apply-crcommand.
Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
Applies to: 5.3.0
If the Red Hat OpenShift Data Foundation or IBM Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.
One symptom is that pods will not start because of a FailedMount error. For
example:
Warning FailedMount 36s (x1456 over 2d1h) kubelet MountVolume.MountDevice failed for volume
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
- Diagnosing the problem
- To confirm whether the Data Foundation
Rook Ceph cluster is unstable:
- Ensure that the
rook-ceph-toolspod is running.oc get pods -n openshift-storage | grep rook-ceph-toolsNote: On IBM Fusion HCI System or on environments that use hosted control planes, the pods are running in theopenshift-storage-clientproject. - Set the
TOOLS_PODenvironment variable to the name of therook-ceph-toolspod:export TOOLS_POD=<pod-name> - Execute into the
rook-ceph-toolspod:oc rsh -n openshift-storage ${TOOLS_POD} - Run the following command to get the status of the Rook Ceph
cluster:
ceph statusConfirm that the output includes the following line:health: HEALTH_WARN - Exit the pod:
exit
- Ensure that the
- Resolving the problem
- To resolve the problem:
- Get the name of the
rook-ceph-mrgpods:oc get pods -n openshift-storage | grep rook-ceph-mgr - Set the
MGR_POD_Aenvironment variable to the name of therook-ceph-mgr-apod:export MGR_POD_A=<rook-ceph-mgr-a-pod-name> - Set the
MGR_POD_Benvironment variable to the name of therook-ceph-mgr-bpod:export MGR_POD_B=<rook-ceph-mgr-b-pod-name> - Delete the
rook-ceph-mgr-apod:oc delete pods ${MGR_POD_A} -n openshift-storage - Ensure that the
rook-ceph-mgr-apod is running before you move to the next step:oc get pods -n openshift-storage | grep rook-ceph-mgr - Delete the
rook-ceph-mgr-bpod:oc delete pods ${MGR_POD_B} -n openshift-storage - Ensure that the
rook-ceph-mgr-bpod is running:oc get pods -n openshift-storage | grep rook-ceph-mgr
- Get the name of the
After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
Applies to: 5.3.0
After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Master Data Management cannot function correctly.
- IBM Knowledge Catalog
- IBM Master Data Management
- Diagnosing the problem
- To identify the cause of this issue, check the FoundationDB status and details.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusIf this command is successful, the returned status is
Complete. If the status isInProgressorFailed, proceed to the workaround steps. - If the status is
Completebut FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.oc rsh sample-cluster-log-1 /bin/fdbcliTo check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the
fdb>prompt.status details- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
coordinatorscommand with the IP addresses specified in the error message as input.oc get pod -o wide | grep storage > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tlsIf this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
- Check the FoundationDB
status.
- Resolving the problem
- To resolve this issue, restart the FoundationDB
pods.
Required role: To complete this task, you must be a cluster administrator.
- Restart the FoundationDB cluster
pods.
oc get fdbcluster oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete poReplace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance. - Restart the FoundationDB operator
pods.
oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po - After the pods finish restarting, check to ensure that FoundationDB is available.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusThe returned status must be
Complete. - Check to ensure that the database is
available.
oc rsh sample-cluster-log-1 /bin/fdbcliIf the database is still not available, complete the following steps.
- Log in to the
ibm-fdb-controllerpod. - Run the
fix-coordinatorscript.kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}Replace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance.
- Log in to the
- Check the FoundationDB
status.
- Restart the FoundationDB cluster
pods.
Persistent volume claims with the WaitForFirstConsumer volume binding mode
are flagged by the installation health checks
Applies to: 5.3.0
ibm-cs-postgres-backupibm-zen-objectstore-backup-pvc
Both of these persistent volume claims are created with the WaitForFirstConsumer
volume binding mode. In addition, both persistent volume claims will remain in the
Pending state until you back up your IBM Software Hub installation. This behavior is expected.
However, when you run the cpd-cli
health
operands command, the Persistent Volume Claim
Healthcheck fails.
If there are more persistent volume claims returned by the health check, you must investigate
further to determine why those persistent volume claims are pending. However, if only the following
persistent volume claims are returned, you can ignore the Failed result:
ibm-cs-postgres-backupibm-zen-objectstore-backup-pvc
Node pinning is not applied to postgresql pods
Applies to: 5.3.0
If you use node pinning to schedule pods on specific nodes, and your environment includes
postgresql pods, the node affinity settings are not applied to the
postgresql pods that are associated with your IBM Software Hub deployment.
The resource specification injection (RSI) webhook cannot patch postgresql pods
because the EDB Postgres operator uses a
PodDisruptionBudget resource to limit the number of concurrent disruptions to
postgresql pods. The PodDisruptionBudget resource prevents
postgresql pods from being evicted.
The ibm-nginx deployment does not scale fast enough when automatic scaling
is configured
Applies to: 5.3.0
If you configure automatic scaling for IBM Software Hub, the ibm-nginx deployment
might not scale fast enough. Some symptoms include:
- Slow response times
- High CPU requests are throttled
- The deployment scales up and down even when the workload is steady
This problem typically occurs when you install watsonx Assistant or watsonx™ Orchestrate.
- Resolving the problem
- If you encounter the preceding symptoms, you must manually scale the
ibm-nginxdeployment:oc patch zenservice lite-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": { "Nginx": { "name": "ibm-nginx", "kind": "Deployment", "container": "ibm-nginx-container", "replicas": 5, "minReplicas": 2, "maxReplicas": 11, "guaranteedReplicas": 2, "metrics": [ { "type": "Resource", "resource": { "name": "cpu", "target": { "type": "Utilization", "averageUtilization": 529 } } } ], "resources": { "limits": { "cpu": "1700m", "memory": "2048Mi", "ephemeral-storage": "500Mi" }, "requests": { "cpu": "225m", "memory": "920Mi", "ephemeral-storage": "100Mi" } }, "containerPolicies": [ { "containerName": "*", "minAllowed": { "cpu": "200m", "memory": "256Mi" }, "maxAllowed": { "cpu": "2000m", "memory": "2048Mi" }, "controlledResources": [ "cpu", "memory" ], "controlledValues": "RequestsAndLimits" } ] } }}'
Uninstalling IBM watsonx services does not remove the IBM watsonx experience
Applies to: 5.3.0
After you uninstall watsonx.ai™ or watsonx.governance™, the IBM watsonx experience is still available in the web client even though there are no services that are specific to the IBM watsonx experience.
- Resolving the problem
- To remove the IBM
watsonx experience
from the web client, an instance administrator must run the following
command:
oc delete zenextension wx-perspective-configuration \ --namespace=${PROJECT_CPD_INST_OPERANDS}
Backup and restore issues
Issues that apply to several backup and restore methods
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
- Restore issues
- Review the following issues before you restore a backup. Do the workarounds that apply
to your environment.
- During a restore, the IBM Master Data Management CR fails with an error stating that a conditional check failed
- After a restore, OperandRequest timeout error in the ZenService custom resource
- SQL30081N RC 115,*,* error for Db2 selectForReceiveTimeout function after instance restore (Fixed in 5.2.1)
- Restore fails and displays postRestoreViaConfigHookRule error in Data Virtualization
- Error 404 displays after backup and restore in Data Virtualization
- The restore process times out while waiting for the ibmcpd status check to complete
- Watson OpenScale fails after restore due to Db2 (db2oltp) or Db2 Warehouse (db2wh) configuration
- Watson OpenScale fails after restore for some Watson Machine Learning deployments
- FoundationDB cluster is stuck in stage_mirror: post-restore state after restore from Online backup
Backup and restore issues with the OADP utility
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
- Restore issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
Backup and restore issues with IBM Fusion
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
- Restore issues
- Do the workarounds that apply to your environment after you restore a backup.
Backup and restore issues with NetApp Trident protect
- Backup issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
- Restore issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
Backup and restore issues with Portworx
- Backup issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
ObjectBucketClaim is not supported by the OADP utility
Applies to: 5.3.0
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- If an ObjectBucketClaim is created in an IBM Software Hub instance, it is not included when you create a backup.
- Cause of the problem
- OADP does not support backup and restore of ObjectBucketClaim.
- Resolving the problem
- Services that provide the option to use ObjectBuckets must ensure that the ObjectBucketClaim is in a separate namespace and backed up separately.
After a restore, OperandRequest timeout error in the ZenService custom resource
Applies to: 5.3.0
Applies to: All backup and restore methods
- Diagnosing the problem
- Get the status of the ZenService
YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yamlIn the output, you see the following error:
... zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request": Timed out waiting on resource' ...Check for failing operandrequests:oc get operandrequests -AFor failing operandrequests, check their conditions forconstraints not satisfiablemessages:oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name> - Cause of the problem
- Subscription wait operations timed out. The problematic subscriptions show an error
similar to the following
example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0 exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0 and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0 originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0, subscription ibm-db2aaservice-cp4d-operator exists'This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.
- Resolving the problem
- Do the following steps:
- Delete the problematic clusterserviceversions and subscriptions, and
restart the Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
- Delete IBM Software Hub instance
projects (namespaces).
For details, see Cleaning up the cluster before a restore.
- Retry the restore.
- Delete the problematic clusterserviceversions and subscriptions, and
restart the Operand Deployment Lifecycle Manager (ODLM) pod.
Restore of IBM Master Data Management fails if RabbitMQ Helm secrets are not excluded before creating the backup
Applies to: 5.3.0
Applies to: Offline backup and restore methods
- Diagnosing the problem
- While performing an offline restore of the IBM
Master Data Management service, the restore
operation fails due to a timeout while waiting for the
Rabbitmqclusterto enter aCompletedstate. TheRabbitmqclusterbecomes stuck, which blocks the IBM Master Data Management CR (mdm-cr) from progressing. - Cause of the problem
- While performing an offline restore, the
Rabbitmqclusteroperator pod does not get reconciled because the RabbitMQ Helm secrets were not excluded before creating the backup. - Resolving the problem
- To resolve this issue, exclude the RabbitMQ Helm secrets before creating
the backup. Patch the RabbitMQ Helm secret with
excludelabels by running the following commands:oc label secret -l owner=helm,name=<rabbitmq_cr_name> velero.io/exclude-from-backup=true -n <CPD-OPERAND-NAMESPACE>oc label role,rolebinding -l app.kubernetes.io/managed-by=Helm,release=<rabbit_cr_name> velero.io/exclude-from-backup=true n <CPD-OPERAND-NAMESPACE>
During a restore, the IBM Master Data Management CR fails with an error stating that a conditional check failed
Applies to: 5.3.0
Applies to: All backup and restore methods
- Diagnosing the problem
- A restore of the IBM
Master Data Management
service fails with an error similar to the following example:
The conditional check 'all_services_available and ( history_enabled | bool or parity_enabled | bool )' failed. The error was: error while evaluating conditional (all_services_available and ( history_enabled | bool or parity_enabled | bool )): 'history_enabled' is undefined The error appears to be in '/opt/ansible/roles/4.10.23/mdm_cp4d/tasks/check_services.yml': line 202, column 5, but may be elsewhere in the file depending on the exact syntax problem. - Resolving the problem
- No action is required. The error will be resolved automatically during subsequent operator reconciliation runs.
Watson
OpenScale fails after
restore due to Db2
(db2oltp) or Db2 Warehouse
(db2wh) configuration
Applies to: 5.3.0
Applies to: All backup and restore methods
- Diagnosing the problem
- After you restore, Watson
OpenScale fails due to memory constraints. You might see Db2
(db2oltp)or Db2 Warehouse(db2wh)instances that return404errors and pod failures wherescikitpods are unable to connect to Apache Kafka. - Cause of the problem
- The root cause is typically insufficient memory or temporary table page size settings, which are critical for query execution and service stability.
- Resolving the problem
- Ensure that the Db2
(db2oltp)or Db2 Warehouse(db2wh)instance is configured with adequate memory resources, specifically:- Set the temporary table page size to at least 700 GB during instance setup or reconfiguration.
- Monitor pod health and Apache Kafka connectivity to verify that dependent services recover after memory allocation is corrected.
Watson OpenScale fails after restore for some Watson Machine Learning deployments
Applies to: 5.3.0
Applies to: All backup and restore methods
- Diagnosing the problem
- After restoring an online backup, automatic payload logging in Watson OpenScale fails for certain Watson Machine Learning deployments.
- Cause of the problem
- The failure is caused by a timing issue during the restore process. Specifically, the Watson OpenScale automatic setup fails because Watson Machine Learning runtime pods are unable to connect to the underlying Apache Kafka service. This connectivity issue occurs because the pods start before Apache Kafka is fully available.
- Resolving the problem
- To resolve the issue, restart all
wml-deppods after you restore. This ensures proper Apache Kafka connectivity and allows Watson OpenScale automatic setup to complete successfully:- List all
wml-deppods:oc get pods -n=${PROJECT_CPD_INST_OPERANDS} | grep wml-dep - Run the following command for each
wml-deppod to delete it and trigger it to restart:oc delete pod <podname> -n=${PROJECT_CPD_INST_OPERANDS}
- List all
FoundationDB cluster is stuck
in stage_mirror: post-restore state after restore from Online
backup
Applies to: 5.3.0
Applies to: All backup and restore methods
- Diagnosing the problem
-
After you complete a restore from an online backup, the FoundationDB cluster gets stuck in the
post-restorestage with anInProgressstatus.oc get fdbcluster wkc-foundationdb-cluster --namespace wkc -o json | jq .status { "backup_status": { "backup_phase": "", "backup_stage": "post-restore", "cpdbr_backup_index": 0, "cpdbr_backup_name": "", "cpdbr_backup_start": 0, "cpdbr_init_time": 0, "fdb_last_up_time": 0, "fdb_up_start_time": 0, "last_completion_time": 0, "last_update_time": 0 }, "coordinators_status": { "last_query_time": 0, "last_update_time": 0 }, "fdbStatus": "InProgress", "restore_phase": "New Restore job created", "shutdownStatus": " ", "stage_mirror": "post-restore" }The restore operation fails, and you see the following error message in the restore logs:
Restoring backup to version: 125493581420 ERROR: Attempted to restore into a non-empty destination database Fatal Error: Attempted to restore into a non-empty destination database - Cause of the problem
-
The issue is caused by a timing issue during the post-restore process. When the restore process completes, FoundationDB pods start up and lineage ingestion services (specifically
wdp-kg-ingestion-serviceandwkc-data-lineage-service) also begin writing to the database before the FoundationDB restore job can lock it. This situation causes a race condition. It results in the database being non-empty when the FoundationDB restore job attempts to clear and restore data, which causes the restore operation to fail. - Resolving the problem
-
- Use the following commands to put the
wkcoperator in maintenance mode.cpd-cli manage update-cr --component=wkc --cpd_instance_ns=<OPERAND_NS> --patch='{"spec":{"ignoreForMaintenance":true}}' - Use the following commands to stop any lineage ingestion
job:
oc scale deploy wdp-kg-ingestion-service --replicas=0 -n <namespace> oc scale deploy wkc-data-lineage-service --replicas=0 -n <namespace> - Edit the FoundationDB
custom resource (CR):
oc edit fdbcluster wkc-foundationdb-cluster -n <namespace>- Force a shut down of the FoundationDB cluster by
setting
spec.shutdownto"force". - Check that the post-restore tag is present in the CR.
- Force a shut down of the FoundationDB cluster by
setting
- Wait for all FoundationDB pods to stop.
- Start the FoundationDB
cluster again by editing the FoundationDB CR and changing
spec.shutdownfrom"force"to"false". - Verify the restore job completes successfully by checking the logs. The job should
not have any
unclean databaseerrors. - Use the following command to verify FoundationDB status reaches
Completedstatus:oc get fdbcluster wkc-foundationdb-cluster -n <namespace> -o json | jq .status.fdbStatus - Use the following commands to scale up the lineage ingestion services:
oc scale deploy wdp-kg-ingestion-service --replicas=1 -n <namespace> oc scale deploy wkc-data-lineage-service --replicas=1 -n <namespace> - Verify that the
wkcCR reachesCompletedstatus.
- Use the following commands to put the
Db2 backup fails at the Hook: br-service hooks/pre-backup step
Applies to: 5.3.0
Applies to: Backup and restore with IBM Fusion
- Diagnosing the problem
- In the cpdbr-oadp.log file, you see messages like in the
following
example:
time=<timestamp> level=info msg=podName: c-db2oltp-5179995-db2u-0, podIdx: 0, container: db2u, actionIdx: 0, commandString: ksh -lc 'manage_snapshots --action suspend --retry 3', command: [sh -c ksh -lc 'manage_snapshots --action suspend --retry 3'], onError: Fail, singlePodOnly: false, timeout: 20m0s func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:767 time=<timestamp> level=info msg=cmd stdout: func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:823 time=<timestamp> level=info msg=cmd stderr: [<timestamp>] - INFO: Setting wolverine to disable Traceback (most recent call last): File "/usr/local/bin/snapshots", line 33, in <module> sys.exit(load_entry_point('db2u-containers==1.0.0.dev1', 'console_scripts', 'snapshots')()) File "/usr/local/lib/python3.9/site-packages/cli/snapshots.py", line 35, in main snap.suspend_writes(parsed_args.retry) File "/usr/local/lib/python3.9/site-packages/snapshots/snapshots.py", line 86, in suspend_writes self._wolverine.toggle_state(enable=False, message="Suspend writes") File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 73, in toggle_state self._toggle_state(state, message) File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 77, in _toggle_state self._cmdr.execute(f'wvcli system {state} -m "{message}"') File "/usr/local/lib/python3.9/site-packages/utils/command_runner/command.py", line 122, in execute raise CommandException(err) utils.command_runner.command.CommandException: Command failed to run:ERROR:root:HTTPSConnectionPool(host='localhost', port=9443): Read timed out. (read timeout=15) - Cause of the problem
- The Wolverine high availability monitoring process was in a
RECOVERINGstate before the backup was taken.Check the Wolverine status by running the following command:
wvcli system statusExample output:ERROR:root:REST server timeout: https://localhost:9443/status ERROR:root:Retrying Request: https://localhost:9443/status ERROR:root:REST server timeout: https://localhost:9443/status ERROR:root:Retrying Request: https://localhost:9443/status ERROR:root:REST server timeout: https://localhost:9443/status ERROR:root:Retrying Request: https://localhost:9443/status ERROR:root:REST server timeout: https://localhost:9443/status HA Management is RECOVERING at <timestamp>.The Wolverine log file /mnt/blumeta0/wolverine/logs/ha.log shows errors like in the following example:<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:414)] - check_and_recover: unhealthy_dm_set = {('c-db2oltp-5179995-db2u-0', 'node')} <timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:416)] - (c-db2oltp-5179995-db2u-0, node) : not OK <timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:421)] - check_and_recover: unhealthy_dm_names = {'node'} - Resolving the problem
- Do the following steps:
- Re-initialize
Wolverine:
wvcli system init --force - Wait until the Wolverine status is
RUNNING. Check the status by running the following command:wvcli system status - Retry the backup.
- Re-initialize
Wolverine:
Watson Discovery fails after restore
with opensearch and post_restore unverified
components
Applies to: 5.3.0
Applies to: Backup and restore with IBM Fusion
- Diagnosing the problem
- After you restore, Watson Discovery
becomes stuck with the following components listed as
unverifiedComponents:unverifiedComponents: - opensearch - post_restoreAdditionally, the OpenSearch client pod might show an unknown container status, similar to the following example:NAME READY STATUS RESTARTS AGE wd-discovery-opensearch-client-000 0/1 ContainerStatusUnknown 0 11h - Cause of the problem
- The
post_restorecomponent depends on theopensearchcomponent being verified. However, the OpenSearch client pod is not running, which prevents verification and causes the restore process to stall. - Resolving the problem
- Manually delete the OpenSearch client pod to allow it to
restart:
$ oc delete -n ${PROJECT_CPD_INST_OPERANDS} pod wd-discovery-opensearch-client-000After the pod is restarted and verified, the
post_restorecomponent should complete the verification process.
Restore process stuck at
db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState
Applies to: 5.3.0
Applies to: Restore with IBM Fusion
- Diagnosing the problem
-
When you use IBM Fusion to restore to the same cluster, the restore process fails when it times out while trying to verify that the
db2uclusterordb2uinstanceresources are inReadystatus. You might receive something similar to the following error message:There was an error when processing the job in the Transaction Manager service. The underlying error was: 'Execution of workflow restore of recipe ibmcpd-tenant completed. Number of failed commands: 1, last failed command: "CheckHook/db2u-aux-ckpt-cm-child.db2ucluster-ready/readyState"'. - Cause of the problem
-
IBM Fusion fails to restore because of a combination of factors:
- The
db2 veleroplugin is not present in IBM Fusion DPA. - The IBM Fusion check
hook requires both
db2uinstanceanddb2uclusterresources to be present, and it cannot skip checks when resources don't exist in the cluster
- The
- Workaround
-
You have two options to resolve the issue:
-
Update IBM Fusion from version 2.10 to either version 2.10 with hotfixes or 2.10.1 version, and then restart the backup and restore
- Reconfigure DataProtectionApplication (DPA) and start a fresh
backup and restore.
- Configure DPA from the source cluster by following the process for creating a DPA custom resource in Installing and configuring software on the source cluster for backup and restore with IBM Fusion.
- Use IBM Fusion to create another backup.
- Start another restore from the new backup.
-
Restore process stuck at zenextensions-patch-ckpt-cm step
Applies to: 5.3.0
Applies to: Backup and restore with IBM Fusion 2.10 (without hotfixes)
- Diagnosing the problem
-
IBM Fusion fails to restore because the process gets stuck at the
zenextensions-patch-ckpt-cmstep. The restore fails at the following stage:ExecHook/zenextensions-patch-ckpt-cm-zen-1-child.zenextensions-hooks/force-reconcile-zenextensions - Cause of the problem
-
The issue is when the IBM Fusion exec hook cannot handle trailing new spaces, and the restore process gets stuck waiting for zenextensions hooks to complete.
- Workarounds
-
You have two options to resolve the issue:
-
Update IBM Fusion from version 2.10 to version either 2.10 with hotfixes or 2.10.1 version, and then restart the backup and restore
-
Use the fusion-resume-restore script.
You can use the fusion-resume-restore.sh script to continue the restore process from the point where it failed.
- Check that you have the prerequisites to run the script:
- yq 4.42.1 or later. You can download and install it from: https://github.com/mikefarah/yq.
- jq-1.7 or later
- Download the fusion-resume-restore script from https://github.com/IBM/cpd-cli/blob/master/cpdops/files/fusion-resume-restore.sh.
- Ensure you have write access to /tmp/fusion-resume-restore.
- Run the script from
hub:
./fusion-resume-restore.sh <fusion-restore-name> ${PROJECT_CPD_INST_OPERATORS}The script displays the steps required to restore and their sequence numbers.
- When the script asks you to
Enter the key of the workflow hook/group to resume from, select the index number that excludes thezenextensions-patch-ckpt-cmstep.For example:
Enter the key of the workflow hook/group to resume from (0-101): 97 Selected workflow (index=97): "hook: zenextensions-patch-ckpt-cm-zen-1-child.zenextensions-hooks/force-reconcile-zenextensions" - When the script asks
Resume the Fusion restore now?, choose a response based on whether you are restoring to the same or a different cluster.- If restoring to the same cluster, reply y.
- If restoring to a different cluster, reply n.
The script provides instructions to manually apply the CRs to resume the restore.
-
OADP 1.5.x Certificate Issue with Fusion BSL
- Diagnosing the problem
- on a cluster with OCP 4.19 with Fusion 2.11, if OADP 1.5.1 is installed/upgraded, BSL
will go into Unavailable Phase. If the cluster has such setup, please upgrade OADP to
Version 1.5.2.
If your cluster is running OpenShift Container Platform 4.19 with IBM Fusion 2.11 and OADP 1.5.1, then the backup service location might enter an unavailable phase. To resolve this issue, upgrade OADP to version 1.5.2.
- Cause of the problem
- Resolving the problem
Neo4j config maps missing during backup
Applies to: 5.3.0
Applies to: Backup with NetApp Trident protect
- Diagnosing the problem
-
During the backup, the backup operation fails, and you see a pre-snapshot hook error similar to:
Error checking for hook failures in backup '<backup-name>': command terminated with exit code 1When you inspect the logs, you see failures related to missing Neo4j configmaps:
failed to validate dynamic online configmap '<>-aux-v2-ckpt-cm': not found failed to validate dynamic offline configmap '<>-aux-v2-br-cm': not foundWhen you investigate the inventory
ConfigMap,ibm-neo4j-inv-list-cmshows that it still contains entries referencing aNeo4j instance that has already been uninstalled or is no longer running.Pre-snapshot hooks attempt to execute backup commands for all entries listed in the inventory
ConfigMap, so the backup fails when it encounters stale/non-existent entries. - Cause of the problem
-
The backup fails because the Neo4j inventory
ConfigMapcontains stale configmap references for Neo4j clusters, which were previously installed but later removed.Backup hooks rely on this inventory to determine which Neo4j instances require online/offline checkpoint operations. When the inventory still includes configmaps that no longer exist in the cluster (such as
<>-aux-v2-ckpt-cmand<>-aux-v2-br-cm), the hook execution fails, which causes the entire backup to fail.This makes the backup system behave as if the missing configmaps indicate a failure, even though the corresponding Neo4j instance was intentionally removed.
- Resolving the problem
-
To resolve the problem, remove stale Neo4j entries from the inventory
ConfigMapso that the backup does not attempt to run hooks against non-existent Neo4j instances.- View the inventory
ConfigMap, and remove entries that look like the following examples:- name: mdm-neo-<id>-aux-v2-ckpt-cm- name: mdm-neo-<id>-aux-v2-br-cm
- Edit the inventory
ConfigMapto remove stale entries:oc edit cm ibm-neo4j-inv-list-cm -n zenRemove the MDM-related blocks from both the online and offline lists, which ensures that the backup executes hooks only for active Neo4j instances.
For example, the following structure shows the stale entries
online: - name: data-lineage-neo4j-aux-v2-ckpt-cm namespace: zen priority-order: '200' - name: mdm-neo-1763773800090371-aux-v2-ckpt-cm namespace: zen priority-order: '200' offline: - name: data-lineage-neo4j-aux-v2-br-cm namespace: zen priority-order: '200' - name: mdm-neo-1763773800090371-aux-v2-br-cm namespace: zen priority-order: '200'And the following example shows the revised structure:
online: - name: data-lineage-neo4j-aux-v2-ckpt-cm namespace: zen priority-order: '200' offline: - name: data-lineage-neo4j-aux-v2-br-cm namespace: zen priority-order: '200' - Rerun the backup process.
The backup should now complete without pre-snapshot hook failures.
- View the inventory
Backups and restores fail because of missing SCCs
Applies to: 5.3.0
Applies to: Backup and restore with NetApp Trident protect
- Diagnosing the problem
-
Backups and restores with NetApp Trident protect can fail because of issues with the Security Context Constraint (SCC). These SCC issues can happen when Red Hat OpenShift AI is installed. For a list of services that use Red Hat OpenShift AI, see Installing Red Hat OpenShift AI.
To diagnose this issue, use the following steps.
- For backups
-
If backups fail, and it appears that SCCs did not get created, use these steps to diagnose the issue.
- Use the following command to check the state of the NetApp Trident protect
custom
resource:
oc get backup.protect.trident.netapp.io -n ${PROJECT_CPD_INST_OPERATORS}The custom resource shows an error similar to the following error message:
NAME STATE ERROR AGE netapp-cx-gmc-120325-4-3 Failed VolumeBackupHandler failed with permanent error kopiaVolumeBackup failed for volume trident-protect/data-aiopenscale-ibm-aios-etcd-1-6f44559e00b0b23bbb077d0e57c45de2: permanent error 124m - Use the following commands to check the
kopiavolumebackupsjobs:oc get kopiavolumebackups -n ${PROJECT_CPD_INST_OPERATORS} oc get jobs -n trident-protect | grep <name> | grep "openshift.io/scc"If the pod's Security Context Constraint (SCC) is anything other than
trident-protect-job, you must apply the workaround.For example, if you see
openshift.io/scc: openshift-ai-llminferenceservice-multi-node-scc, you must apply the workaround.
- Use the following command to check the state of the NetApp Trident protect
custom
resource:
- For restores
-
If restores fail, and it appears that SCCs did not get created, use these steps to diagnose the issue.
- Use the following command to check the state of the NetApp Trident protect
custom
resource:
oc get backuprestore.protect.trident.netapp.io -n ${PROJECT_CPD_INST_OPERATORS}The custom resource shows an error similar to the following error message:
NAME STATE ERROR AGE netapp-cx-gmc-120325-4-3 Failed VolumeRestoreHandler failed with permanent error kopiaVolumeRestore failed for volume trident-protect/data-aiopenscale-ibm-aios-etcd-1-6f44559e00b0b23bbb077d0e57c45de2: permanent error 124m - Use the following commands to check the
KopiaVolumeRestorejobs:oc get kopiavolumerestores -n ${PROJECT_CPD_INST_OPERATORS} oc get jobs -n trident-protect | grep <name> | grep "openshift.io/scc"If the pod's Security Context Constraint (SCC) is anything other than
trident-protect-job, you must apply the workaround.For example, if you see
openshift.io/scc: openshift-ai-llminferenceservice-multi-node-scc, you must apply the workaround.
- Use the following command to check the state of the NetApp Trident protect
custom
resource:
- Cause of the problem
-
The restore fails because the
KopiaVolumeRestoreandkopiavolumebackupsjobs are using an incorrect Security Context Constraint (SCC). Instead of using the expectedtrident-protect-jobSCC for NetApp Trident protect, the jobs are picking up a different SCC with higher priority. In this case,openshift-ai-llminferenceservice-multi-node-sccgets used, which cause permission issues during the restore. - Resolving the problem
- You must be a cluster administrator to use the workaround as SCCs require
cluster-admin level permissions.
Use the following commands to set the priority of
trident-protect-jobhigher than the priority ofopenshift-ai-llminferenceservice-multi-node-scc, which is usually11.oc adm policy add-scc-to-group trident-protect-job system:serviceaccounts:trident-protect oc patch scc trident-protect-job --type=merge -p '{"priority": 20}'If other SCCs are preventing the job pods from using
trident-protect-job, you might need to increase the priority of it further.
IBM Software Hub resources are not migrated
Applies to: 5.3.0
Applies to: Portworx asynchronous disaster recovery
- Diagnosing the problem
- When you use Portworx
asynchronous disaster recovery, the migration finishes almost immediately, and no
volumes or the expected number of resources are migrated. Run the following
command:
storkctl get migrations -n ${PX_ADMIN_NS}Tip:${PX_ADMIN_NS}is usually kube-system.Example output:NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED cpd-tenant-migrationschedule-interval-<timestamp> mig-clusterpair Final Successful 0/0 0/0 <timestamp> Volumes (0s) Resources (3s) 0 - Cause of the problem
- This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected IBM Software Hub resources are not migrated.
- Resolving the problem
- To resolve the problem, downgrade stork to a version prior to
23.11.0. For more information about stork releases, see the stork Releases page.
- Scale down the Portworx operator so that it doesn't reset manual changes to the
stork
deployment:
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0 - Edit the stork deployment image version to a version prior to
23.11.0:
oc edit deploy -n ${PX_ADMIN_NS} stork - If you need to scale up the Portworx operator, run the
following command.Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1
- Scale down the Portworx operator so that it doesn't reset manual changes to the
stork
deployment:
Prompt tuning fails after restoring watsonx.ai
Applies to: 5.3.0
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- When you try to create a prompt tuning experiment, you see the following error
message:
An error occurred while processing prompt tune training. - Resolving the problem
- Do the following steps:
- Restart the caikit
operator:
oc rollout restart deployment caikit-runtime-stack-operator -n ${PROJECT_CPD_INST_OPERATORS}Wait at least 2 minutes for the cais fmaas custom resource to become healthy.
- Check the status of the cais fmaas custom resource by running
the following
command:
oc get cais fmaas -n ${PROJECT_CPD_INST_OPERANDS} - Retry the prompt tuning experiment.
- Restart the caikit
operator:
Restic backup that contains dynamically provisioned volumes in Amazon Elastic File System fails during restore
Applies to: 5.3.0
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
-
When trying to restore from an offline backup in Amazon Elastic File System, the restore process fails in the volume restore phase, and you might see something similar to the following error:
Status: Failed Errors: 8 Warnings: 640 Action Errors: DPP_NAME: cpd-offline-tenant/restore-service-orchestrated-parent-workflow INDEX: 6 ACTION: restore-cpd-volumes ERROR: error: expected restore phase to be Completed, received PartiallyFailedIn the Velero logs, you might see something similar to the following error:
time="2025-12-04T18:14:44Z" level=error msg="Restic command fail with ExitCode: 1. Process ID is 2077, Exit error is: exit status 1" PodVolumeRestore=openshift-adp/cpd-tenant-vol-r-00xda-x44x-4xx0-a9b1 controller=PodVolumeRestore stderr=ignoring error for /index-16: lchown /host_pods/.../mount/index-16: operation not permitted - Cause of the problem
- Restic backups with Amazon Elastic File System is not supported by Velero. Restic cannot properly restore file ownership and permissions on Amazon Elastic File System volumes because it users a different protocol for ownership operations.
- Resolving the issue
-
If you want to use Amazon Elastic File System for dynamically provisioned volumes, you must use Kopia backups instead of Restic. The OADP DataProtectionApplication (DPA) must use
uploaderType: kopia.
Restoring Data Virtualization fails with metastore not running or failed to connect to database error
Applies to: 5.3.0
Applies to: Online backup and restore with the OADP utility
- Diagnosing the problem
- View the status of the restore by running the following
command:
cpd-cli oadp tenant-restore status ${TENANT_BACKUP_NAME}-restore --detailsThe output shows errors like in the following examples:time=<timestamp> level=INFO msg=Verifying if Metastore is listening SERVICE HOSTNAME NODE PID STATUS Standalone Metastore c-db2u-dv-hurricane-dv - - Not runningtime=<timestamp> level=ERROR msg=Failed to connect to BigSQL database * error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred: * error executing command su - db2inst1 -c '/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L' (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=<namespace-name> auxMetaName=dv-aux component=dv actionIdx=0): command terminated with exit code 1 - Cause of the problem
- A timing issue causes restore posthooks to fail at the step where the posthooks check
for the results of the
db2 connect to bigsqlcommand. Thedb2 connect to bigsqlcommand has failed because bigsql is restarting at around the same time. - Resolving the problem
- Run the following
command:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from cpd-post-restore-hooks
Offline backup fails with PartiallyFailed error
Applies to: 5.3.0
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the Velero logs, you see errors like in the following
example:
time="<timestamp>" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:180" time="<timestamp>" level=error msg="error encountered while scanning stdout" backupLocation=oadp-operator/dpa-sample-1 cmd=/plugins/velero-plugin-for-aws controller=backup-sync error="read |0: file already closed" logSource="/remote-source /velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:90" time="<timestamp>" level=error msg="Restic command fail with ExitCode: 1. Process ID is 906, Exit error is: exit status 1" logSource="/remote-source/velero/app/pkg/util/exec/exec.go:66" time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.examplehost.example.com/velero/cpdbackup/restic/cpd-instance --pa ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=76b76bc4-27d4-4369-886c-1272dfdf9ce9 --tag=volume=cc-home-p vc-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --tag=pod=cpdbr-vol-mnt --host=velero --json with error: exit status 3 stderr: {\"message_type\":\"e rror\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/.scripts/system\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival \",\"item\":\".scripts/system\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"_global_/security/artifacts/metakey\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kuberne tes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/_global_/security/artifacts/metakey\"}\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/ve lero/app/pkg/podvolume/backupper.go:328" time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.cp.fyre.ibm.com/velero/cpdbackup/restic/cpd-instance --pa ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod=cpdbr-vol-mnt --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=93e9e23c-d80a-49cc-80bb-31a36524e0d c --tag=volume=data-rabbitmq-ha-0-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --host=velero --json with error: exit status 3 stderr: {\"message_typ e\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\".erlang.cookie\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-93e9e23c-d80a-49cc-80bb-31a36524e0dc/.erlang.cookie\"} \nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/velero/app/pkg/podvolume/backupper.go:328" - Cause of the problem
- The restic folder was deleted after backups were cleaned up (deleted). This problem is a Velero known issue. For more information, see velero does not recreate restic|kopia repository from manifest if its directories are deleted on s3.
- Resolving the problem
- Do the following steps:
- Get the list of backup
repositories:
oc get backuprepositories -n ${OADP-OPERATOR-NAMESPACE} -o yaml - Check for old or invalid object storage URLs.
- Check that the object storage path is in the backuprepositories custom resource.
- Check that the
<objstorage>/<bucket>/<prefix>/restic/<namespace>/config
file exists.
If the file does not exist, make sure that you do not share the same <objstorage>/<bucket>/<prefix> with another cluster, and specify a different <prefix>.
- Delete backup repositories that are invalid for the following reasons:
- The path does not exist anymore in the object storage.
- The restic/<namespace>/config file does not exist.
oc delete backuprepositories -n ${OADP_OPERATOR_NAMESPACE} <backup-repository-name>
- Get the list of backup
repositories:
Offline backup fails after watsonx.ai is uninstalled
Applies to: 5.3.0
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- The problem occurs when you try to take an offline backup after watsonx.ai was uninstalled. The
backup process fails when post-backup hooks are run. In the
CPD-CLI*.log file, you see error messages like in the following
example:
time=<timestamp> level=info msg=cmd stderr: <timestamp> [emerg] 233346#233346: host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10 nginx: [emerg] host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10 nginx: configuration file /usr/local/openresty/nginx/conf/nginx.conf test failed func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:824 time=<timestamp> level=warning msg=failed to get exec hook JSON result for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 err=could not find JSON exec hook result func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:852 time=<timestamp> level=warning msg=no exec hook JSON result found for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:855 time=<timestamp> level=info msg=exit executeCommand func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:860 time=<timestamp> level=error msg=command terminated with exit code 1 - Cause of the problem
- After watsonx.ai is
uninstalled, the
nginxconfiguration in the ibm-nginx pod is not cleared up, and the pod fails. - Resolving the problem
- Restart all ibm-nginx
pods.
oc delete pod \ -n=${PROJECT_CPD_INST_OPERANDS} \ -l component=ibm-nginx
Db2 Big SQL backup pre-hook and post-hook fail during offline backup
Applies to: 5.3.0
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the db2diag logs of the Db2
Big SQL head pod, you see
error messages such as in the following example when backup pre-hooks are
running:
<timestamp> LEVEL: Event PID : 3415135 TID : 22544119580160 PROC : db2star2 INSTANCE: db2inst1 NODE : 000 HOSTNAME: c-bigsql-<xxxxxxxxxxxxxxx>-db2u-0 FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5692 MESSAGE : ZRC=0xFFFFFBD0=-1072 SQL1072C The request failed because the database manager resources are in an inconsistent state. The database manager might have been incorrectly terminated, or another application might be using system resources in a way that conflicts with the use of system resources by the database manager. - Cause of the problem
- The Db2 database
was unable to start because of the error code
SQL1072C. As a result, thebigsql startcommand that runs as part of the post-backup hook hangs, which produces the timeout of the post-hook. The post-hook cannot succeed unless Db2 is brought back to a stable state and thebigsql startcommand runs successfully. The Db2 Big SQL instance is left in an unstable state. - Resolving the problem
- Do one or both of the following troubleshooting and cleanup procedures.Tip: For more information about the
SQL1072Cerror code and how to resolve it, see SQL1000-1999 in the Db2 documentation.- Remove all the database manager processes running under the Db2 instance ID
- Do the following steps:
- Log in to the Db2
Big SQL head
pod:
oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash - Switch to the
db2inst1user:su - db2inst1 - List all the database manager processes that are running under
db2inst1:db2_ps - Remove these
processes:
kill -9 <process-ID>
- Log in to the Db2
Big SQL head
pod:
- Ensure that no other application is running under the Db2 instance ID, and then remove all resources owned by the Db2 instance ID
- Do the following steps:
- Log in to the Db2
Big SQL head
pod:
oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash - Switch to the
db2inst1user:su - db2inst1 - List all IPC resources owned by
db2inst1:ipcs | grep db2inst1 - Remove these
resources:
ipcrm -[q|m|s] db2inst1
- Log in to the Db2
Big SQL head
pod:
OpenSearch operator fails during backup
Applies to: 5.3.0
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- After restoring, verify the list of restored indices in OpenSearch. If only indices
that start with a dot, for example,
.ltstore, are present, and expected non-dot-prefixed indices are missing, this indicates the issue. - Cause of the problem
- During backup and restore, the OpenSearch operator restores
indices that start with either
"."or indices that do not start with".", but not both. This behavior affects Watson Discovery deployments where both types of indices are expected to be restored. - Resolving the problem
- Complete the following steps to resolve the issue:
- Access the client
pod:
oc rsh -n ${PROJECT_CPD_INST_OPERANDS} wd-discovery-opensearch-client-000 - Set the credentials and
repository:
user=$(find /workdir/internal_users/ -mindepth 1 -maxdepth 1 | head -n 1 | xargs basename) password=$(cat "/workdir/internal_users/${user}") repo="cloudpak" - Get the latest
snapshot:
last_snapshot=$(curl --retry 5 --retry-delay 5 --retry-all-errors -k -X GET "https://${user}:${password}@${OS_HOST}/_cat/snapshots/${repo}?h=id&s=end_epoch" | tail -n1) - Check that the latest snapshot was
saved:
echo $last_snapshot - Restore the
snapshot:
curl -k -X POST "https://${user}:${password}@${OS_HOST}/_snapshot/${repo}/${last_snapshot}/_restore?wait_for_completion=true" \ -d '{"indices": "-.*","include_global_state": false}' \ -H 'Content-Type: application/json'This command can take a while to run before output is shown. After it completes, you'll see output similar to the following example:{ "snapshot": { "snapshot": "cloudpak_snapshot_2025-06-10-13-30-45", "indices": [ "966d1979-52e8-6558-0000-019759db7bdc_notice", "b49b2470-70c3-4ba1-9bd2-a16d72ffe49f_curations", ... ], "shards": { "total": 142, "failed": 0, "successful": 142 } } }
This restores all indices, including both dot-prefixed and non-dot-prefixed ones.
- Access the client
pod:
Security issues
Security scans return an Inadequate Account Lockout Mechanism message
Applies to: 5.3.0 and later
- Diagnosing the problem
-
If you run a security scan against IBM Software Hub, the scan returns the following message.
Inadequate Account Lockout Mechanism
- Resolving the problem
-
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.
The Kubernetes version information is disclosed
Applies to: 5.3.0 and later
- Diagnosing the problem
- If you run an Aqua Security scan against your cluster, the scan returns the following issue:
- Resolving the problem
- This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.