Known issues and limitations for IBM Software Hub
The following issues apply to the IBM® Software Hub platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.
- Customer-reported issues
- General issues
- Installation and upgrade issues
- Backup and restore issues
- Security issues
The following issues apply to IBM Software Hub services.
Customer-reported issues
Issues that are found after the release are posted on the IBM Support site.
General issues
- After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional
- The create-physical-location command fails if your load balancer timeout settings are too low
- The delete-physical-location command fails
- The Get authorization token endpoint returns Unauthorized when the username contains special characters
- The
wmlservice key does not work and commands must use the --service option - The health-cluster command fails on ROSA with hosted control planes
- Health check for OpenPages fails with 500 error if service instance is installed on a watsonx.governance environment
- Exports with improper JSON formatting persist on export list despite deletion attempt
After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional
Applies to: 5.1.0 and later
- Diagnosing the problem
- After rebooting the cluster, some IBM Software Hub custom resources remain in the
InProgressstate.For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.
- Workaround
- Do the following steps:
- Find the nodes that have pods that are in an
Errorstate:oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A | grep -v -P "Completed|(\d+)\/\1" - Mark each node as
unschedulable.
oc adm cordon <node_name> - Delete the affected
pods:
oc get pod | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0 - Mark each node as
scheduled:
oc adm uncordon <node_name>
- Find the nodes that have pods that are in an
The create-physical-location command fails if your
load balancer timeout settings are too low
Applies to: 5.1.0
Fixed in: 5.1.1
If your load balancer timeout settings are too low, the connection between the primary cluster
and the remote physical location might be terminated before the API call that is issued by the
cpd-cli
manage
create-physical-location command completes.
- Diagnosing the problem
- The
cpd-cli manage create-physical-locationcommand returns the following error:TASK [utils : fail] ************************************************************ fatal: [localhost]: FAILED! => {"changed": false, "msg": "The <location-name> physical location was not registered. There might be a problem connecting to the hub or to the physical location service on the hub. Wait a few minutes and try to update the physical location again. If the problem persists, review the zen-core-api pods on the hub for issues related to the v1/physical_locations/<location-name> endpoint."}In addition, the log file includes the following message:
"msg": "Nginx extension API returned: 504"
- Resolving the problem
-
- Check the current timeout settings on your cluster. For example, if you use HAProxy,
run:
grep timeout /etc/haproxy/haproxy.cfgThe command returns output with the following format:
timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 5m timeout server 5m timeout http-keep-alive 10s timeout check 10s - If the client timeout or the server timeout is less than 5 minutes (
5m), follow the directions in Changing load balancer timeout settings to increase the timeout to at least 5 minutes.
- Check the current timeout settings on your cluster. For example, if you use HAProxy,
run:
The delete-physical-location command fails
Applies to: 5.1.0
Fixed in: 5.1.1
When you run the cpd-cli
manage
delete-physical-location, the command fails.
- Diagnosing the problem
- The
cpd-cli manage delete-physical-locationcommand returns the following error:TASK [utils : MC Agent: Make the MC storage pod if push mode communications. Dont wait] *** fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'physical_location_id' is undefined\n\nThe error appears to be in '/opt/ansible/ansible-play/roles/utils/tasks/initialize_mc_edge_agent_51.yml': line 203, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n - name: \"MC Agent: Make the MC storage pod if push mode communications. Dont wait\"\n ^ here\nThis one looks easy to fix. It seems that there is a value started\n with a quote, and the YAML parser is expecting to see the line ended\nwith the same kind of quote. For instance:\n\n when: \"ok\" in result.stdout\n\nCould be written as:\n\n when: '\"ok\" in result.stdout'\n\nOr equivalently:\n\n when: \"'ok' in result.stdout\"\n"} PLAY RECAP ********************************************************************* localhost : ok=51 changed=11 unreachable=0 failed=1 skipped=84 rescued=0 ignored=0 [ERROR] ... cmd.Run() failed with exit status 2 [ERROR] ... Command exception: The delete-physical-location command failed (exit status 2). You may find output and logs in the <file-path>/cpd-cli-workspace/olm-utils-workspace/work directory. [ERROR] ... RunPluginCommand:Execution error: exit status 1 - Resolving the problem
- From the client workstation:
- Get the container ID of the
olm-utils-v3image:- Docker
-
docker ps - Podman
-
podman ps
The command returns output with the following format:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8204c95c0fe2 cp.stg.icr.io/cp/cpd/olm-utils-v3:5.1.0 2 hours ago Up 2 hours olm-utils-play-v3 - Open a
bashprompt in the container:- Docker
-
docker exec -ti <container-ID> bash - Podman
-
podman exec -ti <container-ID> bash
- Apply the
workaround:
find ./ansible-play/roles/utils/templates/phyloc-mc-51 -name *deployment* | xargs -I {} sed -i -E 's/^ *icpdsupport\/physicalLocation.*\".*$//g' {} - Run the following command to verify that the workaround was
applied:
find ./ansible-play/roles/utils/templates/phyloc-mc-51 -name \*deployment\* | xargs -I {} grep -E 'icpdsupport\/physicalLocation' {}The command should return the following output:
fieldPath: metadata.labels['icpdsupport/physicalLocationId'] fieldPath: metadata.labels['icpdsupport/physicalLocationName']
After you apply the workaround, you can re-run the
cpd-cli manage delete-physical-locationcommand to delete the remote physical location. - Get the container ID of the
The Get authorization token endpoint returns Unauthorized when the username
contains special characters
Applies to: 5.1.1 and later
If you try to generate a bearer token by calling the Get authorization token endpoint with a username that contains special characters, the call fails and an error response is returned.
- Diagnosing the problem
-
When you call the
/icp4d-api/v1/authorizeendpoint by using credentials that contain special characters in the username, the call fails and you receive the following error response:{ "_statusCode_": 401, "exception": "User unauthorized to invoke this endpoint.", "message": "Unauthorized" }This happens because the special characters in the username are not being encoded properly in the
/icp4d-api/v1/authorizeendpoint. - Resolving the problem
-
You can generate an
access_tokenby calling thevalidateAuthendpoint.curl -X POST \ 'https://<platform_instance_route>/v1/preauth/signin'\ -H 'Content-Type: application/json' \ -d' { "username":<username>, "password":<password> }'Replace
<platform_instance_route>,<username>, and<password>with your credentials and the correct values for your environment.This command returns a response that contains the authorization token.
{ "_messageCode_": "200", "message": "Success", "token": "<authorization-token>" }Use the
<authorization-token>in the authorization header of subsequent API calls.
The wml service key does not work and cpd-cli
health commands must use the --service option
Applies to: 5.1.0 and later
wml service key cannot be used for 5.1.0 and
later versions. Using the wml service key can cause severe data loss. When Watson
Machine Learning is installed on a cluster, there are two
restrictions that you must follow if you want to use either the cpd-cli
health
service-functionality or the cpd-cli
health
service-functionality cleanup commands: - You cannot use the
wmlservice key with the--servicesoption. - You must use the
--servicesoption when you use either thecpd-cli health service-functionalitycommand or thecpd-cli health service-functionality cleanupcommand.
The cpd-cli
cluster command fails on ROSA with hosted control planes
Applies to: 5.1.0
Fixed in: 5.1.1
An error message for missing machineconfigpools (MCP) shows up when you run the cpd-cli
cluster command to check the health of a Red Hat
OpenShift Service on AWS (ROSA) cluster with hosted control planes
(HCP).
Health check for OpenPages fails with 500 error if service instance is installed on a watsonx.governance environment
Applies to: 5.1.1 and later versions
Fixed in: 5.1.3
If you install an OpenPages service
instance on a watsonx.governance™ environment and then run a cpd-cli
health
service-functionality check, the OpenPages service key fails with a 500 server error.
Exports with improper JSON formatting persist on export list despite deletion attempt
Applies to: 5.1.2 and later versions
cpd-cli
export-import
export delete command fails to delete an export when the
job fails due to improper JSON formatting that was passed in the export's yaml file. The command
returns without an error message, but the export job remains in the export list.- Workaround
- Fix any incorrect formatting issues with the JSON string in the export yaml file. Then create an
export with a different name, and rerun the
cpd-cli export-import export deletecommand on this export.
Installation and upgrade issues
- The setup-instance command fails during upgrades
- Upgrades fail or are stuck in the InProgres state when common core services cannot be upgraded because of an empty role name
- Upgrades fail or are stuck in the InProgres state when common core services cannot be upgraded because of roles with duplicate names
- The cloud-native-postgresql-opreq operand request is in a failed state after upgrade
- The Switch locations icon is not available if the apply-cr command times out
- Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
- After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
- Persistent volume claims with the WaitForFirstConsumer volume binding mode are flagged by the installation health checks
- Node pinning is not applied to postgresql pods
- The ibm-nginx deployment does not scale fast enough when automatic scaling is configured
- Uninstalling IBM watsonx services does not remove the IBM watsonx experience
The setup-instance command fails during
upgrades
Applies to: Upgrades from Version 5.0 to 5.1.0
When you run the cpd-cli
manage
setup-instance command, the command fails if the
ibm-common-service-operator-service service is not found in the operator
project for the instance.
When this error occurs, the ibmcpd ibmcpd-cr custom resource is stuck at
35%.
- Diagnosing the problem
- To determine if the command failed because the
ibm-common-service-operator-serviceservice was not found:- Get the
.status.progressvalue from theibmcpd ibmcpd-crcustom resource:oc get ibmcpd ibmcpd-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ -o json | jq -r '.status.progress'- If the command returns 35%, continue to the next step.
- If the command returns a different value, the command failed for a different reason.
- Check for the Error found when checking commonservice CR in namespace error in
the
ibmcpd ibmcpd-crcustom resource:oc get ibmcpd ibmcpd-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ -o json | grep 'Error found when checking commonservice CR in namespace'- If the command returns a response, continue to the next step.
- If the command does not return a response, the command failed for a different.
- Confirm that the
ibm-common-service-operator-serviceservice does not exist:oc get svc ibm-common-service-operator-service \ --namespace=${PROJECT_CPD_INST_OPERATORS}The command should return the following response:Error from server (NotFound): services "ibm-common-service-operator-service" not found
- Get the
- Resolving the problem
- To resolve the problem:
- Get the name of the
cpd-platform-operator-managerpod:oc get pod \ --namespace=${PROJECT_CPD_INST_OPERATORS} \ | grep cpd-platform-operator-manager - Delete the
cpd-platform-operator-managerpod.Replace
<pod-name>with the name of the pod returned in the previous step.oc delete pod <pod-name> \ --namespace=${PROJECT_CPD_INST_OPERATORS} - Wait several minutes for the operator to run a reconcile loop.
- Confirm that the
ibm-common-service-operator-serviceservice exists:
The command should return a response with the following format:oc get svc ibm-common-service-operator-service \ --namespace=${PROJECT_CPD_INST_OPERATORS}NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ibm-common-service-operator-service ClusterIP 198.51.100.255 <none> 443/TCP 1h
After you resolve the issue, you can re-run the
cpd-cli manage setup-instancecommand. - Get the name of the
Upgrades fail or are stuck in the InProgres state when common core services cannot be upgraded because of an empty
role name
- Upgrades from Version 4.8 to 5.1.0
- Upgrades from Version 4.8 to 5.1.1
Fixed in: Version 5.1.2
If you upgrade a service with a dependency on common core services, the upgrade fails or is stuck
in the InProgress state if it cannot upgrade the common core services.
If your installation includes one of the following services, you might encounter this problem when upgrading from IBM Cloud Pak® for Data Version 4.8 to IBM Software Hub Version 5.1:
- IBM Knowledge Catalog
- IBM Knowledge Catalog Premium
- IBM Knowledge Catalog Standard
When you upgrade any service with a dependency on the common core services, the common core services upgrade fails with the following error:
'"Job" "projects-ui-refresh-users": Timed out waiting on resource'
This error occurs because the wkc_reporting_administrator role is created
without a name.
- Avoiding the problem
- You can avoid this problem by updating the
wkc_reporting_administratorrole before you upgrade:- Log in to the web client as a user with the one of the following permissions:
- Administer platform
- Manage platform roles
- From the navigation menu, select .
- Open the Roles tab and look for a role with <no value> in the name column.
- Edit the role. Set the Name to Reporting Administrator and click Save.
- Log in to the web client as a user with the one of the following permissions:
- Diagnosing the problem
- If you started the upgrade without completing the steps in Avoiding the problem complete
the following steps to determine why the upgrade failed:
- Get the name of the common core services operator
pod:
oc get pod -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-cpd-ccs-operator - Check the
ibm-cpd-ccs-operator-*pod logs for the '"Job" "projects-ui-refresh-users": Timed out waiting on resource' error:oc logs <ibm-cpd-ccs-operator-pod-name> -n=${PROJECT_CPD_INST_OPERATORS} \ | grep "'"Job" "projects-ui-refresh-users": Timed out waiting on resource'"- If the command returns a response, proceed to the next step.
- If the command returns an empty response, the upgrade failed for a different reason.
- Get the name of the
projects-ui-refresh-userspod:oc get pod -n=${PROJECT_CPD_INST_OPERANDS} | grep projects-ui-refresh-users - Check the
projects-ui-refresh-users-*pod logs for the Error refreshing role with extension_name - wkc_reporting_administrator - status_code - 400 error:oc logs <projects-ui-refresh-users-pod-name> -n=${PROJECT_CPD_INST_OPERANDS} \ | grep "Error refreshing role with extension_name - wkc_reporting_administrator - status_code - 400"- If the command returns a response, proceed to Resolving the problem.
- If the command returns an empty response, the upgrade failed for a different reason.
- Get the name of the common core services operator
pod:
- Resolving the problem
- The steps for resolving the problem are the same as the steps for avoiding the problem.
Upgrades fail or are stuck in the InProgres state when common core services cannot be upgraded because of roles with
duplicate names
- Upgrades from Version 4.8
- Upgrades from Version 5.0
If you upgrade a service with a dependency on common core services, the upgrade fails or is stuck
in the InProgress state if it cannot upgrade the common core services. This issue can occur if your environment
includes multiple roles with the same name. (This is possible only if you use the
/usermgmt/v1/role API to create roles.)
When you upgrade any service with a dependency on the common core services, the common core services upgrade fails with the following error:
'"Job" "projects-ui-refresh-users": Timed out waiting on resource'
- Avoiding the problem
- You can avoid this problem by removing duplicate role names before you upgrade:
- Log in to the web client as a user with the one of the following permissions:
- Administer platform
- Manage platform roles
- From the navigation menu, select .
- Open the Roles tab and sort the roles by name to find any roles with duplicate names.
- If you find roles with duplicate names, edit the roles to remove the duplicate names and click Save.
- Log in to the web client as a user with the one of the following permissions:
- Diagnosing the problem
- If you started the upgrade without completing the steps in Avoiding the problem complete
the following steps to determine why the upgrade failed:
- Get the name of the common core services operator
pod:
oc get pod -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-cpd-ccs-operator - Check the
ibm-cpd-ccs-operator-*pod logs for the '"Job" "projects-ui-refresh-users": Timed out waiting on resource' error:oc logs <ibm-cpd-ccs-operator-pod-name> -n=${PROJECT_CPD_INST_OPERATORS} \ | grep "'"Job" "projects-ui-refresh-users": Timed out waiting on resource'"- If the command returns a response, proceed to the next step.
- If the command returns an empty response, the upgrade failed for a different reason.
- Get the name of the
projects-ui-refresh-userspod:oc get pod -n=${PROJECT_CPD_INST_OPERANDS} | grep projects-ui-refresh-users - Check the
projects-ui-refresh-users-*pod logs for duplicate role name errors:oc logs <projects-ui-refresh-users-pod-name> -n=${PROJECT_CPD_INST_OPERANDS} \ | grep "Error refreshing role with extension_name" | grep "status_code - 409"- If the command returns a response, proceed to Resolving the problem.
- If the command returns an empty response, the upgrade failed for a different reason.
- Get the name of the common core services operator
pod:
- Resolving the problem
- The steps for resolving the problem are the same as the steps for avoiding the problem.
The cloud-native-postgresql-opreq operand request is in a failed state after
upgrade
- Upgrades from Version 5.0.1 or later
- Upgrades from Version 5.1.0 or later
Fixed in: 5.1.3
After you upgrade an instance of IBM Software Hub, the cloud-native-postgresql-opreq operand request is in the
Failed state.
Even though the operand request is in the Failed state, IBM Software Hub is upgraded and works as expected. To fix
the operand request, run the following command to update the operand request:
oc patch opreq cloud-native-postgresql-opreq \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch '{"spec": {"requests":[{"operands":[{"name":"cloud-native-postgresql-v1.22"}],"registry":"common-service"}]}}'
The Switch locations icon is not available if the
apply-cr command times out
Applies to: 5.1.0 and later
If you install solutions that are available in different experiences, the Switch
locations icon () is not available in the web
client if the
cpd-cli
manage
apply-cr command times out.
- Resolving the problem
-
Re-run the
cpd-cli manage apply-crcommand.
Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
Applies to: 5.1.0 and later
If the Red Hat OpenShift Data Foundation or IBM Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.
One symptom is that pods will not start because of a FailedMount error. For
example:
Warning FailedMount 36s (x1456 over 2d1h) kubelet MountVolume.MountDevice failed for volume
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
- Diagnosing the problem
- To confirm whether the Data Foundation
Rook Ceph cluster is unstable:
- Ensure that the
rook-ceph-toolspod is running.oc get pods -n openshift-storage | grep rook-ceph-toolsNote: On IBM Fusion HCI System or on environments that use hosted control planes, the pods are running in theopenshift-storage-clientproject. - Set the
TOOLS_PODenvironment variable to the name of therook-ceph-toolspod:export TOOLS_POD=<pod-name> - Execute into the
rook-ceph-toolspod:oc rsh -n openshift-storage ${TOOLS_POD} - Run the following command to get the status of the Rook Ceph
cluster:
ceph statusConfirm that the output includes the following line:health: HEALTH_WARN - Exit the pod:
exit
- Ensure that the
- Resolving the problem
- To resolve the problem:
- Get the name of the
rook-ceph-mrgpods:oc get pods -n openshift-storage | grep rook-ceph-mgr - Set the
MGR_POD_Aenvironment variable to the name of therook-ceph-mgr-apod:export MGR_POD_A=<rook-ceph-mgr-a-pod-name> - Set the
MGR_POD_Benvironment variable to the name of therook-ceph-mgr-bpod:export MGR_POD_B=<rook-ceph-mgr-b-pod-name> - Delete the
rook-ceph-mgr-apod:oc delete pods ${MGR_POD_A} -n openshift-storage - Ensure that the
rook-ceph-mgr-apod is running before you move to the next step:oc get pods -n openshift-storage | grep rook-ceph-mgr - Delete the
rook-ceph-mgr-bpod:oc delete pods ${MGR_POD_B} -n openshift-storage - Ensure that the
rook-ceph-mgr-bpod is running:oc get pods -n openshift-storage | grep rook-ceph-mgr
- Get the name of the
After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
Applies to: 5.1.0 and later
After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.
- IBM Knowledge Catalog
- IBM Match 360
- Diagnosing the problem
- To identify the cause of this issue, check the FoundationDB status and details.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusIf this command is successful, the returned status is
Complete. If the status isInProgressorFailed, proceed to the workaround steps. - If the status is
Completebut FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.oc rsh sample-cluster-log-1 /bin/fdbcliTo check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the
fdb>prompt.status details- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
coordinatorscommand with the IP addresses specified in the error message as input.oc get pod -o wide | grep storage > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tlsIf this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
- Check the FoundationDB
status.
- Resolving the problem
- To resolve this issue, restart the FoundationDB
pods.
Required role: To complete this task, you must be a cluster administrator.
- Restart the FoundationDB cluster
pods.
oc get fdbcluster oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete poReplace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance. - Restart the FoundationDB operator
pods.
oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po - After the pods finish restarting, check to ensure that FoundationDB is available.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusThe returned status must be
Complete. - Check to ensure that the database is
available.
oc rsh sample-cluster-log-1 /bin/fdbcliIf the database is still not available, complete the following steps.
- Log in to the
ibm-fdb-controllerpod. - Run the
fix-coordinatorscript.kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}Replace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance.
- Log in to the
- Check the FoundationDB
status.
- Restart the FoundationDB cluster
pods.
Persistent volume claims with the WaitForFirstConsumer volume binding mode
are flagged by the installation health checks
Applies to: 5.1.0 and later
ibm-cs-postgres-backupibm-zen-objectstore-backup-pvc
Both of these persistent volume claims are created with the WaitForFirstConsumer
volume binding mode. In addition, both persistent volume claims will remain in the
Pending state until you back up your IBM Software Hub installation. This behavior is expected.
However, when you run the cpd-cli
health
operands command, the Persistent Volume Claim
Healthcheck fails.
If there are more persistent volume claims returned by the health check, you must investigate
further to determine why those persistent volume claims are pending. However, if only the following
persistent volume claims are returned, you can ignore the Failed result:
ibm-cs-postgres-backupibm-zen-objectstore-backup-pvc
Node pinning is not applied to postgresql pods
Applies to: 5.1.0 and later
If you use node pinning to schedule pods on specific nodes, and your environment includes
postgresql pods, the node affinity settings are not applied to the
postgresql pods that are associated with your IBM Software Hub deployment.
The resource specification injection (RSI) webhook cannot patch postgresql pods
because the EDB Postgres operator uses a
PodDisruptionBudget resource to limit the number of concurrent disruptions to
postgresql pods. The PodDisruptionBudget resource prevents
postgresql pods from being evicted.
The ibm-nginx deployment does not scale fast enough when automatic scaling
is configured
Applies to: 5.1.0 and later
If you configure automatic scaling for IBM Software Hub, the ibm-nginx deployment
might not scale fast enough. Some symptoms include:
- Slow response times
- High CPU requests are throttled
- The deployment scales up and down even when the workload is steady
This problem typically occurs when you install watsonx Assistant or watsonx™ Orchestrate.
- Resolving the problem
- If you encounter the preceding symptoms, you must manually scale the
ibm-nginxdeployment:oc patch zenservice lite-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": { "Nginx": { "name": "ibm-nginx", "kind": "Deployment", "container": "ibm-nginx-container", "replicas": 5, "minReplicas": 2, "maxReplicas": 11, "guaranteedReplicas": 2, "metrics": [ { "type": "Resource", "resource": { "name": "cpu", "target": { "type": "Utilization", "averageUtilization": 529 } } } ], "resources": { "limits": { "cpu": "1700m", "memory": "2048Mi", "ephemeral-storage": "500Mi" }, "requests": { "cpu": "225m", "memory": "920Mi", "ephemeral-storage": "100Mi" } }, "containerPolicies": [ { "containerName": "*", "minAllowed": { "cpu": "200m", "memory": "256Mi" }, "maxAllowed": { "cpu": "2000m", "memory": "2048Mi" }, "controlledResources": [ "cpu", "memory" ], "controlledValues": "RequestsAndLimits" } ] } }}'
Uninstalling IBM watsonx services does not remove the IBM watsonx experience
Applies to: 5.1.0 and later
After you uninstall watsonx.ai™ or watsonx.governance, the IBM watsonx experience is still available in the web client even though there are no services that are specific to the IBM watsonx experience.
- Resolving the problem
- To remove the IBM watsonx experience
from the web client, an instance administrator must run the following
command:
oc delete zenextension wx-perspective-configuration \ --namespace=${PROJECT_CPD_INST_OPERANDS}
Backup and restore issues
Issues that apply to several backup and restore methods
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
- Backup precheck fails due to missing Data Refinery custom resource error message
- Backup precheck fails on upgraded deployment
- Backup fails with an ibm_neo4j error when IBM Match 360 is scaled to the x-small size
- Backup fails for the platform with error in EDB Postgres cluster
- After backing up or restoring IBM Match 360, Redis fails to return to a Completed state
- Restore issues
- Review the following issues before you restore a backup. Do the workarounds that apply to your environment.
- Relationship explorer is not working after restoring IBM Knowledge Catalog
- Execution Engine for Apache Hadoop service is inactive after a restore
- After a restore, OperandRequest timeout error in the ZenService custom resource
- After restoring the IBM Match 360 service, the onboard job fails
- After backing up or restoring IBM Match 360, Redis fails to return to a Completed state
- Post-restore hooks fail to run when restoring deployment that includes IBM Knowledge Catalog
- Unable to log in to IBM Software Hub with OpenShift cluster credentials after successfully restoring to a different cluster
- Running cpd-cli restore post-hook command after Db2 Big SQL was successfully restored times out
- After restoring Analytics Engine powered by Apache Spark, IBM Software Hub resources reference the source cluster
- After restoring IBM Software Hub, watsonx.data Presto connection still references source cluster
- Unable to log in to the IBM Cloud Pak foundational services console after restore
- watsonx Assistant stuck at System is Training after restore
- After an online restore, some service custom resources cannot reconcile
- Restore fails for watsonx Orchestrate
- IBM Software Hub user interface does not load after restoring
- ibm-lh-lakehouse-validator pods repeatedly restart after restore
- Restore to same cluster fails with restore-cpd-volumes error
- Restore fails with ErrImageNeverPull error for Analytics Engine powered by Apache Spark
- SQL30081N RC 115,*,* error for Db2 selectForReceiveTimeout function after instance restore
Backup and restore issues with the OADP utility
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
- Online backup of upgraded IBM Cloud Pak for Data instance fails validation
- Offline backup of Db2 Data Management Console fails with backup validation error
- Offline backup fails with PartiallyFailed error
- OpenPages storage content is missing from offline backup
- Offline backup validation fails in IBM Software Hub deployment that includes Db2 and Informix in the same namespace
- Unable to create offline backup when IBM Software Hub deployment includes MongoDB service
- Backup fails after a service is upgraded and then uninstalled
- Offline backup prehooks fail for lite-maint resource
- Backup fails when deployment includes IBM Knowledge Catalog Standard or IBM Knowledge Catalog Premium without optional components enabled
- ObjectBucketClaim is not supported by the OADP utility
- Unable to create backup due to missing ConfigMaps
- Backup is missing EDB Postgres PVCs
- Offline backup prehooks fail when deployment includes IBM Knowledge Catalog
- Offline backup fails after watsonx.ai is uninstalled
- Db2U backup precheck fails during offline backup
- Db2 Big SQL backup pre-hook and post-hook fail during offline backup
- Restore issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
- Unable to restore offline backup of RStudio Server Runtimes
- Offline post-restore hooks fail when restoring Informix
- Unable to log in to the IBM Software Hub user interface after offline restore
- Offline restore to different cluster fails with nginx error
- IBM Knowledge Catalog custom resources are not restored
- Errors when restoring IBM Software Hub operators
- Post-restore hook error when restoring offline backup of Db2
- Restoring Data Virtualization fails with metastore not running or failed to connect to database error
- After online restore, watsonx Code Assistant for Z is not running properly
- Online post-restore hooks fail to run with timed out waiting for condition error when restoring Analytics Engine powered by Apache Spark
- Restore fails with condition not met error
- Offline restore fails with getting persistent volume claim error message
- After restoring Watson Speech services online backup, unable to use service instance ID to make service REST API calls
- After restoring Watson Discovery online backup, unable to use service instance ID to make service REST API calls
- Restore posthooks fail to run when restoring Data Virtualization
- Offline restore fails with cs-postgres timeout error
- Restoring an offline backup fails with zenservice-check error
- Error running post-restore hooks during offline restore
- Prompt tuning fails after restoring watsonx.ai
- After restoring watsonx Orchestrate, Kafka controller pods in the knative-eventing project enter a CrashLoopBackOff state
- Restoring online backup of Data Virtualization fails
- Restore posthooks timeout errors during Db2U and IBM Software Hub control plane restore
- Online restore posthooks fail when restoring Db2
- Offline restore fails at post-restore hooks step
- Error running post-restore hooks when restoring an offline backup
Backup and restore issues with IBM Fusion
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
- Restore issues
- Do the workarounds that apply to your environment after you restore a backup.
- Restore fails at Hook: br-service-hooks/operators-restore step
- Restore fails at Hook: br-service-hooks/operators-restore step
- IBM Fusion reports successful restore but many service custom resources are not in Completed state
- Unable to connect to Db2 database after restoring Data Virtualization to a different cluster
Backup and restore issues with NetApp Trident protect
- Restore issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
Backup and restore issues with NetApp Astra Control Center
- Restore issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
Backup and restore issues with Portworx
- Backup issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
- Restore issues
- Review the following issues after you restore a backup. Do the workarounds that apply to your environment.
Backup and restore issues with the IBM Software Hub volume backup utility
- Backup issues
- Review the following issues before you create a backup. Do the workarounds that apply to your environment.
Backup precheck fails due to missing Data Refinery custom resource error message
Applies to: 5.1.0
Applies to: All backup and restore methods
Fixed in: 5.1.1
- Diagnosing the problem
- In the CPD-CLI*.log file, you see the following error
message:
time=<timestamp> level=error msg=error performing op preCheckViaConfigHookRule for resource rshaper (configmap=cpd-datarefinery-maint-aux-ckpt-cm): : datarefinery.datarefinery.cpd.ibm.com "datarefinery-cr" not found func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/go/src/cpdbr-oadp/pkg/quiesce/planexecutor.go:1631 The hook is searching for datarefinery-cr, and it failed because the datarefinery-cr is not present. - Cause of the problem
- Starting in IBM Cloud Pak for Data 4.7, the Data Refinery custom resource name was changed from datarefinery-sample to datarefinery-cr. If you upgraded IBM Cloud Pak for Data from version 4.6 or earlier, the Data Refinery custom resource name is still datarefinery-sample.
- Workaround
- Update the Data Refinery custom
resource name to datarefinery-sample in the
cpd-datarefinery-maint-aux-br-cm and
cpd-datarefinery-maint-aux-ckpt-cm ConfigMaps.
- Edit the cpd-datarefinery-maint-aux-br-cm
ConfigMap:
oc -n ${PROJECT_CPD_INST_OPERANDS} edit cm cpd-datarefinery-maint-aux-br-cm - In the
precheck-metasection, underbackup-hooks, update the Data Refinery custom resource name to datarefinery-sample:precheck-meta: backup-hooks: exec-rules: - resource-kind: datarefinery.datarefinery.cpd.ibm.com name: datarefinery-sample - Repeat steps 1 and 2 in the cpd-datarefinery-maint-aux-ckpt-cm ConfigMap.
- Retry the backup.
- Edit the cpd-datarefinery-maint-aux-br-cm
ConfigMap:
Restoring online backup of Data Virtualization fails
Applies to: 5.1.0 and later
Applies to: All online backup and restore methods
- Diagnosing the problem
- In the CPD-CLI*.log file, you see the following error
message:
time=<timestamp> level=info msg= zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137 - Workaround
- Do the following steps:
- Disable the Data Virtualization liveness probe in
the Data Virtualization head
pod:
oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - mkdir /mnt/PV/versioned/marker_file"oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - touch /mnt/PV/versioned/marker_file/.bar" - Disable the BigSQL restart daemon in the Data Virtualization head
pod:
oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker create BIGSQL_DAEMON_PAUSE" - Stop BigSQL in the Data Virtualization head
pod:
oc rsh c-db2u-dv-db2u-0 bashsu - db2inst1bigsql stop - Re-enable the Hive user in the users.json file in the Data Virtualization head pod.
- Edit the users.json
file:
vi /mnt/blumeta0/db2_config/users.json - Locate
"locked":trueand change it to"locked":false.
- Edit the users.json
file:
- On the hurricane pod, rename the hive-site.xml config file so that it can be reconfigured by
restarting the
pod:
oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)su - db2inst1mv /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml.bak - Exit the pod, and then run the following command to delete it.Note: Since the configuration file was renamed, it is regenerated with the correct settings.
oc delete pod -l formation_id=db2u-dv,role=hurricane - After the hurricane pod is started again, run the following commands on the hurricane pod to
disable the SSL so that it can be reconfigured in a later
step:
oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)su - db2inst1bigsql-config -disableMetastoreSSLbigsql-config -disableSchedulerSSL - Clean up leftover files from the hurricane
pod:
rm -rf /mnt/blumeta0/bigsql/security/*rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null - Run the following commands to disable SSL from the head
pod:
oc rsh c-db2u-dv-db2u-0 bashsu - db2inst1rah "bigsql-config -disableMetastoreSSL"rah "bigsql-config -disableSchedulerSSL" - Clean up leftover files from the head and worker
pods:
rm -rf /mnt/blumeta0/bigsql/security/*rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null - Run the following commands to re-enable SSL on the head pod, and restart Db2
Big SQL so that configuration changes can take
effect:
bigsql-config -enableMetastoreSSLbigsql-config -enableSchedulerSSLbigsql stop; bigsql start - Remove markers that were created in steps 1 and 2 in the Data Virtualization head
pod:
oc exec -it c-db2u-dv-db2u-0 -- bash -c "rm -rf /mnt/PV/versioned/marker_file/.bar"oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker delete BIGSQL_DAEMON_PAUSE" - If you are doing the backup and restore with the OADP backup and restore utility, run the following
command:
cpd-cli oadp restore prehooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} --log-level debug --verbose - If you are doing the backup and restore with IBM Fusion, NetApp Astra Control Center, or Portworx data replication, run the following
commands:
CPDBR_POD=$(oc get po -l component=cpdbr-tenant -n ${PROJECT_CPD_INST_OPERATORS} --no-headers | awk '{print $1}')oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS}"oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}"
- Disable the Data Virtualization liveness probe in
the Data Virtualization head
pod:
Unable to connect to Db2 database after restoring Data Virtualization to a different cluster
Applies to: 5.1.1
Applies to: Backup and restore to different cluster with IBM Fusion
Fixed in: 5.1.2
- Diagnosing the problem
- In the IBM Software Hub web client, users see
the following error
message:
SQL30082N Security processing failed with reason "24" ("USERNAME AND/OR PASSWORD INVALID"). SQLSTATE=08001 FAIL: connect to database with admin/password - Cause of the problem
- The Data Virtualization head pod is unable to log in to Db2 instances because Db2 is using the source cluster's IBM Software Hub route.
- Resolving the problem
- Restart the Data Virtualization head and worker pods:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Scale down to 0
replicas:
oc -n ${PROJECT_CPD_INST_OPERANDS} scale sts c-db2u-dv-db2u --replicas=0 - Wait for all c-db2u-dv-db2u-<xxxx> pods to be deleted.
- Scale up to x replicas, where x equals the total number of
head and worker pods.
The following example assumes that you have one head pod and one worker pod.
oc -n ${PROJECT_CPD_INST_OPERANDS} scale sts c-db2u-dv-db2u --replicas=2
-
IBM Fusion shows successful restore but Informix custom resource reports that instance is unhealthy
Applies to: 5.1.1
Applies to: Backup and restore with IBM Fusion
Fixed in: 5.1.2
- Diagnosing the problem
- The status of the informix custom resource is
InProgress. - Cause of the problem
- The informix custom resource is not reporting the proper status of the StatefulSet and pod because a state change is not reconciled. However, the Informix instance is working properly and clients can interact with the instance.
- Resolving the problem
- Run a reconciliation by restarting the Informix operator controller manager pod. When the
reconciliation is completed, the custom resource shows the proper status. Run the following
commands:
oc scale deployment informix-operator-controller-manager --replicas=0 -n ${PROJECT_CPD_INST_OPERATORS} oc scale deployment informix-operator-controller-manager --replicas=1 -n ${PROJECT_CPD_INST_OPERATORS}
Unable to log in to IBM Software Hub with OpenShift cluster credentials after successfully restoring to a different cluster
Applies to: 5.1.0, 5.1.1, 5.1.2
Applies to: All restore to different cluster scenarios
Fixed in: 5.1.3
- Diagnosing the problem
- When IBM Software Hub is integrated with the
Identity Management Service service, you cannot log in with
OpenShift cluster credentials. You might
be able to log in with LDAP or as
cpdadmin. - Resolving the problem
- To work around the problem, run the following
commands:
oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS} oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS} oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS} oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS} oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operator oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-service oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-management oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider
Restore posthooks timeout errors during Db2U and IBM Software Hub control plane restore
Applies to: 5.1.0
Applies to: Online backup and restore with the OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- When you restore a backup, you see an error message in the CPD-CLI*.log
file like in the following
examples:
<time> Hook execution breakdown by status=error/timedout: <time> <time> The following hooks either have errors or timed out <time> <time> post-restore (2): <time> <time> COMPONENT CONFIGMAP METHOD STATUS DURATION ADDONID <time> db2u db2u-aux-ckpt-cm rule error 1h0m0.176305347s databases <time> zen-lite-patch ibm-zen-lite-patch-ckpt-cm rule error 8.55624ms zen-lite <time> <time> -------------------------------------------------------------------------------- <time> <time> Error: failed to execute masterplan: 1 error occurred: <time> * DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=9) error: online post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules. Check the /<directory>/CPD-CLI-<date>.log for errors. <time> 2 errors occurred: <time> * error performing op postRestoreViaConfigHookRule for resource db2u (configmap=db2u-aux-ckpt-cm): 2 errors occurred: <time> * timed out waiting for the condition <time> * timed out waiting for the condition <time> <time> * error performing op postRestoreViaConfigHookRule for resource zen-lite-patch (configmap=ibm-zen-lite-patch-ckpt-cm): : clusterserviceversions.operators.coreos.com "ibm-zen-operator.v6.1.0" not foundtime=<timestamp> level=info msg=Time: <timestamp> level=info - OperandRequest: ibm-iam-request - phase: Installing func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116 time=<timestamp> level=info msg=Time: <timestamp> level=info - sleeping for 64s... (retry attempt 10/10) func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116 time=<timestamp> level=info msg=Time: <timestamp> level=info - OperandRequest: ibm-iam-request - phase: Installing func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116 time=<timestamp> level=info msg=Time: <timestamp> level=warning - Create OperandRequest Timeout Warning func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116 time=<timestamp> level=info msg=-------------------------------------------------- func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116 - Cause of the problem
- The timeout problem is caused by a slow Kubernetes OpenShift server.
- Resolving the problem
- Retry the restore.
Online restore posthooks fail when restoring Db2
Applies to: 5.1.0, 5.1.1
Applies to: Online backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- When you restore a backup, you see an error message in the CPD-CLI*.log
file like in the following
example:
<time> The following hooks either have errors or timed out <time> <time> post-restore (1): <time> <time> COMPONENT CONFIGMAP METHOD STATUS DURATION ADDONID <time> db2u db2u-aux-ckpt-cm rule error 44.941375556s databases <time> <time> -------------------------------------------------------------------------------- <time> <time> Error: failed to execute masterplan: 1 error occurred: <time> * DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=9) error: online post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules. Check the /<directory>/CPD-CLI-<date_timestamp>.log for errors. <time> 1 error occurred: <time> * error performing op postRestoreViaConfigHookRule for resource db2u (configmap=db2u-aux-ckpt-cm): 1 error occurred: <time> * error executing command (container=db2u podIdx=0 podName=c-db2oltp-<xxxxxxxxxxxxxxxx>-db2u-0 namespace=${PROJECT_CPD_INST_OPERANDS} auxMetaName=db2u-aux component=db2u actionIdx=0): command terminated with exit code 1 - Resolving the problem
- This problem is intermittent. Do the following steps:
- Rerun the restore
posthooks:
cpd-cli oadp restore posthooks \ --tenant-operator-namespace ${PROJECT_CPD_INST_OPERANDS \ --hook-kind=posthook \ --log-level=debug - Reset the namespacescope by running the following
commands:
oc get po -A | grep "cpdbr-tenant-service"oc rsh -n ${PROJECT_CPD_INST_OPERANDS} <cpdbr-tenant-...>/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}
- Rerun the restore
posthooks:
Running cpd-cli restore post-hook command after Db2 Big SQL was successfully restored times out
Applies to: 5.1.0 and later
Applies to: Online backup and restore with either the OADP utility or Portworx
- Diagnosing the problem
- After you restore an online backup, you see the following message repeating in the
CPD-CLI*.log
file:
time=<timestamp> level=INFO msg=Waiting for marker /tmp/.ready_to_connectToDb - Resolving the problem
- Manually recreate /tmp/.ready_to_connectToDb in the Db2
Big SQL head pod. Do the following steps:
- Log in to the Db2
Big SQL head
pod:
oc rsh $(oc get pods | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) - Switch to the db2inst1
user:
su - db2inst1 - Recreate
/tmp/.ready_to_connectToDb:
touch /tmp/.ready_to_connectToDb - Rerun the
cpd-cli oadp restore posthookscommand:cpd-cli oadp restore posthooks \ --hook-kind=checkpoint \ --namespace=${PROJECT_CPD_INST_OPERANDS}
- Log in to the Db2
Big SQL head
pod:
After restoring Analytics Engine powered by Apache Spark, IBM Software Hub resources reference the source cluster
Applies to: 5.1.0 and later
Applies to: Online backup and restore to a different cluster
- Diagnosing the problem
- After you restore IBM Software Hub to a different cluster, accessing the IBM Software Hub console shows the source cluster URL instead of the target cluster.
- Cause of the problem
- After the restore, the ConfigMap spark-hb-deployment-properties references the source cluster.
- Resolving the problem
- Do the following steps:
- Delete the spark-hb-deployment-properties
ConfigMap:
oc delete cm spark-hb-deployment-properties -n ${PROJECT_CPD_INST_OPERANDS} - Reconcile the Analytics Engine powered by Apache Spark custom
resource:
oc patch AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"forceReconcile": "'$(date +%s)'"}}'
- Delete the spark-hb-deployment-properties
ConfigMap:
After restoring IBM Software Hub, watsonx.data Presto connection still references source cluster
Applies to: 5.1.0
Applies to: Online backup and restore to a different cluster
Fixed in: 5.1.1
- Diagnosing the problem
- After you restore IBM Software Hub to a different cluster, in the Presto connection details, you see the source cluster hostname in the engine details.
- Cause of the problem
- After the restore, a ConfigMap might have references to the source cluster.
- Resolving the problem
- Do the following steps:
- Get the watsonx.data™
Presto engine
name:
oc get wxdengine -o name -n ${PROJECT_CPD_INST_OPERANDS} - Patch the watsonx.data
Presto
engine:
oc patch wxdengine.watsonxdata.ibm.com/<engine_name> \ --type json \ -n ${PROJECT_CPD_INST_OPERANDS} \ -p '[ { "op": "remove", "path": "/spec/externalEngineUri" } ]' - To accelerate the reconcile, delete the lakehouse operator
pod:
oc delete pod ibm-lakehouse-controller-manager-<xxxxxxxxxx>-<yyyyy> -n ${PROJECT_CPD_INST_OPERATORS}
After a few minutes, the Presto engine repopulates with the target cluster URL, and the ConfigMap is recreated with the correct URLs.
- Get the watsonx.data™
Presto engine
name:
ObjectBucketClaim is not supported by the OADP utility
Applies to: 5.1.0
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- If an ObjectBucketClaim is created in an IBM Software Hub instance, it is not included when you create a backup.
- Cause of the problem
- OADP does not support backup and restore of ObjectBucketClaim.
- Resolving the problem
- Services that provide the option to use ObjectBuckets must ensure that the ObjectBucketClaim is in a separate namespace and backed up separately.
Unable to log in to the IBM Cloud Pak foundational services console after restore
Applies to: 5.1.0
Applies to: Online or offline backup and restore to a different cluster
Fixed in: 5.1.1
- Diagnosing the problem
- After restoring an online or offline backup of IBM Software Hub that is integrated with the
Identity Management Service to a different
cluster, users cannot log in to the IBM Cloud Pak foundational services console. The
following error message
appears:
CWOAU0061E: The OAuth service provider could not find the client because the client name is not valid. Contact your system administrator to resolve the problem. - Resolving the problem
- After the zenservice custom resource is reconciled and in a
Completedstate, restart the Identity Management Service operator and related resources. Run the following commands:oc delete cm ibmcloud-cluster-info -n ${PROJECT_CPD_INST_OPERANDS}oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l component=platform-identity-provideroc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l component=platform-identity-managementoc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l component=platform-auth-serviceoc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/name=ibm-iam-operator
Backup precheck fails on upgraded deployment
Applies to: 5.1.0
Applies to: All online backup and restore methods
Fixed in: 5.1.1
- Diagnosing the problem
- When you try to create an online backup of an IBM Cloud Pak for Data 5.0.x deployment that was
upgraded to IBM Software Hub
5.1.0, the backup precheck fails. You see an error message in the log
file like in the following
example:
time=<timestamp> level=info msg=exit RunExecRules func=cpdbr-oadp/pkg/quiesce.RunExecRules file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:142 time=<timestamp> level=error msg=error performing op preCheckViaConfigHookRule for resource zen (configmap=cpd-zen-aux-ckpt-cm): 1 error occurred: * condition not met (condition={$.status.controlPlaneStatus} == {"Completed"}, namespace=thanos, gvr=cpd.ibm.com/v1, Resource=ibmcpds, name=ibmcpd-cr) func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1631 time=<timestamp> level=info msg=op preCheckViaConfigHookRule for resource zen took 10.035602582s func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1643 time=<timestamp> level=info msg= ** PHASE [PLAN EXECUTOR/INTERNAL/PRECHECK-BACKUP/RESOURCE/ZEN (CONFIGMAP=CPD-ZEN-AUX-CKPT-CM)/END] func=cpdbr-oadp/pkg/utils.LogMarker file=/a/workspace/oadp-upload/pkg/utils/log.go:64 pre-check (1): COMPONENT CONFIGMAP METHOD STATUS DURATION ADDONID zen cpd-zen-aux-ckpt-cm rule error 10.035602582s zen-lite - Cause of the problem
- Two EDB Postgres clusters, wa-postgres and wa-postgres-16, exist as a result of migrating the data from one EDB Postgres cluster to another EDB Postgres cluster.
- Resolving the problem
- Delete the wa-postgres cluster.
Relationship explorer is not working after restoring IBM Knowledge Catalog
Applies to: 5.1.1, 5.1.2
Applies to: All backup and restore methods
Fixed in: 5.1.3
- Diagnosing the problem
- After restoring an online or offline backup, clicking the Relationship
Explorer button gives the following error:
Error fetching canvas Not found. The resource you tried to access does not exist. - Cause of the problem
- This problem typically occurs due to a timing conflict between the restore process and a scheduled backup job. When performing an online or offline restore of the Neo4j database, the restore process might succeed, but it often restores an incorrect database bundle because of the timing conflict. As a result, data loss occurs in the Neo4j graph database.
- Resolving the problem
- Do the following steps:
- Patch the knowledgegraph-cr custom resource to disable the
data-lineage-neo4j-backups-cronjob cronjob before a backup is taken by running
the following
command:
oc patch Knowledgegraph knowledgegraph-cr -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"neo4j_backup_job_enabled": "False"}}' oc delete cronjob data-lineage-neo4j-backups-cronjob -n ${PROJECT_CPD_INST_OPERANDS} - Wait for the knowledgegraph-cr custom resource to reconcile.
- Check that the knowledgegraph-cr custom resource is in a
Completedstate:oc get knowledgegraph.wkc.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS} - Check that the Neo4j custom resource is in a
Completedstate:oc get neo4jclusters.neo4j.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS}
- Check that the knowledgegraph-cr custom resource is in a
- Create a new backup.
- Patch the knowledgegraph-cr custom resource to disable the
data-lineage-neo4j-backups-cronjob cronjob before a backup is taken by running
the following
command:
Execution Engine for Apache Hadoop service is inactive after a restore
Applies to: 5.1.0
Applies to: All backup and restore methods
Fixed in: 5.1.1
- Diagnosing the problem
- After an online or offline restore, the hadoop-cr custom resource is not
active. Run the following
command:
oc get hadoop -n ${PROJECT_CPD_INST_OPERANDS}Output of the command:NAME VERSION RECONCILED STATUS AGE hadoop-cr 5.1.0 22h - Cause of the problem
- Execution Engine for Apache Hadoop subscriptions are missing from the cluster.
- Resolving the problem
- After you take a backup, check that the Execution Engine for Apache Hadoop subscriptions exist by running
the following
command:
for sub in $(oc get cm cpd-operators -o jsonpath='{.data.subscriptions}' -n ${PROJECT_CPD_INST_OPERATORS} | jq -r '.[] | .metadata.name');do oc get subscriptions.operators.coreos.com ${sub} -n ${PROJECT_CPD_INST_OPERATORS} > /dev/null; [ $? -eq 1 ] && echo "FAILED to find subscription ${sub}" || echo "found subscription ${sub} in the backup";done
After restoring watsonx Orchestrate, Kafka
controller pods in the knative-eventing project enter a
CrashLoopBackOff state
Applies to: 5.1.0
Applies to: Online backup and restore
Fixed in: 5.1.1
- Diagnosing the problem
- After restoring watsonx Orchestrate, the Kafka
controller pods in the knative-eventing project enter a
CrashLoopBackOffstate. - Cause of the problem
- The source of the problem is a known issue with restoring watsonx Assistant.
- Resolving the problem
- Do the same workaround for the watsonx Assistant known issue. For details, see watsonx Assistant stuck at System is Training after restore.
watsonx Assistant stuck at System is
Training after restore
Applies to: 5.1.0
Applies to: Backup and restore with the OADP utility or IBM Fusion
Fixed in: 5.1.1
- Diagnosing the problem
- After restoring watsonx Assistant, the Kafka
controller pods in the knative-eventing project might enter a
CrashLoopBackOffstate. As a result, watsonx Assistant is unable to complete the training process, and the service gets stuck at theSystem is Trainingstage. - Resolving the problem
- Do the following steps:
- Identify any Kafka pods in the
CrashLoopBackOffstate:oc get pod -n knative-eventing | grep -vE "Compl|1/1|2/2|3/3|4/4|5/5|6/6|7/7|8/8|9/9|10/10"Example output:NAME READY STATUS RESTARTS AGE kafka-controller-75fcdd9c4c-lk8dz 1/2 CrashLoopBackOff 245 (4m36s ago) 21h kafka-controller-75fcdd9c4c-xfddm 1/2 CrashLoopBackOff 245 (2m45s ago) 21h - Verify related resources in the
PROJECT_CPD_INST_OPERANDSproject.- Check the health of
triggers:
oc get triggersExample output:NAME BROKER SUBSCRIBER_URI AGE READY REASON wa-ke.assistant.clu-controller.v1.training-complete knative-wa-clu-broker 25h Unknown failed to reconcile consumer group wa-ke.assistant.clu-controller.v1.training-failed knative-wa-clu-broker 25h Unknown failed to reconcile consumer group wa-ke.assistant.clu-controller.v1.training-start knative-wa-clu-broker 25h Unknown failed to reconcile consumer group - Check the health of Consumer
Groups:
oc get consumergroupsExample output:NAME READY REASON SUBSCRIBER REPLICAS READY REPLICAS AGE knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.training-complete 1 25h knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.training-failed 1 25h knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.training-start 1 25h - Check the health of
Consumers:
oc get consumersExample output:NAME READY REASON SUBSCRIBER AGE knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.trai8ffsr 25h knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.traikzfhh 25h knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.traitfzrk 25h
- Check the health of
triggers:
- Delete the resources and let the operator create new resources by running the following
script:
#!/bin/bash # Define resource kinds resources=("trigger" "consumer" "consumergroup") # Loop through each resource kind for resource in "${resources[@]}"; do # Get the list of resources with 'assistant' in the name oc get "$resource" -o json | jq -r '.items[] | select(.metadata.name | contains("assistant")) | .metadata.name' | while read name; do # Patch the resource to remove finalizers echo "Removing finalizers from $resource: $name" oc patch "$resource" "$name" --type=merge -p '{"metadata":{"finalizers":[]}}' # Delete the resource echo "Deleting $resource: $name" oc delete "$resource" "$name" done done - Restart Kafka controller
pods:
oc rollout restart deployment/kafka-controller -n knative-eventing - Restart watsonx Assistant CLU training
pods:
oc delete pod -l app=wa-clu-training
After you complete these steps, the Kafka controller pods should recover, and watsonx Assistant will resume normal training operations.
- Identify any Kafka pods in the
Unable to create backup due to missing ConfigMaps
Applies to: 5.1.0
Applies to: Backup with OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
Error: global registry check failed: 1 error occurred: * error from addOnId=zen-lite: 1 error occurred: * failed to find aux configmap 'ibm-cs-postgres-ckpt-cm' in tenant service namespace='${PROJECT_CPD_INST_OPERANDS}': : configmaps "ibm-cs-postgres-ckpt-cm" not found - Cause of the problem
- Backup and restore ConfigMaps can go missing after an upgrade or were accidentally deleted.
Confirm that the ConfigMaps are missing by running the following
commands:
oc get cm -n ${PROJECT_CPD_INST_OPERANDS} -l cpdfwk.aux-kind=checkpoint | grep zenoc get cm -n ${PROJECT_CPD_INST_OPERANDS} -l cpdfwk.aux-kind=checkpoint | grep cs-postgresThe output shows no ConfigMaps when you run these commands.
- Resolving the problem
- Trigger a reconcile by running the following
command:
oc patch -n ${PROJECT_CPD_INST_OPERANDS} zenservice lite-cr --type=merge --patch '{"spec": {"refresh_install": false}}'The reconcile will recreate the missing ConfigMaps.
Backup is missing EDB Postgres PVCs
Applies to: 5.1.0
Applies to: Offline or online backup and restore with the OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.
- Cause of the problem
- EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.
- Resolving the problem
- Before you create a backup, run the following
command:
oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}For more information, see the following topics:
Unable to restore offline backup of RStudio® Server Runtimes
Applies to: 5.1.2 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
1 error occurred: 07:55:55 * error performing op postRestoreViaConfigHookRule for resource rstudio-maint-br (configmap=cpd-rstudio-maint-aux-br-cm): 1 error occurred: 07:55:55 * : rstudioaddons.rstudio.cpd.ibm.com "rstudio-cr" not found - Resolving the problem
- Do the following steps:
- In the OpenShift
Console, edit the
cpd-rstudio-maint-aux-br-cm ConfigMap.Note: To ensure that the ConfigMap is correctly formatted, do not edit the ConfigMap in the vi or vim editor.
- Update the workflow:
workflows: - name: restore-pre-operators sequence: - group: rstudio-sa - group: rstudio-resources - name: restore-post-operators sequence: [] - name: restore-pre-operands sequence: [] - name: restore-operands sequence: - group: rstudio-clusterroles - group: rstudio-crs - name: restore-post-operands sequence: [] - name: restore-post-namespacescope sequence: [] - Retry the restore by running the following
commands:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_BACKUP_NAME}-restore \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level=debug \ --disable-inverseops \ --start-from restore-operands \ --log-level=debug &> ${TENANT_BACKUP_NAME}-restore.log&
- In the OpenShift
Console, edit the
cpd-rstudio-maint-aux-br-cm ConfigMap.
Offline post-restore hooks fail when restoring Informix
Applies to: 5.1.2 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
1 error occurred: * error performing op postRestoreViaConfigHookRule for resource informix-1741694080250204 (configmap=informix-1741694080250204-aux-br-cm): : informixservices.ifx.ibm.com "informixservice-cr" not found - Cause of the problem
- The Zenservice custom resource is not fully reconciling.
- Resolving the problem
- Do the following steps:
- Rerun post-restore hooks by running the following
command:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from restore-post-namespacescope - Re-install the Informix custom
resource:
cpd-cli manage apply-cr \ --components=informix_cp4d \ --release=${VERSION} \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --license_acceptance=true
- Rerun post-restore hooks by running the following
command:
Unable to log in to the IBM Software Hub user interface after offline restore
Applies to: 5.1.2 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- After you restore an offline backup of IBM Software Hub, you cannot log in to the user
interface. The following error message
appears:
CWOAU0061E: The OAuth service provider could not find the client because the client name is not valid. Contact your system administrator to resolve the problem. - Resolving the problem
- Delete the PostgreSQL replica
for the common-service-db cluster by running the following
command:
oc delete po,pvc -l k8s.enterprisedb.io/cluster=common-service-db,k8s.enterprisedb.io/instanceRole=replica
Offline restore to different cluster fails with nginx error
Applies to: 5.1.2
Applies to: Offline backup and restore to a different cluster with the OADP utility
Fixed in: 5.1.3
- Diagnosing the problem
- In the CPD-CLI*.log file, you see errors like in the following
example:
* error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during hook workflowAction.Do() - action=nginx-maint/disable, action-index=0, retry-attempt=0/0, err=: command terminated with exit code 1 - Resolving the problem
- This problem is intermittent. To resolve the problem, run the following
command:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from restore-post-namespacescope
Post-restore hooks fail to run when restoring deployment that includes IBM Knowledge Catalog
Applies to: 5.1.2
Applies to: Online backup and restore with IBM Fusion, NetApp Trident protect, and Portworx asynchronous disaster recovery
Fixed in: 5.1.3
- Diagnosing the problem
- After the restore, the IBM Knowledge Catalog custom
resource remains stuck in the
InProgressstate.In the log file, you see errors like in the following examples:
1 error occurred: * error performing op postRestoreViaConfigHookRule for resource lite (configmap=cpd-lite-aux-br-cm): Timed out waiting for workloads to unquiesce: timed out waiting for the condition. Some workloads may still be in the process of scaling up, please use 'oc get deployment' to check workload statuses. - Resolving the problem
-
Reconcile the IBM Knowledge Catalog custom resource wkc-cr so that it reaches the
Completedstate by running the following command:oc delete job kg-resync-glossary -n ${PROJECT_CPD_INST_OPERANDS}Then wait for wkc-cr to reach a
Completedstate.
IBM Knowledge Catalog custom resources are not restored
Applies to: 5.1.2
Applies to: Online and offline backup and restore with the OADP utility
Fixed in: 5.1.3
- Diagnosing the problem
- The following custom resources are not restored after IBM Knowledge Catalog Premium is successfully restored:
- When semantic automation is enabled, the ikcpremium.ikc.cpd.ibm.com custom resource is not restored.
- When semantic automation is not enabled, the ikcpremium.ikc.cpd.ibm.com and ikcstandard.ikc.cpd.ibm.com custom resources are not restored.
After IBM Knowledge Catalog Standard is successfully restored, the ikcstandard.ikc.cpd.ibm.com custom resource is not restored when semantic automation is not enabled.
- Resolving the problem
- Recreate the custom resources from the backup. Do the following steps:
- Identify the resource
backup:
resourceBackup=$(oc get backups.velero.io -n ${OADP_OPERATOR_PROJECT} -l "cpdbr.cpd.ibm.com/tenant-backup-name=${TENANT_BACKUP_NAME},cpdbr.cpd.ibm.com/backup-name=resource_backup" --no-headers | awk '{print $1}') - Download the
backup:
cpd-cli oadp tenant-backup download ${TENANT_BACKUP_NAME} unzip ${TENANT_BACKUP_NAME}-data.zip mkdir ${resourceBackup} tar -xf ${resourceBackup}-data.tar.gz -C ${resourceBackup} ikcStandardCRLoc=$(find ${resourceBackup}/resources -type f| grep ikcstandard | grep ${PROJECT_CPD_INST_OPERANDS} | grep -v preferred) ikcPremiumCRLoc=$(find ${resourceBackup}/resources -type f| grep ikcpremium | grep ${PROJECT_CPD_INST_OPERANDS} | grep -v preferred) - Recreate the missing custom resources and wait for the reconciliation to
complete:
cat ${ikcStandardCRLoc} | jq 'del(.metadata.ownerReferences) | del(.metadata.uid) | del(.metadata.creationTimestamp) | del(.status) | del(.metadata.managedFields) | del(.metadata.generation) | del(.metadata.resourceVersion)' | oc apply -f - cat ${ikcPremiumCRLoc} | jq 'del(.metadata.ownerReferences) | del(.metadata.uid) | del(.metadata.creationTimestamp) | del(.status) | del(.metadata.managedFields) | del(.metadata.generation) | del(.metadata.resourceVersion)' | oc apply -f - - Check that the custom resources were
created:
oc get ikcpremium.ikc.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS} oc get ikcstandard.ikc.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS} - If you restored an offline backup, remove the custom resources from maintenance
mode:
oc patch ikcpremium.ikc.cpd.ibm.com ikc-premium-cr -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"ignoreForMaintenance": false}}' oc patch ikcstandard.ikc.cpd.ibm.com ikc-standard-cr -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"ignoreForMaintenance": false}}'
- Identify the resource
backup:
Errors when restoring IBM Software Hub operators
Applies to: 5.1.2 and later
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- In the CPD-CLI*log file, you see messages like in the following
example:
func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125 time=<timestamp> level=info msg=Time: <timestamp> level=error - Postgres Cluster: common-service-db Timeout Error func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125 time=<timestamp> level=info msg= func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125 time=<timestamp> level=info msg=Exited with return code=1 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125 time=<timestamp> level=error msg=error executing /tmp/cpd-operators-3c18d4f8-0cb0-4b6b-b2a1-b7363e1f3736.sh restore --operators-namespace <namespace-name> --foundation-namespace <namespace-name>: exit status 1 func=cpdbr-oadp/pkg/cli.ExecCPDOperatorsScript file=/a/workspace/oadp-upload/pkg/cli/scripts.go:105 time=<timestamp> level=error msg=cpd-operators script execution failed (args=[restore --operators-namespace <namespace-name> --foundation-namespace <namespace-name>]): exit status 1 func=cpdbr-oadp/pkg/dpaplan.(*MasterPlan).ExecuteParentRecipe.(*MasterPlan).localExecOperatorsRestoreAction.func7 file=/a/workspace/oadp-upload/pkg/dpaplan/master_plan_local_exec_actions.go:679 - Cause of the problem
- A
BundleUnpackFailederror occurred. Inspect subscriptions by running the following command:oc get sub ibm-common-service-operator -o yamlExample output:
- message: 'bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline' reason: BundleUnpackFailed status: "True" type: BundleUnpackFailed lastUpdated: "<timestamp>"This problem is a known issue with Red Hat. For details, see Operator installation or upgrade fails with DeadlineExceeded in RHOCP 4.
- Resolving the problem
- Do the following steps:
- Delete IBM Software Hub instance projects
(namespaces).
For details, see 3. Cleaning up the cluster before a restore.
- Retry the restore.
- Delete IBM Software Hub instance projects
(namespaces).
After a restore, OperandRequest timeout error in the ZenService custom resource
Applies to: 5.1.0 and later
Applies to: All backup and restore methods
- Diagnosing the problem
- Get the status of the ZenService
YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yamlIn the output, you see the following error:
... zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request": Timed out waiting on resource' ...Check for failing operandrequests:oc get operandrequests -AFor failing operandrequests, check their conditions forconstraints not satisfiablemessages:oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name> - Cause of the problem
- Subscription wait operations timed out. The problematic subscriptions show an error similar to
the following
example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0 exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0 and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0 originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0, subscription ibm-db2aaservice-cp4d-operator exists'This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.
- Resolving the problem
- Do the following steps:
- Delete the problematic clusterserviceversions and subscriptions, and restart the
Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
- Delete IBM Software Hub instance projects
(namespaces).
For details, see 3. Cleaning up the cluster before a restore.
- Retry the restore.
- Delete the problematic clusterserviceversions and subscriptions, and restart the
Operand Deployment Lifecycle Manager (ODLM) pod.
After restoring the IBM Match 360 service, the onboard job fails
Fixed in: 5.1.2
Applies to: 5.1.1
Applies to: All backup and restore methods
- Diagnosing the problem
- After restoring a previously backed up IBM
Match 360 service instance, the
mdm-onboardjob can fail with aDuplicateTenantExceptionerror during model service tenant onboarding. - Cause of the problem
- This issue occurs due to a problem with the operator trying to clean up resources and failing.
- Resolving the problem
- Before creating the backup, you can prevent this issue from occurring by labeling the IBM
Match 360 ConfigMap. Do the following steps:
- Get the ID of the IBM
Match 360 instance:
- From the IBM Software Hub home page, go to .
- Click the link for the IBM Match 360 instance.
- Copy the value after mdm- in the URL.
For example, if the end of the URL is
mdm-1234567891123456, the instance ID is1234567891123456.
- Create the following environment
variable:
export INSTANCE_ID=<instance-id> - Add the
mdmlabel by running the following command:oc label cm mdm-operator-${INSTANCE_ID} icpdsupport/addOnId=mdm -n ${PROJECT_CPD_INST_OPERANDS}
- Get the ID of the IBM
Match 360 instance:
After an online restore, some service custom resources cannot reconcile
Applies to: 5.1.0-5.1.2
Applies to: Online backup and restore with the OADP utility, IBM Fusion, NetApp Astra Control Center, and Portworx
Fixed in: 5.1.3
- Diagnosing the problem
- After an online restore completes, some service custom resources, Zenservice in particular, cannot reconcile. For example, the Zenservice custom resource remains stuck at 51% progress.
- Cause of the problem
- This problem occurs when you do an online backup and restore of a deployment that includes a
service, such as MANTA Automated Data Lineage or Data Gate, that does not support online backup and
restore.Tip: For more information about the backup and restore methods that each service supports, see Services that support backup and restore.
- Resolving the problem
- To resolve the problem, run one of the following commands:
- MANTA Automated Data Lineage
-
ZEN_METASTORE_PRIMARY_POD=$(oc get clusters.postgresql.k8s.enterprisedb.io -n ${PROJECT_CPD_INST_OPERANDS} zen-metastore-edb -o jsonpath='{.status.currentPrimary}') oc exec $ZEN_METASTORE_PRIMARY_POD -n ${PROJECT_CPD_INST_OPERANDS} -- psql -t -U postgres -c "update immutable_extensions set status = 'disabled' where extension_name like '%lineage%' and extension_point_id = 'zen_front_door';refresh materialized view extensions_view;" zen oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx' oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx-tester' - Data Gate
-
ZEN_METASTORE_PRIMARY_POD=$(oc get clusters.postgresql.k8s.enterprisedb.io -n ${PROJECT_CPD_INST_OPERANDS} zen-metastore-edb -o jsonpath='{.status.currentPrimary}') oc exec $ZEN_METASTORE_PRIMARY_POD -n ${PROJECT_CPD_INST_OPERANDS} -- psql -t -U postgres -c "update immutable_extensions set status = 'disabled' where extension_name like '%datagate%' and extension_point_id = 'zen_front_door';refresh materialized view extensions_view;" zen oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx' oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx-tester'
Restore fails for watsonx Orchestrate
Applies to: 5.1.0 and later
Applies to: All backup and restore methods
- Diagnosing the problem
- Restore fails after upgrade for the watsonx Orchestrate service.
- Cause of the problem
- An Out of Memory (OOM) error exists in a job that calls one of the pods.
- Resolving the problem
- Create and apply an RSI patch to increase the memory allocation by completing the following steps:
- Create a
directory:
mkdir cpd-cli-workspace/olm-utils-workspace/work/rsi - From the directory that you created, create a new file called
skill-seq.jsonthat contains the following information:[{"op":"replace","path":"/spec/containers/0/resources/limits/memory","value":"3Gi"}] - Run the following
commands:
podman stop olm-utils-play-v3 podman start olm-utils-play-v3 - Apply the patch:
cpd-cli manage create-rsi-patch \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERATORS} \ --patch_name=skill-seq-resource-limit \ --patch_type=rsi_pod_spec \ --patch_spec=/tmp/work/rsi/skill-seq.json \ --spec_format=json \ --include_labels=wo.watsonx.ibm.com/component:wo-skill-sequencing \ --state=active - Delete the
job:
oc delete job wo-watson-orchestrate-bootstrap-job \ --namespace=${PROJECT_CPD_INST_OPERATORS}
- Create a
directory:
IBM Software Hub user interface does not load after restoring
Applies to: 5.1.3
Applies to: All backup and restore methods
- Diagnosing the problem
- The IBM Software Hub user interface doesn't start after a restore. When you try to log in, you see a blank screen.
- Cause of the problem
- The following pods fail and must be restarted to load the IBM Software Hub user interface:
platform-auth-serviceplatform-identity-managementplatform-identity-provider
- Resolving the problem
- Run the following command to restart the
pods:
for po in $(oc get po -l icpdsupport/module=im -n ${PROJECT_CPD_INST_OPERANDS} --no-headers | awk '{print $1}' | grep -v oidc); do oc delete po -n ${PROJECT_CPD_INST_OPERANDS} ${po};done;
ibm-lh-lakehouse-validator pods repeatedly restart after restore
Applies to: 5.1.3
Applies to: All backup and restore methods
- Diagnosing the problem
- After restore, the
ibm-lh-lakehouse-validatorpod repeatedly restarts. - Cause of the problem
- This issue occurs because the liveness probe fails. The liveness probe has a 120-second delay.
Because of this delay, the health check fails before the application starts and the
ibm-lh-lakehouse-validatorpod repeatedly restarts. - Resolving the problem
- To resolve the problem, create a patch file that updates the initial delay of the liveness probe
to 300 seconds:
- Create the following JSON file. Save the file as
wxd-validator-patch.jsonin thecpd-cli-workspace/olm-utils-workspace/work/directory:[ { "op": "replace", "path": "/spec/containers/0/livenessProbe/initialDelaySeconds", "value": 300 } ] - Apply the patch:
cpd-cli manage create-rsi-patch \ --cpd_instance_ns=cpd-instance \ --patch_name=wxd-validator-patch-213 \ --spec_format=json \ --patch_type=rsi_pod_spec \ --patch_spec=/tmp/work/wxd-validator-patch.json \ --include_labels=icpdsupport/podSelector:ibm-lh-validator \ --state=active
- Create the following JSON file. Save the file as
Restore to same cluster fails with restore-cpd-volumes error
Applies to: 5.1.3
Applies to: All backup and restore methods
- Diagnosing the problem
- IBM Software Hub fails to restore,
and you see an error similar to the following
example:
error: DataProtectionPlan=cpd-offline-tenant/restore-service-orchestrated-parent-workflow, Action=restore-cpd-volumes (index=1) error: expected restore phase to be Completed, received PartiallyFailed - Cause of the problem
- The
ibm-streamsets-static-contentpod is mounted as read-only, which causes the restore to partially fail. - Resolving the problem
- Before you back up, exclude the IBM
StreamSets pod:
- Exclude the
ibm-streamsets-static-contentpod:oc label -l icpdsupport/addOnId=streamsets,app.kubernetes.io/name=ibm-streamsets-static-content velero.io/exclude-from-backup=true - Take another back up and follow the back up and restore procedure for your environment.
- Exclude the
Restore fails with ErrImageNeverPull error for Analytics Engine powered by Apache Spark
Applies to: 5.1.3
Applies to: All backup and restore methods
- Diagnosing the problem
- IBM Software Hub fails to restore,
and you see an
ErrImageNeverPullerror similar to the following example:spark-history-deployment-0e7df4b9-f5ac-4e68-b7b7-518afd4048ggds 0/1 Init:ErrImageNeverPull 0 3h13m - Cause of the problem
- The Analytics Engine powered by Apache Spark history
server deployment is erroneously included in the backup. As a result, the history server deployment
is restored in the
ErrImageNeverPullstate. - Resolving the problem
- To resolve the issue, you must delete the Analytics Engine powered by Apache Spark history server deployment after you restore.
After backing up or restoring IBM
Match 360, Redis fails to return to a
Completed state
Applies to: 5.1.3
Applies to: All backup and restore methods
- Diagnosing the problem
- After completing a backup or restoring the IBM
Match 360 service, its associated Redis
pod (
mdm-redis) and IBM Redis CP fail to return to aCompletedstate. - Cause of the problem
- This occurs when Redis CP fails to come out of maintenance mode.
- Resolving the problem
- To resolve this issue, manually update Redis CP to take it out of maintenance mode:
- Open the Redis CP YAML file for editing.
oc edit rediscp -n ${PROJECT_CPD_INST_OPERANDS} -o yaml - Update the value of the
ignoreForMaintenanceparameter fromtruetofalse.ignoreForMaintenance: false - Save the file and wait for the Redis pod to reconcile.
- Open the Redis CP YAML file for editing.
Backup fails with an ibm_neo4j error when IBM
Match 360 is scaled to the
x-small size
Applies to: 5.1.0
Applies to: All backup and restore methods
Fixed in: 5.1.1
- Diagnosing the problem
- When completing an online or offline backup of an IBM Software Hub instance that has the IBM
Match 360 service installed with the
x-smallscaling configuration, the backup will fail with the following error message:error getting inventory 'cm ibm-neo4j-inv-list-cm'To confirm that this issue is occurring:- Confirm that IBM
Match 360 is
configured an
x-smallscaling configuration by running the following command:
A returned value ofoc get mdm -n ${PROJECT_CPD_INST_OPERANDS} -o yaml | grep scaleConfigscaleConfig: x-smallindicates thex-smallsize. - Confirm that Neo4j is configured to disable backups by running the following
command:
The following values indicate that Neo4j backups are disabled:oc get neo4j -n ${PROJECT_CPD_INST_OPERANDS} -o yamlbackupHookEnabled: false backupJobEnabled: false
- Confirm that IBM
Match 360 is
configured an
- Cause of the problem
- This issue occurs when IBM
Match 360 is scaled to
x-smallbecause the Neo4J Custom Resource (neo4j) is configured to disable backups. The thex-smallsize is intended for non-production clusters, meaning that backups are not expected to be required. - Resolving the problem
- To resolve the problem, edit the IBM
Match 360 CR (
mdm-cr) to enable backups and increase the Neo4J cluster memory limits:- Edit the IBM
Match 360 CR
(
mdm-cr).oc edit mdm - Update the CR to use the following values in the
neo4jsection:neo4j: backupHookEnabled: true backupJobEnabled: true config: server.memory.heap.initial_size: 2g server.memory.heap.max_size: 2g podSpec: resources: limits: memory: 8Gi requests: memory: 4Gi - Wait for
mdm-crto reconcile itself and update the Neo4J cluster. Proceed to the next step when bothmdm-crandNeo4jClusterare in aCompletedstate. - Start the backup process.
- Edit the IBM
Match 360 CR
(
Backup fails for the platform with error in EDB Postgres cluster
Applies to: 5.1.0 and later
Applies to: All backup and restore methods
- Diagnosing the problem
- For example, in IBM Fusion, the
backup fails at the Hook: br-service hooks/pre-backup stage in the backup
sequence.
In the cpdbr-oadp.log file, you see the following error:
time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled - Cause of the problem
- Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.
- Resolving the problem
-
Automatic and manual workarounds are available. Do only one of the workarounds.
- Automatic workaround
-
The following workaround automatically runs before you create a backup. This workaround is useful if you set up automatic backups. Complete the steps that apply to your backup and restore method.
- Download the edb-patch-aux-ckpt-cm-legacy.yaml file.
- Run the following
command:
oc apply -n ${PROJECT_CPD_INST_OPERATORS} -f edb-patch-aux-ckpt-cm-legacy.yaml - Retry the backup.
- Download the edb-patch-aux-br-cm-legacy.yaml file.
- Run the following
command:
oc apply -n ${PROJECT_CPD_INST_OPERATORS} -f edb-patch-aux-br-cm-legacy.yaml - Retry the backup.
Online backup and restore
Offline backup and restore
- Manual workaround
-
If you want to manually run the workaround, complete the following steps:
- Download the edb-patch.sh file.
- Run the following
command:
sh edb-patch.sh ${PROJECT_CPD_INST_OPERATORS} - Retry the backup.
Db2 backup fails at the Hook: br-service hooks/pre-backup step
Applies to: 5.1.0 and later
Applies to: Backup and restore with IBM Fusion
- Diagnosing the problem
- In the cpdbr-oadp.log file, you see messages like in the following
example:
time=<timestamp> level=info msg=podName: c-db2oltp-5179995-db2u-0, podIdx: 0, container: db2u, actionIdx: 0, commandString: ksh -lc 'manage_snapshots --action suspend --retry 3', command: [sh -c ksh -lc 'manage_snapshots --action suspend --retry 3'], onError: Fail, singlePodOnly: false, timeout: 20m0s func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:767 time=<timestamp> level=info msg=cmd stdout: func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:823 time=<timestamp> level=info msg=cmd stderr: [<timestamp>] - INFO: Setting wolverine to disable Traceback (most recent call last): File "/usr/local/bin/snapshots", line 33, in <module> sys.exit(load_entry_point('db2u-containers==1.0.0.dev1', 'console_scripts', 'snapshots')()) File "/usr/local/lib/python3.9/site-packages/cli/snapshots.py", line 35, in main snap.suspend_writes(parsed_args.retry) File "/usr/local/lib/python3.9/site-packages/snapshots/snapshots.py", line 86, in suspend_writes self._wolverine.toggle_state(enable=False, message="Suspend writes") File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 73, in toggle_state self._toggle_state(state, message) File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 77, in _toggle_state self._cmdr.execute(f'wvcli system {state} -m "{message}"') File "/usr/local/lib/python3.9/site-packages/utils/command_runner/command.py", line 122, in execute raise CommandException(err) utils.command_runner.command.CommandException: Command failed to run:ERROR:root:HTTPSConnectionPool(host='localhost', port=9443): Read timed out. (read timeout=15) - Cause of the problem
- The Wolverine high availability monitoring process was in a
RECOVERINGstate before the backup was taken.Check the Wolverine status by running the following command:
wvcli system statusExample output:ERROR:root:REST server timeout: https://localhost:9443/status ERROR:root:Retrying Request: https://localhost:9443/status ERROR:root:REST server timeout: https://localhost:9443/status ERROR:root:Retrying Request: https://localhost:9443/status ERROR:root:REST server timeout: https://localhost:9443/status ERROR:root:Retrying Request: https://localhost:9443/status ERROR:root:REST server timeout: https://localhost:9443/status HA Management is RECOVERING at <timestamp>.The Wolverine log file /mnt/blumeta0/wolverine/logs/ha.log shows errors like in the following example:<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:414)] - check_and_recover: unhealthy_dm_set = {('c-db2oltp-5179995-db2u-0', 'node')} <timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:416)] - (c-db2oltp-5179995-db2u-0, node) : not OK <timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:421)] - check_and_recover: unhealthy_dm_names = {'node'} - Resolving the problem
- Do the following steps:
- Re-initialize
Wolverine:
wvcli system init --force - Wait until the Wolverine status is
RUNNING. Check the status by running the following command:wvcli system status - Retry the backup.
- Re-initialize
Wolverine:
Backup fails at the Hook: br-service-hooks/checkpoint step
Applies to: 5.1.0
Applies to: Backup and restore with IBM Fusion
Fixed in: 5.1.1
- Diagnosing the problem
- In the IBM Fusion backup and
restore transaction manager logs, you see the following error:
time=<timestamp> level=info msg= ** PHASE [CONFIGMAP LOCK/CLEANUP/SUCCESS] ************************************* func=cpdbr-oadp/pkg/utils.LogMarker file=/go/src/cpdbr-oadp/pkg/utils/log.go:63 time=<timestamp> level=info msg=lock released func=cpdbr-oadp/cmd/cli/checkpoint.prepareContext.func1 file=/go/src/cpdbr-oadp/cmd/cli/checkpoint/create.go:279 time=<timestamp> level=error msg=error running processCreate(): error running checkpoint exec hooks: configmaps with the same component name and different module names are not allowed. cm1 name=data-lineage-neo4j-aux-ckpt-cm, cm1 component=data-lineage-neo4j, cm1 module=data-lineage-neo4j-aux, cm2 name=data-lineage-neo4j-aux-v2-ckpt-cm, cm2 component=data-lineage-neo4j, cm2 module=data-lineage-neo4j-aux-v2 func=cpdbr-oadp/cmd/cli/checkpoint.(*createCommandContext).execute file=/go/src/cpdbr-oadp/cmd/cli/checkpoint/create.go:730 Error: error running checkpoint exec hooks: configmaps with the same component name and different module names are not allowed. cm1 name=data-lineage-neo4j-aux-ckpt-cm, cm1 component=data-lineage-neo4j, cm1 module=data-lineage-neo4j-aux, cm2 name=data-lineage-neo4j-aux-v2-ckpt-cm, cm2 component=data-lineage-neo4j, cm2 module=data-lineage-neo4j-aux-v2 time=<timestamp> level=error msg=error running checkpoint exec hooks: configmaps with the same component name and different module names are not allowed. cm1 name=data-lineage-neo4j-aux-ckpt-cm, cm1 component=data-lineage-neo4j, cm1 module=data-lineage-neo4j-aux, cm2 name=data-lineage-neo4j-aux-v2-ckpt-cm, cm2 component=data-lineage-neo4j, cm2 module=data-lineage-neo4j-aux-v2 func=cpdbr-oadp/cmd.Execute file=/go/src/cpdbr-oadp/cmd/root.go:88 - Resolving the problem
- Ensure that the data-lineage-neo4j-aux-ckpt-cm and
data-lineage-neo4j-aux-v2-ckpt-cm ConfigMaps have the following matching
cpdfwk.componentandcpdfwk.modulelabels:"cpdfwk.module"is set to"data-lineage-neo4j-aux""cpdfwk.component"is set to"data-lineage-neo4j"
- To edit the data-lineage-neo4j-aux-ckpt-cm ConfigMap, run the following
command:
oc edit cm data-lineage-neo4j-aux-ckpt-cm - To edit the data-lineage-neo4j-aux-v2-ckpt-cm ConfigMap, run the following
command:
oc edit cm data-lineage-neo4j-aux-v2-ckpt-cm - Retry the backup.
Restoring an online backup of IBM Software Hub on IBM Storage Scale Container Native storage fails
Applies to: IBM Fusion 2.8.2 and later
- Diagnosing the problem
- When you restore an online backup with IBM Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
- Resolving the problem
- This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To
resolve the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create the
backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.Note: If your deployment includes Watson OpenScale, you cannot manually expand Watson OpenScale PVCs. To manage PVC sizes for Watson OpenScale, see Managing persistent volume sizes for Watson OpenScale.
Restore fails at Hook: br-service-hooks/operators-restore step
Applies to: 5.1.2 and later
Applies to: Backup and restore with IBM Fusion
- Diagnosing the problem
- In the IBM Fusion backup and
restore transaction manager logs, you see errors like in the following
example:
[ERROR] <timestamp> [TM_0] KubeCalls:2093 - The execution of the application command "sh -c /cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore --operators-namespace watsonx-ops --foundation-namespace watsonx-ops; echo rc=$? > /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log" on pod cpdbr-tenant-service-86b7475cfc-nd98d in namespace watsonx-ops took longer than the specified timeout of 7200 seconds. [INFO] <timestamp> [TM_0] KubeCalls:2094 - stdout from the application command: [INFO] <timestamp> [TM_0] KubeCalls:2095 - stderr from the application command: cat: /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log: No such file or directory [INFO] <timestamp> [TM_0] KubeCalls:2096 - Error output from the application command: {'status': 'Failed', 'message': 'exec hook failed to write return code to a file cat: /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log: No such file or directory\n'} [ERROR] <timestamp> [TM_0] KubeCalls:2115 - Failed to delete file /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log in cpdbr-tenant-service-86b7475cfc-nd98d, 'rm: cannot remove '/tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log': No such file or directory ' [ERROR] <timestamp> [TM_0] apphooks:148 - Timeout reached before command completed. However, the operation continues because of the on-error annotation value. [ERROR] <timestamp> [TM_0] exechook:353 - Running command '["/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh", "restore", "--operators-namespace", "watsonx-ops", "--foundation-namespace", "watsonx-ops"]' on pod 'watsonx-opscpdbr-tenant-service-86b7475cfc-nd98d' failed with rc: 5 [ERROR] <timestamp> [TM_0] exechook:84 - Operation 'operators-restore' failed with exception: 'The execution of the application command "["/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh", "restore", "--operators-namespace", "watsonx-ops", "--foundation-namespace", "watsonx-ops"]" on pod cpdbr-tenant-service-86b7475cfc-nd98din namespace {self.hook.namespace} took longer than the specified timeout of {self.timeout} seconds.' [ERROR] <timestamp> [TM_0] workflow:145 - cmdResults="{'br-service-hooks/operators-restore': {'watsonx-ops:cpdbr-tenant-service-86b7475cfc-nd98d': 5}}" [ERROR] <timestamp> [TM_0] workflow:206 - End execution sequence "hook/br-service-hooks/operators-restore" failed. [ERROR] <timestamp> [TM_0] recipe:589 - Execution of workflow "restore" of recipe "watsonx-ops:ibmcpd-tenant" completed with 1 failures, last failure was: "ExecHook/br-service-hooks/operators-restore"Hook run in watsonx-ops:cpdbr-tenant-service-86b7475cfc-nd98d ended with rc 5 indicating hook reached timeout prior to completion. [ERROR] <timestamp> [TM_0] restoreguardian:771 - Unexpected failure in hook: 'Execution of workflow restore of recipe ibmcpd-tenant completed. Number of failed commands: 1, last failed command: "ExecHook/br-service-hooks/operators-restore"Hook run in watsonx-ops:cpdbr-tenant-service-86b7475cfc-nd98d ended with rc 5 indicating hook reached timeout prior to completion.' - Cause of the problem
- The operators restore phase might take longer than expected due to slowness with Operator Lifecycle Manager, and fail with a timeout error.
- Resolving the problem
- Re-run the restore without cleaning up the cluster.
Restore fails at Hook: br-service-hooks/operators-restore step
Applies to: 5.1.2 and later
Applies to: Backup and restore with IBM Fusion
- Diagnosing the problem
- This problem occurs when you upgrade IBM Cloud Pak for Data 5.0.x to IBM Software Hub
5.1.2.
In the IBM Fusion backup and restore transaction manager logs, you see the following error:
Time: <timestamp> level=info - Postgres Cluster:: common-service-db - phase: null Time: <timestamp> level=error - Postgres Cluster: common-service-db Timeout Error End Time: <timestamp> -------------------------------------------------- Summary of level=warning/error messages: -------------------------------------------------- Time: <timestamp> level=error - Postgres Cluster: common-service-db Timeout Error Exited with return code=1 /cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh exit code=1 *** cpdbr-cpd-operators.sh failed *** - Cause of the problem
- The PostgreSQL operator is watching
namespaces from annotations instead of the namespace-scope ConfigMap. Run the
following
command:
oc get po -n ${PROJECT_CPD_INST_OPERATORS} $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} | grep postgres | awk '{print $1}' | head -1) -o yaml | grep WATCH A 10 -B 10Example output:- name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.annotations['olm.targetNamespaces']Check if the PostgreSQL clusters are not being reconciled by the PostgreSQL operator by running the following command:
oc get clusters.postgresql.k8s.enterprisedb.io -n ${PROJECT_CPD_INST_OPERANDS}Example output:NAME AGE INSTANCES READY STATUS PRIMARY common-service-db 33h wa-postgres 33h wa-postgres-16 33h zen-metastore-edb 33h - Resolving the problem
- Do the following steps:
- Log in to the source
cluster:
${OC_LOGIN} - Change to the project where IBM Software Hub is
installed:
oc project ${PROJECT_CPD_INST_OPERANDS} - Run the following patch
command:
oc patch csv $(oc get csv | grep postgres | awk '{print $1}') --type json -p '[ { "op": "replace", "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/env/0", "value": { "name": "WATCH_NAMESPACE", "valueFrom": { "configMapKeyRef": { "key": "namespaces", "name": "namespace-scope", "optional": true } } } } ]' - Make the following update to the PostgreSQL
role:
for role in $(oc get role -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=namespace-scope,app.kubernetes.io/managed-by=ibm-namespace-scope-operator | grep cloud-native-postgresql | awk '{print $1}'); do saIdx=$(oc get role ${role} -n ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.rules | to_entries | map(select(.value.resources | index("serviceaccounts") != null) | .key)[0]') eval $(echo "oc patch role ${role} -n ${PROJECT_CPD_INST_OPERANDS} --type json -p '[{\"op\": \"add\", \"path\": \"/rules/${saIdx}/verbs/-\", \"value\": \"delete\"}]'") done - Retry the restore.Note: You do not need to clean up the cluster before retrying the restore.
- Log in to the source
cluster:
IBM Fusion reports successful
restore but many service custom resources are not in Completed state
Applies to: 5.1.1
Applies to: Backup and restore with IBM Fusion
Fixed in: 5.1.2
- Diagnosing the problem
- The status of service custom resources is less than 100%.
- Cause of the problem
- The ZenService custom resource is stuck. Run the following
command:
The output of the command showsoc get ZenService lite-cr -n ${PROJECT_CPD_INST_OPERANDS} -o yamlzenStatus: InProgress. - Resolving the problem
- Rerun restore posthooks and reset the namespacescope by running the following
commands:
oc rsh -n ${PROJECT_CPD_INST_OPERATORS} $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} | grep cpdbr-tenant | awk '{print $1}') /cpdbr-scripts/cpdbr/checkpoint_restore_posthooks.sh --scale-wait-timeout=30m --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} /cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}
Watson OpenScale etcd server fails to start after restoring from a backup
Applies to: 5.1.0 and later
Applies to: Backup and restore with NetApp Astra Control Center
- Diagnosing the problem
- After restoring a backup with NetApp Astra Control Center, the Watson
OpenScale
etcd cluster is in a
Failedstate. - Resolving the problem
- To resolve the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Expand the size of the etcd PersistentVolumes by 1Gi.
In the following example, the current PVC size is 10Gi, and the commands set the new PVC size to 11Gi.
operatorPod=`oc get pod -n ${PROJECT_CPD_INST_OPERATORS} -l name=ibm-cpd-wos-operator | awk 'NR>1 {print $1}'` oc exec ${operatorPod} -n ${PROJECT_CPD_INST_OPERATORS} -- roles/service/files/etcdresizing_for_resizablepv.sh -n ${PROJECT_CPD_INST_OPERANDS} -s 11Gi - Wait for the reconciliation status of the Watson
OpenScale custom resource to be in a
Completedstate:oc get WOService aiopenscale -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status.wosStatus} {"\n"}'The status of the custom resource changes to
Completedwhen the reconciliation finishes successfully.
-
Error when activating applications after a migration
Applies to: 5.1.0
Applies to: Portworx asynchronous disaster recovery
Fixed in: 5.1.1
- Diagnosing the problem
- In a deployment that includes Data Virtualization,
the following errors appear when you try to activate
applications:
* error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred: * error executing command (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=${PROJECT_CPD_INST_OPERANDS} auxMetaName=dv-aux component=dv actionIdx=0): command timed out after 40m0s: timed out waiting for the condition - Cause of the problem
- Restore hooks failed.
- Resolving the problem
- Do the following steps:
- Manually recreate /tmp/.ready_to_connectToDb in the Data Virtualization head
pod:
oc exec -n ${PROJECT_CPD_INST_OPERANDS} -t pod/c-db2u-dv-db2u-0 -- touch /tmp/.ready_to_connectToDb - Run restore
posthooks:
oc exec -n ${PROJECT_CPD_INST_OPERATORS} -t $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} -l component=cpdbr-tenant -o NAME) -- /cpdbr-scripts/cpdbr-oadp restore posthooks --include-namespaces ${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --hook-kind checkpoint --log-level debug --verbose - After running restore posthooks is successfully completed, reset the
namespacescope:
oc exec -n ${PROJECT_CPD_INST_OPERATORS} -t $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} -l component=cpdbr-tenant -o NAME) -- /cpdbr-scripts/cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}
- Manually recreate /tmp/.ready_to_connectToDb in the Data Virtualization head
pod:
Restore is taking a long time to complete
Applies to: 5.1.2
Applies to: Online backup and restore with NetApp Trident protect
Fixed in: 5.1.3
- Diagnosing the problem
- Restores are taking longer than expected to complete.
- Cause of the problem
- Slow processing speeds with large KopiaVolumeRestores like activelogs-c-db2wh-<xxxxxxxxxxxxxxxxx>-db2u-0.
- Resolving the problem
- No workaround. As long as the restore is progressing, let the KopiaVolumeRestores
finish.Best practice: Ensure that the restore location is in the same region as the restore environment.
IBM Software Hub resources are not migrated
Applies to: 5.1.0 and later
Applies to: Portworx asynchronous disaster recovery
- Diagnosing the problem
- When you use Portworx
asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the
expected number of resources are migrated. Run the following
command:
storkctl get migrations -n ${PX_ADMIN_NS}Tip:${PX_ADMIN_NS}is usually kube-system.Example output:NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED cpd-tenant-migrationschedule-interval-<timestamp> mig-clusterpair Final Successful 0/0 0/0 <timestamp> Volumes (0s) Resources (3s) 0 - Cause of the problem
- This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected IBM Software Hub resources are not migrated.
- Resolving the problem
- To resolve the problem, downgrade stork to a version prior to 23.11.0. For
more information about stork releases, see the stork Releases page.
- Scale down the Portworx operator
so that it doesn't reset manual changes to the stork
deployment:
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0 - Edit the stork deployment image version to a version prior to
23.11.0:
oc edit deploy -n ${PX_ADMIN_NS} stork - If you need to scale up the Portworx operator, run the following command.Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1
- Scale down the Portworx operator
so that it doesn't reset manual changes to the stork
deployment:
Offline backup prehooks fail for lite-maint resource
Applies to: 5.1.0
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- In the CPD-CLI*.log file, you see error messages like in the following
example:
Error: 1 error occurred: * failed to execute masterplan: unexpected error from 'v1-orchestration' DataProtectionPlan undo action, exiting execution: error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during local-exec workflowAction.Undo() - action=pre-backup-hooks, action-index=0, retry-attempt=0/0, err=offline post-backup hooks execution failed: error running post-backup hooks: Error running post-processing rules. Check the /root/bar/cpd-cli-workspace/logs/CPD-CLI-2024-11-14.log for errors. 1 error occurred: * error performing op postBackupViaConfigHookRule for resource lite-maint (configmap=cpd-lite-aux-br-maint-cm): 1 error occurred: * error executing command (container=ibm-nginx-container podIdx=1 podName=ibm-nginx-5845cf4bcc-qmbl9 namespace=wkc auxMetaName=lite-maint-aux component=lite-maint actionIdx=0): command terminated with exit code 1 - Cause of the problem
- This problem is intermittent.
- Resolving the problem
- Do the following steps:
- Restart
nginx.
oc rollout restart deploy/ibm-nginx -n ${PROJECT_CPD_INST_OPERANDS} - Retry the backup.
- Restart
nginx.
Offline backup prehooks fail when deployment includes IBM Knowledge Catalog
Applies to: 5.1.0
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- When you try to create a backup, you see error messages in the CPD-CLI*.log
file like in the following
examples:
time=<timestamp> level=error msg=error running processCreate(): failed to execute masterplan: 2 errors occurred: * error from dpp.Execute() [traceId=97cd14a9-9bc3-494f-ba12-7e1c185fef10, dpp=v1-orchestration, operationKind=backup]: error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during local-exec workflowAction.Do() - action=pre-backup-hooks, action-index=2, retry-attempt=0/0, err=offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks * DataProtectionPlan=v1-orchestration, Action=pre-backup-hooks (index=2) error: offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks func=cpdbr-oadp/cmd/cli/tenantbackup.(*CreateCommandContext).execute file=/a/workspace/oadp-upload/cmd/cli/tenantbackup/create.go:1412 time=<timestamp> level=error msg=failed to execute masterplan: 2 errors occurred: * error from dpp.Execute() [traceId=97cd14a9-9bc3-494f-ba12-7e1c185fef10, dpp=v1-orchestration, operationKind=backup]: error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during local-exec workflowAction.Do() - action=pre-backup-hooks, action-index=2, retry-attempt=0/0, err=offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks * DataProtectionPlan=v1-orchestration, Action=pre-backup-hooks (index=2) error: offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks - Cause of the problem
- The wkc-foundationdb-cluster-aux-br-cm ConfigMap is missing the
cpdfwk.spec-version=2.0.0label.To confirm that the label is missing, run the following command:cat cpd-cli-workspace/logs/CPD-CLI-$(date '+%Y-%m-%d').log | grep "is filtered out, because there is newer spec-version"Example output when the label is missing:time=<timestamp> level=warning msg=configmap name=cpd-ikc-enrichment-aux-br-cm in namespace=wkc does not have icpdsupport/addOnId label, skipping filter func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1016 time=<timestamp> level=debug msg=configmap name=cpd-ikc-finley-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-glossary-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-ikc-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-knowledgegraph-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-policy-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-profiling-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-wkcgovui-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 time=<timestamp> level=debug msg=configmap name=cpd-ikc-workflow-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031 - Resolving the problem
- Add the missing label to the ConfigMap by running the following
command:
After the label is added, test that running backup prehooks works by running the following command:oc label cm wkc-foundationdb-cluster-aux-br-cm cpdfwk.spec-version=2.0.0cpd-cli oadp backup prehooks \ --hook-kind=br \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --dry-run \ --verbose \ -log-level=debug
Offline restore fails at post-restore hooks step
Applies to: 5.1.0
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- In the CPD-CLI*.log file, you see error messages like in the following
example:
Hook execution breakdown by status=error/timedout: The following hooks either have errors or timed out post-restore (1): COMPONENT CONFIGMAP METHOD STATUS DURATION ADDONID lite-maint cpd-lite-aux-br-maint-cm rule error 30.607451346s zen-lite -------------------------------------------------------------------------------- Error: failed to execute masterplan: 1 error occurred: * DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=8) error: offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules. Check the /root/CPD-QA-BAR/cpd-cli-workspace/logs/CPD-CLI-2024-11-19.log for errors. 1 error occurred: * error performing op postRestoreViaConfigHookRule for resource lite-maint (configmap=cpd-lite-aux-br-maint-cm): 1 error occurred: * error executing command (container=ibm-nginx-container podIdx=0 podName=ibm-nginx-774fd7445f-2g5hw namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0): command terminated with exit code 1 - Cause of the problem
- This problem is intermittent. After a restore is completed, disabling nginx maintenance mode intermittently fails because the nginx configuration file is not found. As a result, the restore sometimes appears to have failed.
- Resolving the problem
- Do the following steps:
- Restart
nginx.
oc rollout restart deploy/ibm-nginx -n ${PROJECT_CPD_INST_OPERANDS} - Rerun post-restore
hooks:
cpd-cli oadp restore posthooks \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --log-level=debug \ --verbose
- Restart
nginx.
Prompt tuning fails after restoring watsonx.ai
Applies to: 5.1.1
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- When you try to create a prompt tuning experiment, you see the following error
message:
An error occurred while processing prompt tune training. - Resolving the problem
- Do the following steps:
- Restart the caikit
operator:
oc rollout restart deployment caikit-runtime-stack-operator -n ${PROJECT_CPD_INST_OPERATORS}Wait at least 2 minutes for the cais fmaas custom resource to become healthy.
- Check the status of the cais fmaas custom resource by running the following
command:
oc get cais fmaas -n ${PROJECT_CPD_INST_OPERANDS} - Retry the prompt tuning experiment.
- Restart the caikit
operator:
Post-restore hook error when restoring offline backup of Db2
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- This problem occurs intermittently. In the CPD-CLI*log file, you see errors
like in the following example:
Error: failed to execute masterplan: 2 errors occurred: * error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during local-exec workflowAction.Do() - action=cpd-post-restore-hooks, action-index=26, retry-attempt=0/0, err=offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules. Check the /root/cpd_cli_linux/cpd-cli-workspace/logs/CPD-CLI-<date_timestamp>.log for errors. 1 error occurred: * error performing op postRestoreViaConfigHookRule for resource db2u (configmap=db2u-aux-br-cm): 2 errors occurred: * error executing command ksh -lc 'if [[ -f /usr/bin/manage_snapshots ]]; then manage_snapshots --action restore --retry 3; else write-restore; fi' (container=db2u podIdx=0 podName=c-db2oltp-1738299739760012-db2u-0 namespace=zen-ns auxMetaName=db2u-aux component=db2u actionIdx=0): command terminated with exit code 255 - Resolving the problem
- Run the following
command:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from cpd-post-restore-hooks
Restoring Data Virtualization fails with metastore not running or failed to connect to database error
Applies to: 5.1.2 and later
Applies to: Online backup and restore with the OADP utility
- Diagnosing the problem
- View the status of the restore by running the following
command:
cpd-cli oadp tenant-restore status ${TENANT_BACKUP_NAME}-restore --detailsThe output shows errors like in the following examples:time=<timestamp> level=INFO msg=Verifying if Metastore is listening SERVICE HOSTNAME NODE PID STATUS Standalone Metastore c-db2u-dv-hurricane-dv - - Not runningtime=<timestamp> level=ERROR msg=Failed to connect to BigSQL database * error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred: * error executing command su - db2inst1 -c '/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L' (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=<namespace-name> auxMetaName=dv-aux component=dv actionIdx=0): command terminated with exit code 1 - Cause of the problem
- A timing issue causes restore posthooks to fail at the step where the posthooks check for the
results of the
db2 connect to bigsqlcommand. Thedb2 connect to bigsqlcommand has failed because bigsql is restarting at around the same time. - Resolving the problem
- Run the following
command:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from cpd-post-restore-hooks
After online restore, watsonx Code Assistant for Z is not running properly
Applies to: 5.1.2
Applies to: Online backup and restore with the OADP utility
Fixed in: 5.1.3
- Diagnosing the problem
- Errors appear in some tabs in the watsonx Code Assistant™ for Z user interface.
- Cause of the problem
- PVC content transfer time exceeded the maximum time allowed during the restore process. As a
result, the wcaz-cr custom resource is in an
Failedstate, and the wca-codegen-c2j pod is not ready.To check the status of the wcaz-cr custom resource, run the following command:oc get wcaz -n ${PROJECT_CPD_INST_OPERANDS}Example output:NAME VERSION RECONCILED STATUS AGE wcaz-cr 5.1.2 Failed 3h5mTo check the status of the wca-codegen-c2j pod, run the following command:oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep wcaExample output:wca-codegen-76d98b5cbf-qg5kt 1/1 Running 0 125m wca-codegen-c2j-7954f9946f-h4p88 0/1 Running 8 (2m44s ago) 136m wca-codematch-85c8b67766-fpbq2 1/1 Running 0 125m wca-codematch-init-9fqbs 0/1 Completed 0 125m - Resolving the problem
- If wca-codegen-c2j pod is not in a ready state, delete the
internal-tls-pkcs12
secret:
oc delete secret internal-tls-pkcs12 -n ${PROJECT_CPD_INST_OPERANDS}After approximately 2-4 minutes, the Common core services operator reconciles and the secret is recreated. The wca-codegen-c2j pod restarts automatically and the watsonx Code Assistant for Z service will work properly.
If the wcaz-cr custom resource has not completed reconciliation, delete the ibm-wca-z-operator-<xxx> pod:oc delete ibm-wca-z-operator-<xxx> -n ${PROJECT_CPD_INST_OPERATORS}The pod is recreated and reconciliation begins. When reconciliation is completed, the service will work properly.
Online post-restore hooks fail to run with timed out waiting for condition
error when restoring Analytics Engine powered by Apache Spark
Applies to: 5.1.2
Applies to: Online backup and restore with the OADP utility
- Diagnosing the problem
- In the CPD-CLI*.log file, you see error messages like in the following
examples:
Failed with 1 error(s): error: DataProtectionPlan=cpd-tenant/restore-service-orchestrated-parent-workflow, Action=cpd-post-restore-hooks (index=28) online post-restore hooks execution failed: timed out waiting for the conditiontime=<timestamp> level=debug msg=waiting for replicas of spark-hb-create-trust-store deployment (0/1) in namespace <cpd-tenant-namespace> func=cpdbr-oadp/pkg/kube.waitForDeployment.func1 file=/a/workspace/oadp-upload/pkg/kube/deployment.go:116 time=<timestamp> level=debug msg=waiting for replicas of spark-hb-create-trust-store deployment (0/1) in namespace <cpd-tenant-namespace> func=cpdbr-oadp/pkg/kube.waitForDeployment.func1 file=/a/workspace/oadp-upload/pkg/kube/deployment.go:116 time=<timestamp> level=error msg=failed to wait for deployment spark-hb-create-trust-store in namespace <cpd-tenant-namespace>: timed out waiting for the condition func=cpdbr-oadp/pkg/kube.waitForDeploymentPods file=/a/workspace/oadp-upload/pkg/kube/deployment.go:198 - Cause of the problem
- Analytics Engine powered by Apache Spark resources in tethered projects were not restored.
- Resolving the problem
- Do the following steps:
- Download the online-tenant-restore-tethered-namespaces-workaround.sh script onto your workstation.
- Make the script
executable:
chmod 755 online-tenant-restore-tethered-namespaces-workaround.sh - Run the
script:
./online-tenant-restore-tethered-namespaces-workaround.sh ${TENANT_BACKUP_NAME} ${PROJECT_CPD_INST_OPERATORS} ${PROJECT_CPD_INST_OPERANDS} ${OADP_OPERATOR_PROJECT} - Check that no pods are in a
Pendingstate in the tethered projects:oc get po -n ${TETHERED_NAMESPACE} - Resume the restore
process:
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from cpd-post-restore-hooks
Restore fails with condition not met error
Applies to: 5.1.2 and later
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- The restore is actually successful. But in the CPD-CLI*.log file, you see
error messages like in the following
example:
Error: failed to execute masterplan: 2 errors occurred: * error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during hook workflowAction.Do() - action=ibmcpd-check/controlPlaneCompletedStatus, action-index=0, retry-attempt=0/0, err=1 error occurred: * condition not met (condition={$.status.controlPlaneStatus} == {"Completed"}, namespace=cpd-instance, gvr=cpd.ibm.com/v1, Resource=ibmcpds, name=ibmcpd-cr) * DataProtectionPlan=configmap=cpd-zen-aux-v2-ckpt-cm, Action=ibmcpd-check/controlPlaneCompletedStatus (index=0) error: 1 error occurred: * condition not met (condition={$.status.controlPlaneStatus} == {"Completed"}, namespace=cpd-instance, gvr=cpd.ibm.com/v1, Resource=ibmcpds, name=ibmcpd-cr) - Resolving the problem
- Do the following steps:
- Check that the zenservices custom resource is in a
Completedstate:oc get zenservices lite-cr -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status}' - Check that the ibmcpd custom resource is in a
Completedstate:oc get ibmcpd ibmcpd-cr -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status}' - Run one of the following commands.Note: The zenservices and ibmcpd custom resources must be in a
Completedstate before you run one of these commands.- Offline restore
-
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from restore-operands - Online restore
-
export CPDBR_ENABLE_FEATURES=experimental cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \ --from-tenant-backup ${TENANT_BACKUP_NAME} \ --verbose \ --log-level debug \ --start-from restore-post-namespacescope
- Check that the zenservices custom resource is in a
Offline restore fails with getting persistent volume claim error message
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like the following
example:
Error: failed to execute restore sequence: failed to execute 'restore-cpd-volumes' step: failed to execute masterplan: 1 error occurred: * DataProtectionPlan=configmap=ibmcpd-tenant-parent-br-cm, Action=restore-cpd-volumes (index=0) error: error: expected restore phase to be Completed, received PartiallyFailedIn the velero.log file, you see error messages like in the following example:
time="<timestamp>" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-0\" not found" logSource="/remote-source/velero/app/pkg/restore/restore.go:1881" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7 time="<timestamp>" level=error msg="Restoring pod is not scheduled until timeout or cancel, disengage" error="client rate limiter Wait returned an error: context canceled" logSource="/remote-source/velero/app/pkg/podvolume/restorer.go:212" time="<timestamp>" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-1\" not found" logSource="/remote-source/velero/app/pkg/restore/restore.go:1881" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7 time="<timestamp>" level=error msg="Restoring pod is not scheduled until timeout or cancel, disengage" error="client rate limiter Wait returned an error: context canceled" logSource="/remote-source/velero/app/pkg/podvolume/restorer.go:212" time="<timestamp>" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-2\" not found" logSource="/remote-source/velero/app/pkg/restore/restore.go:1881" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7 time="<timestamp>" level=error msg="Restoring pod is not scheduled until timeout or cancel, disengage" error="client rate limiter Wait returned an error: context canceled" logSource="/remote-source/velero/app/pkg/podvolume/restorer.go:212" time="<timestamp>" level=error msg="Velero restore error: error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-0\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:604" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7 time="<timestamp>" level=error msg="Velero restore error: error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-1\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:604" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7 time="<timestamp>" level=error msg="Velero restore error: error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-2\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:604" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7 - Cause of the problem
- One or more watsonx Orchestrate PVCs are labeled for exclusion, but their related pods that mount them are backed up.
- Resolving the problem
- Do the following steps:
- Add a label to the pods so that they are excluded from
backups:
oc label po -n ${PROJECT_CPD_INST_OPERANDS} -l "wo.watsonx.ibm.com/component=docproc,component=etcd" velero.io/exclude-from-backup=true - Create a new backup.
- Add a label to the pods so that they are excluded from
backups:
After restoring Watson Speech services online backup, unable to use service instance ID to make service REST API calls
Applies to: 5.1.0, 5.1.1, 5.1.2
Applies to: Online backup and restore to the same cluster with the OADP utility
Fixed in: 5.1.3
- Diagnosing the problem
- After performing a restore, when you attempt to use the Text-to-Speech REST APIs with the
IBM Software Hub existing service instance token,
you see an error message similar to the following example:
<html> <head><title>401 Authorization Required</title></head> <body> <center><h1>401 Authorization Required</h1></center> <hr><center>openresty</center> </body> </html> - Resolving the problem
- Create a new service instance and use the authorization details (endpoint URL and bearer token) from the new instance.
After restoring Watson Discovery online backup, unable to use service instance ID to make service REST API calls
Applies to: 5.1.2
Applies to: Online backup and restore to the same cluster with the OADP utility
Fixed in: 5.1.3
- Diagnosing the problem
- This problem occurs when Match
360 and
Watson Discovery are installed on the same
cluster.
After the restore, when you attempt to use the Watson Discovery REST APIs with the IBM Software Hub existing service instance token, you see an error message similar to the following example:
<html> <head><title>401 Authorization Required</title></head> <body> <center><h1>401 Authorization Required</h1></center> <hr><center>openresty</center> </body> </html>The following Match 360 user access groups are also present in the Watson Discovery service instance, receiving the same401errors:- DataSteward
- EntityViewer
- PublisherUser
- DataEngineer
You can see these user access groups when you select the service instance followed by Manage Access.
- Resolving the problem
- Do one of the following steps:
- Create a new service instance and use the authorization details (endpoint URL and bearer token) from the new instance.
- Remove the DataSteward, EntityViewer, PublisherUser, and DataEngineer user access groups from the Watson Discovery service instance.
Restore posthooks fail to run when restoring Data Virtualization
Applies to: 5.1.0, 5.1.1
Applies to: Backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
velero post-backup hooks in namespace <namespace> have one or more errors check for errors in <cpd-cli location>, and try again Error: velero post-backup hooks failed [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 - Cause of the problem
- Velero hooks annotations are blocking the restore posthooks from running.Get the Data Virtualization addon pod definition by running a command like in the following example:
oc get po dv-addon-6fdddc4bc7-8bdlq -o jsonpath="{.metadata.annotations}" | jq .Example output that shows the Velero annotations:... "post.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-backup no-op hook\"]", "post.hook.restore.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-resttore no-op hook\"]", "pre.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing pre-backup no-op hook\"]", ... - Resolving the problem
- Remove the Velero hooks annotations. Because these annotations are not used, you can remove them
from all pods. Run the following
commands:
oc annotate po --all post.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS} oc annotate po --all post.hook.restore.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS} oc annotate po --all pre.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}After the annotations are removed, rerun the restore posthooks command:
cpd-cli oadp restore posthooks \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --log-level=debug \ --verbose
Offline restore fails with cs-postgres timeout error
Applies to: 5.1.1 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
* timed out performing op postRestoreViaConfigHookRule for resource cs-postgres (configmap=ibm-cs-postgres-br-cm): timed out waiting for the condition - Cause of the problem
- An offline restore posthook reports a restore failure when the default timeout of the cs-postgres-restore-job job, which is 6 minutes, is exceeded.
- Resolving the problem
- Increase the default timeout to 10 minutes (600 seconds) by doing the following steps:
- Create a YAML file named
restore-hooks.yaml:
cat << EOF > restore-hooks.yaml data: restore-meta: | post-hooks: exec-rules: - actions: - job: job-key: cs-postgres-restore-job timeout: 600s - actions: - job: job-key: cpfs-br-restore-posthooks-job timeout: 600s EOF - Patch the ibm-cs-postgres-br-cm ConfigMap by using the
restore-hooks.yaml
file:
oc patch cm ibm-cs-postgres-br-cm --patch-file restore-hooks.yaml -n ${PROJECT_CPD_INST_OPERANDS} - Re-run the restore posthooks
command:
cpd-cli oadp restore posthooks \ --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \ --verbose \ --log-level debug
- Create a YAML file named
restore-hooks.yaml:
Restoring an offline backup fails with zenservice-check error
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
Error: failed to execute masterplan: 2 errors occurred: * error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during hook workflowAction.Do() - action=zenservice-check/zenStatus, action-index=1, retry-attempt=0/0, err=no matches for /, Resource=zenservices * DataProtectionPlan=configmap=cpd-lite-aux-v2-br-cm, Action=zenservice-check/zenStatus (index=1) error: no matches for /, Resource=zenservices - Resolving the problem
- This problem occurs intermittently. The problem is usually resolved when you retry the restore.
Error running post-restore hooks during offline restore
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
<timestamp> Failed with 1 error(s): <timestamp> error: DataProtectionPlan=platform-orchestration, Action=cpd-post-restore-hooks (index=8) <timestamp> offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules. Check the /root/br/restore/cpd-cli-workspace/logs/CPD-CLI-2025-02-10.log for errors. <timestamp> 1 error occurred: <timestamp> * op error: id=25, name=unquiesceViaScaling, configmap=: error performing op unquiesceViaScaling for resource servicecollection-cronjob, msg: : cronjobs.batch "servicecollection-cronjob" not found <timestamp> <timestamp> <timestamp> <timestamp> Finished with status: Failed <timestamp> saving master plan results to tenant-meta... <timestamp> <timestamp> ** INFO [MASTER PLAN RESULTS/SUMMARY/END] ************************************* <timestamp> <timestamp> ** INFO [RESTORE CREATE/DONE EXECUTING MASTERPLAN '\''RESTORE'\'' WORKFLOW (ISONLINE=FALSE)...] <timestamp> <timestamp> -------------------------------------------------------------------------------- <timestamp> <timestamp> Scenario: RESTORE CREATE (bkpnfsgrpcv110-tenant-offline-v2-b1-restore) <timestamp> Start Time: 2025-02-10 01:44:27.908033034 -0800 PST m=+0.300241262 <timestamp> Completion Time: 2025-02-10 03:00:00.054939864 -0800 PST m=+4532.447148092 <timestamp> Time Elapsed: 1h15m32.14690683s <timestamp> <timestamp> -------------------------------------------------------------------------------- <timestamp> <timestamp> Hook execution breakdown by status=error/timedout: <timestamp> <timestamp> The following hooks either have errors or timed out <timestamp> <timestamp> unquiesce (1): <timestamp> <timestamp> COMPONENT CONFIGMAP METHOD STATUS DURATION ADDONID <timestamp> servicecollection-cronjob N/A N/A error 0s N/A <timestamp> <timestamp> -------------------------------------------------------------------------------- <timestamp> <timestamp> [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1'This problem occurs intermittently to IBM Software Hub deployments when service monitors are installed.
- Resolving the problem
- Manually run the following restore posthooks
command:
Afterwards, retry the restore.cpd-cli oadp restore posthooks \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --hook-kind=br \ --spec-version="2.0.0" \ --verbose \ --log-level=trace
Online backup of upgraded IBM Cloud Pak for Data instance fails validation
Applies to: 5.1.0
Applies to: Online backup and restore with the OADP utility
- Diagnosing the problem
- This problem occurs when you upgrade an IBM Cloud Pak for Data 5.0.3 deployment to IBM Software Hub
5.1.0 and then try to create an online backup. The backup fails at the
backup validation stage. In the CPD-CLI*.log file, you see the following
error:
** INFO [SUMMARIZE BACKUP VALIDATION RESULTS/START] *************************** error: backup validation unsuccessful failed rules report: CM NAME RESOURCE-KIND ADDONID PATH ERR ibm-cs-postgres-ckpt-cm configmap cpfs backup-validation-meta/backup-validations/2/validation-rules/0 object with name 'platform-auth-idp' does not exists in the backup - Resolving the problem
- Do the following steps:
- Label the platform-auth-idp ConfigMap by running the following
command:
oc label cm platform-auth-idp icpdsupport/ignore-on-nd-backup- - Retry the backup.
- Label the platform-auth-idp ConfigMap by running the following
command:
Offline backup of Db2 Data Management Console fails with backup validation error
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see the following
error:
backup validation failed for configmap: cpd-dmc-aux-br-cm - Resolving the problem
- Update the cpd-dmc-aux-br-cm ConfigMap.
- Edit the cpd-dmc-aux-br-cm ConfigMap.
- Under the
backup-validation-metasection, make the following updates.- Change
dmcaddon-sampletodmc-addon. - Delete the following section:
- resource-kind: dmcs.dmc.databases.ibm.com validation-rules: - type: match_names names: - dmc-sample
- Change
After you make these updates, check that thebackup-validation-metasection is like the following codeblock:backup-validation-meta: |- backup-validations: - resource-kind: dmcaddons.dmc.databases.ibm.com validation-rules: - type: match_names names: - dmc-addon - resource-kind: configmap validation-rules: - type: match_names names: - ibm-dmc-addon-api-cm
Offline backup fails with PartiallyFailed error
Applies to: 5.1.1 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the Velero logs, you see errors like in the following
example:
time="<timestamp>" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:180" time="<timestamp>" level=error msg="error encountered while scanning stdout" backupLocation=oadp-operator/dpa-sample-1 cmd=/plugins/velero-plugin-for-aws controller=backup-sync error="read |0: file already closed" logSource="/remote-source /velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:90" time="<timestamp>" level=error msg="Restic command fail with ExitCode: 1. Process ID is 906, Exit error is: exit status 1" logSource="/remote-source/velero/app/pkg/util/exec/exec.go:66" time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.examplehost.example.com/velero/cpdbackup/restic/cpd-instance --pa ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=76b76bc4-27d4-4369-886c-1272dfdf9ce9 --tag=volume=cc-home-p vc-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --tag=pod=cpdbr-vol-mnt --host=velero --json with error: exit status 3 stderr: {\"message_type\":\"e rror\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/.scripts/system\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival \",\"item\":\".scripts/system\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"_global_/security/artifacts/metakey\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kuberne tes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/_global_/security/artifacts/metakey\"}\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/ve lero/app/pkg/podvolume/backupper.go:328" time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.cp.fyre.ibm.com/velero/cpdbackup/restic/cpd-instance --pa ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod=cpdbr-vol-mnt --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=93e9e23c-d80a-49cc-80bb-31a36524e0d c --tag=volume=data-rabbitmq-ha-0-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --host=velero --json with error: exit status 3 stderr: {\"message_typ e\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\".erlang.cookie\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-93e9e23c-d80a-49cc-80bb-31a36524e0dc/.erlang.cookie\"} \nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/velero/app/pkg/podvolume/backupper.go:328" - Cause of the problem
- The restic folder was deleted after backups were cleaned up (deleted). This problem is a Velero known issue. For more information, see velero does not recreate restic|kopia repository from manifest if its directories are deleted on s3.
- Resolving the problem
- Do the following steps:
- Get the list of backup
repositories:
oc get backuprepositories -n ${OADP-OPERATOR-NAMESPACE} -o yaml - Check for old or invalid object storage URLs.
- Check that the object storage path is in the backuprepositories custom resource.
- Check that the
<objstorage>/<bucket>/<prefix>/restic/<namespace>/config
file exists.
If the file does not exist, make sure that you do not share the same <objstorage>/<bucket>/<prefix> with another cluster, and specify a different <prefix>.
- Delete backup repositories that are invalid for the following reasons:
- The path does not exist anymore in the object storage.
- The restic/<namespace>/config file does not exist.
oc delete backuprepositories -n ${OADP_OPERATOR_NAMESPACE} <backup-repository-name>
- Get the list of backup
repositories:
OpenPages storage content is missing from offline backup
Applies to: 5.1.1
Applies to: Offline backup and restore to a different cluster with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- Check whether the OpenPages volume was
backed up by doing the following steps:
- Download the IBM Software Hub tenant volume
backup.Tip: The name of the backup starts with cpd-tenant-vol-.For example:
cpd-cli oadp backup download cpd-tenant-vol-8cebebba-f4e8-11ef-8148-22220a116473 - Uncompress the downloaded file.
For example:
tar -xvf cpd-tenant-vol-8cebebba-f4e8-11ef-8148-22220a116473-data.tar.gz - Look at the volumes that were backed up.
For example:
cd resources/pods/namespaces/zen-ns/ cat cpd-tenant-vol-8cebebba-f4e8-11ef-8148-22220a116473/resources/pods/namespaces/zen-ns/cpdbr-vol-mnt.json | jq '.spec.volumes' | grep name
The list of backups does not include the OpenPages PVC, which is named openpages-${OPENPAGES_INSTANCE_NAME}-appdata-pvc.
- Download the IBM Software Hub tenant volume
backup.
- Resolving the problem
- Manually back up and restore all volumes from all pods by doing the following steps:
- Back up the volumes by running the following
command:
cpd-cli oadp backup create ${BACKUP_NAME} \ --include-resources 'namespaces,persistentvolumeclaims,persistentvolumes,pods,configmaps,secrets' \ --include-cluster-resources \ --include-namespaces ${PROJECT_CPD_INST_OPERANDS} \ --backup-type singleton \ --snapshot-volumes=false \ --default-volumes-to-fs-backup \ --skip-hooks \ --verbose \ --log-level debug - Restore the volumes by running the following
command:
cpd-cli oadp restore create ${RESTORE_NAME} \ --from-backup ${BACKUP_NAME} \ --include-resources 'namespaces,persistentvolumeclaims,persistentvolumes,pods,configmaps,secrets' \ --include-namespaces ${PROJECT_CPD_INST_OPERANDS} \ --skip-hooks \ --verbose \ --log-level debug
- Back up the volumes by running the following
command:
Offline backup validation fails in IBM Software Hub deployment that includes Db2 and Informix in the same namespace
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see error messages like in the following
examples:
time=<timestamp> level=info msg=>> received masterplan status update func=cpdbr-oadp/pkg/utils.LogInfo file=/a/workspace/oadp-upload/pkg/utils/log.go:83 time=<timestamp> level=info msg=>>> status description: finished executing 1 DataProtectionPlan(s): func=cpdbr-oadp/pkg/utils.LogInfo file=/a/workspace/oadp-upload/pkg/utils/log.go:83 time=<timestamp> level=info msg=>>> error: error executing workflow actions: workflow action execution resulted in 1 error(s): - encountered an error during local-exec workflowAction.Do() - action=cpd-backup-validation, action-index=7, retry-attempt=0/0, err=backup validation failed: 2 errors occurred: * backup validation failed for configmap: db2wh-aux-br-cm * backup validation failed for configmap: db2oltp-aux-br-cm func=cpdbr-oadp/pkg/utils.LogInfo file=/a/workspace/oadp-upload/pkg/utils/log.go:83time=<timestamp> level=info msg=CM NAME RESOURCE-KIND ADDONID PATH ERR db2wh-aux-br-cm deployment databases backup-validation-meta/backup-validations/3/validation-rules/0 object with name 'zen-databases' does not exists in the backup db2wh-aux-br-cm deployment databases backup-validation-meta/backup-validations/4/validation-rules/0 object with name 'zen-database-core' does not exists in the backup db2oltp-aux-br-cm deployment databases backup-validation-meta/backup-validations/3/validation-rules/0 object with name 'zen-databases' does not exists in the backup db2oltp-aux-br-cm deployment databases backup-validation-meta/backup-validations/4/validation-rules/0 object with name 'zen-database-core' does not exists in the backup func=cpdbr-oadp/pkg/core/services.(*backupValidationService).printBackupValidationResults file=/a/workspace/oadp-upload/pkg/core/services/backup_validation.go:340 time=<timestamp> level=error msg=backup validation unsuccessful func=cpdbr-oadp/pkg/utils.LogError file=/a/workspace/oadp-upload/pkg/utils/log.go:97 - Cause of the problem
- During the backup process, the zen-databases and zen-database-core pods are deleted.
- Resolving the problem
- To prevent these pods from being deleted, remove some content from the
playbooks/dbtypeservice.yml file in the
ibm-informix-cp4d-controller-manager pod.
- Get the instance ID of the ibm-informix-cp4d-controller-manager
pod:
oc -n ${PROJECT_CPD_INST_OPERATORS} get pods --selector name=ibm-informix-cp4d-operator - Open a terminal on the
ibm-informix-cp4d-controller-manager-<instance-id>
pod.
oc -n ${PROJECT_CPD_INST_OPERATORS} exec -it ibm-informix-cp4d-controller-manager-<instance-id> -- bash - Edit the playbooks/dbtypeservice.yml
file:
vi playbooks/dbtypeservice.yml - Comment out the following content:
- block: - include_role: name: zenhelper tasks_from: get_product_cm_info.yaml - include_role: name: zenhelper tasks_from: preupgrade.yml - include_role: name: zenhelper tasks_from: preupgrade-cleanup.yml rescue: - include_role: name: zenhelper tasks_from: rescuestatus.yml - fail: msg: "preupgrade failed" when: shutdown is not defined or (shutdown is defined and shutdown | lower != "true" and shutdown | lower != "force") - Save the changes.
- Retry the backup.
- Get the instance ID of the ibm-informix-cp4d-controller-manager
pod:
Unable to create offline backup when IBM Software Hub deployment includes MongoDB service
Applies to: 5.1.1
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.2
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the following
example:
time=<timestamp> level=error msg=global registry check failed: 1 error occurred: * error from addOnId=opsmanager: addOnId=opsmanager (state=enabled) is in zen-metastore, but does not exist in the global registryNote: The MongoDB service does not support IBM Software Hub backup and restore. The service is excluded from IBM Software Hub backups. - Resolving the problem
- Re-run the backup command with the
--registry-check-exclude-add-on-ids opsmanageroption. For example:cpd-cli oadp tenant-backup create ${TENANT_OFFLINE_BACKUP_NAME} \ --namespace ${OADP_PROJECT} \ --registry-check-exclude-add-on-ids opsmanager \ --cleanup-completed-resources=true \ --vol-mnt-pod-mem-request=1Gi \ --vol-mnt-pod-mem-limit=4Gi \ --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \ --mode offline \ --image-prefix=registry.redhat.io/ubi9 \ --log-level debug \ --verbose &> ${TENANT_OFFLINE_BACKUP_NAME}.log
Backup fails after a service is upgraded and then uninstalled
Applies to: 5.1.1 and later
Applies to: Backup and restore with the OADP utility
- Diagnosing the problem
- The problem occurs after a service was upgraded from IBM Cloud Pak for Data 4.8.x to IBM Software Hub
5.1.1 and later uninstalled. When you try to take a backup by running
the
cpd-cli oadp tenant-backupcommand, the backup fails. In the CPD-CLI*.log file, you see an error message like in the following example:Error: global registry check failed: 1 error occurred: * error from addOnId=watsonx_ai: 2 errors occurred: * failed to find aux configmap 'cpd-watsonxai-maint-aux-ckpt-cm' in tenant service namespace='<namespace_name>': : configmaps "cpd-watsonxai-maint-aux-ckpt-cm" not found * failed to find aux configmap 'cpd-watsonxai-maint-aux-br-cm' in tenant service namespace='<namespace_name>': : configmaps "cpd-watsonxai-maint-aux-br-cm" not found [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 - Resolving the problem
- Re-run the backup command with the
--skip-registry-checkoption. For example:cpd-cli oadp tenant-backup create ${TENANT_OFFLINE_BACKUP_NAME} \ --namespace ${OADP_PROJECT} \ --vol-mnt-pod-mem-request=1Gi \ --vol-mnt-pod-mem-limit=4Gi \ --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \ --mode offline \ --skip-registry-check \ --image-prefix=registry.redhat.io/ubi9 \ --log-level=debug \ --verbose &> ${TENANT_OFFLINE_BACKUP_NAME}.log&
Offline backup fails after watsonx.ai is uninstalled
Applies to: 5.1.0 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- The problem occurs when you try to take an offline backup after watsonx.ai was uninstalled. The backup process
fails when post-backup hooks are run. In the CPD-CLI*.log file, you see error
messages like in the following
example:
time=<timestamp> level=info msg=cmd stderr: <timestamp> [emerg] 233346#233346: host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10 nginx: [emerg] host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10 nginx: configuration file /usr/local/openresty/nginx/conf/nginx.conf test failed func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:824 time=<timestamp> level=warning msg=failed to get exec hook JSON result for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 err=could not find JSON exec hook result func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:852 time=<timestamp> level=warning msg=no exec hook JSON result found for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:855 time=<timestamp> level=info msg=exit executeCommand func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:860 time=<timestamp> level=error msg=command terminated with exit code 1 - Cause of the problem
- After watsonx.ai is uninstalled,
the
nginxconfiguration in the ibm-nginx pod is not cleared up, and the pod fails. - Resolving the problem
- Restart all ibm-nginx
pods.
oc delete pod \ -n=${PROJECT_CPD_INST_OPERANDS} \ -l component=ibm-nginx
Db2U backup precheck fails during offline backup
Applies to: 5.1.0 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the CPD-CLI*.log file, the following error message
appears:
You also see failed quota error messages similar to the following examples:Hook execution breakdown by status=error/timedout: The following hooks either have errors or timed out pre-check (1): COMPONENT CONFIGMAP METHOD STATUS DURATION ADDONID db2u db2u-aux-br-cm rule timedout 6m0.186233543s databases -------------------------------------------------------------------------------- Error: precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred: * backup precheck hook finished with status=timedout, configmap=db2u-aux-br-cm [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1Error creating: pods "db2-bar-precheck-jobl6482-wj6x5" is forbidden: failed quota: cpd-quota: must specify limits.cpu for: db2-bar-precheck; limits.memory for: db2-bar-precheck; requests.cpu for: db2-bar-precheck; requests.memory for: db2-bar-precheck' - Cause of the problem
- This problem occurs only if you enabled Kubernetes resource quota in the cluster. Resource quotas set for pods in the cluster prevent the creation of the backup precheck pod.
- Resolving the problem
- Do the following steps:
- Scale down the Db2 ansible
operators:
oc -n ${PROJECT_CPD_INST_OPERATORS} scale deployment ibm-db2oltp-cp4d-operator-controller-manager --replicas=0oc -n ${PROJECT_CPD_INST_OPERATORS} scale deployment ibm-db2wh-cp4d-operator-controller-manager --replicas=0oc -n ${PROJECT_CPD_INST_OPERATORS} scale deployment ibm-db2aaservice-cp4d-operator-controller-manager --replicas=0 - For each project that has Db2
instances running, turn on maintenance mode for Db2 ansible custom
resources:
export DB2_PROJECT=<Db2_project_name>oc -n $DB2_PROJECT patch db2oltpservice db2oltp-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}'oc -n $DB2_PROJECT patch db2whservice db2wh-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}'oc -n $DB2_PROJECT patch db2aaserviceservice db2aaservice-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}' - Edit the db2u-aux-br-cm and db2u-aux-ckpt-cm
ConfigMaps to change precheck-meta and backup pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2u-aux-br-cmoc -n $DB2_PROJECT edit cm db2u-aux-ckpt-cm-
In each ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - actions: - job: job-key: db2-bar-precheck-job timeout: 360s - In the
actionssection, addon-error: Continuelike in the following example:precheck-meta: | backup-hooks: exec-rules: - actions: - job: job-key: db2-bar-precheck-job timeout: 360s on-error: Continue
-
- Edit the db2oltp-aux-br-cm ConfigMap to change precheck-meta and backup
pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2oltp-aux-br-cm-
In the ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - resource-kind: db2oltpservices.databases.cpd.ibm.com labels: app.kubernetes.io/name=db2oltp on-error: Fail actions: - builtins: name: cpdbr.cpd.ibm.com/check-condition params: condition: "{$.status.db2oltpStatus} == {\"Completed\"}" timeout: 5s - Change
on-error: Failtoon-error: Continue. - In the ConfigMap, locate the
backup-metasection:backup-meta: | pre-hooks: exec-rules: - resource-kind: db2oltpservice.databases.cpd.ibm.com on-error: Propagate actions: - builtins: name: cpdbr.cpd.ibm.com/enable-maint params: statusFieldName: db2oltpStatus timeout: 360s - Change
on-error: Propagatetoon-error: Continue.
-
- Edit the db2oltp-aux-ckpt-cm ConfigMap to change precheck-meta and backup
pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2oltp-aux-ckpt-cm-
In the ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - resource-kind: db2oltpservices.databases.cpd.ibm.com labels: app.kubernetes.io/name=db2oltp on-error: Fail actions: - builtins: name: cpdbr.cpd.ibm.com/check-condition params: condition: "{$.status.db2oltpStatus} == {\"Completed\"}" timeout: 5s - Change
on-error: Failtoon-error: Continue.
-
- Edit the db2wh-aux-br-cm ConfigMap to change precheck-meta and backup
pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2wh-aux-br-cm-
In the ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - resource-kind: db2whservices.databases.cpd.ibm.com labels: app.kubernetes.io/name=db2wh on-error: Fail actions: - builtins: name: cpdbr.cpd.ibm.com/check-condition params: condition: "{$.status.db2whStatus} == {\"Completed\"}" timeout: 5s - Change
on-error: Failtoon-error: Continue. - In the ConfigMap, locate the
backup-metasection:backup-meta: | pre-hooks: exec-rules: - resource-kind: db2whservice.databases.cpd.ibm.com on-error: Propagate actions: - builtins: name: cpdbr.cpd.ibm.com/enable-maint params: statusFieldName: db2whStatus timeout: 360s - Change
on-error: Propagatetoon-error: Continue.
-
- Edit the db2wh-aux-ckpt-cm ConfigMap to change precheck-meta and backup
pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2wh-aux-ckpt-cm-
In the ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - resource-kind: db2whservices.databases.cpd.ibm.com labels: app.kubernetes.io/name=db2wh on-error: Fail actions: - builtins: name: cpdbr.cpd.ibm.com/check-condition params: condition: "{$.status.db2whStatus} == {\"Completed\"}" timeout: 5s - Change
on-error: Failtoon-error: Continue.
-
- Edit the db2aaservice-aux-br-cm ConfigMap to change precheck-meta and
backup pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2aaservice-aux-br-cm-
In the ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - resource-kind: db2aaserviceservices.databases.cpd.ibm.com labels: app.kubernetes.io/name=db2aaservice on-error: Fail actions: - builtins: name: cpdbr.cpd.ibm.com/check-condition params: condition: "{$.status.db2aaserviceStatus} == {\"Completed\"}" timeout: 5s - Change
on-error: Failtoon-error: Continue. - In the ConfigMap, locate the
backup-metasection:backup-meta: | pre-hooks: exec-rules: - resource-kind: db2aaserviceservice.databases.cpd.ibm.com on-error: Propagate actions: - builtins: name: cpdbr.cpd.ibm.com/enable-maint params: statusFieldName: db2aaserviceStatus timeout: 360s - Change
on-error: Propagatetoon-error: Continue.
-
- Edit the db2aaservice-aux-ckpt-cm ConfigMap to change precheck-meta and
backup pre-hook on-error settings to
Continue.oc -n $DB2_PROJECT edit cm db2aaservice-aux-ckpt-cm-
In the ConfigMap, locate the
precheck-metasection:precheck-meta: | backup-hooks: exec-rules: - resource-kind: db2aaserviceservices.databases.cpd.ibm.com labels: app.kubernetes.io/name=db2aaservice on-error: Fail actions: - builtins: name: cpdbr.cpd.ibm.com/check-condition params: condition: "{$.status.db2aaserviceStatus} == {\"Completed\"}" timeout: 5s - Change
on-error: Failtoon-error: Continue.
-
- Scale down the Db2 ansible
operators:
Db2 Big SQL backup pre-hook and post-hook fail during offline backup
Applies to: 5.1.0 and later
Applies to: Offline backup and restore with the OADP utility
- Diagnosing the problem
- In the db2diag logs of the Db2
Big SQL head pod, you see error messages such as in the following example when backup pre-hooks
are running:
<timestamp> LEVEL: Event PID : 3415135 TID : 22544119580160 PROC : db2star2 INSTANCE: db2inst1 NODE : 000 HOSTNAME: c-bigsql-<xxxxxxxxxxxxxxx>-db2u-0 FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5692 MESSAGE : ZRC=0xFFFFFBD0=-1072 SQL1072C The request failed because the database manager resources are in an inconsistent state. The database manager might have been incorrectly terminated, or another application might be using system resources in a way that conflicts with the use of system resources by the database manager. - Cause of the problem
- The Db2 database was unable
to start because of the error code
SQL1072C. As a result, thebigsql startcommand that runs as part of the post-backup hook hangs, which produces the timeout of the post-hook. The post-hook cannot succeed unless Db2 is brought back to a stable state and thebigsql startcommand runs successfully. The Db2 Big SQL instance is left in an unstable state. - Resolving the problem
- Do one or both of the following troubleshooting and cleanup procedures.Tip: For more information about the
SQL1072Cerror code and how to resolve it, see SQL1000-1999 in the Db2 documentation.- Remove all the database manager processes running under the Db2 instance ID
- Do the following steps:
- Log in to the Db2
Big SQL head
pod:
oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash - Switch to the
db2inst1user:su - db2inst1 - List all the database manager processes that are running under
db2inst1:db2_ps - Remove these
processes:
kill -9 <process-ID>
- Log in to the Db2
Big SQL head
pod:
- Ensure that no other application is running under the Db2 instance ID, and then remove all resources owned by the Db2 instance ID
- Do the following steps:
- Log in to the Db2
Big SQL head
pod:
oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash - Switch to the
db2inst1user:su - db2inst1 - List all IPC resources owned by
db2inst1:ipcs | grep db2inst1 - Remove these
resources:
ipcrm -[q|m|s] db2inst1
- Log in to the Db2
Big SQL head
pod:
Error running post-restore hooks when restoring an offline backup
Applies to: 5.1.0
Applies to: Offline backup and restore with the OADP utility
Fixed in: 5.1.1
- Diagnosing the problem
- After you run the restore command, you see the following
error:
The CPD-CLI-<timestamp>.log file shows following error:error: DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=8) offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules. Check the ../logs/CPD-CLI-<timestamp>.log for errors.* error performing op postRestoreViaConfigHookRule for resource lite-maint (configmap=cpd-lite-aux-br-maint-cm): 1 error occurred: * error executing command (container=ibm-nginx-container podIdx=1 podName=ibm-nginx-<xxxxxxxxxx>-<yyyyy> namespace=${PROJECT_CPD_INST_OPERANDS} auxMetaName=lite-maint-aux component=lite-maint actionIdx=0): command terminated with exit code 1Also, after the restore, you are not able log in to the IBM Software Hub web client. Check the ibm-nginx-xxx pod log for the following error:<timestamp> [error] 1412#1412: *29 connect() failed (111: Connection refused) while connecting to upstream, client: <ip_address>, server: internal-nginx-svc, request: "POST /v2/catalogs/default?check_bucket_existence=false HTTP/1.1", upstream: "https://<ip_address>:<port_number>/v2/catalogs/default?check_bucket_existence=false", host: "internal-nginx-svc:12443" - Resolving the problem
- Do the following steps:
- Restart the ibm-nginx
pods:
oc delete pod -l app.kubernetes.io/component=ibm-nginx -n ${PROJECT_CPD_INST_OPERANDS} - Rerun the post-restore
hook:
cpd-cli oadp restore posthooks \ --hook-kind=br \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS}
If you are still unable to log in to the Cloud Pak for Data web client, it might be because of the following Operator Lifecycle Manager (OLM) known issue: OLM known issue: ResolutionFailed message. To resolve the problem, follow the instructions in the troubleshooting topic. Then wait for the zenservice lite-cr custom resource to reach theCompletedstate by running the following command:oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o json | jq .status.zenStatus - Restart the ibm-nginx
pods:
Creating Watson Studio offline volume backup fails
Applies to: 5.1.0 and later
Applies to: Offline backup and restore with the IBM Software Hub volume backup utility
- Diagnosing the problem
- In the CPD-CLI*.log file, you see the following
error:
Error: error on quiesce: 1 error occurred: * error resolving aux config cpd-ws-maint-aux-qu-cm in namespace <namespace-name>: : pods "init-ws-runtimes-libs" not found - Resolving the problem
- Do the following steps:
- Run the following
commands:
cmName=cpd-ws-maint-aux-qu-cm auxMeta=$(oc get cm -n ${PROJECT_CPD_INST_OPERANDS} ${cmName} -o jsonpath='{.data.aux-meta}' | yq 'del(."managed-resources")') echo -e "data:\n aux-meta: |\n$(echo "$auxMeta" | sed 's/^/ /')" > patch.yaml oc patch cm -n ${PROJECT_CPD_INST_OPERANDS} ${cmName} --patch-file patch.yaml - Retry the
backup:
cpd-cli backup-restore volume-backup create <backup-name> -n ${PROJECT_CPD_INST_OPERANDS}
- Run the following
commands:
Security issues
Security scans return an Inadequate Account Lockout Mechanism message
Applies to: 5.1.0 and later
- Diagnosing the problem
-
If you run a security scan against IBM Software Hub, the scan returns the following message.
Inadequate Account Lockout Mechanism
- Resolving the problem
-
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.
The Kubernetes version information is disclosed
Applies to: 5.1.0 and later
- Diagnosing the problem
- If you run an Aqua Security scan against your cluster, the scan returns the following issue:
- Resolving the problem
- This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.