Known issues and limitations for IBM Software Hub

The following issues apply to the IBM® Software Hub platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

The following issues apply to IBM Software Hub services.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional
The create-physical-location command fails if your load balancer timeout settings are too low
The delete-physical-location command fails
The Get authorization token endpoint returns Unauthorized when the username contains special characters
The wml service key does not work and commands must use the --service option
The health-cluster command fails on ROSA with hosted control planes
Health check for OpenPages fails with 500 error if service instance is installed on a watsonx.governance environment
Exports with improper JSON formatting persist on export list despite deletion attempt

After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional

Applies to: 5.1.0 and later

Diagnosing the problem

After rebooting the cluster, some IBM Software Hub custom resources remain in the InProgress state.

For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.

Workaround

Do the following steps:

Find the nodes that have pods that are in an Error state:

oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A  | grep -v -P "Completed|(\d+)\/\1"

Mark each node as unschedulable.
```
oc adm cordon <node_name>
```

Delete the affected pods:

oc get pod   | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0

Mark each node as scheduled:
```
oc adm uncordon <node_name>
```

The `create-physical-location` command fails if your load balancer timeout settings are too low

Applies to: 5.1.0

Fixed in: 5.1.1

If your load balancer timeout settings are too low, the connection between the primary cluster and the remote physical location might be terminated before the API call that is issued by the cpd-cli manage create-physical-location command completes.

Diagnosing the problem

The

cpd-cli
manage
create-physical-location

command returns the following error:

TASK [utils : fail] ************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "The <location-name> physical location 
was not registered. There might be a problem connecting to the hub or to the physical location 
service on the hub. Wait a few minutes and try to update the physical location again. If the problem 
persists, review the zen-core-api pods on the hub for issues related to the 
v1/physical_locations/<location-name> endpoint."}

In addition, the log file includes the following message:

"msg": "Nginx extension API returned: 504"

Resolving the problem

Check the current timeout settings on your cluster. For example, if you use HAProxy, run:

grep timeout /etc/haproxy/haproxy.cfg

The command returns output with the following format:

	timeout http-request    10s
	timeout queue           1m
	timeout connect         10s
	timeout client          5m
	timeout server          5m
	timeout http-keep-alive 10s
	timeout check           10s

If the client timeout or the server timeout is less than 5 minutes (5m), follow the directions in Changing load balancer timeout settings to increase the timeout to at least 5 minutes.

The `delete-physical-location` command fails

Applies to: 5.1.0

Fixed in: 5.1.1

When you run the cpd-cli manage delete-physical-location, the command fails.

Diagnosing the problem

The

cpd-cli
manage
delete-physical-location

command returns the following error:

TASK [utils : MC Agent:  Make the MC storage pod if push mode communications.  Dont wait] ***
fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. 
The error was: 'physical_location_id' is undefined\n\nThe error appears to be in 
'/opt/ansible/ansible-play/roles/utils/tasks/initialize_mc_edge_agent_51.yml': line 203, column 5, 
but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line 
appears to be:\n\n\n  - name: \"MC Agent:  Make the MC storage pod if push mode communications.  
Dont wait\"\n    ^ here\nThis one looks easy to fix. It seems that there is a value started\n
with a quote, and the YAML parser is expecting to see the line ended\nwith the same kind of quote. 
For instance:\n\n    when: \"ok\" in result.stdout\n\nCould be written as:\n\n   
when: '\"ok\" in result.stdout'\n\nOr equivalently:\n\n   when: \"'ok' in result.stdout\"\n"}

PLAY RECAP *********************************************************************
localhost      : ok=51   changed=11   unreachable=0    failed=1    skipped=84   rescued=0    ignored=0

[ERROR] ... cmd.Run() failed with exit status 2
[ERROR] ... Command exception: The delete-physical-location command failed (exit status 2). 
            You may find output and logs in the <file-path>/cpd-cli-workspace/olm-utils-workspace/work directory.
[ERROR] ... RunPluginCommand:Execution error:  exit status 1

Resolving the problem

From the client workstation:

Get the container ID of the olm-utils-v3 image:

Docker

docker ps

Podman

podman ps

The command returns output with the following format:

CONTAINER ID  IMAGE                                    COMMAND     CREATED      STATUS      PORTS       NAMES
8204c95c0fe2  cp.stg.icr.io/cp/cpd/olm-utils-v3:5.1.0              2 hours ago  Up 2 hours              olm-utils-play-v3

Open a bash prompt in the container:

Docker

docker exec -ti <container-ID> bash

Podman

podman exec -ti <container-ID> bash

Apply the workaround:

find ./ansible-play/roles/utils/templates/phyloc-mc-51 -name *deployment* | xargs -I {} sed -i -E 's/^ *icpdsupport\/physicalLocation.*\".*$//g' {}

Run the following command to verify that the workaround was applied:

find ./ansible-play/roles/utils/templates/phyloc-mc-51 -name \*deployment\* | xargs -I {} grep -E 'icpdsupport\/physicalLocation' {}

The command should return the following output:

fieldPath: metadata.labels['icpdsupport/physicalLocationId']
fieldPath: metadata.labels['icpdsupport/physicalLocationName']

After you apply the workaround, you can re-run the cpd-cli manage delete-physical-location command to delete the remote physical location.

The Get authorization token endpoint returns `Unauthorized` when the username contains special characters

Applies to: 5.1.1 and later

If you try to generate a bearer token by calling the Get authorization token endpoint with a username that contains special characters, the call fails and an error response is returned.

Diagnosing the problem

When you call the /icp4d-api/v1/authorize endpoint by using credentials that contain special characters in the username, the call fails and you receive the following error response:

{
  "_statusCode_": 401,
  "exception": "User unauthorized to invoke this endpoint.",
  "message": "Unauthorized"
}

This happens because the special characters in the username are not being encoded properly in the /icp4d-api/v1/authorize endpoint.

Resolving the problem

You can generate an access_token by calling the validateAuth endpoint.

curl -X POST \
  'https://<platform_instance_route>/v1/preauth/signin'\
  -H 'Content-Type: application/json' \
  -d' {
    "username":<username>,
    "password":<password>
}'

Replace <platform_instance_route>, <username>, and <password> with your credentials and the correct values for your environment.

This command returns a response that contains the authorization token.

{
    "_messageCode_": "200",
    "message": "Success",
    "token": "<authorization-token>"
}

Use the <authorization-token> in the authorization header of subsequent API calls.

The `wml` service key does not work and `cpd-cli health` commands must use the `--service` option

Applies to: 5.1.0 and later

The wml service key cannot be used for 5.1.0 and later versions. Using the wml service key can cause severe data loss. When Watson Machine Learning is installed on a cluster, there are two restrictions that you must follow if you want to use either the

cpd-cli
health
service-functionality

or the

cpd-cli
health
service-functionality cleanup

commands:

You cannot use the wml service key with the --services option.
You must use the --services option when you use either the cpd-cli health service-functionality command or the cpd-cli health service-functionality cleanup command.

The `cpd-cli cluster` command fails on ROSA with hosted control planes

Applies to: 5.1.0

Fixed in: 5.1.1

An error message for missing machineconfigpools (MCP) shows up when you run the cpd-cli cluster command to check the health of a Red Hat OpenShift Service on AWS (ROSA) cluster with hosted control planes (HCP).

Health check for OpenPages fails with 500 error if service instance is installed on a watsonx.governance environment

Applies to: 5.1.1 and later versions

Fixed in: 5.1.3

If you install an OpenPages service instance on a watsonx.governance™ environment and then run a cpd-cli health service-functionality check, the OpenPages service key fails with a 500 server error.

Exports with improper JSON formatting persist on export list despite deletion attempt

Applies to: 5.1.2 and later versions

The

cpd-cli
export-import
export delete

command fails to delete an export when the job fails due to improper JSON formatting that was passed in the export's yaml file. The command returns without an error message, but the export job remains in the export list.

Workaround: Fix any incorrect formatting issues with the JSON string in the export yaml file. Then create an export with a different name, and rerun the cpd-cli export-import export delete command on this export.

Installation and upgrade issues

The `setup-instance` command fails during upgrades

Applies to: Upgrades from Version 5.0 to 5.1.0

When you run the cpd-cli manage setup-instance command, the command fails if the ibm-common-service-operator-service service is not found in the operator project for the instance.

When this error occurs, the ibmcpd ibmcpd-cr custom resource is stuck at 35%.

Diagnosing the problem

To determine if the command failed because the ibm-common-service-operator-service service was not found:

Get the .status.progress value from the ibmcpd ibmcpd-cr custom resource:
```
oc get ibmcpd ibmcpd-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
-o json | jq -r '.status.progress'
```
- If the command returns 35%, continue to the next step.
- If the command returns a different value, the command failed for a different reason.
Check for the Error found when checking commonservice CR in namespace error in the ibmcpd ibmcpd-cr custom resource:
```
oc get ibmcpd ibmcpd-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
-o json | grep 'Error found when checking commonservice CR in namespace'
```
- If the command returns a response, continue to the next step.
- If the command does not return a response, the command failed for a different.

Confirm that the ibm-common-service-operator-service service does not exist:

oc get svc ibm-common-service-operator-service \
--namespace=${PROJECT_CPD_INST_OPERATORS}

The command should return the following response:

Error from server (NotFound): services "ibm-common-service-operator-service" not found

Resolving the problem

To resolve the problem:

Get the name of the cpd-platform-operator-manager pod:

oc get pod \
--namespace=${PROJECT_CPD_INST_OPERATORS} \
| grep cpd-platform-operator-manager

Delete the cpd-platform-operator-manager pod.
Replace <pod-name> with the name of the pod returned in the previous step.
```
oc delete pod <pod-name> \
--namespace=${PROJECT_CPD_INST_OPERATORS}
```
Wait several minutes for the operator to run a reconcile loop.

Confirm that the ibm-common-service-operator-service service exists:

oc get svc ibm-common-service-operator-service \
--namespace=${PROJECT_CPD_INST_OPERATORS}

The command should return a response with the following format:

NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
ibm-common-service-operator-service   ClusterIP   198.51.100.255   <none>        443/TCP   1h

After you resolve the issue, you can re-run the cpd-cli manage setup-instance command.

Upgrades fail or are stuck in the `InProgres` state when common core services cannot be upgraded because of an empty role name

Applies to:

Upgrades from Version 4.8 to 5.1.0
Upgrades from Version 4.8 to 5.1.1

Fixed in: Version 5.1.2

If you upgrade a service with a dependency on common core services, the upgrade fails or is stuck in the InProgress state if it cannot upgrade the common core services.

If your installation includes one of the following services, you might encounter this problem when upgrading from IBM Cloud Pak® for Data Version 4.8 to IBM Software Hub Version 5.1:

IBM Knowledge Catalog
IBM Knowledge Catalog Premium
IBM Knowledge Catalog Standard

When you upgrade any service with a dependency on the common core services, the common core services upgrade fails with the following error:

'"Job" "projects-ui-refresh-users": Timed out waiting on resource'

This error occurs because the wkc_reporting_administrator role is created without a name.

Avoiding the problem

You can avoid this problem by updating the wkc_reporting_administrator role before you upgrade:

Log in to the web client as a user with the one of the following permissions:
- Administer platform
- Manage platform roles
From the navigation menu, select Administration > Access control.
Open the Roles tab and look for a role with <no value> in the name column.
Edit the role. Set the Name to Reporting Administrator and click Save.

Diagnosing the problem

If you started the upgrade without completing the steps in Avoiding the problem complete the following steps to determine why the upgrade failed:

Get the name of the common core services operator pod:

oc get pod -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-cpd-ccs-operator

Check the ibm-cpd-ccs-operator-* pod logs for the '"Job" "projects-ui-refresh-users": Timed out waiting on resource' error:
```
oc logs <ibm-cpd-ccs-operator-pod-name> -n=${PROJECT_CPD_INST_OPERATORS} \
| grep "'"Job" "projects-ui-refresh-users": Timed out waiting on resource'"
```
- If the command returns a response, proceed to the next step.
- If the command returns an empty response, the upgrade failed for a different reason.

Get the name of the projects-ui-refresh-users pod:

oc get pod -n=${PROJECT_CPD_INST_OPERANDS} | grep projects-ui-refresh-users

Check the projects-ui-refresh-users-* pod logs for the Error refreshing role with extension_name - wkc_reporting_administrator - status_code - 400 error:
```
oc logs <projects-ui-refresh-users-pod-name> -n=${PROJECT_CPD_INST_OPERANDS} \
| grep "Error refreshing role with extension_name - wkc_reporting_administrator - status_code - 400"
```
- If the command returns a response, proceed to Resolving the problem.
- If the command returns an empty response, the upgrade failed for a different reason.

Resolving the problem

The steps for resolving the problem are the same as the steps for avoiding the problem.

Upgrades fail or are stuck in the `InProgres` state when common core services cannot be upgraded because of roles with duplicate names

Applies to:

Upgrades from Version 4.8
Upgrades from Version 5.0

If you upgrade a service with a dependency on common core services, the upgrade fails or is stuck in the InProgress state if it cannot upgrade the common core services. This issue can occur if your environment includes multiple roles with the same name. (This is possible only if you use the /usermgmt/v1/role API to create roles.)

When you upgrade any service with a dependency on the common core services, the common core services upgrade fails with the following error:

'"Job" "projects-ui-refresh-users": Timed out waiting on resource'

Avoiding the problem

You can avoid this problem by removing duplicate role names before you upgrade:

Log in to the web client as a user with the one of the following permissions:
- Administer platform
- Manage platform roles
From the navigation menu, select Administration > Access control.
Open the Roles tab and sort the roles by name to find any roles with duplicate names.
If you find roles with duplicate names, edit the roles to remove the duplicate names and click Save.

Diagnosing the problem

If you started the upgrade without completing the steps in Avoiding the problem complete the following steps to determine why the upgrade failed:

Get the name of the common core services operator pod:

oc get pod -n=${PROJECT_CPD_INST_OPERATORS} | grep ibm-cpd-ccs-operator

Check the ibm-cpd-ccs-operator-* pod logs for the '"Job" "projects-ui-refresh-users": Timed out waiting on resource' error:
```
oc logs <ibm-cpd-ccs-operator-pod-name> -n=${PROJECT_CPD_INST_OPERATORS} \
| grep "'"Job" "projects-ui-refresh-users": Timed out waiting on resource'"
```
- If the command returns a response, proceed to the next step.
- If the command returns an empty response, the upgrade failed for a different reason.

Get the name of the projects-ui-refresh-users pod:

oc get pod -n=${PROJECT_CPD_INST_OPERANDS} | grep projects-ui-refresh-users

Check the projects-ui-refresh-users-* pod logs for duplicate role name errors:
```
oc logs <projects-ui-refresh-users-pod-name> -n=${PROJECT_CPD_INST_OPERANDS} \
| grep "Error refreshing role with extension_name" | grep "status_code - 409"
```
- If the command returns a response, proceed to Resolving the problem.
- If the command returns an empty response, the upgrade failed for a different reason.

Resolving the problem

The steps for resolving the problem are the same as the steps for avoiding the problem.

The `cloud-native-postgresql-opreq` operand request is in a failed state after upgrade

Applies to:

Upgrades from Version 5.0.1 or later
Upgrades from Version 5.1.0 or later

Fixed in: 5.1.3

After you upgrade an instance of IBM Software Hub, the cloud-native-postgresql-opreq operand request is in the Failed state.

Even though the operand request is in the Failed state, IBM Software Hub is upgraded and works as expected. To fix the operand request, run the following command to update the operand request:

oc patch opreq cloud-native-postgresql-opreq \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch '{"spec": {"requests":[{"operands":[{"name":"cloud-native-postgresql-v1.22"}],"registry":"common-service"}]}}'

The Switch locations icon is not available if the `apply-cr` command times out

Applies to: 5.1.0 and later

If you install solutions that are available in different experiences, the Switch locations icon ( Image of the Switch locations icon. ) is not available in the web client if the cpd-cli manage apply-cr command times out.

Resolving the problem: Re-run the cpd-cli manage apply-cr command.

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Applies to: 5.1.0 and later

If the Red Hat OpenShift Data Foundation or IBM Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.

One symptom is that pods will not start because of a FailedMount error. For example:

Warning  FailedMount  36s (x1456 over 2d1h)   kubelet  MountVolume.MountDevice failed for volume 
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given 
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists

Diagnosing the problem

To confirm whether the Data Foundation Rook Ceph cluster is unstable:

Ensure that the rook-ceph-tools pod is running.
```
oc get pods -n openshift-storage | grep rook-ceph-tools
```
Note: On IBM Fusion HCI System or on environments that use hosted control planes, the pods are running in the openshift-storage-client project.
Set the TOOLS_POD environment variable to the name of the rook-ceph-tools pod:
```
export TOOLS_POD=<pod-name>
```

Execute into the rook-ceph-tools pod:

oc rsh -n openshift-storage ${TOOLS_POD}

Run the following command to get the status of the Rook Ceph cluster:
```
ceph status
```
Confirm that the output includes the following line:
```
health: HEALTH_WARN
```
Exit the pod:
```
exit
```

Resolving the problem

To resolve the problem:

Get the name of the rook-ceph-mrg pods:

oc get pods -n openshift-storage | grep rook-ceph-mgr

Set the MGR_POD_A environment variable to the name of the rook-ceph-mgr-a pod:
```
export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
```
Set the MGR_POD_B environment variable to the name of the rook-ceph-mgr-b pod:
```
export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
```

Delete the rook-ceph-mgr-a pod:

oc delete pods ${MGR_POD_A} -n openshift-storage

Ensure that the rook-ceph-mgr-a pod is running before you move to the next step:
```
oc get pods -n openshift-storage | grep rook-ceph-mgr
```

Delete the rook-ceph-mgr-b pod:

oc delete pods ${MGR_POD_B} -n openshift-storage

Ensure that the rook-ceph-mgr-b pod is running:

oc get pods -n openshift-storage | grep rook-ceph-mgr

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 5.1.0 and later

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.

This issue affects deployments of the following services.

IBM Knowledge Catalog
IBM Match 360

Diagnosing the problem

To identify the cause of this issue, check the FoundationDB status and details.

Check the FoundationDB status.
```
oc get fdbcluster -o yaml | grep fdbStatus
```
If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.
If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
```
oc rsh sample-cluster-log-1 /bin/fdbcli
```
To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.
```
status details
```
- If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
```
oc get pod -o wide | grep storage
> coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 
```
  If this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.

Resolving the problem

To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

Restart the FoundationDB cluster pods.
```
oc get fdbcluster 
oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po
```
Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

Restart the FoundationDB operator pods.

oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po

After the pods finish restarting, check to ensure that FoundationDB is available.
1. Check the FoundationDB status.
```
oc get fdbcluster -o yaml | grep fdbStatus
```
  The returned status must be Complete.
2. Check to ensure that the database is available.
```
oc rsh sample-cluster-log-1 /bin/fdbcli
```
  If the database is still not available, complete the following steps.
  1. Log in to the ibm-fdb-controller pod.
  2. Run the fix-coordinator script.
```
kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}
```
    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

Persistent volume claims with the `WaitForFirstConsumer` volume binding mode are flagged by the installation health checks

Applies to: 5.1.0 and later

When you install IBM Software Hub, the following persistent volume claims are automatically created:

ibm-cs-postgres-backup
ibm-zen-objectstore-backup-pvc

Both of these persistent volume claims are created with the WaitForFirstConsumer volume binding mode. In addition, both persistent volume claims will remain in the Pending state until you back up your IBM Software Hub installation. This behavior is expected. However, when you run the cpd-cli health operands command, the Persistent Volume Claim Healthcheck fails.

If there are more persistent volume claims returned by the health check, you must investigate further to determine why those persistent volume claims are pending. However, if only the following persistent volume claims are returned, you can ignore the Failed result:

ibm-cs-postgres-backup
ibm-zen-objectstore-backup-pvc

Node pinning is not applied to `postgresql` pods

Applies to: 5.1.0 and later

If you use node pinning to schedule pods on specific nodes, and your environment includes postgresql pods, the node affinity settings are not applied to the postgresql pods that are associated with your IBM Software Hub deployment.

The resource specification injection (RSI) webhook cannot patch postgresql pods because the EDB Postgres operator uses a PodDisruptionBudget resource to limit the number of concurrent disruptions to postgresql pods. The PodDisruptionBudget resource prevents postgresql pods from being evicted.

The `ibm-nginx` deployment does not scale fast enough when automatic scaling is configured

Applies to: 5.1.0 and later

If you configure automatic scaling for IBM Software Hub, the ibm-nginx deployment might not scale fast enough. Some symptoms include:

Slow response times
High CPU requests are throttled
The deployment scales up and down even when the workload is steady

This problem typically occurs when you install watsonx Assistant or watsonx™ Orchestrate.

Resolving the problem

If you encounter the preceding symptoms, you must manually scale the ibm-nginx deployment:

oc patch zenservice lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {
    "Nginx": {
        "name": "ibm-nginx",
        "kind": "Deployment",
        "container": "ibm-nginx-container",
        "replicas": 5,
        "minReplicas": 2,
        "maxReplicas": 11,
        "guaranteedReplicas": 2,
        "metrics": [
            {
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": 529
                    }
                }
            }
        ],
        "resources": {
            "limits": {
                "cpu": "1700m",
                "memory": "2048Mi",
                "ephemeral-storage": "500Mi"
            },
            "requests": {
                "cpu": "225m",
                "memory": "920Mi",
                "ephemeral-storage": "100Mi"
            }
        },
        "containerPolicies": [
            {
                "containerName": "*",
                "minAllowed": {
                    "cpu": "200m",
                    "memory": "256Mi"
                },
                "maxAllowed": {
                    "cpu": "2000m",
                    "memory": "2048Mi"
                },
                "controlledResources": [
                    "cpu",
                    "memory"
                ],
                "controlledValues": "RequestsAndLimits"
            }
        ]
    }
}}'

Uninstalling IBM watsonx services does not remove the IBM watsonx experience

Applies to: 5.1.0 and later

After you uninstall watsonx.ai™ or watsonx.governance, the IBM watsonx experience is still available in the web client even though there are no services that are specific to the IBM watsonx experience.

Resolving the problem

To remove the IBM watsonx experience from the web client, an instance administrator must run the following command:

oc delete zenextension wx-perspective-configuration \
--namespace=${PROJECT_CPD_INST_OPERANDS}

Backup and restore issues

Issues that apply to several backup and restore methods

Backup issues

Review the following issues before you create a backup. Do the workarounds that apply to your environment.

Restore issues

Review the following issues before you restore a backup. Do the workarounds that apply to your environment.

Review the following issues after you restore a backup. Do the workarounds that apply to your environment.

Backup and restore issues with the OADP utility

Backup issues

Review the following issues before you create a backup. Do the workarounds that apply to your environment.

Restore issues

Review the following issues after you restore a backup. Do the workarounds that apply to your environment.

Backup and restore issues with IBM Fusion

Backup issues

Review the following issues before you create a backup. Do the workarounds that apply to your environment.

Restore issues

Do the workarounds that apply to your environment after you restore a backup.

Backup and restore issues with NetApp Trident protect

Restore issues

Review the following issues after you restore a backup. Do the workarounds that apply to your environment.

Restore is taking a long time to complete

Backup and restore issues with NetApp Astra Control Center

Restore issues

Review the following issues after you restore a backup. Do the workarounds that apply to your environment.

Watson OpenScale etcd server fails to start after restoring from a backup

Backup and restore issues with Portworx

Backup issues

Review the following issues after you restore a backup. Do the workarounds that apply to your environment.

IBM Software Hub resources are not migrated

Restore issues

Review the following issues after you restore a backup. Do the workarounds that apply to your environment.

Error when activating applications after a migration

Backup and restore issues with the IBM Software Hub volume backup utility

Backup issues

Review the following issues before you create a backup. Do the workarounds that apply to your environment.

Creating Watson Studio offline volume backup fails

Backup precheck fails due to missing Data Refinery custom resource error message

Applies to: 5.1.0

Applies to: All backup and restore methods

Fixed in: 5.1.1

Diagnosing the problem

In the CPD-CLI*.log file, you see the following error message:

time=<timestamp> level=error msg=error performing op preCheckViaConfigHookRule for resource rshaper (configmap=cpd-datarefinery-maint-aux-ckpt-cm): : datarefinery.datarefinery.cpd.ibm.com "datarefinery-cr" not found func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/go/src/cpdbr-oadp/pkg/quiesce/planexecutor.go:1631
The hook is searching for datarefinery-cr, and it failed because the datarefinery-cr is not present.

Cause of the problem

Starting in IBM Cloud Pak for Data 4.7, the Data Refinery custom resource name was changed from datarefinery-sample to datarefinery-cr. If you upgraded IBM Cloud Pak for Data from version 4.6 or earlier, the Data Refinery custom resource name is still datarefinery-sample.

Workaround

Update the Data Refinery custom resource name to datarefinery-sample in the cpd-datarefinery-maint-aux-br-cm and cpd-datarefinery-maint-aux-ckpt-cm ConfigMaps.

Edit the cpd-datarefinery-maint-aux-br-cm ConfigMap:

oc -n ${PROJECT_CPD_INST_OPERANDS} edit cm cpd-datarefinery-maint-aux-br-cm

In the precheck-meta section, under backup-hooks, update the Data Refinery custom resource name to datarefinery-sample:

precheck-meta:
  backup-hooks:
    exec-rules:
    - resource-kind: datarefinery.datarefinery.cpd.ibm.com
      name: datarefinery-sample

Repeat steps 1 and 2 in the cpd-datarefinery-maint-aux-ckpt-cm ConfigMap.
Retry the backup.

Restoring online backup of Data Virtualization fails

Applies to: 5.1.0 and later

Applies to: All online backup and restore methods

Diagnosing the problem

In the CPD-CLI*.log file, you see the following error message:

time=<timestamp> level=info msg=   zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, 
op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult 
file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137

Workaround

Do the following steps:

Disable the Data Virtualization liveness probe in the Data Virtualization head pod:

oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - mkdir /mnt/PV/versioned/marker_file"

oc exec -it c-db2u-dv-db2u-0 -- bash -c "su - db2inst1 - touch /mnt/PV/versioned/marker_file/.bar"

Disable the BigSQL restart daemon in the Data Virtualization head pod:

oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker create BIGSQL_DAEMON_PAUSE"

Stop BigSQL in the Data Virtualization head pod:

oc rsh c-db2u-dv-db2u-0 bash

su - db2inst1

bigsql stop

Re-enable the Hive user in the users.json file in the Data Virtualization head pod.
1. Edit the users.json file:
```
vi /mnt/blumeta0/db2_config/users.json
```
2. Locate "locked":true and change it to "locked":false.

On the hurricane pod, rename the hive-site.xml config file so that it can be reconfigured by restarting the pod:

oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)

su - db2inst1

mv /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml /mnt/blumeta0/home/db2inst1/ibm/bigsql/hive-site.xml.bak

Exit the pod, and then run the following command to delete it.
Note: Since the configuration file was renamed, it is regenerated with the correct settings.
```
oc delete pod -l formation_id=db2u-dv,role=hurricane
```
After the hurricane pod is started again, run the following commands on the hurricane pod to disable the SSL so that it can be reconfigured in a later step:
```
oc rsh $(oc get pod -o name -l formation_id=db2u-dv,role=hurricane)
```
```
su - db2inst1
```
```
bigsql-config -disableMetastoreSSL
```
```
bigsql-config -disableSchedulerSSL
```

Clean up leftover files from the hurricane pod:

rm -rf /mnt/blumeta0/bigsql/security/*

rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null

Run the following commands to disable SSL from the head pod:

oc rsh c-db2u-dv-db2u-0 bash

su - db2inst1

rah "bigsql-config -disableMetastoreSSL"

rah "bigsql-config -disableSchedulerSSL"

Clean up leftover files from the head and worker pods:

rm -rf /mnt/blumeta0/bigsql/security/*

rm -rfv /mnt/blumeta0/bigsql/security/.* 2>/dev/null

Run the following commands to re-enable SSL on the head pod, and restart Db2 Big SQL so that configuration changes can take effect:
```
bigsql-config -enableMetastoreSSL
```
```
bigsql-config -enableSchedulerSSL
```
```
bigsql stop; bigsql start
```

Remove markers that were created in steps 1 and 2 in the Data Virtualization head pod:

oc exec -it c-db2u-dv-db2u-0 -- bash -c "rm -rf /mnt/PV/versioned/marker_file/.bar"

oc exec -it c-db2u-dv-db2u-0 -- bash -c "db2uctl marker delete BIGSQL_DAEMON_PAUSE"

If you are doing the backup and restore with the OADP backup and restore utility, run the following command:

cpd-cli oadp restore prehooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS} --log-level debug --verbose

If you are doing the backup and restore with IBM Fusion, NetApp Astra Control Center, or Portworx data replication, run the following commands:

CPDBR_POD=$(oc get po -l component=cpdbr-tenant -n ${PROJECT_CPD_INST_OPERATORS} --no-headers | awk '{print $1}')

oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERANDS},${PROJECT_CPD_INST_OPERATORS}"

oc exec -n ${PROJECT_CPD_INST_OPERATORS} ${CPDBR_POD} -it -- /bin/sh -c "./cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS} --operators-namespace ${PROJECT_CPD_INST_OPERATORS}"

Unable to connect to Db2 database after restoring Data Virtualization to a different cluster

Applies to: 5.1.1

Applies to: Backup and restore to different cluster with IBM Fusion

Fixed in: 5.1.2

Diagnosing the problem

In the IBM Software Hub web client, users see the following error message:

SQL30082N  Security processing failed with reason "24" ("USERNAME AND/OR PASSWORD INVALID").  SQLSTATE=08001
FAIL: connect to database with admin/password

Cause of the problem

The Data Virtualization head pod is unable to log in to Db2 instances because Db2 is using the source cluster's IBM Software Hub route.

Resolving the problem

Restart the Data Virtualization head and worker pods:

Log in to Red Hat OpenShift Container Platform as a cluster administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Scale down to 0 replicas:

oc -n ${PROJECT_CPD_INST_OPERANDS} scale sts c-db2u-dv-db2u --replicas=0

Wait for all c-db2u-dv-db2u-<xxxx> pods to be deleted.
Scale up to x replicas, where x equals the total number of head and worker pods.
The following example assumes that you have one head pod and one worker pod.
```
oc -n ${PROJECT_CPD_INST_OPERANDS} scale sts c-db2u-dv-db2u --replicas=2
```

IBM Fusion shows successful restore but Informix custom resource reports that instance is unhealthy

Applies to: 5.1.1

Applies to: Backup and restore with IBM Fusion

Fixed in: 5.1.2

Diagnosing the problem

The status of the informix custom resource is InProgress.

Cause of the problem

The informix custom resource is not reporting the proper status of the StatefulSet and pod because a state change is not reconciled. However, the Informix instance is working properly and clients can interact with the instance.

Resolving the problem

Run a reconciliation by restarting the Informix operator controller manager pod. When the reconciliation is completed, the custom resource shows the proper status. Run the following commands:

oc scale deployment informix-operator-controller-manager --replicas=0 -n ${PROJECT_CPD_INST_OPERATORS}
oc scale deployment informix-operator-controller-manager --replicas=1 -n ${PROJECT_CPD_INST_OPERATORS}

Unable to log in to IBM Software Hub with OpenShift cluster credentials after successfully restoring to a different cluster

Applies to: 5.1.0, 5.1.1, 5.1.2

Applies to: All restore to different cluster scenarios

Fixed in: 5.1.3

Diagnosing the problem

When IBM Software Hub is integrated with the Identity Management Service service, you cannot log in with OpenShift cluster credentials. You might be able to log in with LDAP or as cpdadmin.

Resolving the problem

To work around the problem, run the following commands:

oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}
oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}
oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operator
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-service
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-management
oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider

Restore posthooks timeout errors during Db2U and IBM Software Hub control plane restore

Applies to: 5.1.0

Applies to: Online backup and restore with the OADP utility

Fixed in: 5.1.1

Diagnosing the problem

When you restore a backup, you see an error message in the CPD-CLI*.log file like in the following examples:

<time>  Hook execution breakdown by status=error/timedout:
<time>  
<time>  The following hooks either have errors or timed out
<time>  
<time>  post-restore (2):
<time>  
<time>      	COMPONENT     	CONFIGMAP                 	METHOD	STATUS	DURATION        	ADDONID  
<time>      	db2u          	db2u-aux-ckpt-cm          	rule  	error 	1h0m0.176305347s	databases
<time>      	zen-lite-patch	ibm-zen-lite-patch-ckpt-cm	rule  	error 	8.55624ms       	zen-lite 
<time>  
<time>  --------------------------------------------------------------------------------
<time>  
<time>  Error: failed to execute masterplan: 1 error occurred:
<time>  	* DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=9) error: online post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules.  Check the /<directory>/CPD-CLI-<date>.log for errors.
<time>  2 errors occurred:
<time>  	* error performing op postRestoreViaConfigHookRule for resource db2u (configmap=db2u-aux-ckpt-cm): 2 errors occurred:
<time>     * timed out waiting for the condition
<time>     * timed out waiting for the condition
<time>  
<time>  	* error performing op postRestoreViaConfigHookRule for resource zen-lite-patch (configmap=ibm-zen-lite-patch-ckpt-cm): : clusterserviceversions.operators.coreos.com "ibm-zen-operator.v6.1.0" not found

time=<timestamp> level=info msg=Time: <timestamp> level=info - OperandRequest: ibm-iam-request - phase: Installing
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116
time=<timestamp> level=info msg=Time: <timestamp> level=info - sleeping for 64s... (retry attempt 10/10)
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116
time=<timestamp> level=info msg=Time: <timestamp> level=info - OperandRequest: ibm-iam-request - phase: Installing
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116
time=<timestamp> level=info msg=Time: <timestamp> level=warning - Create OperandRequest Timeout Warning
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116
time=<timestamp> level=info msg=--------------------------------------------------
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:116

Cause of the problem

The timeout problem is caused by a slow Kubernetes OpenShift server.

Resolving the problem

Retry the restore.

Online restore posthooks fail when restoring Db2

Applies to: 5.1.0, 5.1.1

Applies to: Online backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

When you restore a backup, you see an error message in the CPD-CLI*.log file like in the following example:

<time>  The following hooks either have errors or timed out
<time>  
<time>  post-restore (1):
<time>  
<time>      	COMPONENT	CONFIGMAP       	METHOD	STATUS	DURATION     	ADDONID  
<time>      	db2u     	db2u-aux-ckpt-cm	rule  	error 	44.941375556s	databases
<time>  
<time>  --------------------------------------------------------------------------------
<time>  
<time>  Error: failed to execute masterplan: 1 error occurred:
<time>  	* DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=9) error: online post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules.  Check the /<directory>/CPD-CLI-<date_timestamp>.log for errors.
<time>  1 error occurred:
<time>  	* error performing op postRestoreViaConfigHookRule for resource db2u (configmap=db2u-aux-ckpt-cm): 1 error occurred:
<time>     * error executing command (container=db2u podIdx=0 podName=c-db2oltp-<xxxxxxxxxxxxxxxx>-db2u-0 namespace=${PROJECT_CPD_INST_OPERANDS} auxMetaName=db2u-aux component=db2u actionIdx=0): command terminated with exit code 1

Resolving the problem

This problem is intermittent. Do the following steps:

Rerun the restore posthooks:

cpd-cli oadp restore posthooks \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERANDS \
--hook-kind=posthook \
--log-level=debug

Reset the namespacescope by running the following commands:

oc get po -A | grep "cpdbr-tenant-service"

oc rsh -n ${PROJECT_CPD_INST_OPERANDS} <cpdbr-tenant-...>

/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}

Backup fails when deployment includes IBM Knowledge Catalog Standard or IBM Knowledge Catalog Premium without optional components enabled

Applies to: 5.1.0

Fixed in: 5.1.1

Applies to: Offline and online backup and restore with the OADP utility

Diagnosing the problem

When you try to create a backup, you see the following message in the CPD-CLI*.log file:

Error: global registry check failed: 1 error occurred:
	* error from addOnId=ikc_standard: 3 errors occurred:
	* failed to find aux configmap 'cpd-ikc-standard-aux-ckpt-cm' in tenant service namespace='${PROJECT_CPD_INST_OPERANDS': : configmaps "cpd-ikc-standard-aux-ckpt-cm" not found
	* failed to find aux configmap 'cpd-ikc-standard-aux-br-cm' in tenant service namespace='${PROJECT_CPD_INST_OPERANDS': : configmaps "cpd-ikc-standard-aux-br-cm" not found
	* failed to find aux configmap 'cpd-ikcstd-maint-aux-br-cm' in tenant service namespace='${PROJECT_CPD_INST_OPERANDS': : configmaps "cpd-ikcstd-maint-aux-br-cm" not found

Resolving the problem

Run the backup with the option to skip the global registry check. For example, create an offline backup by running the following command:

cpd-cli oadp tenant-backup create tenant-offline-b1 \
--namespace ${OADP_PROJECT} \
--cleanup-completed-resources=true \
--vol-mnt-pod-mem-request=1Gi \
--vol-mnt-pod-mem-limit=4Gi \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERANDS} \
--registry-check-exclude-add-on-ids=volumes,zen-dataplanes,zen-organizations,zen-secrets,databases,ikc_standard \
--mode offline \
--log-level debug \
--verbose

Running cpd-cli restore post-hook command after Db2 Big SQL was successfully restored times out

Applies to: 5.1.0 and later

Applies to: Online backup and restore with either the OADP utility or Portworx

Diagnosing the problem

After you restore an online backup, you see the following message repeating in the CPD-CLI*.log file:

time=<timestamp>  level=INFO msg=Waiting for marker /tmp/.ready_to_connectToDb

Resolving the problem

Manually recreate /tmp/.ready_to_connectToDb in the Db2 Big SQL head pod. Do the following steps:

oc rsh $(oc get pods | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1)

Switch to the db2inst1 user:
```
su - db2inst1
```
Recreate /tmp/.ready_to_connectToDb:
```
touch /tmp/.ready_to_connectToDb
```

Rerun the cpd-cli oadp restore posthooks command:

cpd-cli oadp restore posthooks \
--hook-kind=checkpoint \
--namespace=${PROJECT_CPD_INST_OPERANDS}

After restoring Analytics Engine powered by Apache Spark, IBM Software Hub resources reference the source cluster

Applies to: 5.1.0 and later

Applies to: Online backup and restore to a different cluster

Diagnosing the problem

After you restore IBM Software Hub to a different cluster, accessing the IBM Software Hub console shows the source cluster URL instead of the target cluster.

Cause of the problem

After the restore, the ConfigMap spark-hb-deployment-properties references the source cluster.

Resolving the problem

Do the following steps:

Delete the spark-hb-deployment-properties ConfigMap:

oc delete cm spark-hb-deployment-properties -n ${PROJECT_CPD_INST_OPERANDS}

Reconcile the Analytics Engine powered by Apache Spark custom resource:

oc patch AnalyticsEngine analyticsengine-sample -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"forceReconcile": "'$(date +%s)'"}}'

After restoring IBM Software Hub, watsonx.data Presto connection still references source cluster

Applies to: 5.1.0

Applies to: Online backup and restore to a different cluster

Fixed in: 5.1.1

Diagnosing the problem

After you restore IBM Software Hub to a different cluster, in the Presto connection details, you see the source cluster hostname in the engine details.

Cause of the problem

After the restore, a ConfigMap might have references to the source cluster.

Resolving the problem

Do the following steps:

Get the watsonx.data™ Presto engine name:

oc get wxdengine -o name -n ${PROJECT_CPD_INST_OPERANDS}

Patch the watsonx.data Presto engine:

oc patch wxdengine.watsonxdata.ibm.com/<engine_name> \
       --type json \
       -n ${PROJECT_CPD_INST_OPERANDS} \
       -p '[
         {
           "op": "remove",
           "path": "/spec/externalEngineUri"
         }
]'

To accelerate the reconcile, delete the lakehouse operator pod:

oc delete pod ibm-lakehouse-controller-manager-<xxxxxxxxxx>-<yyyyy> -n ${PROJECT_CPD_INST_OPERATORS}

After a few minutes, the Presto engine repopulates with the target cluster URL, and the ConfigMap is recreated with the correct URLs.

ObjectBucketClaim is not supported by the OADP utility

Applies to: 5.1.0

Applies to: Backup and restore with the OADP utility

Diagnosing the problem: If an ObjectBucketClaim is created in an IBM Software Hub instance, it is not included when you create a backup.
Cause of the problem: OADP does not support backup and restore of ObjectBucketClaim.
Resolving the problem: Services that provide the option to use ObjectBuckets must ensure that the ObjectBucketClaim is in a separate namespace and backed up separately.

Unable to log in to the IBM Cloud Pak foundational services console after restore

Applies to: 5.1.0

Applies to: Online or offline backup and restore to a different cluster

Fixed in: 5.1.1

Diagnosing the problem

After restoring an online or offline backup of IBM Software Hub that is integrated with the Identity Management Service to a different cluster, users cannot log in to the IBM Cloud Pak foundational services console. The following error message appears:

CWOAU0061E: The OAuth service provider could not find the client because the client name is not valid. Contact your system administrator to resolve the problem.

Resolving the problem

After the zenservice custom resource is reconciled and in a Completed state, restart the Identity Management Service operator and related resources. Run the following commands:

oc delete cm ibmcloud-cluster-info -n ${PROJECT_CPD_INST_OPERANDS}

oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l component=platform-identity-provider

oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l component=platform-identity-management

oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l component=platform-auth-service

oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/name=ibm-iam-operator

Backup precheck fails on upgraded deployment

Applies to: 5.1.0

Applies to: All online backup and restore methods

Fixed in: 5.1.1

Diagnosing the problem

When you try to create an online backup of an IBM Cloud Pak for Data 5.0.x deployment that was upgraded to IBM Software Hub 5.1.0, the backup precheck fails. You see an error message in the log file like in the following example:

time=<timestamp> level=info msg=exit RunExecRules func=cpdbr-oadp/pkg/quiesce.RunExecRules file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:142
time=<timestamp> level=error msg=error performing op preCheckViaConfigHookRule for resource zen (configmap=cpd-zen-aux-ckpt-cm): 1 error occurred:
   * condition not met (condition={$.status.controlPlaneStatus} == {"Completed"}, namespace=thanos, gvr=cpd.ibm.com/v1, Resource=ibmcpds, name=ibmcpd-cr)
 func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1631
time=<timestamp> level=info msg=op preCheckViaConfigHookRule for resource zen took 10.035602582s func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1643
time=<timestamp> level=info msg=
** PHASE [PLAN EXECUTOR/INTERNAL/PRECHECK-BACKUP/RESOURCE/ZEN (CONFIGMAP=CPD-ZEN-AUX-CKPT-CM)/END]  func=cpdbr-oadp/pkg/utils.LogMarker file=/a/workspace/oadp-upload/pkg/utils/log.go:64

pre-check (1):

    	COMPONENT	CONFIGMAP          	METHOD	STATUS	DURATION     	ADDONID 
    	zen      	cpd-zen-aux-ckpt-cm	rule  	error 	10.035602582s	zen-lite

Cause of the problem

Two EDB Postgres clusters, wa-postgres and wa-postgres-16, exist as a result of migrating the data from one EDB Postgres cluster to another EDB Postgres cluster.

Resolving the problem

Delete the wa-postgres cluster.

Relationship explorer is not working after restoring IBM Knowledge Catalog

Applies to: 5.1.1, 5.1.2

Applies to: All backup and restore methods

Fixed in: 5.1.3

Diagnosing the problem

After restoring an online or offline backup, clicking the Relationship Explorer button gives the following error:

Error fetching canvas
Not found. The resource you tried to access does not exist.

Cause of the problem

This problem typically occurs due to a timing conflict between the restore process and a scheduled backup job. When performing an online or offline restore of the Neo4j database, the restore process might succeed, but it often restores an incorrect database bundle because of the timing conflict. As a result, data loss occurs in the Neo4j graph database.

Resolving the problem

Do the following steps:

Patch the knowledgegraph-cr custom resource to disable the data-lineage-neo4j-backups-cronjob cronjob before a backup is taken by running the following command:

oc patch Knowledgegraph knowledgegraph-cr -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"neo4j_backup_job_enabled": "False"}}'
oc delete cronjob data-lineage-neo4j-backups-cronjob  -n ${PROJECT_CPD_INST_OPERANDS}

Wait for the knowledgegraph-cr custom resource to reconcile.
1. Check that the knowledgegraph-cr custom resource is in a Completed state:
```
oc get knowledgegraph.wkc.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS}
```
2. Check that the Neo4j custom resource is in a Completed state:
```
oc get neo4jclusters.neo4j.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS}
```
Create a new backup.

Execution Engine for Apache Hadoop service is inactive after a restore

Applies to: 5.1.0

Applies to: All backup and restore methods

Fixed in: 5.1.1

Diagnosing the problem

After an online or offline restore, the hadoop-cr custom resource is not active. Run the following command:

oc get hadoop -n ${PROJECT_CPD_INST_OPERANDS}

Output of the command:

NAME        VERSION   RECONCILED   STATUS   AGE
hadoop-cr   5.1.0                           22h

Cause of the problem

Execution Engine for Apache Hadoop subscriptions are missing from the cluster.

Resolving the problem

After you take a backup, check that the Execution Engine for Apache Hadoop subscriptions exist by running the following command:

for sub in $(oc get cm cpd-operators -o jsonpath='{.data.subscriptions}'  -n ${PROJECT_CPD_INST_OPERATORS} | jq -r '.[] | .metadata.name');do oc get subscriptions.operators.coreos.com ${sub} -n ${PROJECT_CPD_INST_OPERATORS} > /dev/null; [ $? -eq 1 ] && echo "FAILED to find subscription ${sub}" || echo "found subscription ${sub} in the backup";done

After restoring watsonx Orchestrate, Kafka controller pods in the knative-eventing project enter a `CrashLoopBackOff` state

Applies to: 5.1.0

Applies to: Online backup and restore

Fixed in: 5.1.1

Diagnosing the problem: After restoring watsonx Orchestrate, the Kafka controller pods in the knative-eventing project enter a CrashLoopBackOff state.
Cause of the problem: The source of the problem is a known issue with restoring watsonx Assistant.
Resolving the problem: Do the same workaround for the watsonx Assistant known issue. For details, see watsonx Assistant stuck at System is Training after restore.

watsonx Assistant stuck at `System is Training` after restore

Applies to: 5.1.0

Applies to: Backup and restore with the OADP utility or IBM Fusion

Fixed in: 5.1.1

Diagnosing the problem

After restoring watsonx Assistant, the Kafka controller pods in the knative-eventing project might enter a CrashLoopBackOff state. As a result, watsonx Assistant is unable to complete the training process, and the service gets stuck at the System is Training stage.

Resolving the problem

Do the following steps:

Identify any Kafka pods in the CrashLoopBackOff state:

oc get pod -n knative-eventing | grep -vE "Compl|1/1|2/2|3/3|4/4|5/5|6/6|7/7|8/8|9/9|10/10"

Example output:

NAME                                                                    READY   STATUS             RESTARTS          AGE
kafka-controller-75fcdd9c4c-lk8dz                                       1/2     CrashLoopBackOff   245 (4m36s ago)   21h
kafka-controller-75fcdd9c4c-xfddm                                       1/2     CrashLoopBackOff   245 (2m45s ago)   21h

Verify related resources in the PROJECT_CPD_INST_OPERANDS project.

Check the health of triggers:

oc get triggers

Example output:

NAME                                                  BROKER                  SUBSCRIBER_URI   AGE   READY     REASON
wa-ke.assistant.clu-controller.v1.training-complete   knative-wa-clu-broker                    25h   Unknown   failed to reconcile consumer group
wa-ke.assistant.clu-controller.v1.training-failed     knative-wa-clu-broker                    25h   Unknown   failed to reconcile consumer group
wa-ke.assistant.clu-controller.v1.training-start      knative-wa-clu-broker                   25h   Unknown   failed to reconcile consumer group

Check the health of Consumer Groups:

oc get consumergroups

Example output:

NAME                                                                      READY   REASON   SUBSCRIBER   REPLICAS   READY REPLICAS   AGE
knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.training-complete                                 1                           25h
knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.training-failed                                   1                           25h
knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.training-start                                    1                           25h

Check the health of Consumers:

oc get consumers

Example output:

NAME                                                              READY   REASON   SUBSCRIBER   AGE
knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.trai8ffsr                                 25h
knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.traikzfhh                                 25h
knative-trigger-cpd-wa-ke.assistant.clu-controller.v1.traitfzrk                                 25h

Delete the resources and let the operator create new resources by running the following script:

#!/bin/bash

# Define resource kinds
resources=("trigger" "consumer" "consumergroup")

# Loop through each resource kind
for resource in "${resources[@]}"; do
  # Get the list of resources with 'assistant' in the name
  oc get "$resource" -o json | jq -r '.items[] | select(.metadata.name | contains("assistant")) | .metadata.name' | while read name; do
    # Patch the resource to remove finalizers
    echo "Removing finalizers from $resource: $name"
    oc patch "$resource" "$name" --type=merge -p '{"metadata":{"finalizers":[]}}'
    
    # Delete the resource
    echo "Deleting $resource: $name"
    oc delete "$resource" "$name"
  done
done

Restart Kafka controller pods:

oc rollout restart deployment/kafka-controller -n knative-eventing

Restart watsonx Assistant CLU training pods:
```
oc delete pod -l app=wa-clu-training
```

After you complete these steps, the Kafka controller pods should recover, and watsonx Assistant will resume normal training operations.

Unable to create backup due to missing ConfigMaps

Applies to: 5.1.0

Applies to: Backup with OADP utility

Fixed in: 5.1.1

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

Error: global registry check failed: 1 error occurred:
	* error from addOnId=zen-lite: 1 error occurred:
	* failed to find aux configmap 'ibm-cs-postgres-ckpt-cm' in tenant service namespace='${PROJECT_CPD_INST_OPERANDS}': : configmaps "ibm-cs-postgres-ckpt-cm" not found

Cause of the problem

Backup and restore ConfigMaps can go missing after an upgrade or were accidentally deleted. Confirm that the ConfigMaps are missing by running the following commands:

oc get cm -n ${PROJECT_CPD_INST_OPERANDS}  -l cpdfwk.aux-kind=checkpoint | grep zen

oc get cm -n ${PROJECT_CPD_INST_OPERANDS}  -l cpdfwk.aux-kind=checkpoint | grep cs-postgres

The output shows no ConfigMaps when you run these commands.

Resolving the problem

Trigger a reconcile by running the following command:

oc patch -n ${PROJECT_CPD_INST_OPERANDS} zenservice lite-cr --type=merge --patch '{"spec": {"refresh_install": false}}'

The reconcile will recreate the missing ConfigMaps.

Backup is missing EDB Postgres PVCs

Applies to: 5.1.0

Applies to: Offline or online backup and restore with the OADP utility

Fixed in: 5.1.1

Diagnosing the problem

After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.

Cause of the problem

EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.

Resolving the problem

Before you create a backup, run the following command:

oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}

For more information, see the following topics:

Unable to restore offline backup of RStudio® Server Runtimes

Applies to: 5.1.2 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

1 error occurred:
07:55:55  	* error performing op postRestoreViaConfigHookRule for resource rstudio-maint-br (configmap=cpd-rstudio-maint-aux-br-cm): 1 error occurred:
07:55:55  	* : rstudioaddons.rstudio.cpd.ibm.com "rstudio-cr" not found

Resolving the problem

Do the following steps:

In the OpenShift Console, edit the cpd-rstudio-maint-aux-br-cm ConfigMap.
Note: To ensure that the ConfigMap is correctly formatted, do not edit the ConfigMap in the vi or vim editor.

Update the workflow:

 workflows:
      - name: restore-pre-operators
        sequence:
          - group: rstudio-sa
          - group: rstudio-resources
      - name: restore-post-operators
        sequence: []
      - name: restore-pre-operands
        sequence: []
      - name: restore-operands
        sequence:
          - group: rstudio-clusterroles
          - group: rstudio-crs
      - name: restore-post-operands
        sequence: []
      - name: restore-post-namespacescope
        sequence: []

Retry the restore by running the following commands:

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_BACKUP_NAME}-restore \
--from-tenant-backup ${TENANT_BACKUP_NAME}  \
--verbose \
--log-level=debug \
--disable-inverseops  \
--start-from restore-operands \
--log-level=debug &> ${TENANT_BACKUP_NAME}-restore.log&

Offline post-restore hooks fail when restoring Informix

Applies to: 5.1.2 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

1 error occurred:
	* error performing op postRestoreViaConfigHookRule for resource informix-1741694080250204 (configmap=informix-1741694080250204-aux-br-cm): : informixservices.ifx.ibm.com "informixservice-cr" not found

Cause of the problem

The Zenservice custom resource is not fully reconciling.

Resolving the problem

Do the following steps:

Rerun post-restore hooks by running the following command:

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from restore-post-namespacescope

Re-install the Informix custom resource:

cpd-cli manage apply-cr \
--components=informix_cp4d \
--release=${VERSION} \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--license_acceptance=true

Unable to log in to the IBM Software Hub user interface after offline restore

Applies to: 5.1.2 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

After you restore an offline backup of IBM Software Hub, you cannot log in to the user interface. The following error message appears:

CWOAU0061E: The OAuth service provider could not find the client because the client name is not valid. Contact your system administrator to resolve the problem.

Resolving the problem

Delete the PostgreSQL replica for the common-service-db cluster by running the following command:

oc delete po,pvc -l k8s.enterprisedb.io/cluster=common-service-db,k8s.enterprisedb.io/instanceRole=replica

Offline restore to different cluster fails with nginx error

Applies to: 5.1.2

Applies to: Offline backup and restore to a different cluster with the OADP utility

Fixed in: 5.1.3

Diagnosing the problem

In the CPD-CLI*.log file, you see errors like in the following example:

* error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during hook workflowAction.Do() - action=nginx-maint/disable, action-index=0, retry-attempt=0/0, err=: command terminated with exit code 1

Resolving the problem

This problem is intermittent. To resolve the problem, run the following command:

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from restore-post-namespacescope

Post-restore hooks fail to run when restoring deployment that includes IBM Knowledge Catalog

Applies to: 5.1.2

Applies to: Online backup and restore with IBM Fusion, NetApp Trident protect, and Portworx asynchronous disaster recovery

Fixed in: 5.1.3

Diagnosing the problem

After the restore, the IBM Knowledge Catalog custom resource remains stuck in the InProgress state.

In the log file, you see errors like in the following examples:

1 error occurred:
	* error performing op postRestoreViaConfigHookRule for resource lite (configmap=cpd-lite-aux-br-cm): Timed out 
waiting for workloads to unquiesce: timed out waiting for the condition.  Some workloads may still be in the process of scaling up, 
please use 'oc get deployment' to check workload statuses.

Resolving the problem

Reconcile the IBM Knowledge Catalog custom resource wkc-cr so that it reaches the Completed state by running the following command:

oc delete job kg-resync-glossary -n ${PROJECT_CPD_INST_OPERANDS}

Then wait for wkc-cr to reach a Completed state.

IBM Knowledge Catalog custom resources are not restored

Applies to: 5.1.2

Applies to: Online and offline backup and restore with the OADP utility

Fixed in: 5.1.3

Diagnosing the problem

The following custom resources are not restored after IBM Knowledge Catalog Premium is successfully restored:

When semantic automation is enabled, the ikcpremium.ikc.cpd.ibm.com custom resource is not restored.
When semantic automation is not enabled, the ikcpremium.ikc.cpd.ibm.com and ikcstandard.ikc.cpd.ibm.com custom resources are not restored.

After IBM Knowledge Catalog Standard is successfully restored, the ikcstandard.ikc.cpd.ibm.com custom resource is not restored when semantic automation is not enabled.

Resolving the problem

Recreate the custom resources from the backup. Do the following steps:

Identify the resource backup:

resourceBackup=$(oc get backups.velero.io -n ${OADP_OPERATOR_PROJECT} -l "cpdbr.cpd.ibm.com/tenant-backup-name=${TENANT_BACKUP_NAME},cpdbr.cpd.ibm.com/backup-name=resource_backup" --no-headers   | awk '{print $1}')

Download the backup:

cpd-cli oadp tenant-backup download ${TENANT_BACKUP_NAME}
unzip ${TENANT_BACKUP_NAME}-data.zip
mkdir ${resourceBackup}
tar -xf ${resourceBackup}-data.tar.gz -C ${resourceBackup}
ikcStandardCRLoc=$(find ${resourceBackup}/resources -type f| grep ikcstandard | grep ${PROJECT_CPD_INST_OPERANDS}  | grep -v preferred)
ikcPremiumCRLoc=$(find ${resourceBackup}/resources -type f| grep ikcpremium | grep ${PROJECT_CPD_INST_OPERANDS}  | grep -v preferred)

Recreate the missing custom resources and wait for the reconciliation to complete:

cat ${ikcStandardCRLoc} | jq 'del(.metadata.ownerReferences) | del(.metadata.uid) | del(.metadata.creationTimestamp) | del(.status) | del(.metadata.managedFields) | del(.metadata.generation) | del(.metadata.resourceVersion)' | oc apply -f -
cat ${ikcPremiumCRLoc} | jq 'del(.metadata.ownerReferences) | del(.metadata.uid) | del(.metadata.creationTimestamp) | del(.status) | del(.metadata.managedFields) | del(.metadata.generation) | del(.metadata.resourceVersion)' | oc apply -f -

Check that the custom resources were created:

oc get ikcpremium.ikc.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS}
oc get ikcstandard.ikc.cpd.ibm.com -n ${PROJECT_CPD_INST_OPERANDS}

If you restored an offline backup, remove the custom resources from maintenance mode:

oc patch ikcpremium.ikc.cpd.ibm.com ikc-premium-cr -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"ignoreForMaintenance": false}}'
oc patch ikcstandard.ikc.cpd.ibm.com ikc-standard-cr -n ${PROJECT_CPD_INST_OPERANDS} --type merge --patch '{"spec": {"ignoreForMaintenance": false}}'

Errors when restoring IBM Software Hub operators

Applies to: 5.1.2 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem

In the CPD-CLI*log file, you see messages like in the following example:

 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125
time=<timestamp> level=info msg=Time: <timestamp> level=error - Postgres Cluster: common-service-db Timeout Error
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125
time=<timestamp> level=info msg=
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125
time=<timestamp> level=info msg=Exited with return code=1
 func=cpdbr-oadp/pkg/cli.(*LogAndPrintWriter).Write file=/a/workspace/oadp-upload/pkg/cli/scripts.go:125
time=<timestamp> level=error msg=error executing /tmp/cpd-operators-3c18d4f8-0cb0-4b6b-b2a1-b7363e1f3736.sh restore --operators-namespace <namespace-name> --foundation-namespace <namespace-name>: exit status 1 func=cpdbr-oadp/pkg/cli.ExecCPDOperatorsScript file=/a/workspace/oadp-upload/pkg/cli/scripts.go:105
time=<timestamp> level=error msg=cpd-operators script execution failed (args=[restore --operators-namespace <namespace-name> --foundation-namespace <namespace-name>]): exit status 1 func=cpdbr-oadp/pkg/dpaplan.(*MasterPlan).ExecuteParentRecipe.(*MasterPlan).localExecOperatorsRestoreAction.func7 file=/a/workspace/oadp-upload/pkg/dpaplan/master_plan_local_exec_actions.go:679

Cause of the problem

A BundleUnpackFailed error occurred. Inspect subscriptions by running the following command:

oc get sub ibm-common-service-operator -o yaml

Example output:

 - message: 'bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job
      was active longer than specified deadline'
    reason: BundleUnpackFailed
    status: "True"
    type: BundleUnpackFailed
  lastUpdated: "<timestamp>"

This problem is a known issue with Red Hat. For details, see Operator installation or upgrade fails with DeadlineExceeded in RHOCP 4.

Resolving the problem

Do the following steps:

Delete IBM Software Hub instance projects (namespaces).
For details, see 3. Cleaning up the cluster before a restore.
Retry the restore.

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 5.1.0 and later

Applies to: All backup and restore methods

Diagnosing the problem

Get the status of the ZenService YAML:

oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...

Check for failing operandrequests:

oc get operandrequests -A

For failing operandrequests, check their conditions for constraints not satisfiable messages:

oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>

Cause of the problem

Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:

'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Resolving the problem

Do the following steps:

Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
Delete IBM Software Hub instance projects (namespaces).
For details, see 3. Cleaning up the cluster before a restore.
Retry the restore.

After restoring the IBM Match 360 service, the onboard job fails

Fixed in: 5.1.2

Applies to: 5.1.1

Applies to: All backup and restore methods

Diagnosing the problem

After restoring a previously backed up IBM Match 360 service instance, the mdm-onboard job can fail with a DuplicateTenantException error during model service tenant onboarding.

Cause of the problem

This issue occurs due to a problem with the operator trying to clean up resources and failing.

Resolving the problem

Before creating the backup, you can prevent this issue from occurring by labeling the IBM Match 360 ConfigMap. Do the following steps:

Get the ID of the IBM Match 360 instance:
1. From the IBM Software Hub home page, go to Services > Instances.
2. Click the link for the IBM Match 360 instance.
3. Copy the value after mdm- in the URL.
  For example, if the end of the URL is mdm-1234567891123456, the instance ID is 1234567891123456.
Create the following environment variable:
```
export INSTANCE_ID=<instance-id>
```

Add the mdm label by running the following command:

oc label cm mdm-operator-${INSTANCE_ID} icpdsupport/addOnId=mdm -n ${PROJECT_CPD_INST_OPERANDS}

After an online restore, some service custom resources cannot reconcile

Applies to: 5.1.0-5.1.2

Applies to: Online backup and restore with the OADP utility, IBM Fusion, NetApp Astra Control Center, and Portworx

Fixed in: 5.1.3

Diagnosing the problem

After an online restore completes, some service custom resources, Zenservice in particular, cannot reconcile. For example, the Zenservice custom resource remains stuck at 51% progress.

Cause of the problem

This problem occurs when you do an online backup and restore of a deployment that includes a service, such as MANTA Automated Data Lineage or Data Gate, that does not support online backup and restore.

Tip: For more information about the backup and restore methods that each service supports, see Services that support backup and restore.

Resolving the problem

To resolve the problem, run one of the following commands:

MANTA Automated Data Lineage

ZEN_METASTORE_PRIMARY_POD=$(oc get clusters.postgresql.k8s.enterprisedb.io -n ${PROJECT_CPD_INST_OPERANDS} zen-metastore-edb -o jsonpath='{.status.currentPrimary}')
oc exec $ZEN_METASTORE_PRIMARY_POD -n ${PROJECT_CPD_INST_OPERANDS} -- psql -t -U postgres -c "update immutable_extensions set status = 'disabled' where extension_name like '%lineage%' and extension_point_id = 'zen_front_door';refresh materialized view extensions_view;"  zen
oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx'
oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx-tester'

Data Gate

ZEN_METASTORE_PRIMARY_POD=$(oc get clusters.postgresql.k8s.enterprisedb.io -n ${PROJECT_CPD_INST_OPERANDS} zen-metastore-edb -o jsonpath='{.status.currentPrimary}')
oc exec $ZEN_METASTORE_PRIMARY_POD -n ${PROJECT_CPD_INST_OPERANDS} -- psql -t -U postgres -c "update immutable_extensions set status = 'disabled' where extension_name like '%datagate%' and extension_point_id = 'zen_front_door';refresh materialized view extensions_view;"  zen
oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx'
oc delete po -n ${PROJECT_CPD_INST_OPERANDS} -l 'app.kubernetes.io/component=ibm-nginx-tester'

Restore fails for watsonx Orchestrate

Applies to: 5.1.0 and later

Applies to: All backup and restore methods

Diagnosing the problem

Restore fails after upgrade for the watsonx Orchestrate service.

Cause of the problem

An Out of Memory (OOM) error exists in a job that calls one of the pods.

Resolving the problem

Create and apply an RSI patch to increase the memory allocation by completing the following steps:

Create a directory:

mkdir cpd-cli-workspace/olm-utils-workspace/work/rsi

From the directory that you created, create a new file called skill-seq.json that contains the following information:
```
[{"op":"replace","path":"/spec/containers/0/resources/limits/memory","value":"3Gi"}]
```

Run the following commands:

podman stop olm-utils-play-v3
podman start olm-utils-play-v3

Apply the patch:

cpd-cli manage create-rsi-patch \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERATORS} \
--patch_name=skill-seq-resource-limit \
--patch_type=rsi_pod_spec \
--patch_spec=/tmp/work/rsi/skill-seq.json \
--spec_format=json \
--include_labels=wo.watsonx.ibm.com/component:wo-skill-sequencing \
--state=active

Delete the job:

oc delete job wo-watson-orchestrate-bootstrap-job \
--namespace=${PROJECT_CPD_INST_OPERATORS}

IBM Software Hub user interface does not load after restoring

Applies to: 5.1.3

Applies to: All backup and restore methods

Diagnosing the problem

The IBM Software Hub user interface doesn't start after a restore. When you try to log in, you see a blank screen.

Cause of the problem

The following pods fail and must be restarted to load the IBM Software Hub user interface:

platform-auth-service
platform-identity-management
platform-identity-provider

Resolving the problem

Run the following command to restart the pods:

for po in $(oc get po -l icpdsupport/module=im -n ${PROJECT_CPD_INST_OPERANDS} --no-headers | awk '{print $1}' | grep -v oidc); do oc delete po -n ${PROJECT_CPD_INST_OPERANDS} ${po};done;

`ibm-lh-lakehouse-validator` pods repeatedly restart after restore

Applies to: 5.1.3

Applies to: All backup and restore methods

Diagnosing the problem

After restore, the ibm-lh-lakehouse-validator pod repeatedly restarts.

Cause of the problem

This issue occurs because the liveness probe fails. The liveness probe has a 120-second delay. Because of this delay, the health check fails before the application starts and the ibm-lh-lakehouse-validator pod repeatedly restarts.

Resolving the problem

To resolve the problem, create a patch file that updates the initial delay of the liveness probe to 300 seconds:

Create the following JSON file. Save the file as wxd-validator-patch.json in the cpd-cli-workspace/olm-utils-workspace/work/ directory:
```
[
  {
    "op": "replace",
    "path": "/spec/containers/0/livenessProbe/initialDelaySeconds",
    "value": 300
  }
]
```

Apply the patch:

cpd-cli manage create-rsi-patch \
  --cpd_instance_ns=cpd-instance \
  --patch_name=wxd-validator-patch-213 \
  --spec_format=json \
  --patch_type=rsi_pod_spec \
  --patch_spec=/tmp/work/wxd-validator-patch.json \
  --include_labels=icpdsupport/podSelector:ibm-lh-validator \
  --state=active

Restore to same cluster fails with `restore-cpd-volumes` error

Applies to: 5.1.3

Applies to: All backup and restore methods

Diagnosing the problem

IBM Software Hub fails to restore, and you see an error similar to the following example:

error: DataProtectionPlan=cpd-offline-tenant/restore-service-orchestrated-parent-workflow, Action=restore-cpd-volumes (index=1)
error: expected restore phase to be Completed, received PartiallyFailed

Cause of the problem

The ibm-streamsets-static-content pod is mounted as read-only, which causes the restore to partially fail.

Resolving the problem

Before you back up, exclude the IBM StreamSets pod:

Exclude the ibm-streamsets-static-content pod:

oc label -l icpdsupport/addOnId=streamsets,app.kubernetes.io/name=ibm-streamsets-static-content velero.io/exclude-from-backup=true

Take another back up and follow the back up and restore procedure for your environment.

Restore fails with `ErrImageNeverPull` error for Analytics Engine powered by Apache Spark

Applies to: 5.1.3

Applies to: All backup and restore methods

Diagnosing the problem

IBM Software Hub fails to restore, and you see an ErrImageNeverPull error similar to the following example:

spark-history-deployment-0e7df4b9-f5ac-4e68-b7b7-518afd4048ggds   0/1     Init:ErrImageNeverPull   0                3h13m

Cause of the problem

The Analytics Engine powered by Apache Spark history server deployment is erroneously included in the backup. As a result, the history server deployment is restored in the ErrImageNeverPull state.

Resolving the problem

To resolve the issue, you must delete the Analytics Engine powered by Apache Spark history server deployment after you restore.

After backing up or restoring IBM Match 360, Redis fails to return to a `Completed` state

Applies to: 5.1.3

Applies to: All backup and restore methods

Diagnosing the problem

After completing a backup or restoring the IBM Match 360 service, its associated Redis pod (mdm-redis) and IBM Redis CP fail to return to a Completed state.

Cause of the problem

This occurs when Redis CP fails to come out of maintenance mode.

Resolving the problem

To resolve this issue, manually update Redis CP to take it out of maintenance mode:

Open the Redis CP YAML file for editing.

oc edit rediscp -n ${PROJECT_CPD_INST_OPERANDS} -o yaml

Update the value of the ignoreForMaintenance parameter from true to false.
```
ignoreForMaintenance: false
```
Save the file and wait for the Redis pod to reconcile.

Backup fails with an `ibm_neo4j` error when IBM Match 360 is scaled to the `x-small` size

Applies to: 5.1.0

Applies to: All backup and restore methods

Fixed in: 5.1.1

Diagnosing the problem

When completing an online or offline backup of an IBM Software Hub instance that has the IBM Match 360 service installed with the x-small scaling configuration, the backup will fail with the following error message:

error getting inventory 'cm ibm-neo4j-inv-list-cm'

To confirm that this issue is occurring:

Confirm that IBM Match 360 is configured an x-small scaling configuration by running the following command:
```
oc get mdm -n ${PROJECT_CPD_INST_OPERANDS} -o yaml | grep scaleConfig
```
A returned value of scaleConfig: x-small indicates the x-small size.
Confirm that Neo4j is configured to disable backups by running the following command:
```
oc get neo4j -n ${PROJECT_CPD_INST_OPERANDS} -o yaml
```
The following values indicate that Neo4j backups are disabled:
```
backupHookEnabled: false
backupJobEnabled: false
```

Cause of the problem

This issue occurs when IBM Match 360 is scaled to x-small because the Neo4J Custom Resource (neo4j) is configured to disable backups. The the x-small size is intended for non-production clusters, meaning that backups are not expected to be required.

Resolving the problem

To resolve the problem, edit the IBM Match 360 CR (mdm-cr) to enable backups and increase the Neo4J cluster memory limits:

Edit the IBM Match 360 CR (mdm-cr).
```
oc edit mdm
```

Update the CR to use the following values in the neo4j section:

neo4j:
  backupHookEnabled: true
  backupJobEnabled: true
  config:
    server.memory.heap.initial_size: 2g
    server.memory.heap.max_size: 2g
  podSpec:
    resources:
      limits:
        memory: 8Gi
      requests:
        memory: 4Gi

Wait for mdm-cr to reconcile itself and update the Neo4J cluster. Proceed to the next step when both mdm-cr and Neo4jCluster are in a Completed state.
Start the backup process.

Backup fails for the platform with error in EDB Postgres cluster

Applies to: 5.1.0 and later

Applies to: All backup and restore methods

Diagnosing the problem

For example, in IBM Fusion, the backup fails at the Hook: br-service hooks/pre-backup stage in the backup sequence.

In the cpdbr-oadp.log file, you see the following error:

time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance 
or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled

Cause of the problem

Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.

Resolving the problem

Automatic and manual workarounds are available. Do only one of the workarounds.

Automatic workaround

The following workaround automatically runs before you create a backup. This workaround is useful if you set up automatic backups. Complete the steps that apply to your backup and restore method.

Online backup and restore

Download the edb-patch-aux-ckpt-cm-legacy.yaml file.

Run the following command:

oc apply -n ${PROJECT_CPD_INST_OPERATORS} -f edb-patch-aux-ckpt-cm-legacy.yaml

Retry the backup.

Offline backup and restore

Download the edb-patch-aux-br-cm-legacy.yaml file.

Run the following command:

oc apply -n ${PROJECT_CPD_INST_OPERATORS} -f edb-patch-aux-br-cm-legacy.yaml

Retry the backup.

Manual workaround

If you want to manually run the workaround, complete the following steps:

Download the edb-patch.sh file.

Run the following command:

sh edb-patch.sh ${PROJECT_CPD_INST_OPERATORS}

Retry the backup.

Db2 backup fails at the Hook: br-service hooks/pre-backup step

Applies to: 5.1.0 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

In the cpdbr-oadp.log file, you see messages like in the following example:

time=<timestamp> level=info msg=podName: c-db2oltp-5179995-db2u-0, podIdx: 0, container: db2u, actionIdx: 0, commandString: ksh -lc 'manage_snapshots --action suspend --retry 3', command: [sh -c ksh -lc 'manage_snapshots --action suspend --retry 3'], onError: Fail, singlePodOnly: false, timeout: 20m0s func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:767
time=<timestamp> level=info msg=cmd stdout:  func=cpdbr-oadp/pkg/quiesce.executeCommand file=/go/src/cpdbr-oadp/pkg/quiesce/ruleexecutor.go:823
time=<timestamp> level=info msg=cmd stderr: [<timestamp>] - INFO: Setting wolverine to disable
Traceback (most recent call last):
  File "/usr/local/bin/snapshots", line 33, in <module>
    sys.exit(load_entry_point('db2u-containers==1.0.0.dev1', 'console_scripts', 'snapshots')())
  File "/usr/local/lib/python3.9/site-packages/cli/snapshots.py", line 35, in main
    snap.suspend_writes(parsed_args.retry)
  File "/usr/local/lib/python3.9/site-packages/snapshots/snapshots.py", line 86, in suspend_writes
    self._wolverine.toggle_state(enable=False, message="Suspend writes")
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 73, in toggle_state
    self._toggle_state(state, message)
  File "/usr/local/lib/python3.9/site-packages/utils/wolverine/wolverine.py", line 77, in _toggle_state
    self._cmdr.execute(f'wvcli system {state} -m "{message}"')
  File "/usr/local/lib/python3.9/site-packages/utils/command_runner/command.py", line 122, in execute
    raise CommandException(err)
utils.command_runner.command.CommandException: Command failed to run:ERROR:root:HTTPSConnectionPool(host='localhost', port=9443): Read timed out. (read timeout=15)

Cause of the problem

The Wolverine high availability monitoring process was in a RECOVERING state before the backup was taken.

Check the Wolverine status by running the following command:

wvcli system status

Example output:

ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
ERROR:root:Retrying Request: https://localhost:9443/status
ERROR:root:REST server timeout: https://localhost:9443/status
HA Management is RECOVERING at <timestamp>.

The Wolverine log file /mnt/blumeta0/wolverine/logs/ha.log shows errors like in the following example:

<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:414)] -  check_and_recover: unhealthy_dm_set = {('c-db2oltp-5179995-db2u-0', 'node')}
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:416)] - (c-db2oltp-5179995-db2u-0, node) : not OK
<timestamp> [ERROR] <MainProcess:11490> [wolverine.ha.loop(loop.py:421)] -  check_and_recover: unhealthy_dm_names = {'node'}

Resolving the problem

Do the following steps:

Re-initialize Wolverine:
```
wvcli system init --force
```
Wait until the Wolverine status is RUNNING. Check the status by running the following command:
```
wvcli system status
```
Retry the backup.

Backup fails at the Hook: br-service-hooks/checkpoint step

Applies to: 5.1.0

Applies to: Backup and restore with IBM Fusion

Fixed in: 5.1.1

Diagnosing the problem

In the IBM Fusion backup and restore transaction manager logs, you see the following error:

time=<timestamp> level=info msg=
** PHASE [CONFIGMAP LOCK/CLEANUP/SUCCESS] ************************************* func=cpdbr-oadp/pkg/utils.LogMarker file=/go/src/cpdbr-oadp/pkg/utils/log.go:63
time=<timestamp> level=info msg=lock released func=cpdbr-oadp/cmd/cli/checkpoint.prepareContext.func1 file=/go/src/cpdbr-oadp/cmd/cli/checkpoint/create.go:279
time=<timestamp> level=error msg=error running processCreate(): error running checkpoint exec hooks: configmaps with the same component name and different module names are not allowed.  cm1 name=data-lineage-neo4j-aux-ckpt-cm, cm1 component=data-lineage-neo4j, cm1 module=data-lineage-neo4j-aux, cm2 name=data-lineage-neo4j-aux-v2-ckpt-cm, cm2 component=data-lineage-neo4j, cm2 module=data-lineage-neo4j-aux-v2 func=cpdbr-oadp/cmd/cli/checkpoint.(*createCommandContext).execute file=/go/src/cpdbr-oadp/cmd/cli/checkpoint/create.go:730
Error: error running checkpoint exec hooks: configmaps with the same component name and different module names are not allowed.  cm1 name=data-lineage-neo4j-aux-ckpt-cm, cm1 component=data-lineage-neo4j, cm1 module=data-lineage-neo4j-aux, cm2 name=data-lineage-neo4j-aux-v2-ckpt-cm, cm2 component=data-lineage-neo4j, cm2 module=data-lineage-neo4j-aux-v2
time=<timestamp> level=error msg=error running checkpoint exec hooks: configmaps with the same component name and different module names are not allowed.  cm1 name=data-lineage-neo4j-aux-ckpt-cm, cm1 component=data-lineage-neo4j, cm1 module=data-lineage-neo4j-aux, cm2 name=data-lineage-neo4j-aux-v2-ckpt-cm, cm2 component=data-lineage-neo4j, cm2 module=data-lineage-neo4j-aux-v2 func=cpdbr-oadp/cmd.Execute file=/go/src/cpdbr-oadp/cmd/root.go:88

Resolving the problem

Ensure that the data-lineage-neo4j-aux-ckpt-cm and data-lineage-neo4j-aux-v2-ckpt-cm ConfigMaps have the following matching cpdfwk.component and cpdfwk.module labels:

"cpdfwk.module" is set to "data-lineage-neo4j-aux"
"cpdfwk.component" is set to "data-lineage-neo4j"

To edit the data-lineage-neo4j-aux-ckpt-cm ConfigMap, run the following command:
```
oc edit cm data-lineage-neo4j-aux-ckpt-cm
```
To edit the data-lineage-neo4j-aux-v2-ckpt-cm ConfigMap, run the following command:
```
oc edit cm data-lineage-neo4j-aux-v2-ckpt-cm
```
Retry the backup.

Restoring an online backup of IBM Software Hub on IBM Storage Scale Container Native storage fails

Applies to: IBM Fusion 2.8.2 and later

Diagnosing the problem: When you restore an online backup with IBM Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
Resolving the problem: This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To resolve the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.
Note: If your deployment includes Watson OpenScale, you cannot manually expand Watson OpenScale PVCs. To manage PVC sizes for Watson OpenScale, see Managing persistent volume sizes for Watson OpenScale.

Restore fails at Hook: br-service-hooks/operators-restore step

Applies to: 5.1.2 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

In the IBM Fusion backup and restore transaction manager logs, you see errors like in the following example:

[ERROR] <timestamp> [TM_0] KubeCalls:2093 - The execution of the application command "sh -c /cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore --operators-namespace watsonx-ops --foundation-namespace watsonx-ops; echo rc=$? > /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log" on pod cpdbr-tenant-service-86b7475cfc-nd98d in namespace watsonx-ops took longer than the specified timeout of 7200 seconds.
[INFO] <timestamp> [TM_0] KubeCalls:2094 - stdout from the application command: 
[INFO] <timestamp> [TM_0] KubeCalls:2095 - stderr from the application command: cat: /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log: No such file or directory

[INFO] <timestamp> [TM_0] KubeCalls:2096 - Error output from the application command: {'status': 'Failed', 'message': 'exec hook failed to write return code to a file cat: /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log: No such file or directory\n'}
[ERROR] <timestamp> [TM_0] KubeCalls:2115 - Failed to delete file /tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log in cpdbr-tenant-service-86b7475cfc-nd98d, 'rm: cannot remove '/tmp/fccd13fd-5a3f-4abc-afd9-df268670d099-ghysb.log': No such file or directory
'
[ERROR] <timestamp> [TM_0] apphooks:148 - Timeout reached before command completed. However, the operation continues because of the on-error annotation value.
[ERROR] <timestamp> [TM_0] exechook:353 - Running command '["/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh", "restore", "--operators-namespace", "watsonx-ops", "--foundation-namespace", "watsonx-ops"]' on pod 'watsonx-opscpdbr-tenant-service-86b7475cfc-nd98d' failed with rc: 5
[ERROR] <timestamp> [TM_0] exechook:84 - Operation 'operators-restore' failed with exception: 'The execution of the application command "["/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh", "restore", "--operators-namespace", "watsonx-ops", "--foundation-namespace", "watsonx-ops"]" on pod cpdbr-tenant-service-86b7475cfc-nd98din namespace {self.hook.namespace} took longer than the specified timeout of {self.timeout} seconds.'
[ERROR] <timestamp> [TM_0] workflow:145 - cmdResults="{'br-service-hooks/operators-restore': {'watsonx-ops:cpdbr-tenant-service-86b7475cfc-nd98d': 5}}"
[ERROR] <timestamp> [TM_0] workflow:206 - End execution sequence "hook/br-service-hooks/operators-restore" failed.
[ERROR] <timestamp> [TM_0] recipe:589 - Execution of workflow "restore" of recipe "watsonx-ops:ibmcpd-tenant" completed with 1 failures, last failure was: "ExecHook/br-service-hooks/operators-restore"Hook run in watsonx-ops:cpdbr-tenant-service-86b7475cfc-nd98d ended with rc 5 indicating hook reached timeout prior to completion.
[ERROR] <timestamp> [TM_0] restoreguardian:771 - Unexpected failure in hook: 'Execution of workflow restore of recipe ibmcpd-tenant completed. Number of failed commands: 1, last failed command: "ExecHook/br-service-hooks/operators-restore"Hook run in watsonx-ops:cpdbr-tenant-service-86b7475cfc-nd98d ended with rc 5 indicating hook reached timeout prior to completion.'

Cause of the problem

The operators restore phase might take longer than expected due to slowness with Operator Lifecycle Manager, and fail with a timeout error.

Resolving the problem

Re-run the restore without cleaning up the cluster.

Restore fails at Hook: br-service-hooks/operators-restore step

Applies to: 5.1.2 and later

Applies to: Backup and restore with IBM Fusion

Diagnosing the problem

This problem occurs when you upgrade IBM Cloud Pak for Data 5.0.x to IBM Software Hub 5.1.2.

In the IBM Fusion backup and restore transaction manager logs, you see the following error:

Time: <timestamp> level=info - Postgres Cluster:: common-service-db - phase: null
Time: <timestamp> level=error - Postgres Cluster: common-service-db Timeout Error
End Time: <timestamp>

--------------------------------------------------
Summary of level=warning/error messages:
--------------------------------------------------
Time: <timestamp> level=error - Postgres Cluster: common-service-db Timeout Error

Exited with return code=1
/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh exit code=1
*** cpdbr-cpd-operators.sh failed ***

Cause of the problem

The PostgreSQL operator is watching namespaces from annotations instead of the namespace-scope ConfigMap. Run the following command:

oc get po -n ${PROJECT_CPD_INST_OPERATORS} $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} | grep postgres | awk '{print $1}' | head -1) -o yaml | grep WATCH A 10 -B 10

Example output:

- name: WATCH_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations['olm.targetNamespaces']

Check if the PostgreSQL clusters are not being reconciled by the PostgreSQL operator by running the following command:

oc get clusters.postgresql.k8s.enterprisedb.io -n ${PROJECT_CPD_INST_OPERANDS}

Example output:

NAME                AGE   INSTANCES   READY   STATUS   PRIMARY
common-service-db   33h 
wa-postgres         33h 
wa-postgres-16      33h 
zen-metastore-edb   33h

Resolving the problem

Do the following steps:

Log in to the source cluster:
```
${OC_LOGIN}
```
Change to the project where IBM Software Hub is installed:
```
oc project ${PROJECT_CPD_INST_OPERANDS}
```

Run the following patch command:

oc patch csv $(oc get csv | grep postgres | awk '{print $1}') --type json -p '[
 {
   "op": "replace",
   "path": "/spec/install/spec/deployments/0/spec/template/spec/containers/0/env/0",
   "value": {
     "name": "WATCH_NAMESPACE",
     "valueFrom": {
       "configMapKeyRef": {
         "key": "namespaces",
         "name": "namespace-scope",
         "optional": true
       }
     }
   }
 }
]'

Make the following update to the PostgreSQL role:

for role in $(oc get role -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=namespace-scope,app.kubernetes.io/managed-by=ibm-namespace-scope-operator | grep cloud-native-postgresql | awk '{print $1}');
do
    saIdx=$(oc get role ${role} -n ${PROJECT_CPD_INST_OPERANDS} -o json | jq '.rules | to_entries | map(select(.value.resources | index("serviceaccounts") != null) | .key)[0]')
    eval $(echo "oc patch role ${role} -n ${PROJECT_CPD_INST_OPERANDS} --type json -p '[{\"op\": \"add\", \"path\": \"/rules/${saIdx}/verbs/-\", \"value\": \"delete\"}]'")
done

Retry the restore.
Note: You do not need to clean up the cluster before retrying the restore.

IBM Fusion reports successful restore but many service custom resources are not in `Completed` state

Applies to: 5.1.1

Applies to: Backup and restore with IBM Fusion

Fixed in: 5.1.2

Diagnosing the problem

The status of service custom resources is less than 100%.

Cause of the problem

The ZenService custom resource is stuck. Run the following command:

oc get ZenService lite-cr  -n ${PROJECT_CPD_INST_OPERANDS} -o yaml

The output of the command shows zenStatus: InProgress.

Resolving the problem

Rerun restore posthooks and reset the namespacescope by running the following commands:

oc rsh -n ${PROJECT_CPD_INST_OPERATORS} $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} | grep cpdbr-tenant | awk '{print $1}')
/cpdbr-scripts/cpdbr/checkpoint_restore_posthooks.sh --scale-wait-timeout=30m --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS}
/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}

Watson OpenScale etcd server fails to start after restoring from a backup

Applies to: 5.1.0 and later

Applies to: Backup and restore with NetApp Astra Control Center

Diagnosing the problem

After restoring a backup with NetApp Astra Control Center, the Watson OpenScale etcd cluster is in a Failed state.

Resolving the problem

To resolve the problem, do the following steps:

Log in to Red Hat OpenShift Container Platform as a cluster administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Expand the size of the etcd PersistentVolumes by 1Gi.

In the following example, the current PVC size is 10Gi, and the commands set the new PVC size to 11Gi.

operatorPod=`oc get pod -n ${PROJECT_CPD_INST_OPERATORS} -l name=ibm-cpd-wos-operator | awk 'NR>1 {print $1}'`
oc exec ${operatorPod} -n ${PROJECT_CPD_INST_OPERATORS} -- roles/service/files/etcdresizing_for_resizablepv.sh  -n ${PROJECT_CPD_INST_OPERANDS} -s 11Gi

Wait for the reconciliation status of the Watson OpenScale custom resource to be in a Completed state:
```
oc get WOService aiopenscale -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status.wosStatus} {"\n"}'
```
The status of the custom resource changes to Completed when the reconciliation finishes successfully.

Error when activating applications after a migration

Applies to: 5.1.0

Applies to: Portworx asynchronous disaster recovery

Fixed in: 5.1.1

Diagnosing the problem

In a deployment that includes Data Virtualization, the following errors appear when you try to activate applications:

* error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred:
* error executing command (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=${PROJECT_CPD_INST_OPERANDS} auxMetaName=dv-aux component=dv actionIdx=0): command timed out after 40m0s: timed out waiting for the condition

Cause of the problem

Restore hooks failed.

Resolving the problem

Do the following steps:

Manually recreate /tmp/.ready_to_connectToDb in the Data Virtualization head pod:

oc exec -n ${PROJECT_CPD_INST_OPERANDS} -t pod/c-db2u-dv-db2u-0 -- touch /tmp/.ready_to_connectToDb

Run restore posthooks:

oc exec -n ${PROJECT_CPD_INST_OPERATORS} -t $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} -l component=cpdbr-tenant -o NAME) -- /cpdbr-scripts/cpdbr-oadp restore posthooks --include-namespaces ${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --hook-kind checkpoint --log-level debug --verbose

After running restore posthooks is successfully completed, reset the namespacescope:

oc exec -n ${PROJECT_CPD_INST_OPERATORS} -t $(oc get po -n ${PROJECT_CPD_INST_OPERATORS} -l component=cpdbr-tenant -o NAME) -- /cpdbr-scripts/cpdops/files/cpd-operators.sh restore-namespacescope --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}  --operators-namespace ${PROJECT_CPD_INST_OPERATORS}

Restore is taking a long time to complete

Applies to: 5.1.2

Applies to: Online backup and restore with NetApp Trident protect

Fixed in: 5.1.3

Diagnosing the problem: Restores are taking longer than expected to complete.
Cause of the problem: Slow processing speeds with large KopiaVolumeRestores like activelogs-c-db2wh-<xxxxxxxxxxxxxxxxx>-db2u-0.
Resolving the problem: No workaround. As long as the restore is progressing, let the KopiaVolumeRestores finish.
Best practice: Ensure that the restore location is in the same region as the restore environment.

IBM Software Hub resources are not migrated

Applies to: 5.1.0 and later

Applies to: Portworx asynchronous disaster recovery

Diagnosing the problem

When you use Portworx asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the expected number of resources are migrated. Run the following command:

storkctl get migrations -n ${PX_ADMIN_NS}

Tip: ${PX_ADMIN_NS} is usually kube-system.

Example output:

NAME                                                CLUSTERPAIR       STAGE   STATUS       VOLUMES   RESOURCES   CREATED               ELAPSED                       TOTAL BYTES TRANSFERRED
cpd-tenant-migrationschedule-interval-<timestamp>   mig-clusterpair   Final   Successful   0/0       0/0         <timestamp>   Volumes (0s) Resources (3s)   0

Cause of the problem

This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected IBM Software Hub resources are not migrated.

Resolving the problem

To resolve the problem, downgrade stork to a version prior to 23.11.0. For more information about stork releases, see the stork Releases page.

Scale down the Portworx operator so that it doesn't reset manual changes to the stork deployment:
```
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
```
Edit the stork deployment image version to a version prior to 23.11.0:
```
oc edit deploy -n ${PX_ADMIN_NS} stork
```
If you need to scale up the Portworx operator, run the following command.
Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
```
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1
```

Offline backup prehooks fail for lite-maint resource

Applies to: 5.1.0

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.1

Diagnosing the problem

In the CPD-CLI*.log file, you see error messages like in the following example:

Error: 1 error occurred:
        * failed to execute masterplan: unexpected error from 'v1-orchestration' DataProtectionPlan undo action, exiting execution: error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Undo() - action=pre-backup-hooks, action-index=0, retry-attempt=0/0, err=offline post-backup hooks execution failed: error running post-backup hooks: Error running post-processing rules.  Check the /root/bar/cpd-cli-workspace/logs/CPD-CLI-2024-11-14.log for errors.
1 error occurred:
        * error performing op postBackupViaConfigHookRule for resource lite-maint (configmap=cpd-lite-aux-br-maint-cm): 1 error occurred:
   * error executing command (container=ibm-nginx-container podIdx=1 podName=ibm-nginx-5845cf4bcc-qmbl9 namespace=wkc auxMetaName=lite-maint-aux component=lite-maint actionIdx=0): command terminated with exit code 1

Cause of the problem

This problem is intermittent.

Resolving the problem

Do the following steps:

Restart nginx.

oc rollout restart deploy/ibm-nginx -n ${PROJECT_CPD_INST_OPERANDS}

Retry the backup.

Offline backup prehooks fail when deployment includes IBM Knowledge Catalog

Applies to: 5.1.0

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.1

Diagnosing the problem

When you try to create a backup, you see error messages in the CPD-CLI*.log file like in the following examples:

time=<timestamp> level=error msg=error running processCreate(): failed to execute masterplan: 2 errors occurred:
	* error from dpp.Execute() [traceId=97cd14a9-9bc3-494f-ba12-7e1c185fef10, dpp=v1-orchestration, operationKind=backup]: 

error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=pre-backup-hooks, action-index=2, retry-attempt=0/0, err=offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks
	* DataProtectionPlan=v1-orchestration, Action=pre-backup-hooks (index=2) error: offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks

 func=cpdbr-oadp/cmd/cli/tenantbackup.(*CreateCommandContext).execute file=/a/workspace/oadp-upload/cmd/cli/tenantbackup/create.go:1412
time=<timestamp> level=error msg=failed to execute masterplan: 2 errors occurred:
	* error from dpp.Execute() [traceId=97cd14a9-9bc3-494f-ba12-7e1c185fef10, dpp=v1-orchestration, operationKind=backup]: 

error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=pre-backup-hooks, action-index=2, retry-attempt=0/0, err=offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks
	* DataProtectionPlan=v1-orchestration, Action=pre-backup-hooks (index=2) error: offline pre-backup hooks execution failed: error running pre-backup hooks: pod/wdp-profiling-cloud-native-postgresql-1 is not supported for scaling, please define the proper prebackup hooks

Cause of the problem

The wkc-foundationdb-cluster-aux-br-cm ConfigMap is missing the cpdfwk.spec-version=2.0.0 label.

To confirm that the label is missing, run the following command:

cat cpd-cli-workspace/logs/CPD-CLI-$(date '+%Y-%m-%d').log | grep "is filtered out, because there is newer spec-version"

Example output when the label is missing:

time=<timestamp> level=warning msg=configmap name=cpd-ikc-enrichment-aux-br-cm in namespace=wkc does not have icpdsupport/addOnId label, skipping filter func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1016
time=<timestamp> level=debug msg=configmap name=cpd-ikc-finley-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-glossary-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-ikc-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-knowledgegraph-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-policy-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-profiling-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-wkcgovui-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031
time=<timestamp> level=debug msg=configmap name=cpd-ikc-workflow-aux-br-cm namespace=wkc is filtered out, because there is newer spec-version, cm spec-version=2.0.0 wanted spec-version=1.0.0 func=cpdbr-oadp/pkg/registry.FilterOutNewerSpecVersionCms file=/a/workspace/oadp-upload/pkg/registry/registry.go:1031

Resolving the problem

Add the missing label to the ConfigMap by running the following command:

oc label cm wkc-foundationdb-cluster-aux-br-cm cpdfwk.spec-version=2.0.0

After the label is added, test that running backup prehooks works by running the following command:

cpd-cli oadp backup prehooks \
--hook-kind=br \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--dry-run \
--verbose \
-log-level=debug

Offline restore fails at post-restore hooks step

Applies to: 5.1.0

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.1

Diagnosing the problem

In the CPD-CLI*.log file, you see error messages like in the following example:

Hook execution breakdown by status=error/timedout:

The following hooks either have errors or timed out

post-restore (1):

        COMPONENT       CONFIGMAP                       METHOD  STATUS  DURATION        ADDONID 
        lite-maint      cpd-lite-aux-br-maint-cm        rule    error   30.607451346s   zen-lite

--------------------------------------------------------------------------------

Error: failed to execute masterplan: 1 error occurred:
        * DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=8) error: offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules.  Check the /root/CPD-QA-BAR/cpd-cli-workspace/logs/CPD-CLI-2024-11-19.log for errors.
1 error occurred:
        * error performing op postRestoreViaConfigHookRule for resource lite-maint (configmap=cpd-lite-aux-br-maint-cm): 1 error occurred:
   * error executing command (container=ibm-nginx-container podIdx=0 podName=ibm-nginx-774fd7445f-2g5hw namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0): command terminated with exit code 1

Cause of the problem

This problem is intermittent. After a restore is completed, disabling nginx maintenance mode intermittently fails because the nginx configuration file is not found. As a result, the restore sometimes appears to have failed.

Resolving the problem

Do the following steps:

Restart nginx.

oc rollout restart deploy/ibm-nginx -n ${PROJECT_CPD_INST_OPERANDS}

Rerun post-restore hooks:

cpd-cli oadp restore posthooks \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--log-level=debug \
--verbose

Prompt tuning fails after restoring watsonx.ai

Applies to: 5.1.1

Applies to: Backup and restore with the OADP utility

Diagnosing the problem

When you try to create a prompt tuning experiment, you see the following error message:

An error occurred while processing prompt tune training.

Resolving the problem

Do the following steps:

Restart the caikit operator:
```
oc rollout restart deployment caikit-runtime-stack-operator -n ${PROJECT_CPD_INST_OPERATORS}
```
Wait at least 2 minutes for the cais fmaas custom resource to become healthy.
Check the status of the cais fmaas custom resource by running the following command:
```
oc get cais fmaas -n ${PROJECT_CPD_INST_OPERANDS}
```
Retry the prompt tuning experiment.

Post-restore hook error when restoring offline backup of Db2

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

This problem occurs intermittently. In the CPD-CLI*log file, you see errors like in the following example:

Error: failed to execute masterplan: 2 errors occurred:
	* error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=cpd-post-restore-hooks, action-index=26, retry-attempt=0/0, err=offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules.  Check the /root/cpd_cli_linux/cpd-cli-workspace/logs/CPD-CLI-<date_timestamp>.log for errors.
1 error occurred:
	* error performing op postRestoreViaConfigHookRule for resource db2u (configmap=db2u-aux-br-cm): 2 errors occurred:
   * error executing command ksh -lc 'if [[ -f /usr/bin/manage_snapshots ]]; then manage_snapshots --action restore --retry 3; else write-restore; fi' (container=db2u podIdx=0 podName=c-db2oltp-1738299739760012-db2u-0 namespace=zen-ns auxMetaName=db2u-aux component=db2u actionIdx=0): command terminated with exit code 255

Resolving the problem

Run the following command:

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from cpd-post-restore-hooks

Restoring Data Virtualization fails with metastore not running or failed to connect to database error

Applies to: 5.1.2 and later

Applies to: Online backup and restore with the OADP utility

Diagnosing the problem

View the status of the restore by running the following command:

cpd-cli oadp tenant-restore status ${TENANT_BACKUP_NAME}-restore --details

The output shows errors like in the following examples:

time=<timestamp>  level=INFO msg=Verifying if Metastore is listening
SERVICE              HOSTNAME                               NODE      PID  STATUS                                                                                                                                        	
Standalone Metastore c-db2u-dv-hurricane-dv                   -        -   Not running

time=<timestamp>  level=ERROR msg=Failed to connect to BigSQL database                                                                                                                                                                                                                        	     	                      	                                                                                                                                                                                                                                                                                                                                                         	                     	                                                                                                                                                                                                                                                                                                                                                         	
* error performing op postRestoreViaConfigHookRule for resource dv (configmap=cpd-dv-aux-ckpt-cm): 1 error occurred:                                                                                                                                                                                                                                     	
* error executing command su - db2inst1 -c '/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L' (container=db2u podIdx=0 podName=c-db2u-dv-db2u-0 namespace=<namespace-name> auxMetaName=dv-aux component=dv actionIdx=0): command terminated with exit code 1

Cause of the problem

A timing issue causes restore posthooks to fail at the step where the posthooks check for the results of the db2 connect to bigsql command. The

db2 connect to
bigsql

command has failed because bigsql is restarting at around the same time.

Resolving the problem

Run the following command:

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from cpd-post-restore-hooks

After online restore, watsonx Code Assistant for Z is not running properly

Applies to: 5.1.2

Applies to: Online backup and restore with the OADP utility

Fixed in: 5.1.3

Diagnosing the problem

Errors appear in some tabs in the watsonx Code Assistant™ for Z user interface.

Cause of the problem

PVC content transfer time exceeded the maximum time allowed during the restore process. As a result, the wcaz-cr custom resource is in an Failed state, and the wca-codegen-c2j pod is not ready.

To check the status of the wcaz-cr custom resource, run the following command:

oc get wcaz -n ${PROJECT_CPD_INST_OPERANDS}

Example output:

NAME      VERSION   RECONCILED   STATUS   AGE
wcaz-cr   5.1.2                  Failed   3h5m

To check the status of the wca-codegen-c2j pod, run the following command:

oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep wca

Example output:

wca-codegen-76d98b5cbf-qg5kt                                      1/1     Running     0               125m
wca-codegen-c2j-7954f9946f-h4p88                                  0/1     Running     8 (2m44s ago)   136m
wca-codematch-85c8b67766-fpbq2                                    1/1     Running     0               125m
wca-codematch-init-9fqbs                                          0/1     Completed   0               125m

Resolving the problem

If wca-codegen-c2j pod is not in a ready state, delete the internal-tls-pkcs12 secret:

oc delete secret internal-tls-pkcs12 -n ${PROJECT_CPD_INST_OPERANDS}

After approximately 2-4 minutes, the Common core services operator reconciles and the secret is recreated. The wca-codegen-c2j pod restarts automatically and the watsonx Code Assistant for Z service will work properly.

If the wcaz-cr custom resource has not completed reconciliation, delete the ibm-wca-z-operator-<xxx> pod:

oc delete ibm-wca-z-operator-<xxx> -n ${PROJECT_CPD_INST_OPERATORS}

The pod is recreated and reconciliation begins. When reconciliation is completed, the service will work properly.

Online post-restore hooks fail to run with `timed out waiting for condition` error when restoring Analytics Engine powered by Apache Spark

Applies to: 5.1.2

Applies to: Online backup and restore with the OADP utility

Diagnosing the problem

In the CPD-CLI*.log file, you see error messages like in the following examples:

Failed with 1 error(s):
error: DataProtectionPlan=cpd-tenant/restore-service-orchestrated-parent-workflow, Action=cpd-post-restore-hooks (index=28)
                online post-restore hooks execution failed: timed out waiting for the condition

time=<timestamp> level=debug msg=waiting for replicas of spark-hb-create-trust-store deployment (0/1) in namespace <cpd-tenant-namespace> func=cpdbr-oadp/pkg/kube.waitForDeployment.func1 file=/a/workspace/oadp-upload/pkg/kube/deployment.go:116
time=<timestamp> level=debug msg=waiting for replicas of spark-hb-create-trust-store deployment (0/1) in namespace <cpd-tenant-namespace> func=cpdbr-oadp/pkg/kube.waitForDeployment.func1 file=/a/workspace/oadp-upload/pkg/kube/deployment.go:116
time=<timestamp> level=error msg=failed to wait for deployment spark-hb-create-trust-store in namespace <cpd-tenant-namespace>: timed out waiting for the condition func=cpdbr-oadp/pkg/kube.waitForDeploymentPods file=/a/workspace/oadp-upload/pkg/kube/deployment.go:198

Cause of the problem

Analytics Engine powered by Apache Spark resources in tethered projects were not restored.

Resolving the problem

Do the following steps:

Download the online-tenant-restore-tethered-namespaces-workaround.sh script onto your workstation.

Make the script executable:

chmod 755 online-tenant-restore-tethered-namespaces-workaround.sh

Run the script:

./online-tenant-restore-tethered-namespaces-workaround.sh ${TENANT_BACKUP_NAME} ${PROJECT_CPD_INST_OPERATORS} ${PROJECT_CPD_INST_OPERANDS} ${OADP_OPERATOR_PROJECT}

Check that no pods are in a Pending state in the tethered projects:
```
oc get po -n ${TETHERED_NAMESPACE}
```

Resume the restore process:

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from cpd-post-restore-hooks

Restore fails with `condition not met` error

Applies to: 5.1.2 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem

The restore is actually successful. But in the CPD-CLI*.log file, you see error messages like in the following example:

Error: failed to execute masterplan: 2 errors occurred:
	* error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during hook workflowAction.Do() - action=ibmcpd-check/controlPlaneCompletedStatus, action-index=0, retry-attempt=0/0, err=1 error occurred:
   * condition not met (condition={$.status.controlPlaneStatus} == {"Completed"}, namespace=cpd-instance, gvr=cpd.ibm.com/v1, Resource=ibmcpds, name=ibmcpd-cr)

	* DataProtectionPlan=configmap=cpd-zen-aux-v2-ckpt-cm, Action=ibmcpd-check/controlPlaneCompletedStatus (index=0) error: 1 error occurred:
   * condition not met (condition={$.status.controlPlaneStatus} == {"Completed"}, namespace=cpd-instance, gvr=cpd.ibm.com/v1, Resource=ibmcpds, name=ibmcpd-cr)

Resolving the problem

Do the following steps:

Check that the zenservices custom resource is in a Completed state:

oc get zenservices lite-cr -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status}'

Check that the ibmcpd custom resource is in a Completed state:

oc get ibmcpd ibmcpd-cr -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status}'

Run one of the following commands.

Note: The zenservices and ibmcpd custom resources must be in a Completed state before you run one of these commands.

Offline restore

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from restore-operands

Online restore

export CPDBR_ENABLE_FEATURES=experimental
cpd-cli oadp tenant-restore create ${TENANT_RESTORE_NAME}-cont \
--from-tenant-backup ${TENANT_BACKUP_NAME} \
--verbose \
--log-level debug \
--start-from restore-post-namespacescope

Offline restore fails with getting persistent volume claim error message

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like the following example:

Error: failed to execute restore sequence: failed to execute 'restore-cpd-volumes' step: failed to execute masterplan: 1 error occurred:
	* DataProtectionPlan=configmap=ibmcpd-tenant-parent-br-cm, Action=restore-cpd-volumes (index=0) error: error: expected restore phase to be Completed, received PartiallyFailed

In the velero.log file, you see error messages like in the following example:

time="<timestamp>" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-0\" not found" logSource="/remote-source/velero/app/pkg/restore/restore.go:1881" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7
time="<timestamp>" level=error msg="Restoring pod is not scheduled until timeout or cancel, disengage" error="client rate limiter Wait returned an error: context canceled" logSource="/remote-source/velero/app/pkg/podvolume/restorer.go:212"
time="<timestamp>" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-1\" not found" logSource="/remote-source/velero/app/pkg/restore/restore.go:1881" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7
time="<timestamp>" level=error msg="Restoring pod is not scheduled until timeout or cancel, disengage" error="client rate limiter Wait returned an error: context canceled" logSource="/remote-source/velero/app/pkg/podvolume/restorer.go:212"
time="<timestamp>" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-2\" not found" logSource="/remote-source/velero/app/pkg/restore/restore.go:1881" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7
time="<timestamp>" level=error msg="Restoring pod is not scheduled until timeout or cancel, disengage" error="client rate limiter Wait returned an error: context canceled" logSource="/remote-source/velero/app/pkg/podvolume/restorer.go:212"
time="<timestamp>" level=error msg="Velero restore error: error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-0\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:604" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7
time="<timestamp>" level=error msg="Velero restore error: error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-1\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:604" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7
time="<timestamp>" level=error msg="Velero restore error: error getting persistent volume claim for volume: persistentvolumeclaims \"data-wo-docproc-etcd-2\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:604" restore=openshift-adp/cpd-tenant-vol-r-3a590e1c-f001-11ef-a2e1-66a81cc1d6c7

Cause of the problem

One or more watsonx Orchestrate PVCs are labeled for exclusion, but their related pods that mount them are backed up.

Resolving the problem

Do the following steps:

Add a label to the pods so that they are excluded from backups:

oc label po -n ${PROJECT_CPD_INST_OPERANDS} -l "wo.watsonx.ibm.com/component=docproc,component=etcd" velero.io/exclude-from-backup=true

Create a new backup.

After restoring Watson Speech services online backup, unable to use service instance ID to make service REST API calls

Applies to: 5.1.0, 5.1.1, 5.1.2

Applies to: Online backup and restore to the same cluster with the OADP utility

Fixed in: 5.1.3

Diagnosing the problem

After performing a restore, when you attempt to use the Text-to-Speech REST APIs with the IBM Software Hub existing service instance token, you see an error message similar to the following example:

<html>
<head><title>401 Authorization Required</title></head>
<body>
<center><h1>401 Authorization Required</h1></center>
<hr><center>openresty</center>
</body>
</html>

Resolving the problem

Create a new service instance and use the authorization details (endpoint URL and bearer token) from the new instance.

After restoring Watson Discovery online backup, unable to use service instance ID to make service REST API calls

Applies to: 5.1.2

Applies to: Online backup and restore to the same cluster with the OADP utility

Fixed in: 5.1.3

Diagnosing the problem

This problem occurs when Match 360 and Watson Discovery are installed on the same cluster.

After the restore, when you attempt to use the Watson Discovery REST APIs with the IBM Software Hub existing service instance token, you see an error message similar to the following example:

<html>
<head><title>401 Authorization Required</title></head>
<body>
<center><h1>401 Authorization Required</h1></center>
<hr><center>openresty</center>
</body>
</html>

The following Match 360 user access groups are also present in the Watson Discovery service instance, receiving the same 401 errors:

DataSteward
EntityViewer
PublisherUser
DataEngineer

You can see these user access groups when you select the service instance followed by Manage Access.

Resolving the problem

Do one of the following steps:

Create a new service instance and use the authorization details (endpoint URL and bearer token) from the new instance.
Remove the DataSteward, EntityViewer, PublisherUser, and DataEngineer user access groups from the Watson Discovery service instance.

Restore posthooks fail to run when restoring Data Virtualization

Applies to: 5.1.0, 5.1.1

Applies to: Backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

velero post-backup hooks in namespace <namespace> have one or more errors
check for errors in <cpd-cli location>, and try again
Error: velero post-backup hooks failed
[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1

Cause of the problem

Velero hooks annotations are blocking the restore posthooks from running.

Get the Data Virtualization addon pod definition by running a command like in the following example:

oc get po dv-addon-6fdddc4bc7-8bdlq -o jsonpath="{.metadata.annotations}" | jq .

Example output that shows the Velero annotations:

...
  "post.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-backup no-op hook\"]",
  "post.hook.restore.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-resttore no-op hook\"]",
  "pre.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing pre-backup no-op hook\"]",
...

Resolving the problem

Remove the Velero hooks annotations. Because these annotations are not used, you can remove them from all pods. Run the following commands:

oc annotate po --all  post.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}
oc annotate po --all  post.hook.restore.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}
oc annotate po --all  pre.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}

After the annotations are removed, rerun the restore posthooks command:

cpd-cli oadp restore posthooks \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--log-level=debug \
--verbose

Offline restore fails with cs-postgres timeout error

Applies to: 5.1.1 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

* timed out performing op postRestoreViaConfigHookRule for resource cs-postgres (configmap=ibm-cs-postgres-br-cm): timed out waiting for the condition

Cause of the problem

An offline restore posthook reports a restore failure when the default timeout of the cs-postgres-restore-job job, which is 6 minutes, is exceeded.

Resolving the problem

Increase the default timeout to 10 minutes (600 seconds) by doing the following steps:

Create a YAML file named restore-hooks.yaml:

cat << EOF > restore-hooks.yaml
data:
  restore-meta: |
    post-hooks:
      exec-rules:
        - actions:
            - job:
                job-key: cs-postgres-restore-job
                timeout: 600s
        - actions:
            - job:
                job-key: cpfs-br-restore-posthooks-job
                timeout: 600s
EOF

Patch the ibm-cs-postgres-br-cm ConfigMap by using the restore-hooks.yaml file:

oc patch cm ibm-cs-postgres-br-cm --patch-file restore-hooks.yaml -n ${PROJECT_CPD_INST_OPERANDS}

Re-run the restore posthooks command:

cpd-cli oadp restore posthooks \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
--verbose \
--log-level debug

Restoring an offline backup fails with `zenservice-check` error

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

Error: failed to execute masterplan: 2 errors occurred:
	* error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during hook workflowAction.Do() - action=zenservice-check/zenStatus, action-index=1, retry-attempt=0/0, err=no matches for /, Resource=zenservices
	* DataProtectionPlan=configmap=cpd-lite-aux-v2-br-cm, Action=zenservice-check/zenStatus (index=1) error: no matches for /, Resource=zenservices

Resolving the problem

This problem occurs intermittently. The problem is usually resolved when you retry the restore.

Error running post-restore hooks during offline restore

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

<timestamp>  Failed with 1 error(s):
<timestamp>  error: DataProtectionPlan=platform-orchestration, Action=cpd-post-restore-hooks (index=8)
<timestamp>  		offline post-restore hooks execution failed: error running post-restore hooks: Error running post-processing rules.  Check the /root/br/restore/cpd-cli-workspace/logs/CPD-CLI-2025-02-10.log for errors.
<timestamp>  1 error occurred:
<timestamp>  	* op error: id=25, name=unquiesceViaScaling, configmap=: error performing op unquiesceViaScaling for resource servicecollection-cronjob, msg: : cronjobs.batch "servicecollection-cronjob" not found
<timestamp>  
<timestamp>  <timestamp>  <timestamp>  Finished with status: Failed
<timestamp>  saving master plan results to tenant-meta...
<timestamp>  
<timestamp>  ** INFO [MASTER PLAN RESULTS/SUMMARY/END] *************************************
<timestamp>  
<timestamp>  ** INFO [RESTORE CREATE/DONE EXECUTING MASTERPLAN '\''RESTORE'\'' WORKFLOW (ISONLINE=FALSE)...] 
<timestamp>  
<timestamp>  --------------------------------------------------------------------------------
<timestamp>  
<timestamp>  Scenario:       	RESTORE CREATE (bkpnfsgrpcv110-tenant-offline-v2-b1-restore)
<timestamp>  Start Time:     	2025-02-10 01:44:27.908033034 -0800 PST m=+0.300241262      
<timestamp>  Completion Time:	2025-02-10 03:00:00.054939864 -0800 PST m=+4532.447148092   
<timestamp>  Time Elapsed:   	1h15m32.14690683s                                           
<timestamp>  
<timestamp>  --------------------------------------------------------------------------------
<timestamp>  
<timestamp>  Hook execution breakdown by status=error/timedout:
<timestamp>  
<timestamp>  The following hooks either have errors or timed out
<timestamp>  
<timestamp>  unquiesce (1):
<timestamp>  
<timestamp>      	COMPONENT                	CONFIGMAP	METHOD	STATUS	DURATION	ADDONID
<timestamp>      	servicecollection-cronjob	N/A      	N/A   	error 	0s      	N/A    
<timestamp>  
<timestamp>  --------------------------------------------------------------------------------
<timestamp>  
<timestamp>  [ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1'

This problem occurs intermittently to IBM Software Hub deployments when service monitors are installed.

Resolving the problem

Manually run the following restore posthooks command:

cpd-cli oadp restore posthooks \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS}
--hook-kind=br \
--spec-version="2.0.0" \
--verbose \
--log-level=trace

Afterwards, retry the restore.

Online backup of upgraded IBM Cloud Pak for Data instance fails validation

Applies to: 5.1.0

Applies to: Online backup and restore with the OADP utility

Diagnosing the problem

This problem occurs when you upgrade an IBM Cloud Pak for Data 5.0.3 deployment to IBM Software Hub 5.1.0 and then try to create an online backup. The backup fails at the backup validation stage. In the CPD-CLI*.log file, you see the following error:

** INFO [SUMMARIZE BACKUP VALIDATION RESULTS/START] ***************************

error: backup validation unsuccessful

failed rules report:

CM NAME         RESOURCE-KIND  ADDONID PATH                              ERR

ibm-cs-postgres-ckpt-cm configmap    cpfs  backup-validation-meta/backup-validations/2/validation-rules/0 object with name 'platform-auth-idp' does not exists in the backup

Resolving the problem

Do the following steps:

Label the platform-auth-idp ConfigMap by running the following command:
```
oc label cm platform-auth-idp icpdsupport/ignore-on-nd-backup-
```
Retry the backup.

Offline backup of Db2 Data Management Console fails with backup validation error

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see the following error:

backup validation failed for configmap: cpd-dmc-aux-br-cm

Resolving the problem

Update the cpd-dmc-aux-br-cm ConfigMap.

Edit the cpd-dmc-aux-br-cm ConfigMap.

Under the backup-validation-meta section, make the following updates.

Change dmcaddon-sample to dmc-addon.

Delete the following section:

- resource-kind: dmcs.dmc.databases.ibm.com
        validation-rules:
          - type: match_names
            names:
              - dmc-sample

After you make these updates, check that the backup-validation-meta section is like the following codeblock:

backup-validation-meta: |-
  backup-validations:
    - resource-kind: dmcaddons.dmc.databases.ibm.com
      validation-rules:
        - type: match_names
          names:
            - dmc-addon
    - resource-kind: configmap
      validation-rules:
        - type: match_names
          names:
            - ibm-dmc-addon-api-cm

Offline backup fails with `PartiallyFailed` error

Applies to: 5.1.1 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

In the Velero logs, you see errors like in the following example:

time="<timestamp>" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:180"
time="<timestamp>" level=error msg="error encountered while scanning stdout" backupLocation=oadp-operator/dpa-sample-1 cmd=/plugins/velero-plugin-for-aws controller=backup-sync error="read |0: file already closed" logSource="/remote-source
/velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:90"
time="<timestamp>" level=error msg="Restic command fail with ExitCode: 1. Process ID is 906, Exit error is: exit status 1" logSource="/remote-source/velero/app/pkg/util/exec/exec.go:66"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.examplehost.example.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=76b76bc4-27d4-4369-886c-1272dfdf9ce9 --tag=volume=cc-home-p
vc-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --tag=pod=cpdbr-vol-mnt --host=velero --json with error: exit status 3 stderr: {\"message_type\":\"e
rror\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/.scripts/system\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival
\",\"item\":\".scripts/system\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"_global_/security/artifacts/metakey\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kuberne
tes.io~nfs/pvc-76b76bc4-27d4-4369-886c-1272dfdf9ce9/_global_/security/artifacts/metakey\"}\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/ve
lero/app/pkg/podvolume/backupper.go:328"
time="<timestamp>" level=error msg="pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:http://minio-velero.apps.jctesti23.cp.fyre.ibm.com/velero/cpdbackup/restic/cpd-instance --pa
ssword-file=/tmp/credentials/oadp-operator/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod=cpdbr-vol-mnt --tag=pod-uid=1ed9d52f-2f6d-4978-930a-4d8e30acced1 --tag=pvc-uid=93e9e23c-d80a-49cc-80bb-31a36524e0d
c --tag=volume=data-rabbitmq-ha-0-vol --tag=backup=cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 --tag=backup-uid=b55d6323-9875-4afe-b605-646250cbd55c --tag=ns=cpd-instance --host=velero --json with error: exit status 3 stderr: {\"message_typ
e\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\".erlang.cookie\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/1ed9d52f-2f6d-4978-930a-4d8e30acced1/volumes/kubernetes.io~nfs/pvc-93e9e23c-d80a-49cc-80bb-31a36524e0dc/.erlang.cookie\"}
\nWarning: at least one source file could not be read\n" backup=oadp-operator/cpd-tenant-vol-485eef74-efbe-11ef-b2bd-00000a0b44c3 logSource="/remote-source/velero/app/pkg/podvolume/backupper.go:328"

Cause of the problem

The restic folder was deleted after backups were cleaned up (deleted). This problem is a Velero known issue. For more information, see velero does not recreate restic|kopia repository from manifest if its directories are deleted on s3.

Resolving the problem

Do the following steps:

Get the list of backup repositories:

oc get backuprepositories -n ${OADP-OPERATOR-NAMESPACE} -o yaml

Check for old or invalid object storage URLs.
Check that the object storage path is in the backuprepositories custom resource.
Check that the <objstorage>/<bucket>/<prefix>/restic/<namespace>/config file exists.
If the file does not exist, make sure that you do not share the same <objstorage>/<bucket>/<prefix> with another cluster, and specify a different <prefix>.
Delete backup repositories that are invalid for the following reasons:
- The path does not exist anymore in the object storage.
- The restic/<namespace>/config file does not exist.
```
oc delete backuprepositories -n ${OADP_OPERATOR_NAMESPACE} <backup-repository-name>
```

OpenPages storage content is missing from offline backup

Applies to: 5.1.1

Applies to: Offline backup and restore to a different cluster with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

Check whether the OpenPages volume was backed up by doing the following steps:

Download the IBM Software Hub tenant volume backup.
Tip: The name of the backup starts with cpd-tenant-vol-.
For example:
```
cpd-cli oadp backup download cpd-tenant-vol-8cebebba-f4e8-11ef-8148-22220a116473
```

Uncompress the downloaded file.

For example:

tar -xvf cpd-tenant-vol-8cebebba-f4e8-11ef-8148-22220a116473-data.tar.gz

Look at the volumes that were backed up.

For example:

cd resources/pods/namespaces/zen-ns/
cat cpd-tenant-vol-8cebebba-f4e8-11ef-8148-22220a116473/resources/pods/namespaces/zen-ns/cpdbr-vol-mnt.json | jq '.spec.volumes' | grep name

The list of backups does not include the OpenPages PVC, which is named openpages-${OPENPAGES_INSTANCE_NAME}-appdata-pvc.

Resolving the problem

Manually back up and restore all volumes from all pods by doing the following steps:

Back up the volumes by running the following command:

cpd-cli oadp backup create ${BACKUP_NAME} \
--include-resources 'namespaces,persistentvolumeclaims,persistentvolumes,pods,configmaps,secrets' \
--include-cluster-resources \
--include-namespaces ${PROJECT_CPD_INST_OPERANDS} \
--backup-type singleton \
--snapshot-volumes=false \
--default-volumes-to-fs-backup \
--skip-hooks \
--verbose \
--log-level debug

Restore the volumes by running the following command:

cpd-cli oadp restore create ${RESTORE_NAME} \
--from-backup ${BACKUP_NAME} \
--include-resources 'namespaces,persistentvolumeclaims,persistentvolumes,pods,configmaps,secrets' \
--include-namespaces ${PROJECT_CPD_INST_OPERANDS} \
--skip-hooks \
--verbose \
--log-level debug

Offline backup validation fails in IBM Software Hub deployment that includes Db2 and Informix in the same namespace

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see error messages like in the following examples:

time=<timestamp> level=info msg=>> received masterplan status update func=cpdbr-oadp/pkg/utils.LogInfo file=/a/workspace/oadp-upload/pkg/utils/log.go:83
time=<timestamp> level=info msg=>>> 	status description: finished executing 1 DataProtectionPlan(s): func=cpdbr-oadp/pkg/utils.LogInfo file=/a/workspace/oadp-upload/pkg/utils/log.go:83
time=<timestamp> level=info msg=>>> 	error: error executing workflow actions: workflow action execution resulted in 1 error(s):
     - encountered an error during local-exec workflowAction.Do() - action=cpd-backup-validation, action-index=7, retry-attempt=0/0, err=backup validation failed: 2 errors occurred:
	* backup validation failed for configmap: db2wh-aux-br-cm
	* backup validation failed for configmap: db2oltp-aux-br-cm

 func=cpdbr-oadp/pkg/utils.LogInfo file=/a/workspace/oadp-upload/pkg/utils/log.go:83

time=<timestamp> level=info msg=CM NAME          	RESOURCE-KIND	ADDONID  	PATH                                                          	ERR                                                               
db2wh-aux-br-cm  	deployment   	databases	backup-validation-meta/backup-validations/3/validation-rules/0	object with name 'zen-databases' does not exists in the backup    
db2wh-aux-br-cm  	deployment   	databases	backup-validation-meta/backup-validations/4/validation-rules/0	object with name 'zen-database-core' does not exists in the backup
db2oltp-aux-br-cm	deployment   	databases	backup-validation-meta/backup-validations/3/validation-rules/0	object with name 'zen-databases' does not exists in the backup    
db2oltp-aux-br-cm	deployment   	databases	backup-validation-meta/backup-validations/4/validation-rules/0	object with name 'zen-database-core' does not exists in the backup func=cpdbr-oadp/pkg/core/services.(*backupValidationService).printBackupValidationResults file=/a/workspace/oadp-upload/pkg/core/services/backup_validation.go:340
time=<timestamp> level=error msg=backup validation unsuccessful func=cpdbr-oadp/pkg/utils.LogError file=/a/workspace/oadp-upload/pkg/utils/log.go:97

Cause of the problem

During the backup process, the zen-databases and zen-database-core pods are deleted.

Resolving the problem

To prevent these pods from being deleted, remove some content from the playbooks/dbtypeservice.yml file in the ibm-informix-cp4d-controller-manager pod.

Get the instance ID of the ibm-informix-cp4d-controller-manager pod:

oc -n ${PROJECT_CPD_INST_OPERATORS} get pods --selector name=ibm-informix-cp4d-operator

Open a terminal on the ibm-informix-cp4d-controller-manager-<instance-id> pod.

oc -n ${PROJECT_CPD_INST_OPERATORS} exec -it ibm-informix-cp4d-controller-manager-<instance-id> -- bash

Edit the playbooks/dbtypeservice.yml file:
```
vi playbooks/dbtypeservice.yml
```

Comment out the following content:

- block:
  - include_role:
  name: zenhelper
  tasks_from: get_product_cm_info.yaml
  - include_role:
  name: zenhelper
  tasks_from: preupgrade.yml
  - include_role:
  name: zenhelper
  tasks_from: preupgrade-cleanup.yml
  rescue:
  - include_role:
  name: zenhelper
  tasks_from: rescuestatus.yml
  - fail:
  msg: "preupgrade failed"
  when: shutdown is not defined or (shutdown is defined and shutdown | lower != "true" and shutdown | lower != "force")

Save the changes.
Retry the backup.

Unable to create offline backup when IBM Software Hub deployment includes MongoDB service

Applies to: 5.1.1

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.2

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

time=<timestamp> level=error msg=global registry check failed: 1 error occurred:
	* error from addOnId=opsmanager: addOnId=opsmanager (state=enabled) is in zen-metastore, but does not exist in the global registry

Note: The MongoDB service does not support IBM Software Hub backup and restore. The service is excluded from IBM Software Hub backups.

Resolving the problem

Re-run the backup command with the

--registry-check-exclude-add-on-ids
opsmanager

option. For example:

cpd-cli oadp tenant-backup create ${TENANT_OFFLINE_BACKUP_NAME} \
--namespace ${OADP_PROJECT} \
--registry-check-exclude-add-on-ids opsmanager \
--cleanup-completed-resources=true \
--vol-mnt-pod-mem-request=1Gi \
--vol-mnt-pod-mem-limit=4Gi \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
--mode offline \
--image-prefix=registry.redhat.io/ubi9 \
--log-level debug \
--verbose &> ${TENANT_OFFLINE_BACKUP_NAME}.log

Backup fails after a service is upgraded and then uninstalled

Applies to: 5.1.1 and later

Applies to: Backup and restore with the OADP utility

Diagnosing the problem

The problem occurs after a service was upgraded from IBM Cloud Pak for Data 4.8.x to IBM Software Hub 5.1.1 and later uninstalled. When you try to take a backup by running the cpd-cli oadp tenant-backup command, the backup fails. In the CPD-CLI*.log file, you see an error message like in the following example:

Error: global registry check failed: 1 error occurred:
        * error from addOnId=watsonx_ai: 2 errors occurred:
        * failed to find aux configmap 'cpd-watsonxai-maint-aux-ckpt-cm' in tenant service namespace='<namespace_name>': : configmaps "cpd-watsonxai-maint-aux-ckpt-cm" not found
        * failed to find aux configmap 'cpd-watsonxai-maint-aux-br-cm' in tenant service namespace='<namespace_name>': : configmaps "cpd-watsonxai-maint-aux-br-cm" not found




[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1

Resolving the problem

Re-run the backup command with the --skip-registry-check option. For example:

cpd-cli oadp tenant-backup create ${TENANT_OFFLINE_BACKUP_NAME} \
--namespace ${OADP_PROJECT} \
--vol-mnt-pod-mem-request=1Gi \
--vol-mnt-pod-mem-limit=4Gi \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
--mode offline \
--skip-registry-check \
--image-prefix=registry.redhat.io/ubi9 \
--log-level=debug \
--verbose &> ${TENANT_OFFLINE_BACKUP_NAME}.log&

Offline backup fails after watsonx.ai is uninstalled

Applies to: 5.1.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

The problem occurs when you try to take an offline backup after watsonx.ai was uninstalled. The backup process fails when post-backup hooks are run. In the CPD-CLI*.log file, you see error messages like in the following example:

time=<timestamp> level=info msg=cmd stderr: <timestamp> [emerg] 233346#233346: host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10
nginx: [emerg] host not found in upstream "wx-inference-proxyservice:18888" in /nginx_data/extensions/upstreams/latest-510_watsonxaiifm-routes_ie_226.conf:10
nginx: configuration file /usr/local/openresty/nginx/conf/nginx.conf test failed
 func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:824
time=<timestamp> level=warning msg=failed to get exec hook JSON result for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 err=could not find JSON exec hook result func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:852
time=<timestamp> level=warning msg=no exec hook JSON result found for container=ibm-nginx-container podIdx=0 podName=ibm-nginx-fd79d5686-cdpnj namespace=latest-510 auxMetaName=lite-maint-aux component=lite-maint actionIdx=0 func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:855
time=<timestamp> level=info msg=exit executeCommand func=cpdbr-oadp/pkg/quiesce.executeCommand file=/a/workspace/oadp-upload/pkg/quiesce/ruleexecutor.go:860
time=<timestamp> level=error msg=command terminated with exit code 1

Cause of the problem

After watsonx.ai is uninstalled, the nginx configuration in the ibm-nginx pod is not cleared up, and the pod fails.

Resolving the problem

Restart all ibm-nginx pods.

oc delete pod \
-n=${PROJECT_CPD_INST_OPERANDS} \
-l component=ibm-nginx

Db2U backup precheck fails during offline backup

Applies to: 5.1.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

In the CPD-CLI*.log file, the following error message appears:

Hook execution breakdown by status=error/timedout:
The following hooks either have errors or timed out
pre-check (1):
    	COMPONENT	CONFIGMAP     	METHOD	STATUS  	DURATION      	ADDONID  
    	db2u     	db2u-aux-br-cm	rule  	timedout	6m0.186233543s	databases
--------------------------------------------------------------------------------
Error: precheck failed with error: pre-check backup hooks encountered one or more error(s), err=1 error occurred:
	* backup precheck hook finished with status=timedout, configmap=db2u-aux-br-cm
[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1

You also see failed quota error messages similar to the following examples:

Error creating: pods "db2-bar-precheck-jobl6482-wj6x5"
  is forbidden: failed quota: cpd-quota: must specify limits.cpu for: db2-bar-precheck;
  limits.memory for: db2-bar-precheck; requests.cpu for: db2-bar-precheck; requests.memory
  for: db2-bar-precheck'

Cause of the problem

This problem occurs only if you enabled Kubernetes resource quota in the cluster. Resource quotas set for pods in the cluster prevent the creation of the backup precheck pod.

Resolving the problem

Do the following steps:

Scale down the Db2 ansible operators:

oc -n ${PROJECT_CPD_INST_OPERATORS} scale deployment ibm-db2oltp-cp4d-operator-controller-manager --replicas=0

oc -n ${PROJECT_CPD_INST_OPERATORS} scale deployment ibm-db2wh-cp4d-operator-controller-manager --replicas=0

oc -n ${PROJECT_CPD_INST_OPERATORS} scale deployment ibm-db2aaservice-cp4d-operator-controller-manager --replicas=0

For each project that has Db2 instances running, turn on maintenance mode for Db2 ansible custom resources:

export DB2_PROJECT=<Db2_project_name>

oc -n $DB2_PROJECT patch db2oltpservice db2oltp-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}'

oc -n $DB2_PROJECT patch db2whservice db2wh-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}'

oc -n $DB2_PROJECT patch db2aaserviceservice db2aaservice-cr --type=merge -p '{"spec":{"ignoreForMaintenance":true}}'

Edit the db2u-aux-br-cm and db2u-aux-ckpt-cm ConfigMaps to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2u-aux-br-cm

oc -n $DB2_PROJECT edit cm db2u-aux-ckpt-cm

In each ConfigMap, locate the precheck-meta section:

precheck-meta: |
    backup-hooks:
      exec-rules:
        - actions:
            - job:
                job-key: db2-bar-precheck-job
                timeout: 360s

In the actions section, add on-error: Continue like in the following example:

precheck-meta: |
    backup-hooks:
      exec-rules:
        - actions:
            - job:
                job-key: db2-bar-precheck-job
                timeout: 360s
          on-error: Continue

Edit the db2oltp-aux-br-cm ConfigMap to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2oltp-aux-br-cm

In the ConfigMap, locate the precheck-meta section:

precheck-meta: |
    backup-hooks:
        exec-rules:
          - resource-kind: db2oltpservices.databases.cpd.ibm.com
            labels: app.kubernetes.io/name=db2oltp
            on-error: Fail
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/check-condition
                  params:
                    condition: "{$.status.db2oltpStatus} == {\"Completed\"}"
                  timeout: 5s

Change on-error: Fail to on-error: Continue.

In the ConfigMap, locate the backup-meta section:

backup-meta: |
    pre-hooks:
      exec-rules:
      - resource-kind: db2oltpservice.databases.cpd.ibm.com
        on-error: Propagate
        actions:
        - builtins:
            name: cpdbr.cpd.ibm.com/enable-maint
            params:
              statusFieldName: db2oltpStatus
            timeout: 360s

Change on-error: Propagate to on-error: Continue.

Edit the db2oltp-aux-ckpt-cm ConfigMap to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2oltp-aux-ckpt-cm

In the ConfigMap, locate the precheck-meta section:

precheck-meta: |
  backup-hooks:
    exec-rules:
     - resource-kind: db2oltpservices.databases.cpd.ibm.com
      labels: app.kubernetes.io/name=db2oltp
      on-error: Fail
      actions:
       - builtins:
         name: cpdbr.cpd.ibm.com/check-condition
         params:
          condition: "{$.status.db2oltpStatus} == {\"Completed\"}"
         timeout: 5s

Change on-error: Fail to on-error: Continue.

Edit the db2wh-aux-br-cm ConfigMap to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2wh-aux-br-cm

In the ConfigMap, locate the precheck-meta section:

precheck-meta: |
    backup-hooks:
        exec-rules:
          - resource-kind: db2whservices.databases.cpd.ibm.com
            labels: app.kubernetes.io/name=db2wh
            on-error: Fail
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/check-condition
                  params:
                    condition: "{$.status.db2whStatus} == {\"Completed\"}"
                  timeout: 5s

Change on-error: Fail to on-error: Continue.

In the ConfigMap, locate the backup-meta section:

backup-meta: |
    pre-hooks:
      exec-rules:
      - resource-kind: db2whservice.databases.cpd.ibm.com
        on-error: Propagate
        actions:
        - builtins:
            name: cpdbr.cpd.ibm.com/enable-maint
            params:
              statusFieldName: db2whStatus
            timeout: 360s

Change on-error: Propagate to on-error: Continue.

Edit the db2wh-aux-ckpt-cm ConfigMap to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2wh-aux-ckpt-cm

In the ConfigMap, locate the precheck-meta section:

precheck-meta: |
  backup-hooks:
    exec-rules:
     - resource-kind: db2whservices.databases.cpd.ibm.com
      labels: app.kubernetes.io/name=db2wh
      on-error: Fail
      actions:
       - builtins:
         name: cpdbr.cpd.ibm.com/check-condition
         params:
          condition: "{$.status.db2whStatus} == {\"Completed\"}"
         timeout: 5s

Change on-error: Fail to on-error: Continue.

Edit the db2aaservice-aux-br-cm ConfigMap to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2aaservice-aux-br-cm

In the ConfigMap, locate the precheck-meta section:

precheck-meta: |
    backup-hooks:
        exec-rules:
          - resource-kind: db2aaserviceservices.databases.cpd.ibm.com
            labels: app.kubernetes.io/name=db2aaservice
            on-error: Fail
            actions:
              - builtins:
                  name: cpdbr.cpd.ibm.com/check-condition
                  params:
                    condition: "{$.status.db2aaserviceStatus} == {\"Completed\"}"
                  timeout: 5s

Change on-error: Fail to on-error: Continue.

In the ConfigMap, locate the backup-meta section:

backup-meta: |
    pre-hooks:
      exec-rules:
      - resource-kind: db2aaserviceservice.databases.cpd.ibm.com
        on-error: Propagate
        actions:
        - builtins:
            name: cpdbr.cpd.ibm.com/enable-maint
            params:
              statusFieldName: db2aaserviceStatus
            timeout: 360s

Change on-error: Propagate to on-error: Continue.

Edit the db2aaservice-aux-ckpt-cm ConfigMap to change precheck-meta and backup pre-hook on-error settings to Continue.

oc -n $DB2_PROJECT edit cm db2aaservice-aux-ckpt-cm

In the ConfigMap, locate the precheck-meta section:

precheck-meta: |
  backup-hooks:
    exec-rules:
     - resource-kind: db2aaserviceservices.databases.cpd.ibm.com
      labels: app.kubernetes.io/name=db2aaservice
      on-error: Fail
      actions:
       - builtins:
         name: cpdbr.cpd.ibm.com/check-condition
         params:
          condition: "{$.status.db2aaserviceStatus} == {\"Completed\"}"
         timeout: 5s

Change on-error: Fail to on-error: Continue.

Db2 Big SQL backup pre-hook and post-hook fail during offline backup

Applies to: 5.1.0 and later

Applies to: Offline backup and restore with the OADP utility

Diagnosing the problem

In the db2diag logs of the Db2 Big SQL head pod, you see error messages such as in the following example when backup pre-hooks are running:

<timestamp>          LEVEL: Event
PID     : 3415135              TID : 22544119580160  PROC : db2star2
INSTANCE: db2inst1             NODE : 000
HOSTNAME: c-bigsql-<xxxxxxxxxxxxxxx>-db2u-0
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:5692
MESSAGE : ZRC=0xFFFFFBD0=-1072
          SQL1072C  The request failed because the database manager resources
          are in an inconsistent state. The database manager might have been
          incorrectly terminated, or another application might be using system
          resources in a way that conflicts with the use of system resources by
          the database manager.

Cause of the problem

The Db2 database was unable to start because of the error code SQL1072C. As a result, the

bigsql
start

command that runs as part of the post-backup hook hangs, which produces the timeout of the post-hook. The post-hook cannot succeed unless Db2 is brought back to a stable state and the bigsql start command runs successfully. The Db2 Big SQL instance is left in an unstable state.

Resolving the problem

Do one or both of the following troubleshooting and cleanup procedures.

Tip: For more information about the SQL1072C error code and how to resolve it, see SQL1000-1999 in the Db2 documentation.

Remove all the database manager processes running under the Db2 instance ID

Do the following steps:

oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash

Switch to the db2inst1 user:
```
su - db2inst1
```
List all the database manager processes that are running under db2inst1:
```
db2_ps
```
Remove these processes:
```
kill -9 <process-ID>
```

Ensure that no other application is running under the Db2 instance ID, and then remove all resources owned by the Db2 instance ID

Do the following steps:

oc -n ${PROJECT_CPD_INST_OPERANDS} rsh $(oc -n ${PROJECT_CPD_INST_OPERANDS} | grep -i c-bigsql | grep -i db2u-0 | cut -d' ' -f 1) bash

Switch to the db2inst1 user:
```
su - db2inst1
```
List all IPC resources owned by db2inst1:
```
ipcs | grep db2inst1
```
Remove these resources:
```
ipcrm -[q|m|s] db2inst1
```

Error running post-restore hooks when restoring an offline backup

Applies to: 5.1.0

Applies to: Offline backup and restore with the OADP utility

Fixed in: 5.1.1

Diagnosing the problem

After you run the restore command, you see the following error:

error: DataProtectionPlan=v1-orchestration, Action=post-restore-hooks (index=8)
                offline post-restore hooks execution failed: error running post-restore hooks: 
Error running post-processing rules.  Check the ../logs/CPD-CLI-<timestamp>.log for errors.

The CPD-CLI-<timestamp>.log file shows following error:

* error performing op postRestoreViaConfigHookRule for resource lite-maint (configmap=cpd-lite-aux-br-maint-cm): 1 error occurred:
* error executing command (container=ibm-nginx-container podIdx=1 podName=ibm-nginx-<xxxxxxxxxx>-<yyyyy>
namespace=${PROJECT_CPD_INST_OPERANDS} auxMetaName=lite-maint-aux component=lite-maint actionIdx=0): command terminated with exit code 1

Also, after the restore, you are not able log in to the IBM Software Hub web client. Check the ibm-nginx-xxx pod log for the following error:

<timestamp> [error] 1412#1412: *29 connect() failed (111: Connection refused) while connecting to upstream, 
client: <ip_address>, server: internal-nginx-svc, request: "POST /v2/catalogs/default?check_bucket_existence=false 
HTTP/1.1", upstream: "https://<ip_address>:<port_number>/v2/catalogs/default?check_bucket_existence=false", 
host: "internal-nginx-svc:12443"

Resolving the problem

Do the following steps:

Restart the ibm-nginx pods:

oc delete pod -l app.kubernetes.io/component=ibm-nginx -n ${PROJECT_CPD_INST_OPERANDS}

Rerun the post-restore hook:

cpd-cli oadp restore posthooks \
--hook-kind=br \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS}

If you are still unable to log in to the Cloud Pak for Data web client, it might be because of the following Operator Lifecycle Manager (OLM) known issue: OLM known issue: ResolutionFailed message. To resolve the problem, follow the instructions in the troubleshooting topic. Then wait for the zenservice lite-cr custom resource to reach the Completed state by running the following command:

oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o json | jq .status.zenStatus

Creating Watson Studio offline volume backup fails

Applies to: 5.1.0 and later

Applies to: Offline backup and restore with the IBM Software Hub volume backup utility

Diagnosing the problem

In the CPD-CLI*.log file, you see the following error:

Error: error on quiesce: 1 error occurred:
	* error resolving aux config cpd-ws-maint-aux-qu-cm in namespace <namespace-name>: : pods "init-ws-runtimes-libs" not found

Resolving the problem

Do the following steps:

Run the following commands:

cmName=cpd-ws-maint-aux-qu-cm
auxMeta=$(oc get cm -n ${PROJECT_CPD_INST_OPERANDS} ${cmName} -o jsonpath='{.data.aux-meta}' | yq 'del(."managed-resources")')
echo -e "data:\n  aux-meta: |\n$(echo "$auxMeta" | sed 's/^/    /')" > patch.yaml
oc patch cm -n ${PROJECT_CPD_INST_OPERANDS} ${cmName} --patch-file patch.yaml

Retry the backup:

cpd-cli backup-restore volume-backup create <backup-name> -n ${PROJECT_CPD_INST_OPERANDS}

Security issues

Security scans return an `Inadequate Account Lockout Mechanism` message

Applies to: 5.1.0 and later

Diagnosing the problem

If you run a security scan against IBM Software Hub, the scan returns the following message.

Inadequate Account Lockout Mechanism

Resolving the problem

This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 5.1.0 and later

Diagnosing the problem

If you run an Aqua Security scan against your cluster, the scan returns the following issue:

KHV002 - Kubernetes issue disclosure

Resolving the problem

This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.

Known issues and limitations for IBM Software Hub

Customer-reported issues

General issues

After rebooting a cluster that uses OpenShift Data Foundation storage, some IBM Software Hub services aren't functional

The create-physical-location command fails if your load balancer timeout settings are too low

The delete-physical-location command fails

The Get authorization token endpoint returns Unauthorized when the username contains special characters

The wml service key does not work and cpd-cli health commands must use the --service option

The cpd-cli cluster command fails on ROSA with hosted control planes

Health check for OpenPages fails with 500 error if service instance is installed on a watsonx.governance environment

Exports with improper JSON formatting persist on export list despite deletion attempt

Installation and upgrade issues

The setup-instance command fails during upgrades

Upgrades fail or are stuck in the InProgres state when common core services cannot be upgraded because of an empty role name

Upgrades fail or are stuck in the InProgres state when common core services cannot be upgraded because of roles with duplicate names

The cloud-native-postgresql-opreq operand request is in a failed state after upgrade

The Switch locations icon is not available if the apply-cr command times out

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Persistent volume claims with the WaitForFirstConsumer volume binding mode are flagged by the installation health checks

Node pinning is not applied to postgresql pods

The ibm-nginx deployment does not scale fast enough when automatic scaling is configured

Uninstalling IBM watsonx services does not remove the IBM watsonx experience

Backup and restore issues

Issues that apply to several backup and restore methods

Backup and restore issues with the OADP utility

Backup and restore issues with IBM Fusion

Backup and restore issues with NetApp Trident protect

Backup and restore issues with NetApp Astra Control Center

Backup and restore issues with Portworx

Backup and restore issues with the IBM Software Hub volume backup utility

Backup precheck fails due to missing Data Refinery custom resource error message

Restoring online backup of Data Virtualization fails

Unable to connect to Db2 database after restoring Data Virtualization to a different cluster

IBM Fusion shows successful restore but Informix custom resource reports that instance is unhealthy

Unable to log in to IBM Software Hub with OpenShift cluster credentials after successfully restoring to a different cluster

Restore posthooks timeout errors during Db2U and IBM Software Hub control plane restore

Online restore posthooks fail when restoring Db2

Backup fails when deployment includes IBM Knowledge Catalog Standard or IBM Knowledge Catalog Premium without optional components enabled

Running cpd-cli restore post-hook command after Db2 Big SQL was successfully restored times out

After restoring Analytics Engine powered by Apache Spark, IBM Software Hub resources reference the source cluster

After restoring IBM Software Hub, watsonx.data Presto connection still references source cluster

ObjectBucketClaim is not supported by the OADP utility

Unable to log in to the IBM Cloud Pak foundational services console after restore

Backup precheck fails on upgraded deployment

Relationship explorer is not working after restoring IBM Knowledge Catalog

Execution Engine for Apache Hadoop service is inactive after a restore

After restoring watsonx Orchestrate, Kafka controller pods in the knative-eventing project enter a CrashLoopBackOff state

watsonx Assistant stuck at System is Training after restore

Unable to create backup due to missing ConfigMaps

Backup is missing EDB Postgres PVCs

Unable to restore offline backup of RStudio® Server Runtimes

Offline post-restore hooks fail when restoring Informix

Unable to log in to the IBM Software Hub user interface after offline restore

Offline restore to different cluster fails with nginx error

Post-restore hooks fail to run when restoring deployment that includes IBM Knowledge Catalog

IBM Knowledge Catalog custom resources are not restored

Errors when restoring IBM Software Hub operators

After a restore, OperandRequest timeout error in the ZenService custom resource

After restoring the IBM Match 360 service, the onboard job fails

After an online restore, some service custom resources cannot reconcile

Restore fails for watsonx Orchestrate

IBM Software Hub user interface does not load after restoring

ibm-lh-lakehouse-validator pods repeatedly restart after restore

Restore to same cluster fails with restore-cpd-volumes error

Restore fails with ErrImageNeverPull error for Analytics Engine powered by Apache Spark

After backing up or restoring IBM Match 360, Redis fails to return to a Completed state

Backup fails with an ibm_neo4j error when IBM Match 360 is scaled to the x-small size

Backup fails for the platform with error in EDB Postgres cluster

Db2 backup fails at the Hook: br-service hooks/pre-backup step

Backup fails at the Hook: br-service-hooks/checkpoint step

Restoring an online backup of IBM Software Hub on IBM Storage Scale Container Native storage fails

Restore fails at Hook: br-service-hooks/operators-restore step

Restore fails at Hook: br-service-hooks/operators-restore step

IBM Fusion reports successful restore but many service custom resources are not in Completed state

Watson OpenScale etcd server fails to start after restoring from a backup

Error when activating applications after a migration

Restore is taking a long time to complete

IBM Software Hub resources are not migrated

Offline backup prehooks fail for lite-maint resource

The `create-physical-location` command fails if your load balancer timeout settings are too low

The `delete-physical-location` command fails

The Get authorization token endpoint returns `Unauthorized` when the username contains special characters

The `wml` service key does not work and `cpd-cli health` commands must use the `--service` option

The `cpd-cli cluster` command fails on ROSA with hosted control planes

The `setup-instance` command fails during upgrades

Upgrades fail or are stuck in the `InProgres` state when common core services cannot be upgraded because of an empty role name

Upgrades fail or are stuck in the `InProgres` state when common core services cannot be upgraded because of roles with duplicate names

The `cloud-native-postgresql-opreq` operand request is in a failed state after upgrade

The Switch locations icon is not available if the `apply-cr` command times out

Persistent volume claims with the `WaitForFirstConsumer` volume binding mode are flagged by the installation health checks

Node pinning is not applied to `postgresql` pods

The `ibm-nginx` deployment does not scale fast enough when automatic scaling is configured

After restoring watsonx Orchestrate, Kafka controller pods in the knative-eventing project enter a `CrashLoopBackOff` state

watsonx Assistant stuck at `System is Training` after restore

`ibm-lh-lakehouse-validator` pods repeatedly restart after restore

Restore to same cluster fails with `restore-cpd-volumes` error

Restore fails with `ErrImageNeverPull` error for Analytics Engine powered by Apache Spark

After backing up or restoring IBM Match 360, Redis fails to return to a `Completed` state

Backup fails with an `ibm_neo4j` error when IBM Match 360 is scaled to the `x-small` size

IBM Fusion reports successful restore but many service custom resources are not in `Completed` state

Online post-restore hooks fail to run with `timed out waiting for condition` error when restoring Analytics Engine powered by Apache Spark

Restore fails with `condition not met` error

Restoring an offline backup fails with `zenservice-check` error

Offline backup fails with `PartiallyFailed` error

Security scans return an `Inadequate Account Lockout Mechanism` message