Known issues and limitations for IBM Cloud Pak for Data

The following issues apply to the IBM Cloud Pak for Data platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.

The following issues apply to IBM Cloud Pak for Data services.

Customer-reported issues

Issues that are found after the release are posted on the IBM Support site.

General issues

After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional
The Assist me icon is not displayed in the web client
The delete-platform-ca-certs command does not remove certificate mounts from pods
When you add a secret to a vault, you cannot filter the list of users and groups to show only groups
The wml service key does not work and commands must use the --service option
The cpd-cli cluster command fails on ROSA with hosted control planes

After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional

Applies to: 5.0.0 and later

Diagnosing the problem

After rebooting the cluster, some Cloud Pak for Data custom resources remain in the InProgress state.

For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.

Workaround

Do the following steps:

Find the nodes that have pods that are in an Error state:

oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A  | grep -v -P "Completed|(\d+)\/\1"

Mark each node as unschedulable.
```
oc adm cordon <node_name>
```

Delete the affected pods:

oc get pod   | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0

Mark each node as scheduled:
```
oc adm uncordon <node_name>
```

The Assist me icon is not displayed in the web client

Applies to: Upgrades from Version 4.8.x

Fixed in: 5.0.3

If you upgrade IBM Cloud Pak for Data from Version 4.8.x to Version 5.0, the Assist me icon is not visible in the web client toolbar.

The issue occurs because the ASSIST_ME_ENABLED option is set to false.

Resolving the problem

To make Assist me available in the web client:

Log in to Red Hat OpenShift Container Platform as a user with sufficient permissions to complete the task.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Run the following command to set

ASSIST_ME_ENABLED:
true

oc patch cm product-configmap \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch '{"data": {"ASSIST_ME_ENABLED": "true"}}'

Confirm that the ASSIST_ME_ENABLED parameter is set to true:

oc get cm product-configmap \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
-o jsonpath="{.data.ASSIST_ME_ENABLED}{'\n'}"

The `delete-platform-ca-certs` command does not remove certificate mounts from pods

Applies to: 5.0.0

Fixed in: 5.0.3

When you run the cpd-cli manage delete-platform-ca-certs command, the command does not remove the certificate mounts from pods.

Resolving the problem

To remove the certificate mounts from pods:

Delete the cpd-custom-ca-certs secret:

oc delete secret cpd-custom-ca-certs \
--namespace=${PROJECT_CPD_INST_OPERANDS}

Run the

cpd-cli
manage
delete-platform-ca-certs

command:

cpd-cli manage delete-platform-ca-certs \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--apply=true

When you add a secret to a vault, you cannot filter the list of users and groups to show only groups

Applies to: 5.0.0

Fixed in: 5.0.3

When you add a secret to a vault, you can optionally share the secret with other users. However, if you try to filter the list of users and groups to show only groups, the filter does not take effect.

The `wml` service key does not work and `health` commands must use the `--service` option

Applies to: 5.0.3 and later

The wml service key cannot be used for 5.0.3 and later versions. Using the wml service key can cause severe data loss. When Watson Machine Learning is installed on a cluster, there are two restrictions that you must follow if you want to use either the

cpd-cli
health
service-functionality

or the

cpd-cli
health
service-functionality cleanup

You cannot use the wml service key with the --services option.
You must use the --services option when you use either the cpd-cli health service-functionality command or the cpd-cli health service-functionality cleanup command.

The `cpd-cli cluster` command fails on ROSA with hosted control planes

Applies to: 5.0.0

Fixed in: 5.1.1

An error message for missing machineconfigpools (MCP) shows up when you run the cpd-cli cluster command to check the health of a Red Hat OpenShift Service on AWS (ROSA) cluster with hosted control planes (HCP).

Installation and upgrade issues

The Switch locations icon is not available if the `apply-cr` command times out

Applies to: 5.0.0, 5.0.1, and 5.0.2

Fixed in: 5.0.3

If you install solutions that are available in different Cloud Pak for Data experiences, the Switch locations icon Switcher icon is not available in the web client if the cpd-cli manage apply-cr command times out.

Resolving the problem: Re-run the cpd-cli manage apply-cr command.

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Applies to: 5.0.0 and later

If the Red Hat OpenShift Data Foundation or IBM Storage Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.

One symptom is that pods will not start because of a FailedMount error. For example:

Warning  FailedMount  36s (x1456 over 2d1h)   kubelet  MountVolume.MountDevice failed for volume 
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given 
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists

Diagnosing the problem

To confirm whether the Data Foundation Rook Ceph cluster is unstable:

Ensure that the rook-ceph-tools pod is running.
```
oc get pods -n openshift-storage | grep rook-ceph-tools
```
Note: On IBM Storage Fusion HCI System or on environments that use hosted control planes, the pods are running in the openshift-storage-client project.
Set the TOOLS_POD environment variable to the name of the rook-ceph-tools pod:
```
export TOOLS_POD=<pod-name>
```

Execute into the rook-ceph-tools pod:

oc rsh -n openshift-storage ${TOOLS_POD}

Run the following command to get the status of the Rook Ceph cluster:
```
ceph status
```
Confirm that the output includes the following line:
```
health: HEALTH_WARN
```
Exit the pod:
```
exit
```

Resolving the problem

To resolve the problem:

Get the name of the rook-ceph-mrg pods:

oc get pods -n openshift-storage | grep rook-ceph-mgr

Set the MGR_POD_A environment variable to the name of the rook-ceph-mgr-a pod:
```
export MGR_POD_A=<rook-ceph-mgr-a-pod-name>
```
Set the MGR_POD_B environment variable to the name of the rook-ceph-mgr-b pod:
```
export MGR_POD_B=<rook-ceph-mgr-b-pod-name>
```

Delete the rook-ceph-mgr-a pod:

oc delete pods ${MGR_POD_A} -n openshift-storage

Ensure that the rook-ceph-mgr-a pod is running before you move to the next step:
```
oc get pods -n openshift-storage | grep rook-ceph-mgr
```

Delete the rook-ceph-mgr-b pod:

oc delete pods ${MGR_POD_B} -n openshift-storage

Ensure that the rook-ceph-mgr-b pod is running:

oc get pods -n openshift-storage | grep rook-ceph-mgr

Running the `apply-olm` command twice during an upgrade can remove required OLM resources

Applies to:

Upgrades from Version 4.7 to 5.0.0
Upgrades from Version 4.8 to 5.0.0

Upgrades to later 5.0 refreshes are not affected.

If you run the

cpd-cli
manage
apply-olm

two times, you might notice several problems:

The operator subscription is missing
The operator cluster service version (CSV) is missing

If you continue the upgrade by running the

cpd-cli
manage
apply-cr

command, you might notice additional problems:

The version information is missing from the spec section of the service custom resource
When you run the cpd-cli manage get-cr-status command, the values for the Version and Reconciled-version parameters are different.

Resolving the problem: To resolve the problem, you must re-run the cpd-cli manage apply-olm command a third time to ensure that the required resources are available. Then, re-run the cpd-cli manage apply-cr command.

After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a `Failed` status

Applies to: Upgrades from Version 4.7.3 to 5.0.0 and later

After upgrading Cloud Pak for Data from Version 4.7.3 to 5.0, the status of the FoundationDB cluster can indicate that it has failed (fdbStatus: Failed). The Failed status can occur even if FoundationDB is available and working correctly. This issue occurs when the FoundationDB resources do not get properly cleaned up by the upgrade.

This issue affects deployments of the following services.

IBM Knowledge Catalog
IBM Match 360

Diagnosing the problem

To determine if this problem has occurred:

Required role: To complete this task, you must be a cluster administrator.

Check the FoundationDB cluster status.
```
oc get fdbcluster -o yaml | grep fdbStatus
```
If the returned status is Failed, proceed to the next step to determine if the pods are available.
Check to see if the FoundationDB pods are up and running.
```
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep foundation
```
The returned list of FoundationDB pods should all have a status of Running. If they are not running, then the problem is something other than this issue.

Resolving the problem

To resolve this issue, restart the FoundationDB controller (ibm-fdb-controller):

Required role: To complete this task, you must be a cluster administrator.

Identify your FoundationDB controllers.
```
oc get pods  -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-fdb-controller
```
This command returns the names of two FoundationDB controllers in the following formats:
- ibm-fdb-controller-manager-<INSTANCE-ID>
- apple-fdb-controller-manager-<INSTANCE-ID>

Delete the ibm-fdb-controller-manager to refresh it.

oc delete pod ibm-fdb-controller-<INSTANCE-ID> -n ${PROJECT_CPD_INST_OPERATORS}

Wait for the controller to restart. This can take approximately one minute.
Check the status of your FoundationDB cluster:
```
oc -n ${PROJECT_CPD_INST_OPERANDS} get FdbCluster -o yaml
```
Confirm that the fdbStatus is now Completed.

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

Applies to: 5.0.0 and later

After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.

This issue affects deployments of the following services.

IBM Knowledge Catalog
IBM Match 360 with Watson

Diagnosing the problem

To identify the cause of this issue, check the FoundationDB status and details.

Check the FoundationDB status.
```
oc get fdbcluster -o yaml | grep fdbStatus
```
If this command is successful, the returned status is Complete. If the status is InProgress or Failed, proceed to the workaround steps.
If the status is Complete but FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.
```
oc rsh sample-cluster-log-1 /bin/fdbcli
```
To check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the fdb> prompt.
```
status details
```
- If you get a message that is similar to Could not communicate with a quorum of coordination servers, run the coordinators command with the IP addresses specified in the error message as input.
```
oc get pod -o wide | grep storage
> coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tls 
```
  If this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.

Resolving the problem

To resolve this issue, restart the FoundationDB pods.

Required role: To complete this task, you must be a cluster administrator.

Restart the FoundationDB cluster pods.
```
oc get fdbcluster 
oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete po
```
Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.

Restart the FoundationDB operator pods.

oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po

After the pods finish restarting, check to ensure that FoundationDB is available.
1. Check the FoundationDB status.
```
oc get fdbcluster -o yaml | grep fdbStatus
```
  The returned status must be Complete.
2. Check to ensure that the database is available.
```
oc rsh sample-cluster-log-1 /bin/fdbcli
```
  If the database is still not available, complete the following steps.
  1. Log in to the ibm-fdb-controller pod.
  2. Run the fix-coordinator script.
```
kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}
```
    Replace ${CLUSTER_NAME} in the command with the name of your fdbcluster instance.
    
    Note: For more information about the fix-coordinator script, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.

After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster

Applies to: Upgrades from Version 4.7.4 to 5.0.0 and later

If you upgrade from Cloud Pak for Data version 4.7.4 to Cloud Pak for Data 5.0.0 and later, the IAM access token API (/idprovider/v1/auth/identitytoken) fails. You cannot login to the user interface when the identitytoken API fails.

Diagnosing the problem

The following error is displayed in the log when you generate an IAM access token:

Failed to get access token, Liberty error: {"error_description":"CWWKS1406E: The token request had an invalid client credential. The request URI was \/oidc\/endpoint\/OP\/token.","error":"invalid_client"}"

Resolving the problem

Log in to Red Hat OpenShift Container Platform as a cluster administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.
Run the following command to restart the oidc-client-registration job:
```
oc -n delete job oidc-client-registration
```

Secrets are not visible in connections after upgrade

Applies to:

Upgrades from Version 4.7 to Version 5.0.0, 5.0.1, or 5.0.2
Upgrades from Version 4.8 to Version 5.0.0, 5.0.1, or 5.0.2

Fixed in: 5.0.3

If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.

Resolving the problem

To see the secrets in the user interface:

Change to the project where Cloud Pak for Data is installed:
```
oc project ${PROJECT_CPD_INST_OPERANDS}
```

Set the following environment variables:

oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true
oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true

Node pinning is not applied to `postgresql` pods

Applies to: 5.0.0 and later

If you use node pinning to schedule pods on specific nodes, and your environment includes postgresql pods, the node affinity settings are not applied to the postgresql pods that are associated with your Cloud Pak for Data deployment.

The resource specification injection (RSI) webhook cannot patch postgresql pods because the EDB Postgres operator uses a PodDisruptionBudget resource to limit the number of concurrent disruptions to postgresql pods. The PodDisruptionBudget resource prevents postgresql pods from being evicted.

You must manually clean up remote physical location artifacts if the `create-physical-location` command fails

Applies to: 5.0.0

Fixed in: 5.0.1

If the cpd-cli manage create-physical-location command fails, the command leaves behind resources that you must clean up by running the cpd-cli manage delete-physical-location command:

cpd-cli manage delete-physical-location \
--physical_location_name=${REMOTE_PHYSICAL_LOCATION_ID} \
--management_ns=${REMOTE_PROJECT_MANAGEMENT} \
--cpd_hub_url=${CPD_HUB_URL} \
--cpd_hub_api_key=${CPD_HUB_API_KEY}

If you try to re-run the create-physical-location command against the same management project before you run the delete-physical-location command, the create-physical-location command returns the following error:

The physical-location-info-cm ConfigMap already exists in the <management-ns> project.
The physical location in the ConfigMap is called <remote-physical-location-id>
* If you need to re-run the create-physical-location command to finish creating the physical location, 
  you must specify <remote-physical-location-id>.
* If you want to create a new physical location on the cluster, you must specify a different project. 
  You cannot reuse an existing management project.

The `ibm-nginx` deployment does not scale fast enough when automatic scaling is configured

Applies to: 5.0.0 and later

If you configure automatic scaling for the IBM Cloud Pak for Data control plane, the ibm-nginx deployment might not scale fast enough. Some symptoms include:

Slow response times
High CPU requests are throttled
The deployment scales up and down even when the workload is steady

This problem typically occurs when you install watsonx Assistant or watsonx Orchestrate.

Resolving the problem

If you encounter the preceding symptoms, you must manually scale the ibm-nginx deployment:

oc patch zenservice lite-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {
    "Nginx": {
        "name": "ibm-nginx",
        "kind": "Deployment",
        "container": "ibm-nginx-container",
        "replicas": 5,
        "minReplicas": 2,
        "maxReplicas": 11,
        "guaranteedReplicas": 2,
        "metrics": [
            {
                "type": "Resource",
                "resource": {
                    "name": "cpu",
                    "target": {
                        "type": "Utilization",
                        "averageUtilization": 529
                    }
                }
            }
        ],
        "resources": {
            "limits": {
                "cpu": "1700m",
                "memory": "2048Mi",
                "ephemeral-storage": "500Mi"
            },
            "requests": {
                "cpu": "225m",
                "memory": "920Mi",
                "ephemeral-storage": "100Mi"
            }
        },
        "containerPolicies": [
            {
                "containerName": "*",
                "minAllowed": {
                    "cpu": "200m",
                    "memory": "256Mi"
                },
                "maxAllowed": {
                    "cpu": "2000m",
                    "memory": "2048Mi"
                },
                "controlledResources": [
                    "cpu",
                    "memory"
                ],
                "controlledValues": "RequestsAndLimits"
            }
        ]
    }
}}'

Backup and restore issues

Issues that apply to several or all backup and restore methods

Online backup and restore with the OADP backup and restore utility issues

Online backup and restore with IBM Storage Fusion issues

Online backup and restore with NetApp Astra Control Center issues

Watson OpenScale etcd server fails to start after restoring from a backup

Data replication with Portworx issues

Offline backup and restore with the OADP backup and restore utility issues

OADP backup is missing EDB Postgres PVCs

Applies to: 5.0.0 and later

Diagnosing the problem

After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.

Cause of the problem

EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.

Resolving the problem

Before you create a backup, run the following command:

oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}

For more information, see the following topics:

`Disk usage size` error when running the `du-pv` command

Applies to: 5.0.0 and later

Note: Do not use this feature in a production environment.

Diagnosing the problem

When you run the du-pv command to estimate how much storage is needed to create a backup with the OADP utility, you see the following error message:

Total estimated volume usage size: 0

one or more error(s) occurred while trying to get disk usage size.  Please check reported errors in log file for details

The status of the cpdbr-agent pods is ImagePullBackoff:

oc get po -n ${OADP_NAMESPACE}

Example output:

NAME                                                READY   STATUS             RESTARTS   AGE
cpdbr-agent-9lprf                                   0/1     ImagePullBackOff   0          74s
cpdbr-agent-pf42f                                   0/1     ImagePullBackOff   0          74s
cpdbr-agent-trprx                                   0/1     ImagePullBackOff   0          74s

Cause of the problem

The --image-prefix option is not currently used by the cpdbr-agent install command. If you specify this option, it is ignored. Instead, the install command uses the default image at registry.access.redhat.com/ubi9/ubi-minimal:latest.

Resolving the problem

Do the following steps:

Patch the cpdbr-agent daemonset with the desired fully-qualified image name:

oc patch daemonset cpdbr-agent -n ${OADP_NAMESPACE} --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"<fully-qualified-image-name>"}]'

Wait for the daemonset to reach a healthy state:
```
oc rollout status daemonset cpdbr-agent -n ${OADP_NAMESPACE}
```
Retry the dv-pv command.

Tip: For more information about this feature, see Optional: Estimating how much storage to allocate for backups.

After restore, watsonx Assistant custom resource is stuck in `InProgress` at `11/19` verified state

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem

This problem can occur after you restore an online backup to the same cluster or to a different cluster. Run the following command:

oc get <watsonx-Assistant-instance-name> -n ${PROJECT_CPD_INST_OPERANDS}

Example output:

NAME   VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE   AGE
wa     5.0.1     False   Initializing   True       VerifyWait       19/19      11/19                4h39m

Cause of the problem

Pods are unable to find the wa-global-etcd secret. Run the following command:

oc describe pod wa-store-<xxxxxxxxx>-<xxxxx> | tail -5

Example output:

Normal   QueuePosition  51m (x2 over 52m)     ibm-cpd-scheduler  Queue Position: 3
  Normal   QueuePosition  50m (x2 over 52m)     ibm-cpd-scheduler  Queue Position: 2
  Normal   QueuePosition  36m                   ibm-cpd-scheduler  Queue Position: 1
  Warning  FailedMount    6m49s (x22 over 50m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[global-etcd], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedMount    74s (x33 over 52m)    kubelet            MountVolume.SetUp failed for volume "global-etcd" : secret "wa-global-etcd" not found

Resolving the problem

Delete certain deployments and recreate them by doing the following steps:

Ensure that the watsonx Assistant operator is running.
Create the INSTANCE environment variable and set it to the watsonx Assistant instance name:
```
export INSTANCE=<watsonx-Assistant-instance-name>
```

Run the following script:

# Components to restart one by one
SEQUENTIAL_DEPLOYMENTS=("ed" "dragonfly-clu-mm" "tfmm" "clu-triton-serving" "clu-serving" "nlu" "dialog" "store")

# Components to restart together in parallel
PARALLEL_DEPLOYMENTS=("analytics" "clu-embedding" "incoming-webhooks" "integrations" "recommends" "system-entities" "ui" "webhooks-connector" "gw-instance" "store-admin")

for DEPLOYMENT in "${SEQUENTIAL_DEPLOYMENTS[@]}"; do
  echo "#Starting restart of $INSTANCE-$DEPLOYMENT."

  # Delete the deployment
  oc delete deployment $INSTANCE-$DEPLOYMENT
  # Wait until the deployment is completely deleted
  while oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do
    echo "Waiting for $INSTANCE-$DEPLOYMENT to be fully deleted..."
    sleep 5
  done
  
  # Ensure the deployment is recreated
  echo "Recreating $INSTANCE-$DEPLOYMENT."
  while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do
    echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..."
    sleep 5
  done

  echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..."
  oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true

  echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
done

for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do
  echo "#Starting restart of $INSTANCE-$DEPLOYMENT."

  # Delete the deployment
  oc delete deployment $INSTANCE-$DEPLOYMENT &

done

# Wait for all parallel delete operations to complete
wait

# Ensure parallel deployments are recreated
for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do
  while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do
    echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..."
    sleep 5
  done

  echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..."
  oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true

  echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully."
done

echo "All deployments have been restarted successfully."

After restore, watsonx Assistant is stuck on the `17/19` deployed state or custom resource is stuck in `InProgress` state

Applies to: 5.0.1, 5.0.2

Fixed in: 5.0.3

Diagnosing the problem

This problem can occur after you restore an online backup to the same cluster or to a different cluster. Run the following command:

oc get wa -n ${PROJECT_CPD_INST_OPERANDS}

Example output:

NAME   VERSION   READY   READYREASON    UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE   AGE
wa     5.0.1     False   Initializing   True       VerifyWait       17/19      15/19                4h39m

Resolving the problem

Delete the wa-integrations-operand-secret and wa-integrations-datastore-connection-strings secrets by running the following commands:

oc delete secret wa-integrations-operand-secret -n ${PROJECT_CPD_INST_OPERANDS}

oc delete secret wa-integrations-datastore-connection-strings -n ${PROJECT_CPD_INST_OPERANDS}

After the secrets are deleted, the watsonx Assistant operator recreates them with the correct values, and the watsonx Assistant custom resource and pods are now in a good state.

OADP backup precheck command fails

Applies to: 5.0.0, 5.0.1

Fixed in: 5.0.2

Diagnosing the problem

This problem occurs when you do offline or online backup and restore with the OADP backup and restore utility. Run the backup precheck command:

cpd-cli oadp backup precheck --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS}

The following error message appears:

error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope
Error: error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope
[ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1

Cause of the problem

The cpdbr-api pod does not have the necessary permission to list clusterserviceversions.operators.coreos.com in all projects (namespaces) for the backup precheck command.

Resolving the problem

Add --exclude-checks OadpOperatorCSVto the backup precheck command:

cpd-cli oadp backup precheck \
--tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \
--exclude-checks OadpOperatorCSV

During or after a restore, pod shows PVC is missing

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

During or after a restore, a pod shows that one or more PVCs are missing. For example:

oc describe pod c-db2oltp-wkc-db2u-0

Example output:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  18m (x11076 over 16h)  ibm-cpd-scheduler  0/20 nodes are available: 20 persistentvolumeclaim "wkc-db2u-backups" not found. preemption: 0/20 nodes are available: 20 Preemption is not helpful for scheduling.

Cause of the problem

Velero does not back up PVCs that are in a Terminating state.

Resolving the problem

To work around the problem, before you restore a backup, ensure that no PVCs are in a Terminating state. To check for PVCs that are in a Terminating state after a backup is created, check the Velero pod logs for Skipping item because it's being deleted messages:

oc logs po -l deploy=velero -n <oadp-operator-ns>

Example output:

time="<timestamp>" level=info msg="Skipping item because it's being deleted." backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/item_backupper.go:161" name=wkc-db2u-backups namespace=zen1 resource=persistentvolumeclaims
time="<timestamp>" level=info msg="Backed up 286 items out of an estimated total of 292 (estimate will change throughout the backup)" backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/backup.go:404" name=wkc-db2u-backups namespace=zen1 progress= resource=persistentvolumeclaims

After restoring an online backup, status of Watson Discovery custom resource remains in `InProgress` state

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem

You see the following error, even though you did the Multicloud Object Gateway post-restore task. For example, if you used IBM Storage Fusion to do the backup and restore, you created the secrets that Watson Discovery uses to connect to Multicloud Object Gateway.

  - lastTransactionTime: <timestamp>
    message: Post task of online restore is in progress. Please ensure that MCG is
      correctly configured after restore.
    reason: PostRestoreInProgress
    status: "True"
    type: Message

Cause of the problem

The Watson Discovery post-restore task did not complete.

Resolving the problem

To work around the problem, do the following steps:

Check that the Watson Discovery post-restore component exists:
```
oc get wd wd -o jsonpath='{.status.componentStatus.deployedComponents[?(@=="post_restore")]}'
```
If the post-restore component exists, the output of the command is:
```
post_restore
```
Check that the post-restore task is not unverified:
```
oc get wd wd -o jsonpath='{.status.componentStatus.unverifiedComponents[?(@=="post_restore")]}'
```
If the post-restore task is not unverified, no output is produced by the command.
In this situation, some failure jobs do not rerun and must be deleted:
```
oc delete job wd-discovery-enrichment-model-copy wd-discovery-orchestrator-setup
```

Check that Watson Discovery is now ready:

oc get wd

Example output:

NAME   VERSION   READY   READYREASON   UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE        DATASTOREQUIESCE   AGE
wd     5.0.0     True    Stable        False      Stable           23/23      23/23      NOT_QUIESCED   NOT_QUIESCED       22h

After successful restore, the ibm-common-service-operator deployment fails to reach a `Running` state

Applies to: 5.0.0 and later

Diagnosing the problem

The following symptoms are seen:

Running the following command shows that the ibm-common-service-operator pod and deployment are not healthy:

oc get pods -n ${PROJECT_CPD_INST_OPERATORS}

Example output:

ibm-common-service-operator-<...>                      0/1     CrashLoopBackOff        72 (4m46s ago)   6h11m

Error logs show permission issues:

oc logs ibm-common-service-operator-<...>

Example output:

...
# I0529 20:52:39.182025       1 request.go:665] Waited for 1.033737216s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/dashboard.opendatahub.io/v1alpha?timeout=32s
# <date_timestamp>20:52:47.794Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
# I0529 20:52:47.794980       1 main.go:130] Identifying Common Service Operator Role in the namespace cpd-operator
# E0529 20:52:47.835106       1 util.go:465] Failed to fetch configmap kube-public/saas-config: configmaps "saas-config" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-public"
# I0529 20:52:47.837942       1 init.go:152] Single Deployment Status: false, MultiInstance Deployment status: true, SaaS Depolyment Status: false
# I0529 20:52:49.188786       1 request.go:665] Waited for 1.340366538s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1beta1?timeout=32s
# E0529 20:52:57.412736       1 init.go:1683] Failed to cleanup validatingWebhookConfig: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
# E0529 20:52:57.412762       1 main.go:153] Cleanup Webhook Resources failed: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

Running the following command shows that the ibm-common-service-operator CSV is stuck in a Pending state:

oc get csv -n ${PROJECT_CPD_INST_OPERATORS}

Example output:

NAME                      DISPLAY           VERSION   REPLACES   PHASE
ibm-zen-operator.v6.0.0   IBM Zen Service   6.0.0                Pending

Running the following command shows that status of the CommonService custom resource is Succeeded:

oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase

OLM logs show the following error:

oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator

oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator

Example output:

E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-common-service-operator.v4.6.0"} failed: requirements were not met
time="<timestamp>" level=info msg="requirements were not met" csv=cpd-platform-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending

Cause of the problem

The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.

Resolving the problem

To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.

Restore fails with `Error from server (Forbidden): configmaps is forbidden` error

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

When restoring Cloud Pak for Data to a different cluster with IBM Storage Fusion, NetApp Astra Control Center, or Portworx, you see the following error message:

Time: <timestamp> level=error - oc get configmap -n kube-public - FAILED with:  
Error from server (Forbidden): configmaps is forbidden: User "system:serviceaccount:cpd-operator:cpdbr-tenant-service-sa" 
cannot list resource "configmaps" in API group "" in the namespace "kube-public"
End Time: <timestamp>

Cause of the problem

The command to uninstall the cpdbr service was run with the incorrect --tenant-operator-namespace parameter. For example, multiple Cloud Pak for Data instances were installed in the cluster, and while cleaning up one of the instances, the incorrect project was specified when uninstalling the cpdbr service.

Resolving the problem

To work around the problem, reinstall the cpdbr service in the project where it was mistakenly uninstalled. For details, see one of the following topics:

After a restore, unable to access the Cloud Pak for Data console

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

You see the following symptoms:

Running the following command shows that the ibm-iam-operator pod and deployment are not healthy:
```
oc get pods -n ${PROJECT_CPD_INST_OPERATORS}
```
Example output:
```
ibm-iam-operator-<...>                      0/1     CrashLoopBackOff        72 (4m46s ago)   6h11m
```
Error logs show permission issues:
```
oc logs ibm-iam-operator-<...>
```

Running the following command shows that the ibm-iam-operator CSV is stuck in a Pending state:

oc get csv -n ${PROJECT_CPD_INST_OPERATORS}

Example output:

NAME                      DISPLAY           VERSION   REPLACES   PHASE
ibm-iam-operator.v4.6.0   IBM IM Operator   4.6.0                Pending

Running the following command shows that status of the CommonService custom resource is Succeeded:

oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase

OLM logs show the following error:

oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operator

oc logs -n openshift-operator-lifecycle-manager -l app=olm-operator

Example output:

E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-iam-operator.v4.6.0"} failed: requirements were not met
time="<timestamp>" level=info msg="requirements were not met" csv=ibm-iam-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending

Cause of the problem

Insufficient permissions from missing ClusterRole and ClusterRoleBindings. The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.

Resolving the problem

After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

Get the Cloud Pak for Data console route by running the following command:

oc get route -n ${PROJECT_CPD_INST_OPERANDS}

The output of the command shows that the Cloud Pak for Data console route points to the source cluster domain rather than to the target cluster domain.

Cause of the problem

The ibmcloud-cluster-info ConfigMap from the source cluster is included in the restore when it is expected to be excluded and re-generated, causing the target restore cluster to use the source routes.

Resolving the problem

To work around the problem, do the following steps:

Edit the fields in the ibmcloud-cluster-info ConfigMap to use the target cluster hostname:
```
oc edit configmap ibmcloud-cluster-info -n ${PROJECT_CPD_INST_OPERANDS}
```

Restart the ibm-zen-operator pod:

oc delete po -l app.kubernetes.io/name=ibm-zen-operator -n ${PROJECT_CPD_INST_OPERANDS}

Check that the routes are updated:
```
oc get route -n ${PROJECT_CPD_INST_OPERANDS}
```

If restarting the ibm-zen-operator pod does not correctly update the routes, and the ibm-iam-operator deployment is not healthy, do the workaround that is described in the previous issue.

Unable to back up Watson Discovery when the service is scaled to the `xsmall` size

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

The problem that you see depends on the backup and restore method that you are using. For example, if you are using IBM Storage Fusion, a Failed snapshot message appears during the backup process.

Cause of the problem

The xsmall size configuration uses 1 OpenSearch data node. The backup process requires 2 data nodes.

Resolving the problem

To work around the problem, increase the number of OpenSearch data nodes to 2. In the ${PROJECT_CPD_INST_OPERANDS} project (namespace), run the following command:

oc patch wd wd --type=merge --patch='{"spec":{"elasticsearch":{"dataNode":{"replicas":2}}}}'

In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

After the restore, the custom resource of the first is OpenPages instance is in a Completed state. The custom resources of the remaining OpenPages instances are in an InMaintenance state.

Cause of the problem

Hooks (prehooks, posthooks, etc.) are run only on the first OpenPages instance. Log files list only the results for one OpenPages instance when multiple were expected.

Resolving the problem

To work around the problem, do the following steps:

Get the OpenPages instance ConfigMaps:

oc get cm -n ${PROJECT_CPD_INST_OPERANDS} -l cpdfwk.module=openpages-aux

Edit each OpenPages instance ConfigMap so that their .data.aux-meta.name fields match their .metadata.labels.["cpdfwk.name"] label:
```
oc edit cm -n ${PROJECT_CPD_INST_OPERANDS} <configmap-name>
```

Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

When Cloud Pak for Data is integrated with the Identity Management Service service, you cannot log in with OpenShift cluster credentials. You might be able to log in with LDAP or as cpdadmin.

Resolving the problem

To work around the problem, run the following commands:

oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}

oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}

oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}

oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}

oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operator

oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-service

oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-management

oc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider

After a restore, OperandRequest timeout error in the ZenService custom resource

Applies to: 5.0.0 and later

Diagnosing the problem

Get the status of the ZenService YAML:

oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yaml

In the output, you see the following error:

...
zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request":
      Timed out waiting on resource'
...

Check for failing operandrequests:

oc get operandrequests -A

For failing operandrequests, check their conditions for

constraints not
                satisfiable

messages:

oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name>

Cause of the problem

Subscription wait operations timed out. The problematic subscriptions show an error similar to the following example:

'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0
      exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0
      and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0
      originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator
      requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0,
      subscription ibm-db2aaservice-cp4d-operator exists'

This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.

Workaround

Do the following steps:

Delete the problematic clusterserviceversions and subscriptions, and restart the Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
Delete Cloud Pak for Data instance projects (namespaces).
For details, see Preparing to restore Cloud Pak for Data with the OADP utility.
Retry the restore.

Online restore of Data Virtualization fails with post-hook errors

Applies to: 5.0.2, 5.0.3

Diagnosing the problem

Restoring an online backup of Data Virtualization on Portworx storage with the OADP backup and restore utility fails. In the CPD-CLI*.log file, you see errors such as in the following examples:

<time>     zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error

time=<timestamp> level=error msg=error performing op postRestoreViaConfigHookRule for resource dv, msg: 1 error occurred:
   * : command timed out after 40m0s: timed out waiting for the condition
 func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1535

Cause of the problem

Db2 startup is slow, causing the Data Virtualization post-restore hook to time out.

Resolving the problem

To work around the problem, take various Data Virtualization components out of write-suspend mode.

Take dvutils out of write-suspend mode:

oc rsh c-db2u-dv-dvutils-0 bash

/opt/dv/current/dv-utils.sh -o leavesafemode --is-bar

Take the Data Virtualization hurricane pod out of write-suspend mode:

oc rsh $(oc get pods | grep -i hurricane | cut -d' ' -f 1) bash

su - db2inst1

/usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L

Take Db2 out of write-suspend mode:

oc rsh c-db2u-dv-db2u-0 bash

su - db2inst1

/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L

After a few minutes, verify that Db2 is no longer in write-suspend mode:
```
db2 connect to bigsql
```
If the command finishes successfully, Db2 is no longer in write-suspend mode.
Restart the Data Virtualization caching pod by deleting the existing pod:
```
oc delete pod $(oc get pods | grep -i c-db2u-dv-dvcaching | cut -d' ' -f 1)
```

Online backup of Analytics Engine powered by Apache Spark fails

Applies to: 5.0.1

Fixed in: 5.0.2

Diagnosing the problem

When you try to create a backup of a Cloud Pak for Data deployment that includes the Analytics Engine powered by Apache Spark service with the OADP utility, the backup fails at the step to create a backup of Cloud Pak for Data PVCs and volume data. In the log file, you see the following error:

Hook execution breakdown by status=error/timedout:

The following hooks either have errors or timed out

pre-backup (1):

        COMPONENT                       CONFIGMAP                               METHOD  STATUS  DURATION       
        analyticsengine-cnpsql-ckpt     cpd-analyticsengine-aux-edb-ckpt-cm     rule    error   1m17.502299591s

--------------------------------------------------------------------------------

** INFO [BACKUP CREATE/SUMMARY/END] *******************************************
Error: error running pre-backup hooks: Error running pre-processing rules.  Check the /root/install_automation/cpd-cli-linux-EE-14.0.1-353/cpd-cli-workspace/logs/CPD-CLI-<date>.log for errors.
[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1

Cause of the problem

The EDB Postgres cluster spark-hb-cloud-native-postgresql remains fenced.

Resolving the problem

Unfence the cluster by doing the following steps:

Edit the spark-hb-cloud-native-postgresql cluster:

oc edit clusters.postgresql.k8s.enterprisedb.io spark-hb-cloud-native-postgresql

Remove the following line:
```
k8s.enterprisedb.io/fencedInstances: ""
```
Retry the backup.

Tip: For more information about resolving problems with EDB Postgres clusters that remain fenced, see EDB Postgres cluster is in an unhealthy state after a failed online backup.

Watson Speech services status is stuck in `InProgress` after restore

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

After an online restore with the OADP utility, the CPD-CLI*.log file shows speechStatus is in the InProgress state.

Cause of the problem

The speechStatus is in the InProgress state due to a race condition in the stt-async component. Pods that are associated with this component are stuck in 0/1 Running state. Run the following command to confirm this state:

oc get pods -l app.kubernetes.io/component=stt-async

Example output:

NAME                                   READY   STATUS    RESTARTS   AGE
speech-cr-stt-async-775d5b9d55-fpj8x   0/1     Running   0          60m

If one or more pods is in the 0/1 Running state for 20 minutes or more, this problem might occur.

Resolving the problem

For each pod in the 0/1 Running state, run the following command:

oc delete pod <stt-async-podname>

Common core services and dependent services in a failed state after an online restore

Applies to: 5.0.0

Diagnosing the problem

After you restore an online backup with the OADP backup and restore utility, the Common core services custom resource and the custom resource of dependent services remain in an InProgress state.

Cause of the problem

Intermittent Elasticsearch failure.

Workaround

To work around the problem, do the following steps:

Make sure that the current project (namespace) is set to the project that contains the Common core services and Watson Knowledge Catalog deployment.

Make sure that a valid backup is available by running the following command:

oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request GET --url http://localhost:19200/_cat/snapshots/cloudpak  --header 'content-type: application/json'

When a valid backup is present, the command returns output like in the following example:

cloudpak_snapshot_<timestamp> SUCCESS <epoch_timestamp> <hh:mm:ss> <epoch_timestamp> <hh:mm:ss> 200ms 3 23 0 23

If a snapshot is not present, the restore has unexpectedly failed. Contact IBM Support for assistance.

If a valid snapshot is present, delete the indexes on the cluster:

oc exec -n ${PROJECT_CPD_INST_OPERANDS} elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request DELETE --url 'http://localhost:19200/granite-3b,wkc,gs-system-index-wkc-v001,semantic' --header 'content-type: application/json'

Scale the OpenSearch cluster down by quiescing:

oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}'

Wait for the pods to scale down, checking the status with the following command:
```
`watch "oc get pods | grep elasticsea"`
```

When all the pods are gone, restart the cluster by unquiescing it:

oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}'

After you do these steps, Elasticsearch comes back up, and automatically kicks off recovery.

Backup validation fails because of missing resources in `wkc-foundationdb-cluster-aux-checkpoint-cm` ConfigMap

Applies to: 5.0.3 and later

Diagnosing the problem

Backup fails with an error message similar to the following example:

Failed with 1 error(s):
error: DataProtectionPlan=service-orchestration, Action=cpd-backup-validation (index=8)
                backup validation failed: 1 error occurred:
        * backup validation failed for configmap: wkc-foundationdb-cluster-aux-checkpoint-cm

Cause of the problem

The backup validation expected a secret to be present in the backup, but it wasn't captured.

Workaround

To resolve the issue, remove the following resources from the wkc-foundationdb-cluster-aux-checkpoint-cm ConfigMap:

- resource-kind: role.rbac.authorization.k8s.io
  validation-rules:
    - type: match_names
      names:
        - common-service-db-app
        - common-service-db-superuser
- resource-kind: rolebinding.rbac.authorization.k8s.io
  validation-rules:
    - type: match_names
      names:
        - common-service-db
- resource-kind: cluster.postgresql.k8s.enterprisedb.io
  validation-rules:
    - type: count
      op: ">="
      val: 1
      labels: "foundationservices.cloudpak.ibm.com=cs-db,icpdsupport/addOnId=cpfs,icpdsupport/app=postgres"

Now, backup should be validated when you run it again.

Restore posthooks fail to run when restoring Data Virtualization with the OADP utility

Applies to: 5.0.3

Diagnosing the problem

In the CPD-CLI*.log file, you see an error message like in the following example:

velero post-backup hooks in namespace <namespace> have one or more errors
check for errors in <cpd-cli location>, and try again
Error: velero post-backup hooks failed
[ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1

Cause of the problem

Velero hooks annotations are blocking the restore posthooks from running.

Get the Data Virtualization addon pod definition by running a command like in the following example:

oc get po dv-addon-6fdddc4bc7-8bdlq -o jsonpath="{.metadata.annotations}" | jq .

Example output that shows the Velero annotations:

...
  "post.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-backup no-op hook\"]",
  "post.hook.restore.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-resttore no-op hook\"]",
  "pre.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing pre-backup no-op hook\"]",
...

Resolving the problem

Remove the Velero hooks annotations. Because these annotations are not used, you can remove them from all pods. Run the following commands:

oc annotate po --all  post.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}
oc annotate po --all  post.hook.restore.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}
oc annotate po --all  pre.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}

After the annotations are removed, rerun the restore posthooks command:

cpd-cli oadp restore posthooks \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--log-level=debug \
--verbose

Backup fails for the platform with error in EDB Postgres cluster

Applies to: 5.0.0 and later

Applies to: All backup and restore methods

Diagnosing the problem

For example, in IBM Storage Fusion, the backup fails at the Hook: br-service hooks/pre-backup stage in the backup sequence.

In the cpdbr-oadp.log file, you see the following error:

time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance 
or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled

Cause of the problem

Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.

Resolving the problem

Use either the automatic or manual workaround.

Automatic workaround

After you apply the YAML files, the following workaround automatically runs as a prehook every time you take a backup. The issue is automatically handled, so you do not encounter it again, which is especially useful if you have set up automatic backups.

Check that the ${VERSION} environment variable is set in cpd_vars.sh to the correct Cloud Pak for Data version number.
Download the edb-patch-resources-legacy.yaml file.

Run the following command

oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f edb-patch-resources-legacy.yaml

Complete the steps that apply to your backup and restore method.

Manual workaround

Complete the following steps to manually run the workaround:

Note: If another switchover of the EDB Postgres cluster's primary instance and replica happens after you apply the manual workaround, you must complete the workaround again before you take a backup.

Download the edb-patch.sh file.
Run the following command:
```
sh edb-patch.sh ${PROJECT_CPD_INST_OPERANDS}
```
Retry the backup.

Restoring an RSI-enabled backup fails

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem

Restoring an RSI-enabled backup with IBM Storage Fusion fails at the Hook: br-service-hooks-operators restore step. The cpdbr-tenant.log file shows the following error:

cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

Cause of the problem

Permissions are missing in the cpdbr-tenant-service-clusterrole clusterrole.

Resolving the problem

Do the following steps:

Install cpd-cli 5.0.3.

Upgrade the cpdbr service:

The cluster pulls images from the IBM Entitled Registry:

Environments with the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
--cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
--recipe-type=br \
--log-level=debug \
--verbose

Environments without the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
--recipe-type=br \
--log-level=debug \
--verbose

The cluster pulls images from a private container registry:

Environments with the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--namespace=${OADP_OPERATOR_NS} \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
--cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
--recipe-type=br \
--log-level=debug \
--verbose

Environments without the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--namespace=${OADP_OPERATOR_NS} \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
--recipe-type=br \
--log-level=debug \
--verbose

Retry the restore.

Restore fails at `Hook: br-service-hooks-operators restore` step

Applies to: 5.0.1

Fixed in: 5.0.2

Diagnosing the problem

This problem occurs when using IBM Storage Fusion 2.7.2.

The restore process fails at the Hook: br-service-hooks-operators restore step, and you see the following error message:
```
Recipe failed
BMYBR0003 There was an error when processing the job in the Transaction Manager service
```
The ${PROJECT_CPD_INST_OPERANDS} project was not created during the restore.
When you run the following commands, the IBM Storage Fusion application custom resource does not have the Cloud Pak for Data instance project listed under .spec.includeNamespaces.
```
export PROJECT_FUSION=<fusion-namespace>
```
Tip: By default, the IBM Storage Fusion project is ibm-spectrum-fusion-ns.
```
oc get fapp -n ${PROJECT_FUSION} ${PROJECT_CPD_INST_OPERATORS} -o json | jq .spec
```

Cause of the problem

The backup is incomplete, causing the restore to fail.

Resolving the problem

Do the following steps:

Install cpd-cli 5.0.2.

Upgrade the cpdbr service:

The cluster pulls images from the IBM Entitled Registry:

Environments with the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
--cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
--recipe-type=br \
--log-level=debug \
--verbose

Environments without the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \
--recipe-type=br \
--log-level=debug \
--verbose

The cluster pulls images from a private container registry:

Environments with the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--namespace=${OADP_OPERATOR_NS} \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
--cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \
--recipe-type=br \
--log-level=debug \
--verbose

Environments without the scheduling service

cpd-cli oadp install \
--upgrade=true \
--component=cpdbr-tenant \
--namespace=${OADP_OPERATOR_NS} \
--tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \
--cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \
--recipe-type=br \
--log-level=debug \
--verbose

Patch policy assignments with the backup and restore recipe details.

Log in to Red Hat OpenShift Container Platform as an instance administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Get each policy assignment name:

export PROJECT_FUSION=<fusion-namespace>

oc get policyassignment -n ${PROJECT_FUSION}

If installed, patch the ${PROJECT_SCHEDULING_SERVICE} policy assignment:

oc -n ${PROJECT_FUSION} patch policyassignment <cpd-scheduler-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-scheduler", "namespace":"${PROJECT_SCHEDULING_SERVICE}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'

Patch the Cloud Pak for Data tenant policy assignment:

oc -n ${PROJECT_FUSION} patch policyassignment <cpd-tenant-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-tenant", "namespace":"${PROJECT_CPD_INST_OPERATORS}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'

Check that the IBM Storage Fusion application custom resource for the Cloud Pak for Data operator includes the following information:
- All projects (namespaces) that are members of the Cloud Pak for Data instance, including:
  - The Cloud Pak for Data operators project (${PROJECT_CPD_INST_OPERATORS}).
  - The Cloud Pak for Data operands project (${PROJECT_CPD_INST_OPERANDS}).
  - All tethered projects, if they exist.
- The PARENT_NAMESPACE variable, which is set to ${PROJECT_CPD_INST_OPERATORS}.
1. To get the list of all projects that are members of the Cloud Pak for Data instance, run the following command:
```
oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.includedNamespaces'}
```
2. To get the PARENT_NAMESPACE variable, run the following command:
```
oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.variables'}
```
Take a new backup.

Data Virtualization restore fails at post-workload step

Applies to: 5.0.0-5.0.2

Fixed in: 5.0.3

Diagnosing the problem

When restoring an online backup of a Cloud Pak for Data deployment that includes Data Virtualization with IBM Storage Fusion, the restore fails at the Hook: br-service-hooks/post-workload step in the restore sequence. In the log file, you see the following error message:

time=<timestamp> level=info msg=   zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, 
op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult 
file=/go/src/cpdbr-oadp/pkg/quiesce/planexecutor.go:1137

Workaround

To work around the problem, do the following steps:

Scale down the Data Virtualization hurricane pod:
```
oc scale deployment c-db2u-dv-hurricane-dv --replicas=0
```
Log in to the Data Virtualization head pod:
```
oc rsh c-db2u-dv-db2u-0 bash
```
```
su - db2inst1
```

Create a backup copy of the users.json file:

cp /mnt/blumeta0/db2_config/users.json /mnt/PV/versioned/logs/users.json.original

Edit the users.json file:
```
vi /mnt/blumeta0/db2_config/users.json
```
Locate "locked":true and change it to "locked":false.
Scale up the Data Virtualization hurricane pod:
```
oc scale deployment c-db2u-dv-hurricane-dv --replicas=1
```
Restart BigSQL from the Data Virtualization head pod:
```
oc exec -it c-db2u-dv-db2u-0 -- su - db2inst1 -c "bigsql start"
```
The Data Virtualization head and worker pods continue with the startup sequence.
Wait until the Data Virtualization head and worker pods are fully started by running the following 2 commands:
```
oc get pods | grep -i c-db2u-dv-dvcaching | grep 1/1 | grep -i Running
```
```
oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "ls -ls /tmp" | grep dv_setup_complete
```
The Data Virtualization head and worker pods are fully started when these 2 commands return grep results instead of empty results.
Re-create marker file that is needed by Data Virtualization's post-restore hook logic:
```
oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "touch /tmp/.ready_to_connectToDb"
```

Re-run the post-restore hook.

Get the cpdbr-tenant-service pod ID:
```
oc get po -A | grep "cpdbr-tenant-service"
```

oc rsh -n ${PROJECT_CPD_INST_OPERATORS} <cpdbr-tenant-service pod id>

Run the following commands:

/cpdbr-scripts/cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --log-level=debug --verbose

/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}

Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails

Applies to: IBM Storage Fusion 2.7.2 and later

Diagnosing the problem: When you restore an online backup with IBM Storage Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
Workaround: This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller than 5Gi. To work around the problem, expand any PVC that is smaller than 5Gi to at least 5Gi before you create the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver documentation.
Note: You cannot manually expand Watson OpenScale PVCs. To manage PVC sizes for Watson OpenScale, see Managing persistent volume sizes for Watson OpenScale.

Backup failed at Volume group: cpd-volumes stage

Applies to: IBM Storage Fusion 2.7.2

Fixed in: IBM Storage Fusion 2.7.2 hotfix

Diagnosing the problem

In the backup sequence in IBM Storage Fusion 2.7.2, the backup fails at the Volume group: cpd-volumes stage.

The transaction manager log shows several error messages, such as the following examples:

<timestamp>[TM_0] - Error: Processing of volume cc-home-pvc failed.\n", "<timestamp>[VOL_12] -

Snapshot exception (410)\\nReason: Expired: too old resource version: 2575013 (2575014)

Workaround

Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.

Backup of Cloud Pak for Data operators project fails at data transfer stage

Applies to: IBM Storage Fusion 2.7.2

Fixed in: IBM Storage Fusion 2.7.2 hotfix

Diagnosing the problem

In IBM Storage Fusion 2.7.2, the backup fails at the Data transfer stage, with the following error:

Failed transferring data
There was an error when processing the job in the Transaction Manager service

Cause

The length of a Persistent Volume Claim (PVC) name is more than 59 characters.

Workaround

Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.

With the hotfix, PVC names can be up to 249 characters long.

Watson OpenScale etcd server fails to start after restoring from a backup

Applies to: 5.0.0 and later

Diagnosing the problem

After restoring a backup with NetApp Astra Control Center, the Watson OpenScale etcd cluster is in a Failed state.

Workaround

To work around the problem, do the following steps:

Log in to Red Hat OpenShift Container Platform as a cluster administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Expand the size of the etcd PersistentVolumes by 1Gi.

In the following example, the current PVC size is 10Gi, and the commands set the new PVC size to 11Gi.

operatorPod=`oc get pod -n ${PROJECT_CPD_INST_OPERATORS} -l name=ibm-cpd-wos-operator | awk 'NR>1 {print $1}'`
oc exec ${operatorPod} -n ${PROJECT_CPD_INST_OPERATORS} -- roles/service/files/etcdresizing_for_resizablepv.sh  -n ${PROJECT_CPD_INST_OPERANDS} -s 11Gi

Wait for the reconciliation status of the Watson OpenScale custom resource to be in a Completed state:
```
oc get WOService aiopenscale -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status.wosStatus} {"\n"}'
```
The status of the custom resource changes to Completed when the reconciliation finishes successfully.

Restore fails at the running post-restore script step

Applies to: 5.0.3

Diagnosing the problem

When you use Portworx asynchronous disaster recovery, activating applications fails when you run the post-restore script. In the restore_post_hooks_<timestamp>.log file, you see an error message such as in the following example:

Time: <timestamp> level=error -  cpd-tenant-restore-<timestamp>-r2 failed
/cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1
*** cpdbr-tenant.sh post-restore failed ***
command terminated with exit code 1

Resolving the problem

To work around the problem, prior to running the post-restore script, restore custom resource definitions by running the following command:

cpd-cli oadp restore create <restore-name-r2> \
--from-backup=cpd-tenant-backup-<timestamp>-b2 \
--include-resources='customresourcedefinitions' \
--include-cluster-resources=true \
--skip-hooks \
--log-level=debug \
--verbose

Cloud Pak for Data resources are not migrated

Applies to: 5.0.2 and later

Diagnosing the problem

When you use Portworx asynchronous disaster recovery, the migration finishes almost immediately, and no volumes or the expected number of resources are migrated. Run the following command:

storkctl get migrations -n ${PX_ADMIN_NS}

Tip: ${PX_ADMIN_NS} is usually kube-system.

Example output:

NAME                                                CLUSTERPAIR       STAGE   STATUS       VOLUMES   RESOURCES   CREATED               ELAPSED                       TOTAL BYTES TRANSFERRED
cpd-tenant-migrationschedule-interval-<timestamp>   mig-clusterpair   Final   Successful   0/0       0/0         <timestamp>   Volumes (0s) Resources (3s)   0

Cause of the problem

This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected Cloud Pak for Data resources are not migrated.

Resolving the problem

To resolve the problem, downgrade stork to a version prior to 23.11.0. For more information about stork releases, see the stork Releases page.

Scale down the Portworx operator so that it doesn't reset manual changes to the stork deployment:
```
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0
```
Edit the stork deployment image version to a version prior to 23.11.0:
```
oc edit deploy -n ${PX_ADMIN_NS} stork
```
If you need to scale up the Portworx operator, run the following command.
Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
```
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1
```

Custom foundation models missing after backup and restore

Applies to: 5.0.0 and later

Applies to: Offline backup with the OADP utility

Diagnosing the problem

When performing offline backup and restore, BYOM (Bring Your Own Model) custom foundation model deployments are not included in the backup, and they do not exist after the restore. Additionally, inference on BYOM deployments fails after restore even when the deployment metadata is restored.

Cause of the problem

The issue occurs because the watsonx-ai-ifm operator's offline backup hook deletes all inferenceservices without excluding BYOM custom foundation models. The models are deleted during backup preparation, and they are not captured in the backup.

Resolving the problem

To resolve the issue, you must modify the deletion command before taking a backup. And you must restart the caikit-runtime-stack-operator pod after you complete the backup and restore process.

Before taking a backup

Before taking an offline backup, you must do the following steps:

Note: If the operator pod is deleted or restarted, all these changes are lost. These steps must be repeated every time a new operator pod is created.

Put the watsonxaiifm operator and custom resource (CR) into maintenance mode:

oc patch watsonxaiifm watsonxaiifm-cr --namespace $PROJECT_INSTANCE_NAMESPACE --type=merge --patch '{"spec": {"forceReconcile":"true"}}'

Switch to the operator namespace:
```
oc project $PROJECT_OPERATOR_NAMESPACE
```

Access the watsonx-ai-ifm operator pod:

oc rsh ibm-cpd-watsonx-ai-ifm-operator-xxxxxxx

Navigate to the templates directory:

cd /opt/ansible/11.1.0/roles/watsonxaiifm-post-install/templates

Backup the original file:

cp post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2 \
   post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2.bak

Modify the deletion command to exclude custom foundation models:

sed -i 's|command:  \["/bin/bash", "-c", "kubectl delete inferenceservices -l '\''icpdsupport/addOnId=watsonx_ai_ifm'\'' ANDAND kubectl delete inferenceservices -l '\''syom_model'\''"\]|command: ["/bin/bash", "-c", "kubectl delete inferenceservices -l '\''icpdsupport/addOnId=watsonx_ai_ifm,model_type!=custom_foundation_model,type!=cfm'\'' -n $NAMESPACE ANDAND kubectl delete inferenceservices -l '\''syom_model,model_type!=custom_foundation_model,type!=cfm'\'' -n $NAMESPACE"]|' \
post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2

Verify the change:

grep -n "kubectl delete inferenceservices" \
post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2

Exit the container: Ctrl + D

Remove the operator and CR from maintenance mode:

oc patch watsonxaiifm watsonxaiifm-cr --namespace $PROJECT_INSTANCE_NAMESPACE --type=merge --patch '{"spec": {"forceReconcile":"false"}}'

After completing a restore

After you complete the offline restore, restart the caikit-runtime-stack-operator pod.

Switch to operator namespace:
```
oc project $PROJECT_OPERATOR_NAMESPACE
```
List and identify the caikit operator pod:
```
oc get pod|grep caikit
```
Delete the caikit-runtime-stack-operator pod to trigger restart:
```
oc delete pod caikit-runtime-stack-operator-9dcc859d8-bmz8q
```
Verify fmaas pods are running in the instance namespace:
```
oc project cpd-instance
oc get pod|grep fmaas
```
Confirm the following pods are running:
- fmaas-caikit-inf-prompt-tunes
- fmaas-caikit-trainer
- fmaas-mt
- fmaas-router

Creating an offline backup in REST mode stalls

Applies to: 5.0.0 and later

Diagnosing the problem

This problem occurs when you try to create an offline backup in REST mode by using a custom --image-prefix value. The offline backup stalls with cpdbr-vol-mnt pods in the ImagePullBackOff state.

Cause of the problem

When you specify the --image-prefix option in the

cpd-cli
              oadp backup create

command, the default prefix registry.redhat.io/ubi9 is always used.

Resolving the problem

To work around the problem, create the backup in Kubernetes mode instead. To change to this mode, run the following command:

cpd-cli oadp client config set runtime-mode=

Common core services custom resource is in `InProgress` state after an offline restore to a different cluster

Applies to: 5.0.0, 5.0.1

Fixed in: 5.0.2

Diagnosing the problem

Get the status of installed components by running the following command.
```
cpd-cli manage get-cr-status \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}
```
Check that the status of ccs-cr is InProgress.

Cause of the problem

The Common core services component failed to reconcile on the restored cluster, because the dsx-requisite-pre-install-job-<xxxx> pod job is failing.

Resolving the problem

To resolve the problem, follow the instructions that are described in the technote Failed dsx-requisite-pre-install-job during offline restore.

OpenPages offline backup fails with pre-hook error

Applies to: 5.0.1, 5.0.2

Fixed in: 5.0.3

Diagnosing the problem

The CPD-CLI*.log file shows pre-backup hook errors such as in the following example:

<time>  Hook execution breakdown by status=error/timedout:
<time>  
<time>  The following hooks either have errors or timed out
<time>  
<time>  pre-backup (1):
<time>  
<time>      	COMPONENT                     	CONFIGMAP                               	METHOD	STATUS	DURATION      
<time>      	openpages-openpagesinstance-cr	openpages-openpagesinstance-cr-aux-br-cm	rule  	error 	6m0.080179343s
<time>  
<time>  --------------------------------------------------------------------------------
<time>  
<time>  
<time>  ** INFO [BACKUP CREATE/SUMMARY/END] *******************************************
<time>  Error: error running pre-backup hooks: Error running pre-processing rules.  Check the /root/br/backup/cpd-cli-workspace/logs/CPD-CLI-<timestamp>.log for errors.
<time>  [ERROR] <timestamp> RunPluginCommand:Execution error:  exit status 1
<time>  nfs0717bak-tenant-offline-b1 k8s offline backup failed

Cause of the problem

Getting the OpenPages custom resource into the InMaintenance state timed out.

Workaround

Increase the pre-hooks timeout value in the openpages-openpagesinstance-cr-aux-br-cm ConfigMap.

Edit the openpages-openpagesinstance-cr-aux-br-cm ConfigMap:

oc edit cm openpages-openpagesinstance-cr-aux-br-cm -n ${PROJECT_CPD_INST_OPERANDS}

Under pre-hooks, change the timeout value to 600s.

pre-hooks:
      exec-rules:
      - resource-kind: OpenPagesInstance
        name: openpagesinstance-cr
        actions:
        - builtins:
            name: cpdbr.cpd.ibm.com/enable-maint
            params:
              statusFieldName: openpagesStatus
            timeout: 600s

Offline backup pre-hooks fail on Separation of Duties cluster

Applies to: 5.0.0 and later

Diagnosing the problem

The CPD-CLI*.log file shows pre-backup hook errors such as in the following example:

<timestamp> level=info msg=   test-watsonxgovernce-instance/configmap/cpd-analyticsengine-aux-br-cm: component=analyticsengine-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137
...
time=<timestamp> level=info msg=   test-watsonxgovernce-instance/configmap/cpd-analyticsengine-cnpsql-aux-br-cm: component=analyticsengine-cnpsql-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137

Cause of the problem

The EDB Postgres pod for the Analytics Engine powered by Apache Spark service is in a CrashLoopBackOff state.

Workaround

To work around the problem, follow the instructions in the IBM Support document Unable to upgrade Spark due to Enterprise database issues.

Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

After an offline backup is created, but before doing a restore, check if the management-ingress-ibmcloud-cluster-info ConfigMap was backed up by running the following commands:

cpd-cli oadp backup status --details <backup_name1> | grep management-ingress-ibmcloud-cluster-info

cpd-cli oadp backup status --details <backup_name2> | grep management-ingress-ibmcloud-cluster-info

During or after the restore, pods that mount the missing ConfigMap show errors. For example:

oc describe po c-db2oltp-wkc-db2u-0 -n ${PROJECT_CPD_INST_OPERANDS}

Example output:

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  41m (x512 over 17h)  kubelet  MountVolume.SetUp failed for volume "management-ingress-ibmcloud-cluster-info" : configmap "management-ingress-ibmcloud-cluster-info" not found
  Warning  FailedMount  62s (x518 over 17h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[management-ingress-ibmcloud-cluster-info], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

Cause of the problem

When a related ibmcloud-cluster-info ConfigMap gets excluded as part of backup hooks, the management-ingress-ibmcloud-cluster-info ConfigMap copies the exclude labeling and unintentionally gets excluded from the backup.

Workaround

To work around the problem, do the following steps:

Log in to Red Hat OpenShift Container Platform as a cluster administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Apply the following patch to ensure that the management-ingress-ibmcloud-cluster-info ConfigMap is not excluded from the backup:

oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - << EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cpdbr-management-ingress-exclude-fix-br
  labels:
    cpdfwk.aux-kind: br
    cpdfwk.component: cpdbr-patch
    cpdfwk.module: cpdbr-management-ingress-exclude-fix
    cpdfwk.name: cpdbr-management-ingress-exclude-fix-br-cm
    cpdfwk.managed-by: ibm-cpd-sre
    cpdfwk.vendor: ibm
    cpdfwk.version: 1.0.0
data:
  aux-meta: |
    name: cpdbr-management-ingress-exclude-fix-br
    description: |
      This configmap defines offline backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info
      configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap prehooks.
      This is a temporary workaround until a complete fix is implemented.
    version: 1.0.0
    component: cpdbr-patch
    aux-kind: br
    priority-order: 99999 # This should happen at the end of backup prehooks
  backup-meta: |
    pre-hooks:
      exec-rules:
      # Remove lingering velero exclude label from offline prehooks
      - resource-kind: configmap
        name: management-ingress-ibmcloud-cluster-info
        actions:
          - builtins:
              name: cpdbr.cpd.ibm.com/label-resources
              params:
                action: remove
                key: velero.io/exclude-from-backup
                value: "true"
              timeout: 360s
      # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation
      - resource-kind: configmap
        name: management-ingress-ibmcloud-cluster-info
        actions:
          - builtins:
              name: cpdbr.cpd.ibm.com/label-resources
              params:
                action: remove
                key: icpdsupport/ignore-on-nd-backup
                value: "true"
              timeout: 360s
    post-hooks:
      exec-rules: 
      - resource-kind: # do nothing for posthooks
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cpdbr-management-ingress-exclude-fix-ckpt
  labels:
    cpdfwk.aux-kind: checkpoint
    cpdfwk.component: cpdbr-patch
    cpdfwk.module: cpdbr-management-ingress-exclude-fix
    cpdfwk.name: cpdbr-management-ingress-exclude-fix-ckpt-cm
    cpdfwk.managed-by: ibm-cpd-sre
    cpdfwk.vendor: ibm
    cpdfwk.version: 1.0.0
data:
  aux-meta: |
    name: cpdbr-management-ingress-exclude-fix-ckpt
    description: |
      This configmap defines online backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info
      configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap checkpoint operation.
      This is a temporary workaround until a complete fix is implemented.
    version: 1.0.0
    component: cpdbr-patch
    aux-kind: ckpt
    priority-order: 99999 # This should happen at the end of backup prehooks
  backup-meta: |
    pre-hooks:
      exec-rules:
      # Remove lingering velero exclude label from offline prehooks
      - resource-kind: configmap
        name: management-ingress-ibmcloud-cluster-info
        actions:
          - builtins:
              name: cpdbr.cpd.ibm.com/label-resources
              params:
                action: remove
                key: velero.io/exclude-from-backup
                value: "true"
              timeout: 360s
      # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation
      - resource-kind: configmap
        name: management-ingress-ibmcloud-cluster-info
        actions:
          - builtins:
              name: cpdbr.cpd.ibm.com/label-resources
              params:
                action: remove
                key: icpdsupport/ignore-on-nd-backup
                value: "true"
              timeout: 360s
    post-hooks:
      exec-rules: 
      - resource-kind: # do nothing for posthooks
  checkpoint-meta: |
    exec-hooks:
      exec-rules: 
      - resource-kind: # do nothing for checkpoint
EOF

Unable to restore offline backup of OpenPages to different cluster

Applies to: 5.0.0

Fixed in: 5.0.1

Diagnosing the problem

In the CPD-CLI*.log file, you see an error like in the following example:

CPD-CLI-<timestamp>.log:time=<timestamp> level=error msg=failed to wait for statefulset openpages--78c5-ib-12ce in namespace <cpd_instance_ns>: 
timed out waiting for the condition func=cpdbr-oadp/pkg/kube.waitForStatefulSetPods file=/a/workspace/oadp-upload/pkg/kube/statefulset.go:173

Cause of the problem

The second RabbitMQ pod (ending in -1) is in a CrashLoopBackOff state. Run the following command:

oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep openpages

Example output:

openpages--78c5-ib-12ce-0                                1/1     Running                 0                 23h
openpages--78c5-ib-12ce-1                                0/1     CrashLoopBackOff        248 (3m57s ago)   23h
openpages-openpagesinstance-cr-sts-0                     1/2     Running                 91 (12m ago)      23h
openpages-openpagesinstance-cr-sts-1                     1/2     Running                 91 (12m ago)      23h

Workaround

To work around the problem, do the following steps:

Log in to Red Hat OpenShift Container Platform as a cluster administrator.
```
${OC_LOGIN}
```
Remember: OC_LOGIN is an alias for the oc login command.

Check the OpenPages logs for the following error in the second RabbitMQ pod:

===========
Exception during startup:

exit:{boot_failed,{exit_status,1}}

    peer:start_it/2, line 639
    rabbit_peer_discovery:query_node_props/1, line 408
    rabbit_peer_discovery:sync_desired_cluster/3, line 189
    rabbit_db:init/0, line 65
    rabbit_boot_steps:-run_step/2-lc$^0/1-0-/2, line 51
    rabbit_boot_steps:run_step/2, line 58
    rabbit_boot_steps:-run_boot_steps/1-lc$^0/1-0-/1, line 22
    rabbit_boot_steps:run_boot_steps/1, line 23

If you see this error, check the Erlang cookie value at the top of the OpenPages logs. For example, run the following command:

oc logs openpages--78c5-ib-12ce-1

Example output:

Defaulted container "openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq" out of: openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq, copy-rabbitmq-config (init)
----------------------
+FkpbwejzK2RXfmPLQAnITroiieu3uGa3vkRA2k6t+8=
----------------------
<timestamp> [warning] <0.156.0> Overriding Erlang cookie using the value set in the environment

The plus sign (+) at the beginning of the cookie value is the source of the problem.

Regenerate a new token:

openssl rand -base64 32 | tr -d '\\n' | base64 | tr -d '\\n'

Decode from Base64 format, and make sure that the cookie value does not begin with a plus sign (+).
Replace the cookie value in the auth secret.
1. Edit the auth secret:
```
oc edit secret openpages-openpagesinstance-cr-<instance_id>-rabbitmq-auth-secret
```
2. Replace the rabbitmq-erlang-cookie value with the new value.
Delete the StatefulSet, or scale down and then scale up to get all the pods to pick up the new cookie.

Flight service issues

Security issues

Security scans return an `Inadequate Account Lockout Mechanism` message

Applies to: 5.0.0 and later

Diagnosing the problem

If you run a security scan against Cloud Pak for Data, the scan returns the following message.

Inadequate Account Lockout Mechanism

Resolving the problem

This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.

The Kubernetes version information is disclosed

Applies to: 5.0.0 and later

Diagnosing the problem

If you run an Aqua Security scan against your cluster, the scan returns the following issue:

KHV002 - Kubernetes issue disclosure

Resolving the problem

This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.

Known issues and limitations for IBM Cloud Pak for Data

Customer-reported issues

General issues

After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional

The Assist me icon is not displayed in the web client

The delete-platform-ca-certs command does not remove certificate mounts from pods

When you add a secret to a vault, you cannot filter the list of users and groups to show only groups

The wml service key does not work and health commands must use the --service option

The cpd-cli cluster command fails on ROSA with hosted control planes

Installation and upgrade issues

The Switch locations icon is not available if the apply-cr command times out

Upgrades fail if the Data Foundation Rook Ceph cluster is unstable

Running the apply-olm command twice during an upgrade can remove required OLM resources

After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed status

After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable

After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster

Secrets are not visible in connections after upgrade

Node pinning is not applied to postgresql pods

You must manually clean up remote physical location artifacts if the create-physical-location command fails

The ibm-nginx deployment does not scale fast enough when automatic scaling is configured

Backup and restore issues

OADP backup is missing EDB Postgres PVCs

Disk usage size error when running the du-pv command

After restore, watsonx Assistant custom resource is stuck in InProgress at 11/19 verified state

After restore, watsonx Assistant is stuck on the 17/19 deployed state or custom resource is stuck in InProgress state

OADP backup precheck command fails

During or after a restore, pod shows PVC is missing

After restoring an online backup, status of Watson Discovery custom resource remains in InProgress state

After successful restore, the ibm-common-service-operator deployment fails to reach a Running state

Restore fails with Error from server (Forbidden): configmaps is forbidden error

After a restore, unable to access the Cloud Pak for Data console

After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain

Unable to back up Watson Discovery when the service is scaled to the xsmall size

In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored

Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster

After a restore, OperandRequest timeout error in the ZenService custom resource

Online restore of Data Virtualization fails with post-hook errors

Online backup of Analytics Engine powered by Apache Spark fails

Watson Speech services status is stuck in InProgress after restore

Common core services and dependent services in a failed state after an online restore

Backup validation fails because of missing resources in wkc-foundationdb-cluster-aux-checkpoint-cm ConfigMap

Restore posthooks fail to run when restoring Data Virtualization with the OADP utility

Backup fails for the platform with error in EDB Postgres cluster

Restoring an RSI-enabled backup fails

Restore fails at Hook: br-service-hooks-operators restore step

Data Virtualization restore fails at post-workload step

Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails

Backup failed at Volume group: cpd-volumes stage

Backup of Cloud Pak for Data operators project fails at data transfer stage

Watson OpenScale etcd server fails to start after restoring from a backup

Restore fails at the running post-restore script step

Cloud Pak for Data resources are not migrated

Custom foundation models missing after backup and restore

Creating an offline backup in REST mode stalls

Common core services custom resource is in InProgress state after an offline restore to a different cluster

OpenPages offline backup fails with pre-hook error

Offline backup pre-hooks fail on Separation of Duties cluster

Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore

Unable to restore offline backup of OpenPages to different cluster

Flight service issues

Security issues

Security scans return an Inadequate Account Lockout Mechanism message

The Kubernetes version information is disclosed

The `delete-platform-ca-certs` command does not remove certificate mounts from pods

The `wml` service key does not work and `health` commands must use the `--service` option

The `cpd-cli cluster` command fails on ROSA with hosted control planes

The Switch locations icon is not available if the `apply-cr` command times out

Running the `apply-olm` command twice during an upgrade can remove required OLM resources

After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a `Failed` status

Node pinning is not applied to `postgresql` pods

You must manually clean up remote physical location artifacts if the `create-physical-location` command fails

The `ibm-nginx` deployment does not scale fast enough when automatic scaling is configured

`Disk usage size` error when running the `du-pv` command

After restore, watsonx Assistant custom resource is stuck in `InProgress` at `11/19` verified state

After restore, watsonx Assistant is stuck on the `17/19` deployed state or custom resource is stuck in `InProgress` state

After restoring an online backup, status of Watson Discovery custom resource remains in `InProgress` state

After successful restore, the ibm-common-service-operator deployment fails to reach a `Running` state

Restore fails with `Error from server (Forbidden): configmaps is forbidden` error

Unable to back up Watson Discovery when the service is scaled to the `xsmall` size

Watson Speech services status is stuck in `InProgress` after restore

Backup validation fails because of missing resources in `wkc-foundationdb-cluster-aux-checkpoint-cm` ConfigMap

Restore fails at `Hook: br-service-hooks-operators restore` step

Common core services custom resource is in `InProgress` state after an offline restore to a different cluster

Security scans return an `Inadequate Account Lockout Mechanism` message