Known issues and limitations for IBM Cloud Pak for Data
The following issues apply to the IBM Cloud Pak for Data platform. Each issue includes information about the releases that it applies to. If the issue was fixed in a refresh, that information is also included.
- Customer-reported issues
- General issues
- Installation and upgrade issues
- Backup and restore issues
- Flight service issues
- Security issues
The following issues apply to IBM Cloud Pak for Data services.
Customer-reported issues
Issues that are found after the release are posted on the IBM Support site.
General issues
- After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional
- The Assist me icon is not displayed in the web client
- The delete-platform-ca-certs command does not remove certificate mounts from pods
- When you add a secret to a vault, you cannot filter the list of users and groups to show only groups
- The
wmlservice key does not work and commands must use the --service option - The
cpd-cli clustercommand fails on ROSA with hosted control planes
After rebooting a cluster that uses OpenShift Data Foundation storage, some Cloud Pak for Data services aren't functional
Applies to: 5.0.0 and later
- Diagnosing the problem
- After rebooting the cluster, some Cloud Pak for Data
custom resources remain in the
InProgressstate.For more information about this problem, see Missing NodeStageVolume RPC call blocks new pods from going into Running state in the Red Hat® OpenShift® Data Foundation 4.1.4 release notes.
- Workaround
- Do the following steps:
- Find the nodes that have pods that are in an
Errorstate:oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide s -A | grep -v -P "Completed|(\d+)\/\1" - Mark each node as
unschedulable.
oc adm cordon <node_name> - Delete the affected
pods:
oc get pod | grep -Ev "Comp|0/0|1/1|2/2|3/3|4/4|5/5|6/6|7/7"|awk '{print $1}' |xargs oc delete po --force=true --grace-period=0 - Mark each node as
scheduled:
oc adm uncordon <node_name>
- Find the nodes that have pods that are in an
The Assist me icon is not displayed in the web client
Applies to: Upgrades from Version 4.8.x
Fixed in: 5.0.3
If you upgrade IBM Cloud Pak for Data from Version
4.8.x to Version 5.0, the Assist me icon is not visible in the web client
toolbar.
The issue occurs because the ASSIST_ME_ENABLED option is set to
false.
- Resolving the problem
- To make Assist me available in the web client:
-
Log in to Red Hat OpenShift Container Platform as a user with sufficient permissions to complete the task.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Run the following command to set
ASSIST_ME_ENABLED: true:oc patch cm product-configmap \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch '{"data": {"ASSIST_ME_ENABLED": "true"}}' - Confirm that the
ASSIST_ME_ENABLEDparameter is set totrue:oc get cm product-configmap \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ -o jsonpath="{.data.ASSIST_ME_ENABLED}{'\n'}"
-
The delete-platform-ca-certs command does not
remove certificate mounts from pods
Applies to: 5.0.0
Fixed in: 5.0.3
When you run the cpd-cli
manage
delete-platform-ca-certs command, the command does not remove the
certificate mounts from pods.
- Resolving the problem
- To remove the certificate mounts from pods:
- Delete the
cpd-custom-ca-certssecret:oc delete secret cpd-custom-ca-certs \ --namespace=${PROJECT_CPD_INST_OPERANDS} - Run the
cpd-cli manage delete-platform-ca-certscommand:cpd-cli manage delete-platform-ca-certs \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \ --apply=true
- Delete the
When you add a secret to a vault, you cannot filter the list of users and groups to show only groups
Applies to: 5.0.0
Fixed in: 5.0.3
When you add a secret to a vault, you can optionally share the secret with other users. However, if you try to filter the list of users and groups to show only groups, the filter does not take effect.
The wml service key does not work and health commands must use the --service option
Applies to: 5.0.3 and later
wml service key cannot be used for 5.0.3 and later versions. Using the
wml service key can cause severe data loss. When Watson
Machine Learning is installed on a cluster, there are two
restrictions that you must follow if you want to use either the cpd-cli
health
service-functionality or the cpd-cli
health
service-functionality cleanup: - You cannot use the
wmlservice key with the--servicesoption. - You must use the
--servicesoption when you use either thecpd-cli health service-functionalitycommand or thecpd-cli health service-functionality cleanupcommand.
The cpd-cli
cluster command fails on ROSA with hosted control planes
Applies to: 5.0.0
Fixed in: 5.1.1
An error message for missing machineconfigpools (MCP) shows up when you run the cpd-cli
cluster command to check the health of a Red Hat
OpenShift Service on AWS (ROSA) cluster with hosted control planes
(HCP).
Installation and upgrade issues
- The Switch locations icon is not available if the apply-cr command times out
- Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
- Running the apply-olm command twice during an upgrade can remove required OLM resources
- After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed status
- After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
- After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster
- Secrets are not visible in connections after upgrade
- Node pinning is not applied to postgresql pods
- You must manually clean up remote physical location artifacts if the create-physical-location command fails
- The ibm-nginx deployment does not scale fast enough when automatic scaling is configured
The Switch locations icon is not available if the apply-cr command times out
Applies to: 5.0.0, 5.0.1, and 5.0.2
Fixed in: 5.0.3
If you install solutions that are available in different Cloud Pak for Data experiences, the Switch
locations icon is not available in the web client if the
cpd-cli
manage
apply-cr command times out.
- Resolving the problem
- Re-run the
cpd-cli manage apply-crcommand.
Upgrades fail if the Data Foundation Rook Ceph cluster is unstable
Applies to: 5.0.0 and later
If the Red Hat OpenShift Data Foundation or IBM Storage Fusion Data Foundation Rook Ceph® cluster is unstable, upgrades fail.
One symptom is that pods will not start because of a FailedMount error. For
example:
Warning FailedMount 36s (x1456 over 2d1h) kubelet MountVolume.MountDevice failed for volume
"pvc-73bf3705-43e9-40bd-87ed-c1e1656d6f12" : rpc error: code = Aborted desc = an operation with the given
Volume ID 0001-0011-openshift-storage-0000000000000001-5e17508b-c295-4306-b684-eaa327aec2ab already exists
- Diagnosing the problem
- To confirm whether the Data Foundation
Rook Ceph cluster is unstable:
- Ensure that the
rook-ceph-toolspod is running.oc get pods -n openshift-storage | grep rook-ceph-toolsNote: On IBM Storage Fusion HCI System or on environments that use hosted control planes, the pods are running in theopenshift-storage-clientproject. - Set the
TOOLS_PODenvironment variable to the name of therook-ceph-toolspod:export TOOLS_POD=<pod-name> - Execute into the
rook-ceph-toolspod:oc rsh -n openshift-storage ${TOOLS_POD} - Run the following command to get the status of the Rook Ceph
cluster:
ceph statusConfirm that the output includes the following line:health: HEALTH_WARN - Exit the pod:
exit
- Ensure that the
- Resolving the problem
- To resolve the problem:
- Get the name of the
rook-ceph-mrgpods:oc get pods -n openshift-storage | grep rook-ceph-mgr - Set the
MGR_POD_Aenvironment variable to the name of therook-ceph-mgr-apod:export MGR_POD_A=<rook-ceph-mgr-a-pod-name> - Set the
MGR_POD_Benvironment variable to the name of therook-ceph-mgr-bpod:export MGR_POD_B=<rook-ceph-mgr-b-pod-name> - Delete the
rook-ceph-mgr-apod:oc delete pods ${MGR_POD_A} -n openshift-storage - Ensure that the
rook-ceph-mgr-apod is running before you move to the next step:oc get pods -n openshift-storage | grep rook-ceph-mgr - Delete the
rook-ceph-mgr-bpod:oc delete pods ${MGR_POD_B} -n openshift-storage - Ensure that the
rook-ceph-mgr-bpod is running:oc get pods -n openshift-storage | grep rook-ceph-mgr
- Get the name of the
Running the apply-olm command twice during an
upgrade can remove required OLM
resources
- Upgrades from Version 4.7 to 5.0.0
- Upgrades from Version 4.8 to 5.0.0
Upgrades to later 5.0 refreshes are not affected.
cpd-cli
manage
apply-olm two times, you might notice several problems:- The operator subscription is missing
- The operator cluster service version (CSV) is missing
cpd-cli
manage
apply-cr command, you might notice additional problems:- The
versioninformation is missing from thespecsection of the service custom resource - When you run the
cpd-cli manage get-cr-statuscommand, the values for theVersionandReconciled-versionparameters are different.
- Resolving the problem
- To resolve the problem, you must re-run the
cpd-cli manage apply-olmcommand a third time to ensure that the required resources are available. Then, re-run thecpd-cli manage apply-crcommand.
After an upgrade from Cloud Pak for Data 4.7.3, FoundationDB can indicate a Failed
status
Applies to: Upgrades from Version 4.7.3 to 5.0.0 and later
After upgrading Cloud Pak for Data from Version 4.7.3
to 5.0, the status of the FoundationDB cluster can indicate that it has failed
(fdbStatus: Failed). The Failed status can occur even if FoundationDB is available and working correctly. This issue
occurs when the FoundationDB resources do not get
properly cleaned up by the upgrade.
- IBM Knowledge Catalog
- IBM Match 360
- Diagnosing the problem
-
To determine if this problem has occurred:
Required role: To complete this task, you must be a cluster administrator.
- Check the FoundationDB cluster
status.
oc get fdbcluster -o yaml | grep fdbStatusIf the returned status is
Failed, proceed to the next step to determine if the pods are available. - Check to see if the FoundationDB pods are up and
running.
oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep foundationThe returned list of FoundationDB pods should all have a status of
Running. If they are not running, then the problem is something other than this issue.
- Check the FoundationDB cluster
status.
- Resolving the problem
-
To resolve this issue, restart the FoundationDB controller (
ibm-fdb-controller):Required role: To complete this task, you must be a cluster administrator.
- Identify your FoundationDB
controllers.
This command returns the names of two FoundationDB controllers in the following formats:oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-fdb-controlleribm-fdb-controller-manager-<INSTANCE-ID>apple-fdb-controller-manager-<INSTANCE-ID>
- Delete the
ibm-fdb-controller-managerto refresh it.oc delete pod ibm-fdb-controller-<INSTANCE-ID> -n ${PROJECT_CPD_INST_OPERATORS} - Wait for the controller to restart. This can take approximately one minute.
- Check the status of your FoundationDB
cluster:
Confirm that theoc -n ${PROJECT_CPD_INST_OPERANDS} get FdbCluster -o yamlfdbStatusis nowCompleted.
- Identify your FoundationDB
controllers.
After you upgrade a Red Hat OpenShift Container Platform cluster, the FoundationDB resource can become unavailable
Applies to: 5.0.0 and later
After you upgrade your cluster to a new version of Red Hat OpenShift Container Platform, the IBM FoundationDB pods can become unavailable. When this issue occurs, services that rely on FoundationDB such as IBM Knowledge Catalog and IBM Match 360 cannot function correctly.
- IBM Knowledge Catalog
- IBM Match 360 with Watson
- Diagnosing the problem
- To identify the cause of this issue, check the FoundationDB status and details.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusIf this command is successful, the returned status is
Complete. If the status isInProgressorFailed, proceed to the workaround steps. - If the status is
Completebut FoundationDB is still unavailable, log in to one of the FDB pods and check the status details to ensure that the database is available and all coordinators are reachable.oc rsh sample-cluster-log-1 /bin/fdbcliTo check the detailed status of the FDB pod, run fdbcli to enter the FoundationDB command-line interface, then run the following command at the
fdb>prompt.status details- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
coordinatorscommand with the IP addresses specified in the error message as input.oc get pod -o wide | grep storage > coordinators IP-ADDRESS-1:4500:tls IP-ADDRESS-2:4500:tls IP-ADDRESS-3:4500:tlsIf this step does not resolve the problem, proceed to the workaround steps.
- If you get a different message, such as Recruiting new transaction servers, proceed to the workaround steps.
- If you get a message that is similar to Could not communicate with a quorum of
coordination servers, run the
- Check the FoundationDB
status.
- Resolving the problem
- To resolve this issue, restart the FoundationDB
pods.
Required role: To complete this task, you must be a cluster administrator.
- Restart the FoundationDB cluster
pods.
oc get fdbcluster oc get po |grep ${CLUSTER_NAME} |grep -v backup|awk '{print }' |xargs oc delete poReplace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance. - Restart the FoundationDB operator
pods.
oc get po |grep fdb-controller |awk '{print }' |xargs oc delete po - After the pods finish restarting, check to ensure that FoundationDB is available.
- Check the FoundationDB
status.
oc get fdbcluster -o yaml | grep fdbStatusThe returned status must be
Complete. - Check to ensure that the database is
available.
oc rsh sample-cluster-log-1 /bin/fdbcliIf the database is still not available, complete the following steps.
- Log in to the
ibm-fdb-controllerpod. - Run the
fix-coordinatorscript.kubectl fdb fix-coordinator-ips -c ${CLUSTER_NAME} -n ${PROJECT_CPD_INST_OPERATORS}Replace ${CLUSTER_NAME} in the command with the name of your
fdbclusterinstance.Note: For more information about thefix-coordinatorscript, see the workaround steps from the resolved IBM Match 360 known issue item The FoundationDB cluster can become unavailable.
- Log in to the
- Check the FoundationDB
status.
- Restart the FoundationDB cluster
pods.
After you upgrade from Cloud Pak for Data 4.7.4, generating a bearer token fails in an IAM-enabled cluster
Applies to: Upgrades from Version 4.7.4 to 5.0.0 and later
If you upgrade from Cloud Pak for Data version 4.7.4
to Cloud Pak for Data
5.0.0 and later, the IAM access token API
(/idprovider/v1/auth/identitytoken) fails. You cannot login to the user interface
when the identitytoken API fails.
- Diagnosing the problem
-
The following error is displayed in the log when you generate an IAM access token:
Failed to get access token, Liberty error: {"error_description":"CWWKS1406E: The token request had an invalid client credential. The request URI was \/oidc\/endpoint\/OP\/token.","error":"invalid_client"}" - Resolving the problem
-
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Run the following command to restart the
oidc-client-registrationjob:oc -n delete job oidc-client-registration
-
Secrets are not visible in connections after upgrade
- Upgrades from Version 4.7 to Version 5.0.0, 5.0.1, or 5.0.2
- Upgrades from Version 4.8 to Version 5.0.0, 5.0.1, or 5.0.2
Fixed in: 5.0.3
If you use secrets when you create connections, the secrets are not visible in the connection details after you upgrade Cloud Pak for Data. This issue occurs when your vault uses a private CA signed certificate.
- Resolving the problem
- To see the secrets in the user interface:
- Change to the project where Cloud Pak for Data is
installed:
oc project ${PROJECT_CPD_INST_OPERANDS} - Set the following environment
variables:
oc set env deployment/zen-core-api VAULT_BRIDGE_TLS_RENEGOTIATE=true oc set env deployment/zen-core-api VAULT_BRIDGE_TOLERATE_SELF_SIGNED=true
- Change to the project where Cloud Pak for Data is
installed:
Node pinning is not applied to postgresql pods
Applies to: 5.0.0 and later
If you use node pinning to schedule pods on specific nodes, and your environment includes
postgresql pods, the node affinity settings are not applied to the
postgresql pods that are associated with your Cloud Pak for Data deployment.
The resource specification injection (RSI) webhook cannot patch postgresql pods
because the EDB Postgres operator uses a
PodDisruptionBudget resource to limit the number of concurrent disruptions to
postgresql pods. The PodDisruptionBudget resource prevents
postgresql pods from being evicted.
You must manually clean up remote physical location artifacts if the create-physical-location command fails
Applies to: 5.0.0
Fixed in: 5.0.1
If the cpd-cli
manage
create-physical-location command fails, the command leaves behind
resources that you must clean up by running the cpd-cli
manage
delete-physical-location command:
cpd-cli manage delete-physical-location \
--physical_location_name=${REMOTE_PHYSICAL_LOCATION_ID} \
--management_ns=${REMOTE_PROJECT_MANAGEMENT} \
--cpd_hub_url=${CPD_HUB_URL} \
--cpd_hub_api_key=${CPD_HUB_API_KEY}
If you try to re-run the create-physical-location
command against the same management project before you run the delete-physical-location command, the create-physical-location command returns the following error:
The physical-location-info-cm ConfigMap already exists in the <management-ns> project.
The physical location in the ConfigMap is called <remote-physical-location-id>
* If you need to re-run the create-physical-location command to finish creating the physical location,
you must specify <remote-physical-location-id>.
* If you want to create a new physical location on the cluster, you must specify a different project.
You cannot reuse an existing management project.
The ibm-nginx deployment does not scale fast enough when automatic scaling
is configured
Applies to: 5.0.0 and later
If you configure automatic scaling for the IBM Cloud Pak for Data control plane, the ibm-nginx
deployment might not scale fast enough. Some symptoms include:
- Slow response times
- High CPU requests are throttled
- The deployment scales up and down even when the workload is steady
This problem typically occurs when you install watsonx Assistant or watsonx Orchestrate.
- Resolving the problem
- If you encounter the preceding symptoms, you must manually scale the
ibm-nginxdeployment:oc patch zenservice lite-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": { "Nginx": { "name": "ibm-nginx", "kind": "Deployment", "container": "ibm-nginx-container", "replicas": 5, "minReplicas": 2, "maxReplicas": 11, "guaranteedReplicas": 2, "metrics": [ { "type": "Resource", "resource": { "name": "cpu", "target": { "type": "Utilization", "averageUtilization": 529 } } } ], "resources": { "limits": { "cpu": "1700m", "memory": "2048Mi", "ephemeral-storage": "500Mi" }, "requests": { "cpu": "225m", "memory": "920Mi", "ephemeral-storage": "100Mi" } }, "containerPolicies": [ { "containerName": "*", "minAllowed": { "cpu": "200m", "memory": "256Mi" }, "maxAllowed": { "cpu": "2000m", "memory": "2048Mi" }, "controlledResources": [ "cpu", "memory" ], "controlledValues": "RequestsAndLimits" } ] } }}'
Backup and restore issues
- Issues that apply to several or all backup and restore methods
-
- Restore posthooks fail to run when restoring Data Virtualization with the OADP utility
- Backup fails for the platform with error in EDB Postgres cluster
- OADP backup is missing EDB Postgres PVCs
- Disk usage size error when running the du-pv command
- After restore, watsonx Assistant custom resource is stuck in InProgress at 11/19 verified state
- After restore, watsonx Assistant is stuck on the 17/19 deployed state or custom resource is stuck in InProgress state
- OADP backup precheck command fails
- During or after a restore, pod shows PVC is missing
- After restoring an online backup, status of Watson Discovery custom resource remains in InProgress state
- After successful restore, the ibm-common-service-operator deployment fails to reach a Running state
- Restore fails with Error from server (Forbidden): configmaps is forbidden error
- After a restore, unable to access the Cloud Pak for Data console
- After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain
- Unable to back up Watson Discovery when the service is scaled to the xsmall size
- In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored
- Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster
- After a restore, OperandRequest timeout error in the ZenService custom resource
- Online backup and restore with the OADP backup and restore utility issues
-
- Online restore of Data Virtualization fails with post-hook errors
- Online backup of Analytics Engine powered by Apache Spark fails
- Watson Speech services status is stuck in InProgress after restore
- Common core services and dependent services in a failed state after an online restore
- Backup validation fails because of missing resources in wkc-foundationdb-cluster-aux-checkpoint-cm ConfigMap
- Online backup and restore with IBM Storage Fusion issues
-
- Restoring an RSI-enabled backup fails
- Restore fails at Hook: br-service-hooks-operators restore step
- Data Virtualization restore fails at post-workload step
- Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails
- Backup failed at Volume group: cpd-volumes stage
- Backup of Cloud Pak for Data operators project fails at data transfer stage
- Online backup and restore with NetApp Astra Control Center issues
- Data replication with Portworx issues
- Offline backup and restore with the OADP backup and restore utility issues
-
- Custom foundation models missing after backup and restore
- Creating an offline backup in REST mode stalls
- Common core services custom resource is in InProgress state after an offline restore to a different cluster
- OpenPages offline backup fails with pre-hook error
- Offline backup pre-hooks fail on Separation of Duties cluster
- Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore
- Unable to restore offline backup of OpenPages to different cluster
OADP backup is missing EDB Postgres PVCs
Applies to: 5.0.0 and later
- Diagnosing the problem
- After an online or offline backup is taken with the OADP utility, EDB Postgres PVCs are missing in the PVC backup list.
- Cause of the problem
- EDB Postgres replica PVCs might be excluded from a backup when an EDB Postgres cluster switches primary instances.
- Resolving the problem
- Before you create a backup, run the following
command:
oc label pvc,pods -l k8s.enterprisedb.io/cluster,velero.io/exclude-from-backup=true velero.io/exclude-from-backup- -n ${PROJECT_CPD_INST_OPERANDS}For more information, see the following topics:
Disk usage size error when running the du-pv
command
Applies to: 5.0.0 and later
- Diagnosing the problem
- When you run the
du-pvcommand to estimate how much storage is needed to create a backup with the OADP utility, you see the following error message:Total estimated volume usage size: 0 one or more error(s) occurred while trying to get disk usage size. Please check reported errors in log file for detailsThe status of the cpdbr-agent pods is
ImagePullBackoff:oc get po -n ${OADP_NAMESPACE}Example output:NAME READY STATUS RESTARTS AGE cpdbr-agent-9lprf 0/1 ImagePullBackOff 0 74s cpdbr-agent-pf42f 0/1 ImagePullBackOff 0 74s cpdbr-agent-trprx 0/1 ImagePullBackOff 0 74s - Cause of the problem
- The
--image-prefixoption is not currently used by the cpdbr-agent install command. If you specify this option, it is ignored. Instead, the install command uses the default image atregistry.access.redhat.com/ubi9/ubi-minimal:latest. - Resolving the problem
- Do the following steps:
- Patch the cpdbr-agent daemonset with the desired
fully-qualified image
name:
oc patch daemonset cpdbr-agent -n ${OADP_NAMESPACE} --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"<fully-qualified-image-name>"}]' - Wait for the daemonset to reach a healthy
state:
oc rollout status daemonset cpdbr-agent -n ${OADP_NAMESPACE} - Retry the
dv-pvcommand.
Tip: For more information about this feature, see Optional: Estimating how much storage to allocate for backups. - Patch the cpdbr-agent daemonset with the desired
fully-qualified image
name:
After restore, watsonx Assistant
custom resource is stuck in InProgress at 11/19 verified
state
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- This problem can occur after you restore an online backup to the same cluster or to a
different cluster. Run the following
command:
oc get <watsonx-Assistant-instance-name> -n ${PROJECT_CPD_INST_OPERANDS}Example output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE AGE wa 5.0.1 False Initializing True VerifyWait 19/19 11/19 4h39m - Cause of the problem
- Pods are unable to find the wa-global-etcd secret. Run the
following
command:
oc describe pod wa-store-<xxxxxxxxx>-<xxxxx> | tail -5Example output:Normal QueuePosition 51m (x2 over 52m) ibm-cpd-scheduler Queue Position: 3 Normal QueuePosition 50m (x2 over 52m) ibm-cpd-scheduler Queue Position: 2 Normal QueuePosition 36m ibm-cpd-scheduler Queue Position: 1 Warning FailedMount 6m49s (x22 over 50m) kubelet Unable to attach or mount volumes: unmounted volumes=[global-etcd], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 74s (x33 over 52m) kubelet MountVolume.SetUp failed for volume "global-etcd" : secret "wa-global-etcd" not found - Resolving the problem
- Delete certain deployments and recreate them by doing the following steps:
- Ensure that the watsonx Assistant operator is running.
- Create the
INSTANCEenvironment variable and set it to the watsonx Assistant instance name:export INSTANCE=<watsonx-Assistant-instance-name> - Run the following
script:
# Components to restart one by one SEQUENTIAL_DEPLOYMENTS=("ed" "dragonfly-clu-mm" "tfmm" "clu-triton-serving" "clu-serving" "nlu" "dialog" "store") # Components to restart together in parallel PARALLEL_DEPLOYMENTS=("analytics" "clu-embedding" "incoming-webhooks" "integrations" "recommends" "system-entities" "ui" "webhooks-connector" "gw-instance" "store-admin") for DEPLOYMENT in "${SEQUENTIAL_DEPLOYMENTS[@]}"; do echo "#Starting restart of $INSTANCE-$DEPLOYMENT." # Delete the deployment oc delete deployment $INSTANCE-$DEPLOYMENT # Wait until the deployment is completely deleted while oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do echo "Waiting for $INSTANCE-$DEPLOYMENT to be fully deleted..." sleep 5 done # Ensure the deployment is recreated echo "Recreating $INSTANCE-$DEPLOYMENT." while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..." sleep 5 done echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..." oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully." done for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do echo "#Starting restart of $INSTANCE-$DEPLOYMENT." # Delete the deployment oc delete deployment $INSTANCE-$DEPLOYMENT & done # Wait for all parallel delete operations to complete wait # Ensure parallel deployments are recreated for DEPLOYMENT in "${PARALLEL_DEPLOYMENTS[@]}"; do while ! oc get deployment $INSTANCE-$DEPLOYMENT &> /dev/null; do echo "Waiting for $INSTANCE-$DEPLOYMENT to be created..." sleep 5 done echo "Waiting for $INSTANCE-$DEPLOYMENT to become ready..." oc rollout status deployment/$INSTANCE-$DEPLOYMENT --watch=true echo "#Rolling restart of $INSTANCE-$DEPLOYMENT completed successfully." done echo "All deployments have been restarted successfully."
After restore, watsonx Assistant is
stuck on the 17/19 deployed state or custom resource is stuck in
InProgress state
Applies to: 5.0.1, 5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- This problem can occur after you restore an online backup to the same cluster or to a
different cluster. Run the following
command:
oc get wa -n ${PROJECT_CPD_INST_OPERANDS}Example output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE AGE wa 5.0.1 False Initializing True VerifyWait 17/19 15/19 4h39m - Resolving the problem
- Delete the
wa-integrations-operand-secretandwa-integrations-datastore-connection-stringssecrets by running the following commands:oc delete secret wa-integrations-operand-secret -n ${PROJECT_CPD_INST_OPERANDS}oc delete secret wa-integrations-datastore-connection-strings -n ${PROJECT_CPD_INST_OPERANDS}After the secrets are deleted, the watsonx Assistant operator recreates them with the correct values, and the watsonx Assistant custom resource and pods are now in a good state.
OADP backup precheck command fails
Applies to: 5.0.0, 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
- This problem occurs when you do offline or online backup and restore with the OADP backup and restore utility. Run
the backup precheck
command:
cpd-cli oadp backup precheck --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS}The following error message appears:
error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope Error: error getting csv list: : clusterserviceversions.operators.coreos.com is forbidden: User "system:serviceaccount:zen-cpdbrapi:cpdbr-api-sa" cannot list resource "clusterserviceversions" in API group "operators.coreos.com" at the cluster scope [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 - Cause of the problem
- The cpdbr-api pod does not have the necessary permission to list clusterserviceversions.operators.coreos.com in all projects (namespaces) for the backup precheck command.
- Resolving the problem
- Add
--exclude-checks OadpOperatorCSVto the backup precheck command:cpd-cli oadp backup precheck \ --tenant-operator-namespace ${PROJECT_CPD_INST_OPERATORS} \ --exclude-checks OadpOperatorCSV
During or after a restore, pod shows PVC is missing
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- During or after a restore, a pod shows that one or more PVCs are missing. For
example:
oc describe pod c-db2oltp-wkc-db2u-0Example output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 18m (x11076 over 16h) ibm-cpd-scheduler 0/20 nodes are available: 20 persistentvolumeclaim "wkc-db2u-backups" not found. preemption: 0/20 nodes are available: 20 Preemption is not helpful for scheduling. - Cause of the problem
- Velero does not back up PVCs that are in a
Terminatingstate. - Resolving the problem
- To work around the problem, before you restore a backup, ensure that no PVCs are in a
Terminatingstate. To check for PVCs that are in aTerminatingstate after a backup is created, check the Velero pod logs forSkipping item because it's being deletedmessages:oc logs po -l deploy=velero -n <oadp-operator-ns>Example output:
time="<timestamp>" level=info msg="Skipping item because it's being deleted." backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/item_backupper.go:161" name=wkc-db2u-backups namespace=zen1 resource=persistentvolumeclaims time="<timestamp>" level=info msg="Backed up 286 items out of an estimated total of 292 (estimate will change throughout the backup)" backup=oadp-operator/bkupocs661-tenant-online-b1 logSource="/remote-source/velero/app/pkg/backup/backup.go:404" name=wkc-db2u-backups namespace=zen1 progress= resource=persistentvolumeclaims
After restoring an online backup, status of Watson Discovery custom resource remains in
InProgress state
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- You see the following error, even though you did the Multicloud Object Gateway post-restore task. For
example, if you used IBM Storage Fusion to do the backup and
restore, you created the secrets that Watson Discovery uses to connect to
Multicloud Object Gateway.
- lastTransactionTime: <timestamp> message: Post task of online restore is in progress. Please ensure that MCG is correctly configured after restore. reason: PostRestoreInProgress status: "True" type: Message - Cause of the problem
- The Watson Discovery post-restore task did not complete.
- Resolving the problem
- To work around the problem, do the following steps:
- Check that the Watson Discovery
post-restore component
exists:
oc get wd wd -o jsonpath='{.status.componentStatus.deployedComponents[?(@=="post_restore")]}'If the post-restore component exists, the output of the command is:post_restore - Check that the post-restore task is not
unverified:
oc get wd wd -o jsonpath='{.status.componentStatus.unverifiedComponents[?(@=="post_restore")]}'If the post-restore task is not unverified, no output is produced by the command.
- In this situation, some failure jobs do not rerun and must be
deleted:
oc delete job wd-discovery-enrichment-model-copy wd-discovery-orchestrator-setup - Check that Watson Discovery is
now ready:
oc get wdExample output:NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED QUIESCE DATASTOREQUIESCE AGE wd 5.0.0 True Stable False Stable 23/23 23/23 NOT_QUIESCED NOT_QUIESCED 22h
- Check that the Watson Discovery
post-restore component
exists:
After successful restore, the ibm-common-service-operator
deployment fails to reach a Running state
Applies to: 5.0.0 and later
- Diagnosing the problem
- The following symptoms are seen:
- Running the following command shows that the
ibm-common-service-operator pod and deployment are not
healthy:
Example output:oc get pods -n ${PROJECT_CPD_INST_OPERATORS}ibm-common-service-operator-<...> 0/1 CrashLoopBackOff 72 (4m46s ago) 6h11mError logs show permission issues:
oc logs ibm-common-service-operator-<...>Example output:... # I0529 20:52:39.182025 1 request.go:665] Waited for 1.033737216s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/dashboard.opendatahub.io/v1alpha?timeout=32s # <date_timestamp>20:52:47.794Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} # I0529 20:52:47.794980 1 main.go:130] Identifying Common Service Operator Role in the namespace cpd-operator # E0529 20:52:47.835106 1 util.go:465] Failed to fetch configmap kube-public/saas-config: configmaps "saas-config" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-public" # I0529 20:52:47.837942 1 init.go:152] Single Deployment Status: false, MultiInstance Deployment status: true, SaaS Depolyment Status: false # I0529 20:52:49.188786 1 request.go:665] Waited for 1.340366538s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/cdi.kubevirt.io/v1beta1?timeout=32s # E0529 20:52:57.412736 1 init.go:1683] Failed to cleanup validatingWebhookConfig: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope # E0529 20:52:57.412762 1 main.go:153] Cleanup Webhook Resources failed: validatingwebhookconfigurations.admissionregistration.k8s.io "ibm-common-service-validating-webhook-cpd-operator" is forbidden: User "system:serviceaccount:cpd-operator:ibm-common-service-operator" cannot delete resource "validatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope - Running the following command shows that the
ibm-common-service-operator CSV is stuck in a
Pendingstate:
Example output:oc get csv -n ${PROJECT_CPD_INST_OPERATORS}NAME DISPLAY VERSION REPLACES PHASE ibm-zen-operator.v6.0.0 IBM Zen Service 6.0.0 PendingRunning the following command shows that status of the CommonService custom resource is
Succeeded:oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase - OLM logs show the following
error:
oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operatoroc logs -n openshift-operator-lifecycle-manager -l app=olm-operatorExample output:E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-common-service-operator.v4.6.0"} failed: requirements were not met time="<timestamp>" level=info msg="requirements were not met" csv=cpd-platform-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending
- Running the following command shows that the
ibm-common-service-operator pod and deployment are not
healthy:
- Cause of the problem
- The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.
- Resolving the problem
- To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.
Restore fails with Error from server (Forbidden): configmaps is
forbidden error
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- When restoring Cloud Pak for Data to a
different cluster with IBM Storage Fusion, NetApp Astra Control Center, or Portworx, you see the following error
message:
Time: <timestamp> level=error - oc get configmap -n kube-public - FAILED with: Error from server (Forbidden): configmaps is forbidden: User "system:serviceaccount:cpd-operator:cpdbr-tenant-service-sa" cannot list resource "configmaps" in API group "" in the namespace "kube-public" End Time: <timestamp> - Cause of the problem
- The command to uninstall the
cpdbr service was run with the incorrect
--tenant-operator-namespaceparameter. For example, multiple Cloud Pak for Data instances were installed in the cluster, and while cleaning up one of the instances, the incorrect project was specified when uninstalling the cpdbr service. - Resolving the problem
- To work around the problem, reinstall the cpdbr service in the project where it was mistakenly uninstalled. For details, see one of the following topics:
After a restore, unable to access the Cloud Pak for Data console
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- You see the following symptoms:
- Running the following command shows that the ibm-iam-operator
pod and deployment are not
healthy:
Example output:oc get pods -n ${PROJECT_CPD_INST_OPERATORS}ibm-iam-operator-<...> 0/1 CrashLoopBackOff 72 (4m46s ago) 6h11mError logs show permission issues:oc logs ibm-iam-operator-<...> - Running the following command shows that the ibm-iam-operator
CSV is stuck in a
Pendingstate:
Example output:oc get csv -n ${PROJECT_CPD_INST_OPERATORS}NAME DISPLAY VERSION REPLACES PHASE ibm-iam-operator.v4.6.0 IBM IM Operator 4.6.0 PendingRunning the following command shows that status of the CommonService custom resource is
Succeeded:oc get commonservice -n ${PROJECT_CPD_INST_OPERANDS} common-service -o json | jq .status.phase - OLM logs show the following
error:
oc logs -n openshift-operator-lifecycle-manager -l app=catalog-operatoroc logs -n openshift-operator-lifecycle-manager -l app=olm-operatorExample output:E0530 01:00:07.268889 1 queueinformer_operator.go:319] sync {"update" "cpd-operator/ibm-iam-operator.v4.6.0"} failed: requirements were not met time="<timestamp>" level=info msg="requirements were not met" csv=ibm-iam-operator.v4.6.0 id=<...> namespace=cpd-operator phase=Pending
- Running the following command shows that the ibm-iam-operator
pod and deployment are not
healthy:
- Cause of the problem
- Insufficient permissions from missing ClusterRole and ClusterRoleBindings. The root cause is from a known OLM issue where ClusterRoleBindings are missing, even though the InstallPlan shows it was created. For details, see the OLM issue ClusterRoleBinding is missing although InstallPlan shows it was created.
- Resolving the problem
- To work around the problem, clean up the Cloud Pak for Data instance and operator projects (namespaces) and retry the restore. For cleanup instructions, see Preparing to restore Cloud Pak for Data with the OADP utility.
After a successful restore, the Cloud Pak for Data console points to the source cluster domain in its URL instead of the target cluster domain
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- Get the Cloud Pak for Data console route by
running the following
command:
oc get route -n ${PROJECT_CPD_INST_OPERANDS}The output of the command shows that the Cloud Pak for Data console route points to the source cluster domain rather than to the target cluster domain.
- Cause of the problem
- The ibmcloud-cluster-info ConfigMap from the source cluster is included in the restore when it is expected to be excluded and re-generated, causing the target restore cluster to use the source routes.
- Resolving the problem
- To work around the problem, do the following steps:
- Edit the fields in the ibmcloud-cluster-info ConfigMap to use
the target cluster
hostname:
oc edit configmap ibmcloud-cluster-info -n ${PROJECT_CPD_INST_OPERANDS} - Restart the ibm-zen-operator
pod:
oc delete po -l app.kubernetes.io/name=ibm-zen-operator -n ${PROJECT_CPD_INST_OPERANDS} - Check that the routes are
updated:
oc get route -n ${PROJECT_CPD_INST_OPERANDS}
If restarting the ibm-zen-operator pod does not correctly update the routes, and the ibm-iam-operator deployment is not healthy, do the workaround that is described in the previous issue.
- Edit the fields in the ibmcloud-cluster-info ConfigMap to use
the target cluster
hostname:
Unable to back up Watson Discovery
when the service is scaled to the xsmall size
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- The problem that you see depends on the backup and restore method that you are using. For example, if you are using IBM Storage Fusion, a Failed snapshot message appears during the backup process.
- Cause of the problem
- The
xsmallsize configuration uses 1 OpenSearch data node. The backup process requires 2 data nodes. - Resolving the problem
- To work around the problem, increase the number of OpenSearch data nodes to 2. In the
${PROJECT_CPD_INST_OPERANDS}project (namespace), run the following command:oc patch wd wd --type=merge --patch='{"spec":{"elasticsearch":{"dataNode":{"replicas":2}}}}'
In a Cloud Pak for Data deployment that has multiple OpenPages instances, only the first instance is successfully restored
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- After the restore, the custom resource of the first is OpenPages instance is in a
Completedstate. The custom resources of the remaining OpenPages instances are in anInMaintenancestate. - Cause of the problem
- Hooks (prehooks, posthooks, etc.) are run only on the first OpenPages instance. Log files list only the results for one OpenPages instance when multiple were expected.
- Resolving the problem
- To work around the problem, do the following steps:
- Get the OpenPages
instance
ConfigMaps:
oc get cm -n ${PROJECT_CPD_INST_OPERANDS} -l cpdfwk.module=openpages-aux - Edit each OpenPages
instance ConfigMap so that their
.data.aux-meta.namefields match their.metadata.labels.["cpdfwk.name"]label:oc edit cm -n ${PROJECT_CPD_INST_OPERANDS} <configmap-name>
- Get the OpenPages
instance
ConfigMaps:
Unable to log in to Cloud Pak for Data with OpenShift cluster credentials after successfully restoring to a different cluster
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- When Cloud Pak for Data is integrated with
the Identity Management Service service, you
cannot log in with OpenShift
cluster credentials. You might be able to log in with LDAP or as
cpdadmin. - Resolving the problem
- To work around the problem, run the following
commands:
oc delete cm platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}oc delete cm oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}oc delete cm ibm-iam-bindinfo-oauth-client-map -n ${PROJECT_CPD_INST_OPERANDS}oc delete cm ibm-iam-bindinfo-platform-auth-idp -n ${PROJECT_CPD_INST_OPERANDS}oc delete pods -n ${PROJECT_CPD_INST_OPERATORS} -l app.kubernetes.io/instance=ibm-common-service-operatoroc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-auth-serviceoc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-managementoc delete pods -n ${PROJECT_CPD_INST_OPERANDS} -l app.kubernetes.io/instance=platform-identity-provider
After a restore, OperandRequest timeout error in the ZenService custom resource
Applies to: 5.0.0 and later
- Diagnosing the problem
- Get the status of the ZenService
YAML:
oc get zenservice lite-cr -n ${PROJECT_CPD_INST_OPERATORS} -o yamlIn the output, you see the following error:
... zenMessage: '5.1.3/roles/0010-infra has failed with error: "OperandRequest" "zen-ca-operand-request": Timed out waiting on resource' ...Check for failing operandrequests:oc get operandrequests -AFor failing operandrequests, check their conditions forconstraints not satisfiablemessages:oc describe -n ${PROJECT_CPD_INST_OPERATORS} <opreq-name> - Cause of the problem
- Subscription wait operations timed out. The problematic subscriptions show an error
similar to the following
example:
'constraints not satisfiable: clusterserviceversion ibm-db2aaservice-cp4d-operator.v5.2.0 exists and is not referenced by a subscription, @existing/cpd-operators//ibm-db2aaservice-cp4d-operator.v5.2.0 and ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0 originate from package ibm-db2aaservice-cp4d-operator, subscription ibm-db2aaservice-cp4d-operator requires ibm-db2aaservice-cp4d-operator-catalog/cpd-operators/v5.2/ibm-db2aaservice-cp4d-operator.v5.2.0, subscription ibm-db2aaservice-cp4d-operator exists'This problem is a known issue with Red Hat Operator Lifecycle Manager. For details, see Failed to install or upgrade operator with warning 'clusterserviceversion is not referenced by a subscription'.
- Workaround
- Do the following steps:
- Delete the problematic clusterserviceversions and subscriptions, and
restart the Operand Deployment Lifecycle Manager (ODLM) pod.
For details, follow the steps in the troubleshooting document cloud-native-postgresql operator is installed with the certified-operators catalogsource.
- Delete Cloud Pak for Data instance
projects (namespaces).
For details, see Preparing to restore Cloud Pak for Data with the OADP utility.
- Retry the restore.
- Delete the problematic clusterserviceversions and subscriptions, and
restart the Operand Deployment Lifecycle Manager (ODLM) pod.
Online restore of Data Virtualization fails with post-hook errors
Applies to: 5.0.2, 5.0.3
- Diagnosing the problem
- Restoring an online backup of Data Virtualization on Portworx storage with the OADP backup and restore utility fails.
In the CPD-CLI*.log file, you see errors such as in the following
examples:
<time> zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=errortime=<timestamp> level=error msg=error performing op postRestoreViaConfigHookRule for resource dv, msg: 1 error occurred: * : command timed out after 40m0s: timed out waiting for the condition func=cpdbr-oadp/pkg/quiesce.(*BasicPlanExecutor).applyPlanInternal file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1535 - Cause of the problem
- Db2 startup is slow, causing the Data Virtualization post-restore hook to time out.
- Resolving the problem
- To work around the problem, take various Data Virtualization components out of
write-suspend mode.
- Take dvutils out of write-suspend
mode:
oc rsh c-db2u-dv-dvutils-0 bash/opt/dv/current/dv-utils.sh -o leavesafemode --is-bar - Take the Data Virtualization
hurricane pod out of write-suspend
mode:
oc rsh $(oc get pods | grep -i hurricane | cut -d' ' -f 1) bashsu - db2inst1/usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L - Take Db2 out of
write-suspend
mode:
oc rsh c-db2u-dv-db2u-0 bashsu - db2inst1/db2u/scripts/bigsql-exec.sh /usr/ibmpacks/current/bigsql/bigsql/bigsql-cli/BIGSQL/package/scripts/bigsql-db2ubar-hook.sh -H POST -M RESTORE -L - After a few minutes, verify that Db2 is no longer in write-suspend
mode:
db2 connect to bigsqlIf the command finishes successfully, Db2 is no longer in write-suspend mode.
- Restart the Data Virtualization
caching pod by deleting the existing
pod:
oc delete pod $(oc get pods | grep -i c-db2u-dv-dvcaching | cut -d' ' -f 1)
- Take dvutils out of write-suspend
mode:
Online backup of Analytics Engine powered by Apache Spark fails
Applies to: 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
- When you try to create a backup of a Cloud Pak for Data deployment that includes the
Analytics Engine powered by Apache Spark service with the
OADP utility, the backup
fails at the step to create a backup of Cloud Pak for Data PVCs and volume data. In the log
file, you see the following
error:
Hook execution breakdown by status=error/timedout: The following hooks either have errors or timed out pre-backup (1): COMPONENT CONFIGMAP METHOD STATUS DURATION analyticsengine-cnpsql-ckpt cpd-analyticsengine-aux-edb-ckpt-cm rule error 1m17.502299591s -------------------------------------------------------------------------------- ** INFO [BACKUP CREATE/SUMMARY/END] ******************************************* Error: error running pre-backup hooks: Error running pre-processing rules. Check the /root/install_automation/cpd-cli-linux-EE-14.0.1-353/cpd-cli-workspace/logs/CPD-CLI-<date>.log for errors. [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 - Cause of the problem
- The EDB Postgres cluster spark-hb-cloud-native-postgresql remains fenced.
- Resolving the problem
- Unfence the cluster by doing the following steps:
- Edit the spark-hb-cloud-native-postgresql
cluster:
oc edit clusters.postgresql.k8s.enterprisedb.io spark-hb-cloud-native-postgresql - Remove the following
line:
k8s.enterprisedb.io/fencedInstances: "" - Retry the backup.
Tip: For more information about resolving problems with EDB Postgres clusters that remain fenced, see EDB Postgres cluster is in an unhealthy state after a failed online backup. - Edit the spark-hb-cloud-native-postgresql
cluster:
Watson Speech services status is stuck in
InProgress after restore
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- After an online restore with the OADP utility, the
CPD-CLI*.log file shows
speechStatusis in theInProgressstate. - Cause of the problem
- The
speechStatusis in theInProgressstate due to a race condition in the stt-async component. Pods that are associated with this component are stuck in0/1 Runningstate. Run the following command to confirm this state:oc get pods -l app.kubernetes.io/component=stt-asyncExample output:NAME READY STATUS RESTARTS AGE speech-cr-stt-async-775d5b9d55-fpj8x 0/1 Running 0 60mIf one or more pods is in the
0/1 Runningstate for 20 minutes or more, this problem might occur. - Resolving the problem
- For each pod in the
0/1 Runningstate, run the following command:oc delete pod <stt-async-podname>
Common core services and dependent services in a failed state after an online restore
Applies to: 5.0.0
- Diagnosing the problem
- After you restore an online backup with the OADP backup and restore utility, the
Common core services custom resource and
the custom resource of dependent services remain in an
InProgressstate. - Cause of the problem
- Intermittent Elasticsearch failure.
- Workaround
- To work around the problem, do the following steps:
- Make sure that the current project (namespace) is set to the project that contains the Common core services and Watson Knowledge Catalog deployment.
- Make sure that a valid backup is available by running the following
command:
oc exec elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request GET --url http://localhost:19200/_cat/snapshots/cloudpak --header 'content-type: application/json' - When a valid backup is present, the command returns output like in the following
example:
cloudpak_snapshot_<timestamp> SUCCESS <epoch_timestamp> <hh:mm:ss> <epoch_timestamp> <hh:mm:ss> 200ms 3 23 0 23 - If a snapshot is not present, the restore has unexpectedly failed. Contact IBM Support for assistance.
- If a valid snapshot is present, delete the indexes on the
cluster:
oc exec -n ${PROJECT_CPD_INST_OPERANDS} elasticsea-0ac3-ib-6fb9-es-server-esnodes-0 -c elasticsearch -- curl --request DELETE --url 'http://localhost:19200/granite-3b,wkc,gs-system-index-wkc-v001,semantic' --header 'content-type: application/json' - Scale the OpenSearch
cluster down by
quiescing:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": true}}' - Wait for the pods to scale down, checking the status with the following
command:
`watch "oc get pods | grep elasticsea"` - When all the pods are gone, restart the cluster by unquiescing
it:
oc patch elasticsearchcluster elasticsearch-master --type merge --patch '{"spec": {"quiesce": false}}'
After you do these steps, Elasticsearch comes back up, and automatically kicks off recovery.
Backup validation fails because of missing resources in
wkc-foundationdb-cluster-aux-checkpoint-cm ConfigMap
Applies to: 5.0.3 and later
- Diagnosing the problem
- Backup fails with an error message similar to the following
example:
Failed with 1 error(s): error: DataProtectionPlan=service-orchestration, Action=cpd-backup-validation (index=8) backup validation failed: 1 error occurred: * backup validation failed for configmap: wkc-foundationdb-cluster-aux-checkpoint-cm - Cause of the problem
- The backup validation expected a secret to be present in the backup, but it wasn't captured.
- Workaround
- To resolve the issue, remove the following resources from the
wkc-foundationdb-cluster-aux-checkpoint-cmConfigMap:- resource-kind: role.rbac.authorization.k8s.io validation-rules: - type: match_names names: - common-service-db-app - common-service-db-superuser - resource-kind: rolebinding.rbac.authorization.k8s.io validation-rules: - type: match_names names: - common-service-db - resource-kind: cluster.postgresql.k8s.enterprisedb.io validation-rules: - type: count op: ">=" val: 1 labels: "foundationservices.cloudpak.ibm.com=cs-db,icpdsupport/addOnId=cpfs,icpdsupport/app=postgres"Now, backup should be validated when you run it again.
Restore posthooks fail to run when restoring Data Virtualization with the OADP utility
Applies to: 5.0.3
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error message like in the
following
example:
velero post-backup hooks in namespace <namespace> have one or more errors check for errors in <cpd-cli location>, and try again Error: velero post-backup hooks failed [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 - Cause of the problem
- Velero hooks annotations are blocking the restore posthooks from running.Get the Data Virtualization addon pod definition by running a command like in the following example:
oc get po dv-addon-6fdddc4bc7-8bdlq -o jsonpath="{.metadata.annotations}" | jq .Example output that shows the Velero annotations:... "post.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-backup no-op hook\"]", "post.hook.restore.velero.io/command": "[\"bash\", \"-c\", \"echo Executing post-resttore no-op hook\"]", "pre.hook.backup.velero.io/command": "[\"bash\", \"-c\", \"echo Executing pre-backup no-op hook\"]", ... - Resolving the problem
- Remove the Velero hooks annotations. Because these annotations are not used, you can
remove them from all pods. Run the following
commands:
oc annotate po --all post.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS} oc annotate po --all post.hook.restore.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS} oc annotate po --all pre.hook.backup.velero.io/command- -n ${PROJECT_CPD_INST_OPERANDS}After the annotations are removed, rerun the restore posthooks command:
cpd-cli oadp restore posthooks \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --log-level=debug \ --verbose
Backup fails for the platform with error in EDB Postgres cluster
Applies to: 5.0.0 and later
Applies to: All backup and restore methods
- Diagnosing the problem
- For example, in IBM Storage Fusion, the backup fails at the Hook: br-service hooks/pre-backup
stage in the backup sequence.
In the cpdbr-oadp.log file, you see the following error:
time=<timestamp> level=info msg=cmd stderr: Error: cannot take a cold backup of the primary instance or a target primary instance if the k8s.enterprisedb.io/snapshotAllowColdBackupOnPrimary annotation is not set to enabled - Cause of the problem
- Labels and annotations in the EDB Postgres cluster resources were not updated after a switchover of the EDB Postgres cluster's primary instance and replica.
- Resolving the problem
-
Use either the automatic or manual workaround.
- Automatic workaround
-
After you apply the YAML files, the following workaround automatically runs as a prehook every time you take a backup. The issue is automatically handled, so you do not encounter it again, which is especially useful if you have set up automatic backups.
- Check that the
${VERSION}environment variable is set in cpd_vars.sh to the correct Cloud Pak for Data version number. - Download the edb-patch-resources-legacy.yaml file.
- Run the following
command
oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f edb-patch-resources-legacy.yaml - Complete the steps that apply to your backup and restore method.
- Download the edb-patch-aux-ckpt-cm-legacy.yaml file.
- Run the following
command:
sed "s/VERSION_PLACEHOLDER/${VERSION}/g" edb-patch-aux-ckpt-cm-legacy.yaml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - - Retry the backup.
- Download the edb-patch-aux-br-cm-legacy.yaml file.
- Run the following
command:
sed "s/VERSION_PLACEHOLDER/${VERSION}/g" edb-patch-aux-br-cm-legacy.yaml | oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - - Retry the backup.
Online backup and restore
Offline backup and restore
- Check that the
- Manual workaround
-
Complete the following steps to manually run the workaround:
Note: If another switchover of the EDB Postgres cluster's primary instance and replica happens after you apply the manual workaround, you must complete the workaround again before you take a backup.- Download the edb-patch.sh file.
- Run the following
command:
sh edb-patch.sh ${PROJECT_CPD_INST_OPERANDS} - Retry the backup.
Restoring an RSI-enabled backup fails
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- Restoring an RSI-enabled backup with IBM Storage Fusion fails at the
Hook: br-service-hooks-operators restorestep. The cpdbr-tenant.log file shows the following error:cannot create resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope - Cause of the problem
- Permissions are missing in the cpdbr-tenant-service-clusterrole clusterrole.
- Resolving the problem
- Do the following steps:
- Install cpd-cli 5.0.3.
- Upgrade the cpdbr service:
- The cluster pulls images from the IBM
Entitled Registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose - Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from a private container registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose - Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from the IBM
Entitled Registry:
- Retry the restore.
Restore fails at Hook: br-service-hooks-operators restore step
Applies to: 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
- This problem occurs when using IBM Storage Fusion 2.7.2.
- The restore process fails at the
Hook: br-service-hooks-operators restorestep, and you see the following error message:Recipe failed BMYBR0003 There was an error when processing the job in the Transaction Manager service - The ${PROJECT_CPD_INST_OPERANDS} project was not created during the restore.
- When you run the following commands, the IBM Storage Fusion application custom
resource does not have the Cloud Pak for Data instance project listed under
.spec.includeNamespaces.export PROJECT_FUSION=<fusion-namespace>Tip: By default, the IBM Storage Fusion project isibm-spectrum-fusion-ns.oc get fapp -n ${PROJECT_FUSION} ${PROJECT_CPD_INST_OPERATORS} -o json | jq .spec
- The restore process fails at the
- Cause of the problem
- The backup is incomplete, causing the restore to fail.
- Resolving the problem
- Do the following steps:
- Install cpd-cli 5.0.2.
- Upgrade the cpdbr service:
- The cluster pulls images from the IBM
Entitled Registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose - Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=icr.io/cpopen/cpd \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from a private container registry:
- Environments with the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --cpd-scheduler-namespace=${PROJECT_SCHEDULING_SERVICE} \ --recipe-type=br \ --log-level=debug \ --verbose - Environments without the scheduling service
-
cpd-cli oadp install \ --upgrade=true \ --component=cpdbr-tenant \ --namespace=${OADP_OPERATOR_NS} \ --tenant-operator-namespace=${PROJECT_CPD_INST_OPERATORS} \ --cpdbr-hooks-image-prefix=${PRIVATE_REGISTRY_LOCATION} \ --recipe-type=br \ --log-level=debug \ --verbose
- The cluster pulls images from the IBM
Entitled Registry:
- Patch policy assignments with the backup and restore recipe details.
- Log in to Red Hat
OpenShift Container Platform as an instance
administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Get each policy assignment
name:
export PROJECT_FUSION=<fusion-namespace>oc get policyassignment -n ${PROJECT_FUSION} - If installed, patch the
${PROJECT_SCHEDULING_SERVICE}policy assignment:oc -n ${PROJECT_FUSION} patch policyassignment <cpd-scheduler-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-scheduler", "namespace":"${PROJECT_SCHEDULING_SERVICE}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}' - Patch the Cloud Pak for Data tenant
policy
assignment:
oc -n ${PROJECT_FUSION} patch policyassignment <cpd-tenant-policy-assignment> --type merge -p '{"spec":{"recipe":{"name":"ibmcpd-tenant", "namespace":"${PROJECT_CPD_INST_OPERATORS}", "apiVersion":"spp-data-protection.isf.ibm.com/v1alpha1"}}}'
- Log in to Red Hat
OpenShift Container Platform as an instance
administrator.
- Check that the IBM Storage Fusion application custom
resource for the Cloud Pak for Data
operator includes the following information:
- All projects (namespaces) that are members of the Cloud Pak for Data instance, including:
- The Cloud Pak for Data
operators project (
${PROJECT_CPD_INST_OPERATORS}). - The Cloud Pak for Data operands
project (
${PROJECT_CPD_INST_OPERANDS}). - All tethered projects, if they exist.
- The Cloud Pak for Data
operators project (
- The
PARENT_NAMESPACEvariable, which is set to${PROJECT_CPD_INST_OPERATORS}.
- To get the list of all projects that are members of the Cloud Pak for Data instance, run the
following
command:
oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.includedNamespaces'} - To get the
PARENT_NAMESPACEvariable, run the following command:oc get -n ${PROJECT_FUSION} applications.application.isf.ibm.com ${PROJECT_CPD_INST_OPERATORS} -o jsonpath={'.spec.variables'}
- All projects (namespaces) that are members of the Cloud Pak for Data instance, including:
- Take a new backup.
Data Virtualization restore fails at post-workload step
Applies to: 5.0.0-5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- When restoring an online backup of a Cloud Pak for Data deployment that includes Data Virtualization with IBM Storage Fusion, the restore fails at
the Hook: br-service-hooks/post-workload step in the restore
sequence. In the log file, you see the following error
message:
time=<timestamp> level=info msg= zen/configmap/cpd-dv-aux-ckpt-cm: component=dv, op=<mode=post-restore,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/go/src/cpdbr-oadp/pkg/quiesce/planexecutor.go:1137 - Workaround
- To work around the problem, do the following steps:
- Scale down the Data Virtualization
hurricane
pod:
oc scale deployment c-db2u-dv-hurricane-dv --replicas=0 - Log in to the Data Virtualization
head
pod:
oc rsh c-db2u-dv-db2u-0 bashsu - db2inst1 - Create a backup copy of the users.json
file:
cp /mnt/blumeta0/db2_config/users.json /mnt/PV/versioned/logs/users.json.original - Edit the users.json
file:
vi /mnt/blumeta0/db2_config/users.json - Locate
"locked":trueand change it to"locked":false. - Scale up the Data Virtualization
hurricane
pod:
oc scale deployment c-db2u-dv-hurricane-dv --replicas=1 - Restart BigSQL from the Data Virtualization head
pod:
oc exec -it c-db2u-dv-db2u-0 -- su - db2inst1 -c "bigsql start"The Data Virtualization head and worker pods continue with the startup sequence.
- Wait until the Data Virtualization
head and worker pods are fully started by running the following 2
commands:
oc get pods | grep -i c-db2u-dv-dvcaching | grep 1/1 | grep -i Runningoc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "ls -ls /tmp" | grep dv_setup_completeThe Data Virtualization head and worker pods are fully started when these 2 commands return
grepresults instead of empty results. - Re-create marker file that is needed by Data Virtualization's post-restore hook
logic:
oc exec -t c-db2u-dv-db2u-0 -- su - db2inst1 -c "touch /tmp/.ready_to_connectToDb" - Re-run the post-restore hook.
- Get the cpdbr-tenant-service pod
ID:
oc get po -A | grep "cpdbr-tenant-service" - Log in to the cpdbr-tenant-service
pod:
oc rsh -n ${PROJECT_CPD_INST_OPERATORS} <cpdbr-tenant-service pod id> - Run the following
commands:
/cpdbr-scripts/cpdbr-oadp restore posthooks --hook-kind=checkpoint --include-namespaces=${PROJECT_CPD_INST_OPERATORS},${PROJECT_CPD_INST_OPERANDS} --log-level=debug --verbose/cpdbr-scripts/cpdbr/cpdbr-cpd-operators.sh restore-namespacescope --operators-namespace ${PROJECT_CPD_INST_OPERATORS} --foundation-namespace ${PROJECT_CPD_INST_OPERATORS}
- Get the cpdbr-tenant-service pod
ID:
- Scale down the Data Virtualization
hurricane
pod:
Restoring an online backup of Cloud Pak for Data on IBM Storage Scale Container Native storage fails
Applies to: IBM Storage Fusion 2.7.2 and later
- Diagnosing the problem
- When you restore an online backup with IBM Storage Fusion, the restore process fails at the Volume group: cpd-volumes step in the restore sequence.
- Workaround
- This problem occurs when you have Persistent Volume Claims (PVCs) that are smaller
than 5Gi. To work around the problem, expand any PVC that is smaller than 5Gi to at
least 5Gi before you create the backup. For details, see Volume Expansion in the IBM Storage Scale Container Storage Interface Driver
documentation.Note: You cannot manually expand Watson OpenScale PVCs. To manage PVC sizes for Watson OpenScale, see Managing persistent volume sizes for Watson OpenScale.
Backup failed at Volume group: cpd-volumes stage
Applies to: IBM Storage Fusion 2.7.2
Fixed in: IBM Storage Fusion 2.7.2 hotfix
- Diagnosing the problem
- In the backup sequence in IBM Storage Fusion 2.7.2, the backup
fails at the Volume group: cpd-volumes stage.
The transaction manager log shows several error messages, such as the following examples:
<timestamp>[TM_0] - Error: Processing of volume cc-home-pvc failed.\n", "<timestamp>[VOL_12] -Snapshot exception (410)\\nReason: Expired: too old resource version: 2575013 (2575014) - Workaround
- Install the IBM Storage Fusion 2.7.2 hotfix. For details, see IBM Storage Fusion and IBM Storage Fusion HCI hotfix.
Backup of Cloud Pak for Data operators project fails at data transfer stage
Applies to: IBM Storage Fusion 2.7.2
Fixed in: IBM Storage Fusion 2.7.2 hotfix
- Diagnosing the problem
- In IBM Storage Fusion 2.7.2, the
backup fails at the Data transfer stage, with the following
error:
Failed transferring data There was an error when processing the job in the Transaction Manager service - Cause
- The length of a Persistent Volume Claim (PVC) name is more than 59 characters.
- Workaround
- Install the IBM Storage Fusion
2.7.2 hotfix. For details, see IBM Storage Fusion and
IBM Storage Fusion HCI
hotfix.
With the hotfix, PVC names can be up to 249 characters long.
Watson OpenScale etcd server fails to start after restoring from a backup
Applies to: 5.0.0 and later
- Diagnosing the problem
- After restoring a backup with NetApp Astra Control Center, the Watson
OpenScale
etcd cluster is in a
Failedstate. - Workaround
- To work around the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Expand the size of the etcd PersistentVolumes by 1Gi.
In the following example, the current PVC size is 10Gi, and the commands set the new PVC size to 11Gi.
operatorPod=`oc get pod -n ${PROJECT_CPD_INST_OPERATORS} -l name=ibm-cpd-wos-operator | awk 'NR>1 {print $1}'` oc exec ${operatorPod} -n ${PROJECT_CPD_INST_OPERATORS} -- roles/service/files/etcdresizing_for_resizablepv.sh -n ${PROJECT_CPD_INST_OPERANDS} -s 11Gi - Wait for the reconciliation status of the Watson
OpenScale custom resource to be in a
Completedstate:oc get WOService aiopenscale -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.status.wosStatus} {"\n"}'The status of the custom resource changes to
Completedwhen the reconciliation finishes successfully.
-
Restore fails at the running post-restore script step
Applies to: 5.0.3
- Diagnosing the problem
- When you use Portworx
asynchronous disaster recovery, activating applications fails when you run the post-restore script. In the
restore_post_hooks_<timestamp>.log file,
you see an error message such as in the following
example:
Time: <timestamp> level=error - cpd-tenant-restore-<timestamp>-r2 failed /cpdbr-scripts/cpdbr/cpdbr-tenant.sh post-restore exit code=1 *** cpdbr-tenant.sh post-restore failed *** command terminated with exit code 1 - Resolving the problem
- To work around the problem, prior to running the post-restore script, restore custom
resource definitions by running the following
command:
cpd-cli oadp restore create <restore-name-r2> \ --from-backup=cpd-tenant-backup-<timestamp>-b2 \ --include-resources='customresourcedefinitions' \ --include-cluster-resources=true \ --skip-hooks \ --log-level=debug \ --verbose
Cloud Pak for Data resources are not migrated
Applies to: 5.0.2 and later
- Diagnosing the problem
- When you use Portworx
asynchronous disaster recovery, the migration finishes almost immediately, and no
volumes or the expected number of resources are migrated. Run the following
command:
storkctl get migrations -n ${PX_ADMIN_NS}Tip:${PX_ADMIN_NS}is usually kube-system.Example output:NAME CLUSTERPAIR STAGE STATUS VOLUMES RESOURCES CREATED ELAPSED TOTAL BYTES TRANSFERRED cpd-tenant-migrationschedule-interval-<timestamp> mig-clusterpair Final Successful 0/0 0/0 <timestamp> Volumes (0s) Resources (3s) 0 - Cause of the problem
- This problem occurs starting with stork 23.11.0. Backup exec rules are not run, and expected Cloud Pak for Data resources are not migrated.
- Resolving the problem
- To resolve the problem, downgrade stork to a version prior to
23.11.0. For more information about stork releases, see the stork Releases page.
- Scale down the Portworx operator so that it doesn't reset manual changes to the
stork
deployment:
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=0 - Edit the stork deployment image version to a version prior to
23.11.0:
oc edit deploy -n ${PX_ADMIN_NS} stork - If you need to scale up the Portworx operator, run the
following command.Note: The Portworx operator will undo changes to the stork deployment and return to the original stork version.
oc scale -n ${PX_ADMIN_NS} deploy portworx-operator --replicas=1
- Scale down the Portworx operator so that it doesn't reset manual changes to the
stork
deployment:
Custom foundation models missing after backup and restore
Applies to: 5.0.0 and later
Applies to: Offline backup with the OADP utility
- Diagnosing the problem
-
When performing offline backup and restore, BYOM (Bring Your Own Model) custom foundation model deployments are not included in the backup, and they do not exist after the restore. Additionally, inference on BYOM deployments fails after restore even when the deployment metadata is restored.
- Cause of the problem
-
The issue occurs because the
watsonx-ai-ifmoperator's offline backup hook deletes allinferenceserviceswithout excluding BYOM custom foundation models. The models are deleted during backup preparation, and they are not captured in the backup. - Resolving the problem
-
To resolve the issue, you must modify the deletion command before taking a backup. And you must restart the
caikit-runtime-stack-operatorpod after you complete the backup and restore process.- Before taking a backup
-
Before taking an offline backup, you must do the following steps:
Note: If the operator pod is deleted or restarted, all these changes are lost. These steps must be repeated every time a new operator pod is created.- Put the
watsonxaiifmoperator and custom resource (CR) into maintenance mode:oc patch watsonxaiifm watsonxaiifm-cr --namespace $PROJECT_INSTANCE_NAMESPACE --type=merge --patch '{"spec": {"forceReconcile":"true"}}' - Switch to the operator namespace:
oc project $PROJECT_OPERATOR_NAMESPACE - Access the
watsonx-ai-ifmoperator pod:oc rsh ibm-cpd-watsonx-ai-ifm-operator-xxxxxxx - Navigate to the templates directory:
cd /opt/ansible/11.1.0/roles/watsonxaiifm-post-install/templates - Backup the original file:
cp post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2 \ post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2.bak - Modify the deletion command to exclude custom foundation models:
sed -i 's|command: \["/bin/bash", "-c", "kubectl delete inferenceservices -l '\''icpdsupport/addOnId=watsonx_ai_ifm'\'' ANDAND kubectl delete inferenceservices -l '\''syom_model'\''"\]|command: ["/bin/bash", "-c", "kubectl delete inferenceservices -l '\''icpdsupport/addOnId=watsonx_ai_ifm,model_type!=custom_foundation_model,type!=cfm'\'' -n $NAMESPACE ANDAND kubectl delete inferenceservices -l '\''syom_model,model_type!=custom_foundation_model,type!=cfm'\'' -n $NAMESPACE"]|' \ post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2 - Verify the change:
grep -n "kubectl delete inferenceservices" \ post-cpbr-watsonxaiifm-inference-services-br-config.yaml.j2 - Exit the container:
Ctrl + D - Remove the operator and CR from maintenance mode:
oc patch watsonxaiifm watsonxaiifm-cr --namespace $PROJECT_INSTANCE_NAMESPACE --type=merge --patch '{"spec": {"forceReconcile":"false"}}'
- Put the
- After completing a restore
-
After you complete the offline restore, restart the
caikit-runtime-stack-operatorpod.- Switch to operator namespace:
oc project $PROJECT_OPERATOR_NAMESPACE - List and identify the caikit operator pod:
oc get pod|grep caikit - Delete the
caikit-runtime-stack-operatorpod to trigger restart:oc delete pod caikit-runtime-stack-operator-9dcc859d8-bmz8q - Verify
fmaaspods are running in the instance namespace:oc project cpd-instance oc get pod|grep fmaas - Confirm the following pods are running:
fmaas-caikit-inf-prompt-tunesfmaas-caikit-trainerfmaas-mtfmaas-router
- Switch to operator namespace:
Creating an offline backup in REST mode stalls
Applies to: 5.0.0 and later
- Diagnosing the problem
- This problem occurs when you try to create an offline backup in REST mode by using a
custom
--image-prefixvalue. The offline backup stalls with cpdbr-vol-mnt pods in theImagePullBackOffstate. - Cause of the problem
- When you specify the
--image-prefixoption in thecpd-cli oadp backup createcommand, the default prefixregistry.redhat.io/ubi9is always used. - Resolving the problem
- To work around the problem, create the backup in Kubernetes mode instead. To change
to this mode, run the following
command:
cpd-cli oadp client config set runtime-mode=
Common core services custom
resource is in InProgress state after an offline restore to a different
cluster
Applies to: 5.0.0, 5.0.1
Fixed in: 5.0.2
- Diagnosing the problem
-
- Get the status of installed components by running the following
command.
cpd-cli manage get-cr-status \ --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} - Check that the status of ccs-cr is
InProgress.
- Get the status of installed components by running the following
command.
- Cause of the problem
- The Common core services component failed to reconcile on the restored cluster, because the dsx-requisite-pre-install-job-<xxxx> pod job is failing.
- Resolving the problem
- To resolve the problem, follow the instructions that are described in the technote Failed dsx-requisite-pre-install-job during offline restore.
OpenPages offline backup fails with pre-hook error
Applies to: 5.0.1, 5.0.2
Fixed in: 5.0.3
- Diagnosing the problem
- The CPD-CLI*.log file shows pre-backup hook errors such as in the
following
example:
<time> Hook execution breakdown by status=error/timedout: <time> <time> The following hooks either have errors or timed out <time> <time> pre-backup (1): <time> <time> COMPONENT CONFIGMAP METHOD STATUS DURATION <time> openpages-openpagesinstance-cr openpages-openpagesinstance-cr-aux-br-cm rule error 6m0.080179343s <time> <time> -------------------------------------------------------------------------------- <time> <time> <time> ** INFO [BACKUP CREATE/SUMMARY/END] ******************************************* <time> Error: error running pre-backup hooks: Error running pre-processing rules. Check the /root/br/backup/cpd-cli-workspace/logs/CPD-CLI-<timestamp>.log for errors. <time> [ERROR] <timestamp> RunPluginCommand:Execution error: exit status 1 <time> nfs0717bak-tenant-offline-b1 k8s offline backup failed - Cause of the problem
- Getting the OpenPages custom
resource into the
InMaintenancestate timed out. - Workaround
- Increase the pre-hooks timeout value in the
openpages-openpagesinstance-cr-aux-br-cm ConfigMap.
- Edit the openpages-openpagesinstance-cr-aux-br-cm
ConfigMap:
oc edit cm openpages-openpagesinstance-cr-aux-br-cm -n ${PROJECT_CPD_INST_OPERANDS} - Under
pre-hooks, change the timeout value to 600s.pre-hooks: exec-rules: - resource-kind: OpenPagesInstance name: openpagesinstance-cr actions: - builtins: name: cpdbr.cpd.ibm.com/enable-maint params: statusFieldName: openpagesStatus timeout: 600s
- Edit the openpages-openpagesinstance-cr-aux-br-cm
ConfigMap:
Offline backup pre-hooks fail on Separation of Duties cluster
Applies to: 5.0.0 and later
- Diagnosing the problem
- The CPD-CLI*.log file shows pre-backup hook errors such as in the
following
example:
<timestamp> level=info msg= test-watsonxgovernce-instance/configmap/cpd-analyticsengine-aux-br-cm: component=analyticsengine-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137 ... time=<timestamp> level=info msg= test-watsonxgovernce-instance/configmap/cpd-analyticsengine-cnpsql-aux-br-cm: component=analyticsengine-cnpsql-br, op=<mode=pre-backup,type=config-hook,method=rule>, status=error func=cpdbr-oadp/pkg/quiesce.logPlanResult file=/a/workspace/oadp-upload/pkg/quiesce/planexecutor.go:1137 - Cause of the problem
- The EDB Postgres
pod for the Analytics Engine powered by Apache Spark
service is in a
CrashLoopBackOffstate. - Workaround
- To work around the problem, follow the instructions in the IBM Support document Unable to upgrade Spark due to Enterprise database issues.
Offline restore to a different cluster fails due to management-ingress-ibmcloud-cluster-info ConfigMap not found in PodVolumeRestore
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- After an offline backup is created, but before doing a restore, check if the
management-ingress-ibmcloud-cluster-info ConfigMap was backed up
by running the following
commands:
cpd-cli oadp backup status --details <backup_name1> | grep management-ingress-ibmcloud-cluster-infocpd-cli oadp backup status --details <backup_name2> | grep management-ingress-ibmcloud-cluster-infoDuring or after the restore, pods that mount the missing ConfigMap show errors. For example:
oc describe po c-db2oltp-wkc-db2u-0 -n ${PROJECT_CPD_INST_OPERANDS}Example output:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 41m (x512 over 17h) kubelet MountVolume.SetUp failed for volume "management-ingress-ibmcloud-cluster-info" : configmap "management-ingress-ibmcloud-cluster-info" not found Warning FailedMount 62s (x518 over 17h) kubelet Unable to attach or mount volumes: unmounted volumes=[management-ingress-ibmcloud-cluster-info], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition - Cause of the problem
- When a related ibmcloud-cluster-info ConfigMap gets excluded as
part of backup hooks, the management-ingress-ibmcloud-cluster-info
ConfigMap copies the
excludelabeling and unintentionally gets excluded from the backup. - Workaround
- To work around the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Apply the following patch to ensure that the
management-ingress-ibmcloud-cluster-info ConfigMap is not
excluded from the
backup:
oc apply -n ${PROJECT_CPD_INST_OPERANDS} -f - << EOF apiVersion: v1 kind: ConfigMap metadata: name: cpdbr-management-ingress-exclude-fix-br labels: cpdfwk.aux-kind: br cpdfwk.component: cpdbr-patch cpdfwk.module: cpdbr-management-ingress-exclude-fix cpdfwk.name: cpdbr-management-ingress-exclude-fix-br-cm cpdfwk.managed-by: ibm-cpd-sre cpdfwk.vendor: ibm cpdfwk.version: 1.0.0 data: aux-meta: | name: cpdbr-management-ingress-exclude-fix-br description: | This configmap defines offline backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap prehooks. This is a temporary workaround until a complete fix is implemented. version: 1.0.0 component: cpdbr-patch aux-kind: br priority-order: 99999 # This should happen at the end of backup prehooks backup-meta: | pre-hooks: exec-rules: # Remove lingering velero exclude label from offline prehooks - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: velero.io/exclude-from-backup value: "true" timeout: 360s # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: icpdsupport/ignore-on-nd-backup value: "true" timeout: 360s post-hooks: exec-rules: - resource-kind: # do nothing for posthooks --- apiVersion: v1 kind: ConfigMap metadata: name: cpdbr-management-ingress-exclude-fix-ckpt labels: cpdfwk.aux-kind: checkpoint cpdfwk.component: cpdbr-patch cpdfwk.module: cpdbr-management-ingress-exclude-fix cpdfwk.name: cpdbr-management-ingress-exclude-fix-ckpt-cm cpdfwk.managed-by: ibm-cpd-sre cpdfwk.vendor: ibm cpdfwk.version: 1.0.0 data: aux-meta: | name: cpdbr-management-ingress-exclude-fix-ckpt description: | This configmap defines online backup prehooks to prevent cases where Bedrock's management-ingress-ibmcloud-cluster-info configmap gets unexpectedly excluded when ibmcloud-cluster-info is excluded during cs-postgres configmap checkpoint operation. This is a temporary workaround until a complete fix is implemented. version: 1.0.0 component: cpdbr-patch aux-kind: ckpt priority-order: 99999 # This should happen at the end of backup prehooks backup-meta: | pre-hooks: exec-rules: # Remove lingering velero exclude label from offline prehooks - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: velero.io/exclude-from-backup value: "true" timeout: 360s # Remove lingering ignore-on-nd-backup exclude label from online checkpoint operation - resource-kind: configmap name: management-ingress-ibmcloud-cluster-info actions: - builtins: name: cpdbr.cpd.ibm.com/label-resources params: action: remove key: icpdsupport/ignore-on-nd-backup value: "true" timeout: 360s post-hooks: exec-rules: - resource-kind: # do nothing for posthooks checkpoint-meta: | exec-hooks: exec-rules: - resource-kind: # do nothing for checkpoint EOF
-
Unable to restore offline backup of OpenPages to different cluster
Applies to: 5.0.0
Fixed in: 5.0.1
- Diagnosing the problem
- In the CPD-CLI*.log file, you see an error like in the following
example:
CPD-CLI-<timestamp>.log:time=<timestamp> level=error msg=failed to wait for statefulset openpages--78c5-ib-12ce in namespace <cpd_instance_ns>: timed out waiting for the condition func=cpdbr-oadp/pkg/kube.waitForStatefulSetPods file=/a/workspace/oadp-upload/pkg/kube/statefulset.go:173 - Cause of the problem
- The second RabbitMQ pod (ending in
-1) is in aCrashLoopBackOffstate. Run the following command:
Example output:oc get pods -n ${PROJECT_CPD_INST_OPERANDS} | grep openpagesopenpages--78c5-ib-12ce-0 1/1 Running 0 23h openpages--78c5-ib-12ce-1 0/1 CrashLoopBackOff 248 (3m57s ago) 23h openpages-openpagesinstance-cr-sts-0 1/2 Running 91 (12m ago) 23h openpages-openpagesinstance-cr-sts-1 1/2 Running 91 (12m ago) 23h - Workaround
- To work around the problem, do the following steps:
-
Log in to Red Hat OpenShift Container Platform as a cluster administrator.
${OC_LOGIN}Remember:OC_LOGINis an alias for theoc logincommand. - Check the OpenPages logs
for the following error in the second RabbitMQ pod:
=========== Exception during startup: exit:{boot_failed,{exit_status,1}} peer:start_it/2, line 639 rabbit_peer_discovery:query_node_props/1, line 408 rabbit_peer_discovery:sync_desired_cluster/3, line 189 rabbit_db:init/0, line 65 rabbit_boot_steps:-run_step/2-lc$^0/1-0-/2, line 51 rabbit_boot_steps:run_step/2, line 58 rabbit_boot_steps:-run_boot_steps/1-lc$^0/1-0-/1, line 22 rabbit_boot_steps:run_boot_steps/1, line 23 - If you see this error, check the Erlang cookie value at the top of
the OpenPages logs. For
example, run the following
command:
Example output:oc logs openpages--78c5-ib-12ce-1Defaulted container "openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq" out of: openpages-openpagesinstance-cr-<instance_id>-ibm-rabbitmq, copy-rabbitmq-config (init) ---------------------- +FkpbwejzK2RXfmPLQAnITroiieu3uGa3vkRA2k6t+8= ---------------------- <timestamp> [warning] <0.156.0> Overriding Erlang cookie using the value set in the environmentThe plus sign (+) at the beginning of the cookie value is the source of the problem.
- Regenerate a new
token:
openssl rand -base64 32 | tr -d '\\n' | base64 | tr -d '\\n' - Decode from Base64 format, and make sure that the cookie value does not begin with a plus sign (+).
- Replace the cookie value in the auth secret.
- Edit the auth
secret:
oc edit secret openpages-openpagesinstance-cr-<instance_id>-rabbitmq-auth-secret - Replace the
rabbitmq-erlang-cookievalue with the new value.
- Edit the auth
secret:
- Delete the StatefulSet, or scale down and then scale up to get all the pods to pick up the new cookie.
-
Flight service issues
Security issues
Security scans return an Inadequate Account Lockout Mechanism message
Applies to: 5.0.0 and later
- Diagnosing the problem
-
If you run a security scan against Cloud Pak for Data, the scan returns the following message.
Inadequate Account Lockout Mechanism
- Resolving the problem
-
This is by design. It is strongly recommended that you use an enterprise-grade password management solution, such as SAML SSO or an LDAP provider for password management, as described in the following resources.
The Kubernetes version information is disclosed
Applies to: 5.0.0 and later
- Diagnosing the problem
- If you run an Aqua Security scan against your cluster, the scan returns the following issue:
- Resolving the problem
- This is expected based on the following solution document from Red Hat OpenShift Customer Portal: Hide kubernetes /version API endpoint in OpenShift Container Platform 4.