Backup issues
List of backup issues in the Backup & Restore service of IBM Fusion.
VolumeSnapShotClass Creation issue
- Problem statement
- Backup & Restore fails for applications on the remote
cluster that is connected through the Global Data Platform
Remote Mount Service. The reason is that the
VolumeSnapShotClassfor scale is not created through code as expected.
- Diagnosis
- Run the following command to confirm whether the
VolumeSnapShotClassfor scale exists or not:oc get VolumeSnapshotClass ibm-spectrum-scale-snapshot-classExample output:NAME DRIVER DELETIONPOLICY AGE ibm-spectrum-scale-snapshot-class spectrumscale.csi.ibm.com Delete 119d
If it does not exist, then do the resolution steps.
- Resolution
- Do the following steps to resolve the issue:
- Run the following command to get the
clusterrolebindingfor the CNS operator:
Example output:oc get clusterrolebindings -o json | jq -r '.items[] | select(.subjects[]?.name == "isf-cns-operator-controller-manager" and .roleRef.name != "system:auth-delegator") | .metadata.name'% oc get clusterrolebindings -o json | jq -r '.items[] | select(.subjects[]?.name == "isf-cns-operator-controller-manager" and .roleRef.name != "system:auth-delegator") | .metadata.name' isf-operator.v2.10.0-375-awgs31UnUIdSsrwvNPiqlHYLqL6givWque6aPd
- Find the Role name corresponding to the CNS Operator.
- Run the following command to retrieve information about ClusterRoles in the cluster:
Example command:
Replace the provided example values in this command with your own.
oc get ClusterRole isf-operator.v2.9.0-87828235c84 -o yaml > clusterrole.yaml - Open the clusterrole.yam in edit
mode.
vim clusterrole.yaml - Add the following rules in the clusterrole.yaml created in previous
step.
- apiGroups: - snapshot.storage.k8s.io resources: - volumesnapshotclasses verbs: - create - delete - get - list - patch - update - watch - Apply the YAML file:
oc apply -f clusterrole.yaml - In OpenShift® Container Platform console, go to
,
and verify whether a class with the name
ibm-spectrum-scale-snapshot-classexists.
- Run the following command to get the
Backup size for block volumes is not correct
When the policy backup storage location gets changed, the backup size for application backups with Ceph RBD block volumes is incorrectly displayed during the first backup. As the correct information is displayed after the first backup following the change in the backup storage location, you can ignore this issue.
Multiple VM backups failing with snapshot deadline exceeded error
- Resolution
-
- Run the following command to enable full
auditing:
sudo auditctl -w /etc/shadow -p - Rerun the backup and run the following command to identify the file that is causing ownership or
permission issues.
sudo ausearch -m avc -ts recent - If the ausearch command reports issues, then run the following commands to
generate a local policy to allow access.
sudo ausearch -c 'qemu-ga' --raw | audit2allow -M my-qemuga sudo semodule -X 300 -i my-qemuga.pp
- Run the following command to enable full
auditing:
Failed to create snapshot content
- Problem statement
- Failed to create snapshot content with the following error:
Cannot find CSI PersistentVolumeSource for directory-based static volume
- Resolution
- To resolve the error, see Create a VolumeSnapshot.
Assign a backup policy operation fails
- Problem statement
- If you have a PolicyAssignment for an application on the hub and you create a PolicyAssignment
for the same application on the spoke, then your attempt to assign a backup policy for the
application fails. In both assignments, the application, backup policy, and short-form cluster name
are the same. The current format of the PolicyAssignment CR name is
appName-backupPolicyName-shortFormClusterName. The issue happens when the first string of the cluster names is identical. In this scenario, the creation gets rejected because the PolicyAssignment name exists in OpenShift Container Platform.For example:
Hub assignment createsapp1-bp1-apps:- Application -
app1 - BackupPolicy -
bp1 - AppCluster -
apps.cluster1
app1-bp1-apps(The OpenShift Container Platform rejects it)- Application -
app1 - BackupPolicy -
bp1 - AppCluster -
apps.cluster2
- Application -
- Resolution
- To create the PolicyAssignment for the spoke application, delete the PolicyAssignment CR for the hub application assignment and attempt spoke application assignment again.
Backups do not work as defined in the backup policies
- Problem statement
- Sometimes, backups do not work as defined in the backup policies, especially when you set hourly policies. For example, if you set a policy for two hours and it does not run every two hours, then gaps exist in the backup history. The possible reason might be that during pod crash and restart, scheduled jobs were not accounting for the time zone, causing gaps in run intervals.
- Diagnosis
- The following are the observed symptoms:
- Policies with custom every X hour at minute YY schedules: the first scheduled run of this policy will run at minute YY after X hours + time zone offset from UTC instead of at minute YY after X hours.
- Monthly and yearly policies run more frequently.
- Resolution
- You can start backups manually until the next scheduled time.
Backup jobs are stuck in a running state for a long time and are not canceled
- Resolution
- Do the following steps to resolve the issue:
- Ensure that all jobs are finished and the queue is empty before you do any disruptive actions like node restarts.
- If jobs are running for a long period and do not progress, follow the steps to delete the
backup or restore CR directly.
- Log in to IBM Fusion.
- Go to and get the name of the job that is stuck.
- Run the following command to delete backup
job.
oc delete fbackup <job_name> - Run the following command to delete restore
job.
oc delete frestore <job_name>
Policy creation
- Problem statement
- Sometimes, when you create a backup policy, the following errors can occur:
Error: Policy daily-snapshot could not created.
- Resolution
- Restart the
isf-data-protection-operator-controller-manager-* podin IBM Fusion namespace. It triggers the recreation of the in-place-snapshot BackupStorageLocation CR.
Policy assignment from Backup & Restore service page of the OpenShift Container Platform console
- Problem statement
- In the Backup & Restore service page of the OpenShift Container Platform console, the backup policy assignment to an application fails with a gateway timeout error.
- Resolution
- Use your IBM Fusion user interface.
Backup of multiple VMs attempt is failed
- Problem statement
- This issue occurs when some VMs are in a migrating state. The OpenShift Container Platform does not support snapshot of the VMs in migrating state.
- Resolution
- Follow the steps to resolve this issue:
- Check whether the virtual machine is in a migrating state:
- Run the following command to check migrating
VM.
oc get virtualmachineinstancemigrations -AExample output:NAMESPACE NAME PHASE VMI fb-bm1-fs-1-5g-10 rhel8-lesser-wildcat-migration-8fhbo Failed rhel8-lesser-wildcat vm-centipede-bm2 centos-stream9-chilly-hawk-migration-57jyk Failed centos-stream9-chilly-hawk vm-centos9-bm1-1 centos-stream9-instant-toad-migration-bfyz6 Failed centos-stream9-instant-toad vm-centos9-bm1-1 centos-stream9-instant-toad-migration-d9547 Failed centos-stream9-instant-toad vm-windows10-bm2-1 kubevirt-workload-update-4dm57 Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-f2s5w Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-gt6nj Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-rjwmn Failed win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-vfxfl TargetReady win10-zealous-unicorn vm-windows10-bm2-1 kubevirt-workload-update-z2thw Failed win10-zealous-unicorn vm-windows11-bm2-1 kubevirt-workload-update-9gr6v Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-clbck Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-j6pmx Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-sfbbx Pending win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-th5dd Failed win11-graceful-coyote vm-windows11-bm2-1 kubevirt-workload-update-zl679 Failed win11-graceful-coyote vm-windows11-bm2-2 kubevirt-workload-update-7dp6g Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-9nb9m TargetReady win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-cdrf5 Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-dm8fz Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-kwr6c Failed win11-conservative-moth vm-windows11-bm2-2 kubevirt-workload-update-zt8wx Failed win11-conservative-moth
- Exclude the migrating virtual machine from the backup. Reattempt it after the migration is complete.
Backup applications table does not show the new backup times for the backed-up applications
- Problem statement
- The backup applications table does not show the new backup times for the backed-up applications.
- Resolution
- Go to the Applications and Jobs view to see the last successful backup job for a given application. For applications on the hub, the Applications table has the correct last backup time.
Backups are failing for the virtual machines
- Problem statement
- The backups and snapshots are failing for the virtual machines that is mounted with second disk.
- Resolution
-
- Run the following command to get disks details for the virtual
machine.
Example output:oc get virtualmachine -A -o json | jq '.items[] | [{name:.metadata.name, namespace:.metadata.namespace, volumes:.spec.template.spec.volumes}] | select(.[].volumes[].dataVolume | length > 1) | {name :.[].name, namespace:.[].namespace}'{ "name": "rhel9-absent-basilisk", "namespace": "vmtesting" } - If you find the virtual machines are mounted with second disk, then follow the steps mentioned in the Red Hat solution to resolve the issue.
- Run the following command to get disks details for the virtual
machine.
Known issues and limitations
- When the Backup & Restore Hub auto-upgrade setting is disabled, the Backup & Restore service may still be upgraded to IBM Fusion 2.13.0.
- Note: This issue only occurs with the first job after installation.After a fresh installation, the first job might not process as expected. To resolve this issue, cancel the job and wait for the cancellation to complete. Then, retry the job to process it successfully.
- The OpenShift Container Platform cluster can have problems and become unusable. After you recover the cluster, rejoin the connections. OpenShift Container Platform cluster can have problems and become unusable.
- The S3 bucket must not have an expiration policy or an archive rule. For more information about this known issue, see S3 buckets must not enable expiration policies.
- The Azure Endpoint URL must not contain the name of the bucket.
- When you backup an application present in namespace with high security context constraints
privileges. The restored namespace will not have the same security context constraints privileges,
resulting in restored pods in crashloopbackoff status.Workaround:
- Restart the application pod.
- When you configure node affinity for the Backup & Restore, consider the following limitations:
- There is no way to add
nodeAffinityinDataProtectionApplicationCR. The Velero pod may get scheduled on the GPU nodes, even if other nodes are eligible for scheduling. - There is one
Node-agentpod per node, which continues to run on GPU nodes as well.For more information about Configuring the OpenShift API for Data Protection with Multicloud Object Gateway, see Red Hat documentation.
- There is no way to specify
nodeAffinityfor AMQ and OAPD operator pods. These operator pods may get scheduled on the GPU nodes, even if other nodes are eligible for scheduling. - There is no way to specify anti
NodeAffinityfor OLM jobs pods, so these pods may get scheduled on the GPU nodes, even if other nodes are eligible for scheduling. However, these are very short lived pods. - The
datamoverpod for restore work may get scheduled on a GPU node. Currently, Kopia honors the setting for backup job pod, but not for restore job pod. - In summary, the following pods do not have the logic to avoid GPU nodes:
- Velero
- Datamover (for restore)
- AMQ Operator
- OAPD operator
- Node Agent
- OLM jobs
- There is no way to add