Backup issues

List of backup issues in the Backup & Restore service of IBM Fusion.

VolumeSnapShotClass Creation issue

Problem statement: Backup & Restore fails for applications on the remote cluster that is connected through the Global Data Platform Remote Mount Service. The reason is that the VolumeSnapShotClass for scale is not created through code as expected.

Diagnosis

Run the following command to confirm whether the VolumeSnapShotClass for scale exists or not:

oc get VolumeSnapshotClass ibm-spectrum-scale-snapshot-class

Example output:

NAME                                  DRIVER            DELETIONPOLICY    AGE
ibm-spectrum-scale-snapshot-class     spectrumscale.csi.ibm.com       Delete      119d

If it does not exist, then do the resolution steps.

Resolution

Do the following steps to resolve the issue:

Run the following command to get the clusterrolebinding for the CNS operator:

oc get clusterrolebindings -o json | jq -r '.items[] | select(.subjects[]?.name == "isf-cns-operator-controller-manager" and .roleRef.name != "system:auth-delegator") | .metadata.name'

Example output:

% oc get clusterrolebindings -o json | jq -r '.items[] | select(.subjects[]?.name == "isf-cns-operator-controller-manager" and .roleRef.name != "system:auth-delegator") | .metadata.name'

isf-operator.v2.10.0-375-awgs31UnUIdSsrwvNPiqlHYLqL6givWque6aPd

Find the Role name corresponding to the CNS Operator.
Run the following command to retrieve information about ClusterRoles in the cluster:
Example command:

Replace the provided example values in this command with your own.
```
oc get ClusterRole isf-operator.v2.9.0-87828235c84 -o yaml > clusterrole.yaml
```
Open the clusterrole.yam in edit mode.
```
vim clusterrole.yaml
```

Add the following rules in the clusterrole.yaml created in previous step.

- apiGroups:
  - snapshot.storage.k8s.io
  resources:
  - volumesnapshotclasses
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch

Apply the YAML file:
```
oc apply -f clusterrole.yaml
```
In OpenShift® Container Platform console, go to Storage > VolumeSnapShotClass, and verify whether a class with the name ibm-spectrum-scale-snapshot-class exists.

Backup size for block volumes is not correct

When the policy backup storage location gets changed, the backup size for application backups with Ceph RBD block volumes is incorrectly displayed during the first backup. As the correct information is displayed after the first backup following the change in the backup storage location, you can ignore this issue.

Multiple VM backups failing with snapshot deadline exceeded error

Resolution

Run the following command to enable full auditing:
```
sudo auditctl -w /etc/shadow -p
```
Rerun the backup and run the following command to identify the file that is causing ownership or permission issues.
```
sudo ausearch -m avc -ts recent
```
If the ausearch command reports issues, then run the following commands to generate a local policy to allow access.
```
sudo ausearch -c 'qemu-ga' --raw | audit2allow -M my-qemuga
sudo semodule -X 300 -i my-qemuga.pp
```

Failed to create snapshot content

Problem statement: Failed to create snapshot content with the following error:
Cannot find CSI PersistentVolumeSource for directory-based static volume

Resolution: To resolve the error, see Create a VolumeSnapshot.

Assign a backup policy operation fails

Problem statement

If you have a PolicyAssignment for an application on the hub and you create a PolicyAssignment for the same application on the spoke, then your attempt to assign a backup policy for the application fails. In both assignments, the application, backup policy, and short-form cluster name are the same. The current format of the PolicyAssignment CR name is appName-backupPolicyName-shortFormClusterName. The issue happens when the first string of the cluster names is identical. In this scenario, the creation gets rejected because the PolicyAssignment name exists in OpenShift Container Platform.

For example:

Hub assignment creates app1-bp1-apps:

Application - app1
BackupPolicy - bp1
AppCluster - apps.cluster1

Spoke assignment creates app1-bp1-apps (The OpenShift Container Platform rejects it)

Application - app1
BackupPolicy - bp1
AppCluster - apps.cluster2

Resolution: To create the PolicyAssignment for the spoke application, delete the PolicyAssignment CR for the hub application assignment and attempt spoke application assignment again.

Backups do not work as defined in the backup policies

Problem statement: Sometimes, backups do not work as defined in the backup policies, especially when you set hourly policies. For example, if you set a policy for two hours and it does not run every two hours, then gaps exist in the backup history. The possible reason might be that during pod crash and restart, scheduled jobs were not accounting for the time zone, causing gaps in run intervals.

Diagnosis

The following are the observed symptoms:

Policies with custom every X hour at minute YY schedules: the first scheduled run of this policy will run at minute YY after X hours + time zone offset from UTC instead of at minute YY after X hours.
Monthly and yearly policies run more frequently.

Resolution: You can start backups manually until the next scheduled time.

Backup jobs are stuck in a running state for a long time and are not canceled

Resolution

Do the following steps to resolve the issue:

Ensure that all jobs are finished and the queue is empty before you do any disruptive actions like node restarts.
If jobs are running for a long period and do not progress, follow the steps to delete the backup or restore CR directly.
1. Log in to IBM Fusion.
2. Go to Backup & Restore > Jobs > Queue and get the name of the job that is stuck.
3. Run the following command to delete backup job.
```
oc delete fbackup <job_name>
```
4. Run the following command to delete restore job.
```
oc delete frestore <job_name>
```

Policy creation

Problem statement

Sometimes, when you create a backup policy, the following errors can occur:

Error: Policy daily-snapshot could not created.

Resolution: Restart the isf-data-protection-operator-controller-manager-* pod in IBM Fusion namespace. It triggers the recreation of the in-place-snapshot BackupStorageLocation CR.

Policy assignment from Backup & Restore service page of the OpenShift Container Platform console

Problem statement: In the Backup & Restore service page of the OpenShift Container Platform console, the backup policy assignment to an application fails with a gateway timeout error.

Resolution: Use your IBM Fusion user interface.

Backup of multiple VMs attempt is failed

Problem statement: This issue occurs when some VMs are in a migrating state. The OpenShift Container Platform does not support snapshot of the VMs in migrating state.

Resolution

Follow the steps to resolve this issue:

Check whether the virtual machine is in a migrating state:

Run the following command to check migrating VM.

oc get virtualmachineinstancemigrations -A

Example output:

NAMESPACE            NAME                                          PHASE         VMI
fb-bm1-fs-1-5g-10    rhel8-lesser-wildcat-migration-8fhbo          Failed        rhel8-lesser-wildcat
vm-centipede-bm2     centos-stream9-chilly-hawk-migration-57jyk    Failed        centos-stream9-chilly-hawk
vm-centos9-bm1-1     centos-stream9-instant-toad-migration-bfyz6   Failed        centos-stream9-instant-toad
vm-centos9-bm1-1     centos-stream9-instant-toad-migration-d9547   Failed        centos-stream9-instant-toad
vm-windows10-bm2-1   kubevirt-workload-update-4dm57                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-f2s5w                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-gt6nj                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-rjwmn                Failed        win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-vfxfl                TargetReady   win10-zealous-unicorn
vm-windows10-bm2-1   kubevirt-workload-update-z2thw                Failed        win10-zealous-unicorn
vm-windows11-bm2-1   kubevirt-workload-update-9gr6v                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-clbck                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-j6pmx                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-sfbbx                Pending       win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-th5dd                Failed        win11-graceful-coyote
vm-windows11-bm2-1   kubevirt-workload-update-zl679                Failed        win11-graceful-coyote
vm-windows11-bm2-2   kubevirt-workload-update-7dp6g                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-9nb9m                TargetReady   win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-cdrf5                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-dm8fz                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-kwr6c                Failed        win11-conservative-moth
vm-windows11-bm2-2   kubevirt-workload-update-zt8wx                Failed        win11-conservative-moth

Exclude the migrating virtual machine from the backup. Reattempt it after the migration is complete.

Backup applications table does not show the new backup times for the backed-up applications

Problem statement: The backup applications table does not show the new backup times for the backed-up applications.

Resolution: Go to the Applications and Jobs view to see the last successful backup job for a given application. For applications on the hub, the Applications table has the correct last backup time.

Backups are failing for the virtual machines

Problem statement: The backups and snapshots are failing for the virtual machines that is mounted with second disk.

Resolution

Run the following command to get disks details for the virtual machine.

oc get virtualmachine -A -o json | jq '.items[] | [{name:.metadata.name, namespace:.metadata.namespace, volumes:.spec.template.spec.volumes}] | select(.[].volumes[].dataVolume | length > 1) | {name
:.[].name, namespace:.[].namespace}'

Example output:

{
"name": "rhel9-absent-basilisk",
"namespace": "vmtesting"
}

If you find the virtual machines are mounted with second disk, then follow the steps mentioned in the Red Hat solution to resolve the issue.

Known issues and limitations

When the Backup & Restore Hub auto-upgrade setting is disabled, the Backup & Restore service may still be upgraded to IBM Fusion 2.13.0.
Note: This issue only occurs with the first job after installation.
After a fresh installation, the first job might not process as expected. To resolve this issue, cancel the job and wait for the cancellation to complete. Then, retry the job to process it successfully.
The OpenShift Container Platform cluster can have problems and become unusable. After you recover the cluster, rejoin the connections. OpenShift Container Platform cluster can have problems and become unusable.
The S3 bucket must not have an expiration policy or an archive rule. For more information about this known issue, see S3 buckets must not enable expiration policies.
The Azure Endpoint URL must not contain the name of the bucket.
When you backup an application present in namespace with high security context constraints privileges. The restored namespace will not have the same security context constraints privileges, resulting in restored pods in crashloopbackoff status.
Workaround:
- Restart the application pod.
When you configure node affinity for the Backup & Restore, consider the following limitations:
- There is no way to add nodeAffinity in DataProtectionApplication CR. The Velero pod may get scheduled on the GPU nodes, even if other nodes are eligible for scheduling.
- There is one Node-agent pod per node, which continues to run on GPU nodes as well.
  For more information about Configuring the OpenShift API for Data Protection with Multicloud Object Gateway, see Red Hat documentation.
- There is no way to specify nodeAffinity for AMQ and OAPD operator pods. These operator pods may get scheduled on the GPU nodes, even if other nodes are eligible for scheduling.
- There is no way to specify anti NodeAffinity for OLM jobs pods, so these pods may get scheduled on the GPU nodes, even if other nodes are eligible for scheduling. However, these are very short lived pods.
- The datamover pod for restore work may get scheduled on a GPU node. Currently, Kopia honors the setting for backup job pod, but not for restore job pod.
- In summary, the following pods do not have the logic to avoid GPU nodes:
  - Velero
  - Datamover (for restore)
  - AMQ Operator
  - OAPD operator
  - Node Agent
  - OLM jobs