Backup issues

List of backup issues in the Backup & Restore service of IBM Fusion.

Backup size for block volumes is not correct

When the policy backup storage location gets changed, the backup size for application backups with Ceph RBD block volumes is incorrectly displayed during the first backup. As the correct information is displayed after the first backup following the change in the backup storage location, you can ignore this issue.

Shallow copy support is not available

Shallow copy support is currently unavailable in the IBM Storage Scale CSI driver. Consequently, a new copy of the data is created when the PVC is generated from the snapshot.

Multiple VM backups failing with snapshot deadline exceeded error

Resolution
  1. Run the following command to enable full auditing:
    sudo auditctl -w /etc/shadow -p
  2. Rerun the backup and run the following command to identify the file that is causing ownership or permission issues.
    sudo ausearch -m avc -ts recent
  3. If the ausearch command reports issues, then run the following commands to generate a local policy to allow access.
    
    sudo ausearch -c 'qemu-ga' --raw | audit2allow -M my-qemuga
    sudo semodule -X 300 -i my-qemuga.pp

IBM Cloud Limitations with OADP DataMover

Problem statement
Backups of raw volumes fail for IBM Cloud with OADP 1.4.0 or lower.
Cause
The OADP 1.3 or higher expose the volume of the underlying host during backup or restore. The folders on the host are exposed in the Pods that are associated with the Daemonset node-agent. The /var/lib/kubelet/{plugins,pods} folder is exposed by default. The folders required to work on IBM Cloud are /var/data/kubelet/{plugins,pods}. As a result, the backup and restore of volumeMode: block volumes fail with the following example error:
Failed transferring data

[BMYBR0009](https://ibm.com/docs/SSFETU_2.9/errorcodes/BMYBR0009.html) There was an error when processing the job in the Transaction Manager service. The underlying error was: 'Data uploads watch caused an exception: DataUpload d4e7706d-7f0f-4448-b3a0-e9cdff8d33db-1 failed with message: data path backup failed: Failed to run kopia backup: unable to get local block device entry: resolveSymlink: lstat /var/data: no such file or directory'.

The ID of the individual DataUpload varies for jobs.

Resolution
  1. Set the DataMover type to legacy by either the global method or the Per PolicyAssignment method. For the procedure to update the type, see Configure DataMover type for Backup and Restore.

    Though this workaround allows continued use of of DataMover kopia type, it has the following drawbacks:

    It disables further changes to DataMover and Velero configurations, such as the ability to change resource allocations (CPU, Memory, and Ephemeral-storage) and nodeSelectors (datamover node placement) for Datamover type kopia.

    Legacy is not affected by this workaround.

    To avoid job failures, do not make these changes while a backup or restore job is in progress.

  2. In the OpenShift Console, go to Workloads > Deployments.
  3. Select openshift-adp-controller-manager and scale the number of Pods to 0.
  4. Go to Workloads > Daemonsets and select node-agent.
  5. Select the YAML tab.
  6. Under the volumes section, add the additional volume host-data as shown in the following example.
    Note: It exposes an additional folder on the host other than the folders mention in the Cause of this issue.
    
        volumes:
           - name: host-pods
             hostPath:
               path: /var/lib/kubelet/pods
               type: ''
           - name: host-plugins
             hostPath:
               path: /var/lib/kubelet/plugins
               type: ''
           - name: host-data
             hostPath:
               path: /var/data/kubelet
               type: ''
           - name: scratch
             emptyDir: {}
           - name: certs
             emptyDir: {}
  7. Under volumeMounts, add the host-data volume as shown in the following example.
    
              volumeMounts:
                - name: host-pods
                  mountPath: /host_pods
                  mountPropagation: HostToContainer
                - name: host-plugins
                  mountPath: /var/lib/kubelet/plugins
                  mountPropagation: HostToContainer
                - name: host-data
                  mountPath: /var/data/kubelet
                  mountPropagation: HostToContainer
                - name: scratch
                  mountPath: /scratch
                - name: certs
                  mountPath: /etc/ssl/certs
  8. Save the changes and wait for a couple minutes for the Pods to restart.

    The backups and restores of PersistentVolumeClaims with volumeMode: block succeeds on Red Hat® OpenShift® of IBM Cloud.

Failed to create snapshot content

Problem statement
Failed to create snapshot content with the following error:

Cannot find CSI PersistentVolumeSource for directory-based static volume

Resolution
To resolve the error, see https://www.ibm.com/docs/en/scalecsi/2.10?topic=snapshot-create-volumesnapshot.

Assign a backup policy operation fails

Problem statement
If you have a PolicyAssignment for an application on the hub and you create a PolicyAssignment for the same application on the spoke, then your attempt to assign a backup policy for the application fails. In both assignments, the application, backup policy, and short-form cluster name are the same. The current format of the PolicyAssignment CR name is appName-backupPolicyName-shortFormClusterName. The issue happens when the first string of the cluster names is identical. In this scenario, the creation gets rejected because the PolicyAssignment name exists in OpenShift Container Platform.

For example:

Hub assignment creates app1-bp1-apps:
  • Application - app1
  • BackupPolicy - bp1
  • AppCluster - apps.cluster1
Spoke assignment creates app1-bp1-apps (The OpenShift Container Platform rejects it)
  • Application - app1
  • BackupPolicy - bp1
  • AppCluster - apps.cluster2
Resolution
To create the PolicyAssignment for the spoke application, delete the PolicyAssignment CR for the hub application assignment and attempt spoke application assignment again.

Backups do not work as defined in the backup policies

Problem statement
Sometimes, backups do not work as defined in the backup policies, especially when you set hourly policies. For example, if you set a policy for two hours and it does not run every two hours, then gaps exist in the backup history. The possible reason might be that during pod crash and restart, scheduled jobs were not accounting for the time zone, causing gaps in run intervals.
Diagnosis
The following are the observed symptoms:
  • Policies with custom every X hour at minute YY schedules: the first scheduled run of this policy will run at minute YY after X hours + time zone offset from UTC instead of at minute YY after X hours.
  • Monthly and yearly policies run more frequently.
Resolution
You can start backups manually until the next scheduled time.

Backup & Restore service deployed in IBM Cloud Satellite

Problem statement
You can encounter an error when you attempt backup operation on IBM Fusion Backup & Restore service that is deployed in IBM Cloud® Satellite.
Diagnosis
Backup operations fail with the following log entries:

level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=pods, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=replicasets.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
level=error msg="Error backing up item" backup=<item> error="error executing custom action (groupResource=deployments.apps, namespace=<namespace>, name=<name>): rpc error: code = Unknown desc = configmaps \"config\" not found" error.file="/remote-source/velero/app/pkg/backup/item_backupper.go:326" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=<name>
Cause
An issue exists with the default OADP plug-in and it must be disabled to continue.
Resolution

Do the following steps to disable the plug-in:

  1. In the OpenShift console, go to Administration > CustomerResourceDefinitions.
  2. Search for the CustomResourceDefiniton DataProtectionApplication.
  3. In the Instances tab, locate the instance that is named velero.
  4. Open the YAML file in edit mode for the instance.
  5. Under the entry spec:velero:defaultPlugins, remove the line for openshift.
  6. Save the YAML file.

Backup jobs are stuck in a running state for a long time and are not canceled

Resolution
Do the following steps to resolve the issue:
  1. Ensure that all jobs are finished and the queue is empty before you do any disruptive actions like node restarts.
  2. If jobs are running for a long period and do not progress, follow the steps to delete the backup or restore CR directly.
    1. Log in to IBM Fusion.
    2. Go to Backup & Restore > Jobs > Queue and get the name of the job that is stuck.
    3. Run the following command to delete backup job.
      oc delete fbackup <job_name>
    4. Run the following command to delete restore job.
      oc delete frestore <job_name>

Policy creation

Problem statement
Sometimes, when you create a backup policy, the following errors can occur:
Error: Policy daily-snapshot could not created. 
Resolution
Restart the isf-data-protection-operator-controller-manager-* pod in IBM Fusion namespace. It triggers the recreation of the in-place-snapshot BackupStorageLocation CR.

Policy assignment from Backup & Restore service page of the OpenShift Container Platform console

Problem statement
In the Backup & Restore service page of the OpenShift Container Platform console, the backup policy assignment to an application fails with a gateway timeout error.
Resolution
Use your IBM Fusion user interface.

Backup of multiple VMs attempt is failed

Problem statement
This issue occurs when some VMs are in a migrating state. The OpenShift Container Platform does not support snapshot of the VMs in migrating state.
Resolution
Follow the steps to resolve this issue:
  1. Check whether the virtual machine is in a migrating state:
  2. Run the following command to check migrating VM.
    oc get virtualmachineinstancemigrations -A
    Example output:
    NAMESPACE            NAME                                          PHASE         VMI
    fb-bm1-fs-1-5g-10    rhel8-lesser-wildcat-migration-8fhbo          Failed        rhel8-lesser-wildcat
    vm-centipede-bm2     centos-stream9-chilly-hawk-migration-57jyk    Failed        centos-stream9-chilly-hawk
    vm-centos9-bm1-1     centos-stream9-instant-toad-migration-bfyz6   Failed        centos-stream9-instant-toad
    vm-centos9-bm1-1     centos-stream9-instant-toad-migration-d9547   Failed        centos-stream9-instant-toad
    vm-windows10-bm2-1   kubevirt-workload-update-4dm57                Failed        win10-zealous-unicorn
    vm-windows10-bm2-1   kubevirt-workload-update-f2s5w                Failed        win10-zealous-unicorn
    vm-windows10-bm2-1   kubevirt-workload-update-gt6nj                Failed        win10-zealous-unicorn
    vm-windows10-bm2-1   kubevirt-workload-update-rjwmn                Failed        win10-zealous-unicorn
    vm-windows10-bm2-1   kubevirt-workload-update-vfxfl                TargetReady   win10-zealous-unicorn
    vm-windows10-bm2-1   kubevirt-workload-update-z2thw                Failed        win10-zealous-unicorn
    vm-windows11-bm2-1   kubevirt-workload-update-9gr6v                Failed        win11-graceful-coyote
    vm-windows11-bm2-1   kubevirt-workload-update-clbck                Failed        win11-graceful-coyote
    vm-windows11-bm2-1   kubevirt-workload-update-j6pmx                Failed        win11-graceful-coyote
    vm-windows11-bm2-1   kubevirt-workload-update-sfbbx                Pending       win11-graceful-coyote
    vm-windows11-bm2-1   kubevirt-workload-update-th5dd                Failed        win11-graceful-coyote
    vm-windows11-bm2-1   kubevirt-workload-update-zl679                Failed        win11-graceful-coyote
    vm-windows11-bm2-2   kubevirt-workload-update-7dp6g                Failed        win11-conservative-moth
    vm-windows11-bm2-2   kubevirt-workload-update-9nb9m                TargetReady   win11-conservative-moth
    vm-windows11-bm2-2   kubevirt-workload-update-cdrf5                Failed        win11-conservative-moth
    vm-windows11-bm2-2   kubevirt-workload-update-dm8fz                Failed        win11-conservative-moth
    vm-windows11-bm2-2   kubevirt-workload-update-kwr6c                Failed        win11-conservative-moth
    vm-windows11-bm2-2   kubevirt-workload-update-zt8wx                Failed        win11-conservative-moth
  3. Exclude the migrating virtual machine from the backup. Reattempt it after the migration is complete.

Backup applications table does not show the new backup times for the backed-up applications

Problem statement
The backup applications table does not show the new backup times for the backed-up applications.
Resolution
Go to the Applications and Jobs view to see the last successful backup job for a given application. For applications on the hub, the Applications table has the correct last backup time.

Backups are failing for the virtual machines

Problem statement
The backups and snapshots are failing for the virtual machines that is mounted with second disk.
Resolution
  1. Run the following command to get disks details for the virtual machine.
    oc get virtualmachine -A -o json | jq '.items[] | [{name:.metadata.name, namespace:.metadata.namespace, volumes:.spec.template.spec.volumes}] | select(.[].volumes[].dataVolume | length > 1) | {name
    :.[].name, namespace:.[].namespace}'
    Example output:
    {
    "name": "rhel9-absent-basilisk",
    "namespace": "vmtesting"
    }
  2. If you find the virtual machines are mounted with second disk, then follow the steps mentioned in the Red Hat solution to resolve the issue.

Known issues and limitations

  • The OpenShift Container Platform cluster can have problems and become unusable. After you recover the cluster, rejoin the connections. For the steps to clean the connection and setup the connection between two clusters again, see Connection setup after OpenShift Container Platform cluster recovery.OpenShift Container Platform cluster can have problems and become unusable.
  • The S3 bucket must not have an expiration policy or an archive rule. For more information about this known issue, see S3 buckets must not enable expiration policies.
  • The Azure Endpoint URL must not contain the name of the bucket.