Restore issues
List of restore issues in Backup & Restore service of IBM Fusion.
Restore fails because of ephemeral local storage exceeding limits error in Velero pod
- Resolution
- When you back up and restore a large number of cluster resources, Velero pod resource limits
must be increased for the operations to succeed. The optimal values depend on your individual
environment, but the following values are representative for backup and restore of 1500 resources.
To change the local storage the Velero Pod is allowed to use from the default 500Mi (Mebibytes) to
1Gi
(Gibibytes):
oc patch dpa velero -n <backup-restore-namespace> --type merge -p '{"spec": {"configuration": {"velero": {"podConfig": {"resourceAllocations": {"limits": {"ephemeral-storage": “1Gi"}}}}}}}'To change the memory limit of the Velero Pod from default 2Gi (Gibibytes) to 4Gi (Gibibytes):oc patch dpa velero -n <backup-restore-namespace> --type merge -p '{"spec": {"configuration": {"velero": {"podConfig": {"resourceAllocations": {"limits": {"memory": “4Gi"}}}}}}}'
Restore of applications from hub or spoke 2.9.0 to spoke at version 2.8.1 fails
- Problem statement
- Restore of applications from hub or spoke 2.9.0 to spoke at version 2.8.x fails with error in
Transaction Manager service:
Argument of type 'NoneType' is not iterable.
- Resolution
- This is a known limitation so do not attempt restore 2.9.0 (kopia) based backups to a lower version 2.8.x (restic).
IBM Cloud Limitations with OADP DataMover
- Problem statement
- Backups of raw volumes fail for IBM Cloud with OADP 1.4.0 or lower.
- Cause
- The OADP 1.3 or higher expose the volume of the underlying host during backup or restore. The
folders on the host are exposed in the Pods that are associated with the Daemonset node-agent. The
/var/lib/kubelet/{plugins,pods} folder is exposed by default. The folders
required to work on IBM Cloud are /var/data/kubelet/{plugins,pods}. As a
result, the backup and restore of volumeMode: block volumes fail with the following example
error:
Failed transferring data [BMYBR0009](https://ibm.com/docs/SSFETU_2.9/errorcodes/BMYBR0009.html) There was an error when processing the job in the Transaction Manager service. The underlying error was: 'Data uploads watch caused an exception: DataUpload d4e7706d-7f0f-4448-b3a0-e9cdff8d33db-1 failed with message: data path backup failed: Failed to run kopia backup: unable to get local block device entry: resolveSymlink: lstat /var/data: no such file or directory'.
The ID of the individual DataUpload varies for jobs.
- Resolution
-
- Set the DataMover type to
legacyby either the global method or the Per PolicyAssignment method. For the procedure to update the type, see Configure DataMover type for Backup and Restore.Though this workaround allows continued use of of DataMover
kopiatype, it has the following drawbacks:It disables further changes to DataMover and Velero configurations, such as the ability to change resource allocations (CPU, Memory, and Ephemeral-storage) and nodeSelectors (datamover node placement) for Datamover type
kopia.Legacy is not affected by this workaround.
To avoid job failures, do not make these changes while a backup or restore job is in progress.
- In the OpenShift Console, go to .
- Select
openshift-adp-controller-managerand scale the number of Pods to 0. - Go to and select node-agent.
- Select the YAML tab.
- Under the volumes section, add the additional volume
host-dataas shown in the following example.Note: It exposes an additional folder on the host other than the folders mention in the Cause of this issue.volumes: - name: host-pods hostPath: path: /var/lib/kubelet/pods type: '' - name: host-plugins hostPath: path: /var/lib/kubelet/plugins type: '' - name: host-data hostPath: path: /var/data/kubelet type: '' - name: scratch emptyDir: {} - name: certs emptyDir: {} - Under
volumeMounts, add thehost-datavolume as shown in the following example.volumeMounts: - name: host-pods mountPath: /host_pods mountPropagation: HostToContainer - name: host-plugins mountPath: /var/lib/kubelet/plugins mountPropagation: HostToContainer - name: host-data mountPath: /var/data/kubelet mountPropagation: HostToContainer - name: scratch mountPath: /scratch - name: certs mountPath: /etc/ssl/certs - Save the changes and wait for a couple minutes for the Pods to restart.
The backups and restores of PersistentVolumeClaims with volumeMode: block succeeds on Red Hat® OpenShift® of IBM Cloud.
- Set the DataMover type to
Restore fails on IBM Power Systems
Restore failures are observed in a IBM Power Systems environment. If you have cluster(s) running on IBM Power Systems, do not upgrade the Backup & Restore service to 2.8.1 instead contact Contact IBM Support.
exec format error
- Problem statement
- Sometimes, you may observe the following error
message:
"exec <executable name>": exec format errorFor example:The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes.The pod log is empty except for this message: exec /filebrowser
- Resolution
- As a resolution, check whether the container that you want to restore and the local node architecture match.
Restore of namespaces that contains admission webhooks fails
- Problem statement
- Restore of namespaces that contains admission webhooks fails.Example error in IBM Fusion restore job:
"Failed restore <some resource>" "BMYBR0003 RestorePvcsFailed There was an error when processing the job in the Transaction Manager service"Example error in Velero pod:level=error msg="Namespace domino-platform, resource restore error: error restoring certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\": service "cert-manager-webhook" not found"
- Resolution
-
- Identify the admission webhooks that is applicable to the namespace being
restored:
oc get mutatingwebhookconfigurations oc describe mutatingwebhookconfigurations - Change the failure Policy parameter from
FailtoIgnoreto temporarily disable webhook validation prior to restore:failurePolicy: Ignore
- Identify the admission webhooks that is applicable to the namespace being
restored:
Restore before upgrade fails with a BMYBR0003 error
- Problem statement
- When you try to restore backups before upgrade, it fails with a BMYBR0003 error.
- Diagnosis
-
After you upgrade, your jobs may fail:
- Backup jobs with the status:
"Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
- Restore jobs with the status:
"Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
Confirm the issue in the logs of the manager container of the Data Mover pod.
A sample error message:
2023-07-26T03:39:47Z ERROR Failed with error. {"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"} github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry /workspace/controllers/util/utils.go:39 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage /workspace/controllers/kafka/kafka_native_connection.go:71 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus /workspace/controllers/status.go:191 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore /workspace/controllers/guardiancopyrestore_controller.go:187 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile /workspace/controllers/guardiancopyrestore_controller.go:92 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235 - Backup jobs with the status:
- Resolution:
- Search for the
guardian-dm-controller-managerand kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.
"Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs
- A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.
- Cause
- The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 10 GB.
- Resolution
- Increase the IBM Storage Scale PVC size to a minimum of 10 GB and do a backup and restore operation.
Cannot restore multiple namespaces to a single alternative namespace
023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2 2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up
Restore to a cluster that does not have an identical storage class as the source cluster
You cannot restore to a cluster that does not have an identical storage class as the source
cluster. However, the transaction manager still attempts to create PVCs with the non-existent
storage class on the spoke cluster and eventually fails with Failed restore
snapshot status.
Applications page does not show the details of the application
- Problem statement
- The new backed up applications page does not show the details of the application when you upgrade IBM Fusion to the latest version while leaving the Backup & Restore service in the older version.
- Resolution
- As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after an IBM Fusion upgrade.
S3 buckets must not enable expiration policies
Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated.
If you configure an expiration policy on the container, the restic configuration objects get deleted at the end of the expiration period. If you have an archive rule set, the configuration objects get archived after the time defined in the rule. In either case, the configuration is no longer accessible and subsequent backup and restore operations fail.
Virtual machine restore failure
- Problem statement
- By default, the VirtualMachineClone objects are not restored due to the following unexpected
behavior:
- If you create a VirtualMachineClone object and delete the original virtual machine, then the restore fails because the object gets rejected.
- If you create a VirtualMachineClone object and then delete the clone virtual machine, then the
restore fails because the Virtualization ignores the
status.phase"Succeeded" and clones the virtual machine again.As a result, the clone gets re-created every time you delete it.
- If you create a VirtualMachineClone and then do a backup with the original and clone virtual
machine, the restore fails because it ignores the
status.phase"Succeeded" and tries to clone again to the virtual machine that exists.The Openshift Virtualization creates a snapshot of the original VirtualMachine, which adds an unwanted VirtualMachineSnasphot and a set of associated VolumeSnapshots after restore whose name starts with "tmp-". The clone operation does not complete and remains stuck in "RestoreInProgress" state because the requested VirtualMachine exists in the VirtualMachineClone.
- Resolution
- As a resolution, force the restore of the VirtualMachineClone objects by explicitly including it
in the Recipe.Change the OADP
DataProtectionApplicationobject "velero" and add in thespec.configuration.velero.args.restore-resource-prioritiesfield as follows:velero: args: restore-resource-priorities: "securitycontextconstraints,customresourcedefinitions,namespaces,managedcluster.cluster.open-cluster-management.io,managedcluster.clusterview.open-cluster-management.io,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,virtualmachines.kubevirt.io,virtualmachineclones.clone.kubevirt.io"
Service recovery involving a large number of CRs
- Problem statement
- When service recovery processes a large number of CRs, service discovery may fail after you add the backup storage location.
- Resolution
-
- Check whether
backup-locationpod is evicted due to the ephemeral storage limit exceeded. - If no
backup-locationpod is evicted, check the logs of backup-location pod for any of the following messages:- Processing service protection discovery was successful
- Processing service protection discovery failed
- Error processing service protection discovery
If you do not see any of these messages, then the service discovery may be still in running state.
- If the
backup-locationpod restarts because of ephemeral storage limit, update the resources of backup location deployment temporarily. Increase the ephemeral storage limit from 512 Mi to 1 Gi, and the memory limit from 512 Mi to 2 Gi. These limits must be changed back after the service discovery completes populating the service backups that are available for restore.oc set resources deployment backup-location-deployment --limits=ephemeral-storage=1Gi,memory=2Gi -n <backup-restore-namespace> - After the service discovery completes, search the
backup-locationpod logs for the following message, which includes the number of resources found in the latest, recoverable service backup:Restic backup validation processed
Example resources:For more information about Backup & Restore Hub performance and scaling, see Backup & restore hub performance and scaling.- 363 Application CRs
- 40 Fusion BackupStorageLocation CRs
- 385 Secrets for BackupStorageLocation CRs
- 25 BackupPolicy CRs
- 565 PolicyAssignment CRs
- 12126 Backup CRs
Note: There may be more than one such message if service backups are not suitable for recovery, such as containing no Backup CRs. - Update the resources of the transaction manager to temporarily increase the memory limit from
500 Mi to 2 Gi.
For example:
oc set resources deployment transaction-manager --limits=memory=2Gi -n <backup-restore-namespace> - Update the resources of backup services to temporarily increase the number of replicas. For
example, 8 for the previous step example, as well as increasing the memory limit from 1 Gi to 2
Gi.
Sample output:oc set resources deployment backup-service --limits=memory=2Gi -n <backup-restore-namespace>The following command isn't working; scale the number of replicas in OpenShift UI. oc scale deployment backup-service --replicas=8 -n <backup-restore-namespace>
- Update the resources of mongodb based on the number of CRs to be processed. In this example,
temporarily increase the CPU limit to 4 and the memory limit to 4
Gi.
oc set resources sts mongodb --limits=cpu=4,memory=4Gi -n <backup-restore-namespace> - Update the resources of velero resources. Modify the
DataProtectionApplicationCR named Velero:
In this example, update the ephemeral-storage limit from 500Mi to 1Gi.velero: customPlugins: - image: 'cp.icr.io/cp/fbr/guardian-kubevirt-velero-plugin@sha256:bce6932338d8e2d2ce0dcca2c95bdfa8ab6e337a758e039ee91773f3b26ceb06' name: isf-kubevirt defaultPlugins: - openshift - aws noDefaultBackupLocation: false podConfig: labels: app.kubernetes.io/part-of: ibm-backup-restore resourceAllocations: limits: cpu: '2' ephemeral-storage: 500Mi memory: 2Gi requests: cpu: 200m ephemeral-storage: 275Mi memory: 256Mi - Wait for all pods that were modified to restart and be in running state.Note: After the service recovers successfuly, it may take time for the services to processes the restored CRs.
- Wait for all of the backup storage locations to be connected.
- Check the backup service pods to determine if any CRs are still in progress.
- After processing completes, revert the resources to their original values or adjust them based on the expected number of concurrent jobs.
- Check whether
More than one BackupRepository found for workload namespace
- Problem statement
- When you restore a namespace with multiple PVCs, the restore job might fail with the following
error message:
error to initialize data path: error to ensure backup repository <repo>: failed to wait BackupRepository, errored early: more than one BackupRepository found for workload namespace <namespace>
- Cause
- It is due to a race condition during the creation of the Backup repository.
- Resolution
- To workaround the problem, clean the restored namespaces and try again.