Restore issues
List of restore issues in Backup & Restore service of IBM Storage Fusion.
Restore fails on IBM Power Systems
Restore failures are observed in a IBM Power Systems environment. If you have cluster(s) running on IBM Power Systems, do not upgrade or install the Backup & Restore service to 2.8.1.
exec format error
- Problem statement
- Sometimes, you may observe the following error
message:
"exec <executable name>": exec format error
For example:The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes.The pod log is empty except for this message: exec /filebrowser
- Resolution
- As a resolution, check whether the container that you want to restore and the local node architecture match.
Restore of namespaces that contains admission webhooks fails
- Problem statement
- Restore of namespaces that contains admission webhooks fails.Example error in IBM Storage Fusion restore job:
"Failed restore <some resource>" "BMYBR0003 RestorePvcsFailed There was an error when processing the job in the Transaction Manager service"
Example error in Velero pod:level=error msg="Namespace domino-platform, resource restore error: error restoring certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\": service "cert-manager-webhook" not found"
- Resolution
-
- Identify the admission webhooks that is applicable to the namespace being
restored:
oc get mutatingwebhookconfigurations oc describe mutatingwebhookconfigurations
- Change the failure Policy parameter from
Fail
toIgnore
to temporarily disable webhook validation prior to restore:failurePolicy: Ignore
- Identify the admission webhooks that is applicable to the namespace being
restored:
Restore before upgrade fails with a BMYBR0003 error
- Problem statement
- When you try to restore backups before upgrade, it fails with a BMYBR0003 error.
- Diagnosis
-
After you upgrade, your jobs may fail:
- Backup jobs with the status:
"Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
- Restore jobs with the status:
"Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"
Confirm the issue in the logs of the manager container of the Data Mover pod.
A sample error message:
2023-07-26T03:39:47Z ERROR Failed with error. {"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"} github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry /workspace/controllers/util/utils.go:39 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage /workspace/controllers/kafka/kafka_native_connection.go:71 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus /workspace/controllers/status.go:191 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore /workspace/controllers/guardiancopyrestore_controller.go:187 github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile /workspace/controllers/guardiancopyrestore_controller.go:92 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
- Backup jobs with the status:
- Resolution:
- Search for the
guardian-dm-controller-manager
and kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.
"Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs
- A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.
- Cause
- The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 5 GB.
- Resolution
- Increase the IBM Storage Scale PVC size to a minimum of 5 GB and do a backup and restore operation.
Cannot restore multiple namespaces to a single alternative namespace
023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2 2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up
Restore to a cluster that does not have an identical storage class as the source cluster
You cannot restore to a cluster that does not have an identical storage class as the source
cluster. However, the transaction manager still attempts to create PVCs with the non-existent
storage class on the spoke cluster and eventually fails with Failed restore
snapshot
status.
Applications page does not show the details of the application
- Problem statement
- The new backed up applications page does not show the details of the application when you upgrade IBM Storage Fusion to the latest version while leaving the Backup & Restore service in the older version.
- Resolution
- As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after an IBM Storage Fusion upgrade.
S3 buckets must not enable expiration policies
Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated.
If you configure an expiration policy on the container, the restic configuration objects get deleted at the end of the expiration period. If you have an archive rule set, the configuration objects get archived after the time defined in the rule. In either case, the configuration is no longer accessible and subsequent backup and restore operations fail.
Virtual machine restore failure
- Problem statement
- By default, the VirtualMachineClone objects are not restored due to the following unexpected
behavior:
- If you create a VirtualMachineClone object and delete the original virtual machine, then the restore fails because the object gets rejected.
- If you create a VirtualMachineClone object and then delete the clone virtual machine, then the
restore fails because the Virtualization ignores the
status.phase
"Succeeded" and clones the virtual machine again.As a result, the clone gets re-created every time you delete it.
- If you create a VirtualMachineClone and then do a backup with the original and clone virtual
machine, the restore fails because it ignores the
status.phase
"Succeeded" and tries to clone again to the virtual machine that exists.The Openshift Virtualization creates a snapshot of the original VirtualMachine, which adds an unwanted VirtualMachineSnasphot and a set of associated VolumeSnapshots after restore whose name starts with "tmp-". The clone operation does not complete and remains stuck in "RestoreInProgress" state because the requested VirtualMachine exists in the VirtualMachineClone.
- Resolution
- As a resolution, force the restore of the VirtualMachineClone objects by explicitly including it
in the Recipe.Change the OADP
DataProtectionApplication
object "velero" and add in thespec.configuration.velero.args.restore-resource-priorities
field as follows:velero: args: restore-resource-priorities: "securitycontextconstraints,customresourcedefinitions,namespaces,managedcluster.cluster.open-cluster-management.io,managedcluster.clusterview.open-cluster-management.io,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,virtualmachines.kubevirt.io,virtualmachineclones.clone.kubevirt.io"
- Problem statement
- Datamover operator must restore CephFS and IBM Storage Scale 5.2.0+ snapshots for backup by using the ReadOnlyMany access modes.
- Resolution