Restore issues

List of restore issues in Backup & Restore service of IBM Storage Fusion.

Restore fails on IBM Power Systems

Restore failures are observed in a IBM Power Systems environment. If you have cluster(s) running on IBM Power Systems, do not upgrade or install the Backup & Restore service to 2.8.1.

exec format error

Problem statement

Sometimes, you may observe the following error message:

"exec <executable name>": exec format error

For example:

The pod log is empty except for this message: exec /filebrowser

The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes.

Resolution: As a resolution, check whether the container that you want to restore and the local node architecture match.

Restore of namespaces that contains admission webhooks fails

Problem statement

Restore of namespaces that contains admission webhooks fails.

Example error in IBM Storage Fusion restore job:

"Failed restore <some  resource>" "BMYBR0003
      RestorePvcsFailed There was an error when  processing the job in the Transaction Manager
      service"

Example error in Velero pod:

level=error msg="Namespace
      domino-platform, resource restore error: error restoring
      certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v:
      Internal error occurred: failed calling webhook  "webhook.cert-manager.io": failed to call
      webhook: Post  "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\":
      service "cert-manager-webhook" not found"

Resolution

Identify the admission webhooks that is applicable to the namespace being restored:
```
oc get mutatingwebhookconfigurations
oc describe mutatingwebhookconfigurations
```
Change the failure Policy parameter from Fail to Ignore to temporarily disable webhook validation prior to restore:
```
failurePolicy: Ignore
```

Restore before upgrade fails with a BMYBR0003 error

Problem statement: When you try to restore backups before upgrade, it fails with a BMYBR0003 error.

Diagnosis

After you upgrade, your jobs may fail:

Backup jobs with the status:

"Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

Restore jobs with the status:

"Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

Confirm the issue in the logs of the manager container of the Data Mover pod.

A sample error message:

2023-07-26T03:39:47Z	ERROR	Failed with error.	{"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"}
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry
	/workspace/controllers/util/utils.go:39
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage
	/workspace/controllers/kafka/kafka_native_connection.go:71
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus
	/workspace/controllers/status.go:191
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore
	/workspace/controllers/guardiancopyrestore_controller.go:187
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile
	/workspace/controllers/guardiancopyrestore_controller.go:92
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Resolution:: Search for the guardian-dm-controller-manager and kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.

"Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs

A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.

Cause

The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 5 GB.

Resolution

Increase the IBM Storage Scale PVC size to a minimum of 5 GB and do a backup and restore operation.

Cannot restore multiple namespaces to a single alternative namespace

You cannot restore multiple namespaces to a single alternative namespace. If you attempt such a restore, then the job fails. Example transaction manager log:

023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2

2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up

Restore to a cluster that does not have an identical storage class as the source cluster

You cannot restore to a cluster that does not have an identical storage class as the source cluster. However, the transaction manager still attempts to create PVCs with the non-existent storage class on the spoke cluster and eventually fails with Failed restore snapshot status.

Applications page does not show the details of the application

Problem statement: The new backed up applications page does not show the details of the application when you upgrade IBM Storage Fusion to the latest version while leaving the Backup & Restore service in the older version.

Resolution: As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after an IBM Storage Fusion upgrade.

S3 buckets must not enable expiration policies

Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated.

If you configure an expiration policy on the container, the restic configuration objects get deleted at the end of the expiration period. If you have an archive rule set, the configuration objects get archived after the time defined in the rule. In either case, the configuration is no longer accessible and subsequent backup and restore operations fail.

Note: The S3 buckets must not enable expiration policies. Also, the bucket must not have an archive rule set.

Virtual machine restore failure

Problem statement

By default, the VirtualMachineClone objects are not restored due to the following unexpected behavior:

If you create a VirtualMachineClone object and delete the original virtual machine, then the restore fails because the object gets rejected.
If you create a VirtualMachineClone object and then delete the clone virtual machine, then the restore fails because the Virtualization ignores the status.phase "Succeeded" and clones the virtual machine again.
As a result, the clone gets re-created every time you delete it.
If you create a VirtualMachineClone and then do a backup with the original and clone virtual machine, the restore fails because it ignores the status.phase "Succeeded" and tries to clone again to the virtual machine that exists.
The Openshift Virtualization creates a snapshot of the original VirtualMachine, which adds an unwanted VirtualMachineSnasphot and a set of associated VolumeSnapshots after restore whose name starts with "tmp-". The clone operation does not complete and remains stuck in "RestoreInProgress" state because the requested VirtualMachine exists in the VirtualMachineClone.

Resolution

As a resolution, force the restore of the VirtualMachineClone objects by explicitly including it in the Recipe.

Change the OADP DataProtectionApplication object "velero" and add in the spec.configuration.velero.args.restore-resource-priorities field as follows:


    velero:
      args:
        restore-resource-priorities: "securitycontextconstraints,customresourcedefinitions,namespaces,managedcluster.cluster.open-cluster-management.io,managedcluster.clusterview.open-cluster-management.io,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,virtualmachines.kubevirt.io,virtualmachineclones.clone.kubevirt.io"

Problem statement: Datamover operator must restore CephFS and IBM Storage Scale 5.2.0+ snapshots for backup by using the ReadOnlyMany access modes.

Resolution