Restore issues

List of restore issues in Backup & Restore service of IBM Fusion.

Restore fails because of ephemeral local storage exceeding limits error in Velero pod

Resolution

When you back up and restore a large number of cluster resources, Velero pod resource limits must be increased for the operations to succeed. The optimal values depend on your individual environment, but the following values are representative for backup and restore of 1500 resources. To change the local storage the Velero Pod is allowed to use from the default 500Mi (Mebibytes) to 1Gi (Gibibytes):

oc patch dpa velero -n <backup-restore-namespace> --type merge -p '{"spec": {"configuration": {"velero": {"podConfig": {"resourceAllocations": {"limits": {"ephemeral-storage": “1Gi"}}}}}}}'

To change the memory limit of the Velero Pod from default 2Gi (Gibibytes) to 4Gi (Gibibytes):

oc patch dpa velero -n <backup-restore-namespace> --type merge -p '{"spec": {"configuration": {"velero": {"podConfig": {"resourceAllocations": {"limits": {"memory": “4Gi"}}}}}}}'

Restore of applications from hub or spoke 2.9.0 to spoke at version 2.8.1 fails

Problem statement

Restore of applications from hub or spoke 2.9.0 to spoke at version 2.8.x fails with error in Transaction Manager service:

Argument of type 'NoneType' is not iterable.

Resolution: This is a known limitation so do not attempt restore 2.9.0 (kopia) based backups to a lower version 2.8.x (restic).

IBM Cloud Limitations with OADP DataMover

Problem statement: Backups of raw volumes fail for IBM Cloud with OADP 1.4.0 or lower.

Cause

The OADP 1.3 or higher expose the volume of the underlying host during backup or restore. The folders on the host are exposed in the Pods that are associated with the Daemonset node-agent. The /var/lib/kubelet/{plugins,pods} folder is exposed by default. The folders required to work on IBM Cloud are /var/data/kubelet/{plugins,pods}. As a result, the backup and restore of volumeMode: block volumes fail with the following example error:

Failed transferring data

[BMYBR0009](https://ibm.com/docs/SSFETU_2.9/errorcodes/BMYBR0009.html) There was an error when processing the job in the Transaction Manager service. The underlying error was: 'Data uploads watch caused an exception: DataUpload d4e7706d-7f0f-4448-b3a0-e9cdff8d33db-1 failed with message: data path backup failed: Failed to run kopia backup: unable to get local block device entry: resolveSymlink: lstat /var/data: no such file or directory'.

The ID of the individual DataUpload varies for jobs.

Resolution

Set the DataMover type to legacy by either the global method or the Per PolicyAssignment method. For the procedure to update the type, see Configure DataMover type for Backup and Restore.
Though this workaround allows continued use of of DataMover kopia type, it has the following drawbacks:

It disables further changes to DataMover and Velero configurations, such as the ability to change resource allocations (CPU, Memory, and Ephemeral-storage) and nodeSelectors (datamover node placement) for Datamover type kopia.

Legacy is not affected by this workaround.

To avoid job failures, do not make these changes while a backup or restore job is in progress.
In the OpenShift Console, go to Workloads > Deployments.
Select openshift-adp-controller-manager and scale the number of Pods to 0.
Go to Workloads > Daemonsets and select node-agent.
Select the YAML tab.

Under the volumes section, add the additional volume host-data as shown in the following example.

Note: It exposes an additional folder on the host other than the folders mention in the Cause of this issue.


    volumes:
       - name: host-pods
         hostPath:
           path: /var/lib/kubelet/pods
           type: ''
       - name: host-plugins
         hostPath:
           path: /var/lib/kubelet/plugins
           type: ''
       - name: host-data
         hostPath:
           path: /var/data/kubelet
           type: ''
       - name: scratch
         emptyDir: {}
       - name: certs
         emptyDir: {}

Under volumeMounts, add the host-data volume as shown in the following example.


          volumeMounts:
            - name: host-pods
              mountPath: /host_pods
              mountPropagation: HostToContainer
            - name: host-plugins
              mountPath: /var/lib/kubelet/plugins
              mountPropagation: HostToContainer
            - name: host-data
              mountPath: /var/data/kubelet
              mountPropagation: HostToContainer
            - name: scratch
              mountPath: /scratch
            - name: certs
              mountPath: /etc/ssl/certs

Save the changes and wait for a couple minutes for the Pods to restart.
The backups and restores of PersistentVolumeClaims with volumeMode: block succeeds on Red Hat® OpenShift® of IBM Cloud.

Restore fails on IBM Power Systems

Restore failures are observed in a IBM Power Systems environment. If you have cluster(s) running on IBM Power Systems, do not upgrade the Backup & Restore service to 2.8.1 instead contact Contact IBM Support.

exec format error

Problem statement

Sometimes, you may observe the following error message:

"exec <executable name>": exec format error

For example:

The pod log is empty except for this message: exec /filebrowser

The example error can be due to the wrong architecture of the container. For example, an amd64 container on s390x nodes or an s90x container on amd64 nodes.

Resolution: As a resolution, check whether the container that you want to restore and the local node architecture match.

Restore of namespaces that contains admission webhooks fails

Problem statement

Restore of namespaces that contains admission webhooks fails.

Example error in IBM Fusion restore job:

"Failed restore <some  resource>" "BMYBR0003
      RestorePvcsFailed There was an error when  processing the job in the Transaction Manager
      service"

Example error in Velero pod:

level=error msg="Namespace
      domino-platform, resource restore error: error restoring
      certificaterequests.cert-manager.io/domino-platform/hephaestus-buildkit-client-85k2v:
      Internal error occurred: failed calling webhook  "webhook.cert-manager.io": failed to call
      webhook: Post  "https://cert-manager-webhook.domino-platform.svc:443/mutate?timeout=10s\\":
      service "cert-manager-webhook" not found"

Resolution

Identify the admission webhooks that is applicable to the namespace being restored:
```
oc get mutatingwebhookconfigurations
oc describe mutatingwebhookconfigurations
```
Change the failure Policy parameter from Fail to Ignore to temporarily disable webhook validation prior to restore:
```
failurePolicy: Ignore
```

Restore before upgrade fails with a BMYBR0003 error

Problem statement: When you try to restore backups before upgrade, it fails with a BMYBR0003 error.

Diagnosis

After you upgrade, your jobs may fail:

Backup jobs with the status:

"Failed transferring data" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

Restore jobs with the status:

"Failed restore <some resource>" "BMYBR0003 There was an error when processing the job in the Transaction Manager service"

Confirm the issue in the logs of the manager container of the Data Mover pod.

A sample error message:

2023-07-26T03:39:47Z	ERROR	Failed with error.	{"controller": "guardiancopyrestore", "controllerGroup": "guardian.isf.ibm.com", "controllerKind": "GuardianCopyRestore", "GuardianCopyRestore": {"name":"52a2abfb-ea9b-422f-a60d-fed59527d38e-r","namespace":"ibm-backup-restore"}, "namespace": "ibm-backup-restore", "name": "52a2abfb-ea9b-422f-a60d-fed59527d38e-r", "reconcileID": "c8642f40-c086-413f-a10e-6d6a85531337", "attempt": 2, "total attempts": 3, "error": "EOF"}
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/util.Retry
	/workspace/controllers/util/utils.go:39
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers/kafka.(*kafkaWriterConnection).PublishMessage
	/workspace/controllers/kafka/kafka_native_connection.go:71
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).updateOverallCrStatus
	/workspace/controllers/status.go:191
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).doRestore
	/workspace/controllers/guardiancopyrestore_controller.go:187
github.ibm.com/ProjectAbell/guardian-dm-operator/controllers.(*GuardianCopyRestoreReconciler).Reconcile
	/workspace/controllers/guardiancopyrestore_controller.go:92
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Resolution:: Search for the guardian-dm-controller-manager and kill it. A new pod starts in a minute. After the pod reaches a healthy state, retry backup and restores.

"Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs

A "Failed restore snapshot" error occurs with applications using IBM Storage Scale storage PVCs.

Cause

The "disk quota exceeded" error occurs whenever you restore from an object storage location having applications that use IBM Storage Scale PVC with a size less than 10 GB.

Resolution

Increase the IBM Storage Scale PVC size to a minimum of 10 GB and do a backup and restore operation.

Cannot restore multiple namespaces to a single alternative namespace

You cannot restore multiple namespaces to a single alternative namespace. If you attempt such a restore, then the job fails. Example transaction manager log:

023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:resre_application_w_recipe Line 172][INFO] - altNS: ns2, number of ns: 2

2023-06-27 15:05:53,633[TM_5][c2a4b768-2acc-4169-9c5f-f88e60f3be2b][restoreguardian:restore_application_w_recipe Line 176][ERROR] - Alternate namespace specified, but more than one namespace was backed up

Restore to a cluster that does not have an identical storage class as the source cluster

You cannot restore to a cluster that does not have an identical storage class as the source cluster. However, the transaction manager still attempts to create PVCs with the non-existent storage class on the spoke cluster and eventually fails with Failed restore snapshot status.

Applications page does not show the details of the application

Problem statement: The new backed up applications page does not show the details of the application when you upgrade IBM Fusion to the latest version while leaving the Backup & Restore service in the older version.

Resolution: As a resolution, it is recommended to upgrade the Backup & Restore service to the latest version after an IBM Fusion upgrade.

S3 buckets must not enable expiration policies

Backup & Restore currently uses restic to move the Velero backups to repositories on S3 buckets. When a restic repository gets initialized, it creates a configuration file and several subdirectories for its snapshots. As restic does not update it after initialization, the modification timestamp of this configuration file is never updated.

If you configure an expiration policy on the container, the restic configuration objects get deleted at the end of the expiration period. If you have an archive rule set, the configuration objects get archived after the time defined in the rule. In either case, the configuration is no longer accessible and subsequent backup and restore operations fail.

Note: The S3 buckets must not enable expiration policies. Also, the bucket must not have an archive rule set.

Virtual machine restore failure

Problem statement

By default, the VirtualMachineClone objects are not restored due to the following unexpected behavior:

If you create a VirtualMachineClone object and delete the original virtual machine, then the restore fails because the object gets rejected.
If you create a VirtualMachineClone object and then delete the clone virtual machine, then the restore fails because the Virtualization ignores the status.phase "Succeeded" and clones the virtual machine again.
As a result, the clone gets re-created every time you delete it.
If you create a VirtualMachineClone and then do a backup with the original and clone virtual machine, the restore fails because it ignores the status.phase "Succeeded" and tries to clone again to the virtual machine that exists.
The Openshift Virtualization creates a snapshot of the original VirtualMachine, which adds an unwanted VirtualMachineSnasphot and a set of associated VolumeSnapshots after restore whose name starts with "tmp-". The clone operation does not complete and remains stuck in "RestoreInProgress" state because the requested VirtualMachine exists in the VirtualMachineClone.

Resolution

As a resolution, force the restore of the VirtualMachineClone objects by explicitly including it in the Recipe.

Change the OADP DataProtectionApplication object "velero" and add in the spec.configuration.velero.args.restore-resource-priorities field as follows:


    velero:
      args:
        restore-resource-priorities: "securitycontextconstraints,customresourcedefinitions,namespaces,managedcluster.cluster.open-cluster-management.io,managedcluster.clusterview.open-cluster-management.io,klusterletaddonconfig.agent.open-cluster-management.io,managedclusteraddon.addon.open-cluster-management.io,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io,virtualmachines.kubevirt.io,virtualmachineclones.clone.kubevirt.io"

Service recovery involving a large number of CRs

Problem statement: When service recovery processes a large number of CRs, service discovery may fail after you add the backup storage location.

Resolution

Check whether backup-location pod is evicted due to the ephemeral storage limit exceeded.
If no backup-location pod is evicted, check the logs of backup-location pod for any of the following messages:
- Processing service protection discovery was successful
- Processing service protection discovery failed
- Error processing service protection discovery
If you do not see any of these messages, then the service discovery may be still in running state.
If the backup-location pod restarts because of ephemeral storage limit, update the resources of backup location deployment temporarily. Increase the ephemeral storage limit from 512 Mi to 1 Gi, and the memory limit from 512 Mi to 2 Gi. These limits must be changed back after the service discovery completes populating the service backups that are available for restore.
```
oc set resources deployment backup-location-deployment --limits=ephemeral-storage=1Gi,memory=2Gi -n <backup-restore-namespace>
```
After the service discovery completes, search the backup-location pod logs for the following message, which includes the number of resources found in the latest, recoverable service backup:
```
Restic backup validation processed
```
Example resources:
- 363 Application CRs
- 40 Fusion BackupStorageLocation CRs
- 385 Secrets for BackupStorageLocation CRs
- 25 BackupPolicy CRs
- 565 PolicyAssignment CRs
- 12126 Backup CRs
Note: There may be more than one such message if service backups are not suitable for recovery, such as containing no Backup CRs.
For more information about Backup & Restore Hub performance and scaling, see Backup & restore hub performance and scaling.
Update the resources of the transaction manager to temporarily increase the memory limit from 500 Mi to 2 Gi.
For example:
```
oc set resources deployment transaction-manager --limits=memory=2Gi -n <backup-restore-namespace>
```

Update the resources of backup services to temporarily increase the number of replicas. For example, 8 for the previous step example, as well as increasing the memory limit from 1 Gi to 2 Gi.

oc set resources deployment backup-service --limits=memory=2Gi -n <backup-restore-namespace>

Sample output:

The following command isn't working; scale the number of replicas in OpenShift UI.
oc scale deployment backup-service --replicas=8 -n <backup-restore-namespace>

Update the resources of mongodb based on the number of CRs to be processed. In this example, temporarily increase the CPU limit to 4 and the memory limit to 4 Gi.
```
oc set resources sts mongodb --limits=cpu=4,memory=4Gi -n <backup-restore-namespace>
```

Update the resources of velero resources. Modify the DataProtectionApplication CR named Velero:

    velero:
      customPlugins:
        - image: 'cp.icr.io/cp/fbr/guardian-kubevirt-velero-plugin@sha256:bce6932338d8e2d2ce0dcca2c95bdfa8ab6e337a758e039ee91773f3b26ceb06'
          name: isf-kubevirt
      defaultPlugins:
        - openshift
        - aws
      noDefaultBackupLocation: false
      podConfig:
        labels:
          app.kubernetes.io/part-of: ibm-backup-restore
        resourceAllocations:
          limits:
            cpu: '2'
            ephemeral-storage: 500Mi
            memory: 2Gi
          requests:
            cpu: 200m
            ephemeral-storage: 275Mi
            memory: 256Mi

In this example, update the ephemeral-storage limit from 500Mi to 1Gi.

Wait for all pods that were modified to restart and be in running state.
Note: After the service recovers successfuly, it may take time for the services to processes the restored CRs.
Wait for all of the backup storage locations to be connected.
Check the backup service pods to determine if any CRs are still in progress.
After processing completes, revert the resources to their original values or adjust them based on the expected number of concurrent jobs.

More than one BackupRepository found for workload namespace

Problem statement

When you restore a namespace with multiple PVCs, the restore job might fail with the following error message:

error to initialize data path: error to ensure backup repository <repo>: failed to wait BackupRepository, errored early: more than one BackupRepository found for workload namespace <namespace>

Cause: It is due to a race condition during the creation of the Backup repository.

Resolution: To workaround the problem, clean the restored namespaces and try again.