Backup & restore configuration parameters

Configuration parameters in ConfigMap guardian-configmap can be used to change defaults for IBM Storage Fusion Backup & restore agent:

deleteBackupWait
Timeout for restic command to delete backup in S3 storage. Set in minutes. The default value is 20 minutes, and the allowed range is 10 to 120.
pvcSnapshotMaxParallel
Number of threads available to take concurrent snapshots. The default value is 20.
backupDatamoverTimeout
Maximum amount of time in minutes for the datamover to complete backup. The default value is 20 minutes, and the allowed range is 10 to 14400. After you modify backupDatamoverTimeout, update cancelJobAfter.
restoreDatamoverTimeout
Maximum amount of time in minutes for datamover to complete restore. The default value is 20 minutes, and the allowed range is 10 to 14400. After you modify restoreDatamoverTimeout, update cancelJobAfter.
snapshotRestoreJobTimeLimit
This parameter is not used.
pvcSnapshotRestoreTimeout
Timeout for creating PVC from snapshot in minutes. The default value is 15 minutes.
kafka-thread-size
The number of processing threads in the transaction manager. The default value is 10.
snapshotTimeout
Timeout for snapshot to resolve to the ready state in minutes. The default value is 20, and the allowed range is 10 to 120.
datamoverJobpodEphemeralStorageLimit
Datamover pod ephemeral storage limit. The default value is 2000Mi.
datamoverJobPodDataMinGB
Minimum PVC capacity for each datamover pod before a new datamover pod is started. It is set in GB, and the default value is 10 GB.
datamoverJobpodMemoryLimit
It is the Datamover pod memory limit, and the default value is 15000Mi.
datamoverJobpodCPULimit
It is the Datamover pod CPU limit, and the default value is 2.
cancelJobAfter
If you modify backupDatamoverTimeout or restoreDatamoverTimeout, update the job-manager deployment configuration parameter cancelJobAfter. It is the maximum amount of time in milliseconds that the job-manager waits before it cancels the long-running job. The default value is 3600000 (1 hour).
MaxNumJobPods
Implement configurable MaxNumJobPods for datamovers.

This field helps to control the number of PersistentVolumeClaims that is attached to datamover pods during backup and restore to the BackupStorageLocation. It is set on a per install basis. If you have spoke clusters installed in your IBM Storage Fusion installation, then set on each spoke cluster individually. It is not a global field that applies to all clusters in the installation.

It helps to distribute the storage load across multiple nodes of the cluster when available. Some StorageClasses impose a maximum number of PVCs that can be attached to an individual node of the cluster. This field helps to manage this StorageClass limitation. To find out whether your StorageClass has this limitation, check whether the CSINode of your storage provider has the spec.drivers[].allocatable.count field set. The VPC Block on IBM Cloud is one such storage provider with this limitation, typically 10 per node. If you increase the number of pods and the application has more than 30 PVCs, it decreases the number of PVCs attached to each node. If it goes below this, you can use the default, which is more than sufficient.

This field can increase or decrease the maximum number of datamover pods that are assigned in each backup or restore. More pods use more resources such as CPU and memory, and can help improve performance for backups and restores with larger numbers of PVCs. Increasing this value may help if the number of PVCs to be backed up or restored at the same time is more than 30.

This field does not guarantee the creation of more datamover pods. A number of heuristics are used at runtime to help determine the assignment of PVCs to datamovers, including total PVC capacity, number of PVCs, amount of data transferred during previous backups, total number of PVCs handled, storage providers involved, among others. This field changes the maximum allowed, and it does not guarantee the specified number of datamovers.

Do the following steps to change the value:
  1. In the OpenShift® Container Platform console, click Operators > Installed Operators.
  2. Change the project to the IBM Spectrum Fusion Backup and Restore namespace. For example ibm-backup-restore.
  3. Click to open IBM Storage Fusion Backup & Restore Agent.
  4. Click the Data Protection Agent tab and click the dpagent install. Alternatively, if you want to use the OC command:
    oc edit -n ibm-backup-restore dataprotectionagent
  5. Go to the YAML tab.
  6. Edit spec.transactionManager.datamoverJobPodCountLimit. The value must be numeric and in quotes. For example, '3', '5', '10'

Backup and restore large number of files

When you back up and restore a large number of files located on CephFS, you must perform the following additional steps for the operations to succeed. The optimal values depend on your individual environment, but the following values are representative for a backup and restore of a million files.
  • Prevent the transaction manager from failing long running backup jobs. In the ibm-backup-restore project, edit the config map named guardian-configmap. Look for backupDatamoverTimeout. This value is in minutes, and the default is 20 minutes. For example, increase this value to 8 hours (480).
  • Prevent the job manager from canceling long running jobs. In the ibm-backup-restore project, edit the job-manager deployment. Under env, look for cancelJobAfter. This value is in milliseconds, and the default is 1 hour. For example, increase this value to 20 hours (72000000).
  • Prevent the transaction manager from failing long running restore jobs. In the ibm-backup-restore project, edit the config map named guardian-configmap. Look for restoreDatamoverTimeout. This value is in minutes, and the default is 20 minutes. For example, increase this value to 20 hours (1200).
  • In the same config map, increase the amount of ephemeral storage the data mover is allowed to use by increasing datamoverJobpodEphemeralStorageLimit to 4000Mi or more.
  • OpenShift Data Foundation parameters.
    • Increase the resources available to OpenShift Data Foundation. Increase the limits and requests for the two MDS pods, a and b, for example to 2 CPU and 32 Gi memory. For more information about the changes, see Changing resources for the OpenShift Data Foundation components.
    • Prevent SELinux relabelling

      At restore time, OpenShift will attempt to relabel each of the files. If it takes too long, the restored pod will fail with CreateContainerError. This article explains the situation and some of the possible workarounds to prevent the relabeling:https://access.redhat.com/solutions/6221251.

Additional parameters are available when you backup and restore large number of files that are located on CephFS. The optimal values depend on your individual environment, but the values in this example represent backup and restore of a million files. For such large number of files, you must be on OpenShift Container Platform 4.12 or later, and Data Foundation 4.12 or later.

Set up the following Backup & Restore parameters:
  • In the same config map, increase the amount of ephemeral storage the data mover is allowed to use. Increase datamoverJobpodEphemeralStorageLimit to 4000Mi or more.
Set up the following Red Hat® OpenShift Data Foundation parameters:
  • Increase the resources available to Red Hat OpenShift Data Foundation. Increase the limits and requests for the two MDS pods, for example, to 2 cpu and 32 Gi memory. For the procedure to change, see Changing resources for the OpenShift Data Foundation components.
  • Set up to Prevent SELinux relabeling:

    At restore time, OpenShift attempts to relabel each of the files. If it takes too long, the restored pod fail with CreateContainerError. This article explains the situation and some of the possible workarounds to prevent the relabeling: https://access.redhat.com/solutions/6221251.

Use the following steps to understand whether your backups fail due to a large number of files:
  1. Run the following command to check whether the MDS pods are restarting:
    oc get pod -n openshift-storage |grep mds
  2. If they are restart, check the termination reason:
    1. Describe the pod.
    2. Check whether the termination is OOM Kill.
  3. Run the following command to check the memory usage by the MDS pods and monitor for memory usage:
    oc adm top pod -n openshift-storage
  4. If the memory usage keeps spiking until the pod restarts, then see Changing resources for the OpenShift Data Foundation components.