Backup & restore configuration parameters

Configuration parameters allow customization of IBM Fusion Backup & restore settings.

Change defaults for IBM Fusion Backup & restore agent

deleteBackupWait
Timeout for restic command to delete backup in S3 storage. Set in minutes. The default value is 20 minutes, and the allowed range is 10 to 120.
pvcSnapshotMaxParallel
Number of threads available to take concurrent snapshots. The default value is 20.
backupDatamoverTimeout
Maximum amount of time in minutes for the datamover to complete backup. The default value is 1200 seconds, and the allowed range is 10 to 14400. After you modify backupDatamoverTimeout, update cancelJobAfter.
restoreDatamoverTimeout
Maximum amount of time in minutes for datamover to complete restore. The default value is 1200 seconds, and the allowed range is 10 to 14400. After you modify restoreDatamoverTimeout, update cancelJobAfter.
snapshotRestoreJobTimeLimit
This parameter is not used.
pvcSnapshotRestoreTimeout
Timeout for creating PVC from snapshot in minutes. The default value is 15 minutes.
kafka-thread-size
The number of processing threads in the transaction manager. The default value is 10.
snapshotTimeout
Timeout for snapshot to resolve to the ready state in minutes. The default value is 20, and the allowed range is 10 to 120.
datamoverJobpodEphemeralStorageLimit
Datamover pod ephemeral storage limit. The default value is 2000Mi.
datamoverJobPodDataMinGB
Minimum PVC capacity for each datamover pod before a new datamover pod is started. It is set in GB, and the default value is 10 GB.
datamoverJobpodMemoryLimit
It is the Datamover pod memory limit, and the default value is 15000Mi.
datamoverJobpodCPULimit
It is the Datamover pod CPU limit, and the default value is 2.
cancelJobAfter
If you modify backupDatamoverTimeout or restoreDatamoverTimeout, update the job-manager deployment configuration parameter cancelJobAfter. It is the maximum amount of time in milliseconds that the job-manager waits before it cancels the long-running job. The default value is 3600000 (1 hour).
MaxNumJobPods
Implement configurable MaxNumJobPods for datamovers.

This field helps to control the number of PersistentVolumeClaims that is attached to datamover pods during backup and restore to the BackupStorageLocation. It is set on a per install basis. If you have spoke clusters installed in your IBM Fusion installation, then set on each spoke cluster individually. It is not a global field that applies to all clusters in the installation.

It helps to distribute the storage load across multiple nodes of the cluster when available. Some StorageClasses impose a maximum number of PVCs that can be attached to an individual node of the cluster. This field helps to manage this StorageClass limitation. To find out whether your StorageClass has this limitation, check whether the CSINode of your storage provider has the spec.drivers[].allocatable.count field set. The VPC Block on IBM Cloud is one such storage provider with this limitation, typically 10 per node. If you increase the number of pods and the application has more than 30 PVCs, it decreases the number of PVCs attached to each node. If it goes below this, you can use the default, which is more than sufficient.

This field can increase or decrease the maximum number of datamover pods that are assigned in each backup or restore. More pods use more resources such as CPU and memory, and can help improve performance for backups and restores with larger numbers of PVCs. Increasing this value may help if the number of PVCs to be backed up or restored at the same time is more than 30.

This field does not guarantee the creation of more datamover pods. A number of heuristics are used at runtime to help determine the assignment of PVCs to datamovers, including total PVC capacity, number of PVCs, amount of data transferred during previous backups, total number of PVCs handled, storage providers involved, among others. This field changes the maximum allowed, and it does not guarantee the specified number of datamovers.

For more information about how to change the value, see Change defaults for IBM Fusion Backup & restore agent.

DeleteBackupRequest CR cleanup
The DeleteBackupRequest CRs in the Completed state gets automatically deleted after a default retention of 14 days. You can set the dbrCRRetention configuration parameter in ConfigMap guardian-configmap to change the default. By default, the cleanup thread runs every 2 hours and cleans up a maximum of 100 DeleteBackupRequest CRs in one run. These two values can be overridden by specifying dbrCleanupCheckInterval and dbrCleanupBatchSize configuration parameters in the guardian-configmap ConfigMap.
dbrCRRetention
Number of days after which DeleteBackupRequest CRs in the Completed state get deleted. If not specified, then 14 days is the default.
dbrCleanupCheckInterval
Interval, in hours, at which the DeleteBackupRequest cleanup thread runs. If not specified, 2 hours is the default.
dbrCleanupBatchSize
Number of DeleteBackupRequest CRs to cleanup in one run of the cleanup thread. If not specified, 100 is the default.

Change defaults for IBM Fusion Backup & restore agent

Do the following steps to change the value:
  1. In the OpenShift® Container Platform console, click Operators > Installed Operators.
  2. Change the project to the IBM Spectrum Fusion Backup and Restore namespace. For example ibm-backup-restore.
  3. Click to open IBM Fusion Backup & Restore Agent.
  4. Click the Data Protection Agent tab and click the dpagent install. Alternatively, if you want to use the OC command:
    oc edit -n ibm-backup-restore dataprotectionagent
  5. Go to the YAML tab.
  6. Edit spec.transactionManager.datamoverJobPodCountLimit. The value must be numeric and in quotes. For example, '3', '5', '10'

Backup and restore large number of files

When you back up and restore a large number of files located on CephFS, you must perform the following additional steps for the operations to succeed. The optimal values depend on your individual environment, but the following values are representative for a backup and restore of a million files.
  • Prevent the transaction manager from failing long running backup jobs. In the ibm-backup-restore project, edit the config map named guardian-configmap. Look for backupDatamoverTimeout. This value is in minutes, and the default is 20 minutes. For example, increase this value to 8 hours (480).
  • Prevent the job manager from canceling long running jobs. In the ibm-backup-restore project, edit the job-manager deployment. Under env, look for cancelJobAfter. This value is in milliseconds, and the default is 1 hour. For example, increase this value to 20 hours (72000000).
  • Prevent the transaction manager from failing long running restore jobs. In the ibm-backup-restore project, edit the config map named guardian-configmap. Look for restoreDatamoverTimeout. This value is in minutes, and the default is 20 minutes. For example, increase this value to 20 hours (1200).
  • In the same config map, increase the amount of ephemeral storage the data mover is allowed to use by increasing datamoverJobpodEphemeralStorageLimit to 4000Mi or more.
  • OpenShift Data Foundation parameters.
    • Increase the resources available to OpenShift Data Foundation. Increase the limits and requests for the two MDS pods, a and b, for example to 2 CPU and 32 Gi memory. For more information about the changes, see Changing resources for the OpenShift Data Foundation components.
    • Prevent SELinux relabelling

      At restore time, OpenShift will attempt to relabel each of the files. If it takes too long, the restored pod will fail with CreateContainerError. This article explains the situation and some of the possible workarounds to prevent the relabeling:https://access.redhat.com/solutions/6221251.

Additional parameters are available when you backup and restore large number of files that are located on CephFS. The optimal values depend on your individual environment, but the values in this example represent backup and restore of a million files. For such large number of files, you must be on OpenShift Container Platform 4.14 or later, and Data Foundation 4.12 or later.

Set up the following Backup & Restore parameters:
  • In the same config map, increase the amount of ephemeral storage the data mover is allowed to use. Increase datamoverJobpodEphemeralStorageLimit to 4000Mi or more.
Set up the following Red Hat® OpenShift Data Foundation parameters:
  • Increase the resources available to Red Hat OpenShift Data Foundation. Increase the limits and requests for the two MDS pods, for example, to 2 cpu and 32 Gi memory. For the procedure to change, see Changing resources for the OpenShift Data Foundation components.
  • Set up to Prevent SELinux relabeling:

    At restore time, OpenShift attempts to relabel each of the files. If it takes too long, the restored pod fail with CreateContainerError. This article explains the situation and some of the possible workarounds to prevent the relabeling: https://access.redhat.com/solutions/6221251.

Use the following steps to understand whether your backups fail due to a large number of files:
  1. Run the following command to check whether the MDS pods are restarting:
    oc get pod -n openshift-storage |grep mds
  2. If they are restart, check the termination reason:
    1. Describe the pod.
    2. Check whether the termination is OOM Kill.
  3. Run the following command to check the memory usage by the MDS pods and monitor for memory usage:
    oc adm top pod -n openshift-storage
  4. If the memory usage keeps spiking until the pod restarts, then see Changing resources for the OpenShift Data Foundation components.

Configure DataMover type for Backup and Restore

The DataMover type dictates both the entity responsible for backing up data on PersistentVolumeClaims and the method of storage. kopia is the only allowed value.
  • The default value is kopia.
  • The legacy DataMover for versions equal to or lower than 2.8
The value can be set globally and per PolicyAssignment. The two different DataMover types are not cross-compatible. Backups with one DataMover type use that type during restore. Existing backups continue to expire normally per their Policy.
Change the DataMover type used across all new backups
From the OpenShift console:
  1. Log in to OpenShift Console.
  2. Go to Workloads > ConfigMaps.
  3. Select the IBM Fusion install namespace.
  4. Open isf-data-protection-config and update data.Datamover field.
Example command:
oc patch configmap -n ibm-spectrum-fusion-ns isf-data-protection-config -p '{"data":{"DataMover":"kopia"}}'
Here, replace namespace ibm-spectrum-fusion-ns with your Fusion namespace. Also, this example shows kopia DataMover. For legacy DataMover, replace kopia to legacy.
Choose DataMover during policy assignment
To facilitate a trial of the DataMover options, select it by annotating PolicyAssignment objects. Each PolicyAssignment object associates an existing Policy and BackupStorageLocation with an application to backup. By setting this value according to the following instructions, the global setting gets ignored. Consequently, all new backups associated with the PolicyAssignment use the specified DataMover type in the annotation.
  1. Search by label to find the existing PolicyAssignment for your applications.
    • To search by application - dp.isf.ibm.com/application-name=<\application name>
    • To search by policy - dp.isf.ibm.com/backuppolicy-name=<\policy name>
    • To search by backpropagation - dp.isf.ibm.com/backupstoragelocation-name=<\backupstoragelocation name>
      The general naming format is as follows:
      <application name>-<\policy name>-<cluster url>
      Example:
      aws-20240716-220437-awsdaily-apps.bnr-hcp-munch.apps.blazehub01.mydomain.com
  2. Change the value of DataMover.
    From the OpenShift console:
    1. Go to Operators > Installed Operators.
    2. Select the IBM Fusion install namespace.
    3. Open IBM Fusion.
    4. Go to the Policy Assignment tab and change the DataMover value.
    Using commands:
    1. Get the policy assignment object:
      oc get policyassignment -n ibm-spectrum-fusion-ns
    2. Add the annotation to the PolicyAssignment object.
      dp.isf.ibm.com/dataMover: <datamover type>

      Example:

      dp.isf.ibm.com/dataMover: legacy

Choose nodes for DataMovers for OADP DataMover

From 2.9 version onwards, the Backup & restore is transitioning from the in-house DataMover to OADP. DataMover controller using restic as the data transfer software to OADP, which uses its own implementation of DataMover and controller with kopia.
spec.datamoverConfiguration
The field spec.datamoverConfiguration is used to configure the new OADP DataMovers. For backwards compatibility, you can continue to use the fields under spec.transactionManager to configure the DataMovers from the previous releases.
spec.datamoverConfiguration.nodeAgentConfig.env
This array allows users to specify environment variables inserted into the node-agent pods. The example value of HOME allows setting the folder used by kopia for local files.
It controls the on-container folder where cache files related to the storage repository are stored. The default value is /home/velero and it uses local ephemeral storage.

spec:
  datamoverConfiguration:
    nodeAgentConfig:
      env:
      - name: HOME
      value: /home/velero
spec.datamoverConfiguration.nodeAgentConfig.labels
This field adds additional labels to the node-agent daemonset and pods. It is an optional field to organize pods and does not affect backup and restore behavior.This field is a YAML object.

 spec:
  datamoverConfiguration:
    nodeAgentConfig:
      labels:
        cloudpakbackup: datamover
spec.datamoverConfiguration.nodeAgentConfig.tolerations
This is an array field that specifies the conditions for nodes that permit the execution of Pods in resource-constrained environments or multi-architecture clusters. For more information about this field, see Kubernetes documentation.

 spec:
  datamoverConfiguration:
    nodeAgentConfig:
      tolerations:
      - key: "kubernetes.io/arch"
        operator: "Equal"
        effect: "amd64"
spec.datamoverConfiguration.nodeAgentConfig.resourceAllocations
This field defines the resource allocation for the node-agent controller, including both scheduling requirements and limits.
If backups or restores are causing evictions in the node-agent pods, increase the value of this field. Go through the Events in the Backup and Restore install namespace to check resource violations:
  • The status field of the Daemonset node-agent
  • The OpenShift Dashboard for resource monitoring in the Backup and Restore namespace

    See OpenShift documentation.

  • Grafana in the Backup and Restore namespace if installed

    See OpenShift documentation

  • Other applications that use Prometheus metrics.
Example:

 spec:
  datamoverConfiguration:
    nodeAgentConfig:
        resourceAllocations:
          limits:
            cpu: "2"
            ephemeral-storage: "50Mi"
            memory: "2048Mi"
          requests:
            cpu: "200m"
            ephemeral-storage: "25Mi"
            memory: "256Mi"
spec.datamoverConfiguration.nodeAgentConfig.nodeSelector
This field decides the nodes where the DaemonSet node agent must run. This feature is a crucial security feature to restrict the HostPath mount of the node-agent that exposes the following values:
  • All PersistentVolumeClaims
  • Projected API volumes such as configmaps and secrets through volume mount
  • Kubernetes API tokens that are in use on the node
Note: As this field causes Restore failures, remove it before restore operations.

If needed, this field can isolate the Daemonset node-agent Pods to Nodes without any security concerns.

The pods are assigned to Nodes that match the labels attached to the Nodes. If more than one label is used, the labels are treated in a logical AND. Only Nodes that match all of the labels runs the Pod.

For example, if you set the nodeSelector to "kubernetes.io/hostname: bnr-hcp-munch-6df6702b-2bv7z", then the Daemonset node-agent Pods run only the node with this label.If there is only one node with this label, Daemonset node-agent is restricted to deploying Pods to only the labeled node.


 spec:
  datamoverConfiguration:
    nodeAgentConfig:
      nodeSelector:
        kubernetes.io/hostname: bnr-hcp-munch-6df6702b-2bv7z 
spec.datamoverConfiguration.datamoverPodConfig
This setting manages the PVCs, resource allocations, and the location within the cluster where the datamover pods operate. The LoadConcurrency provides the number and LoadAffinity provides the location.
spec.datamoverConfiguration.datamoverPodConfig.loadConcurrency
This field controls the maximum number of datamovers assigned to nodes in the cluster.

If there is anything that appears to be in conflict, see Velero documentation

This setting controls the number of datamovers deployed per node-agent controller pod, and it consists of two parts:
  • spec.datamoverConfiguration.datamoverPodConfig.loadConcurrency.globalConfig

    This sets the maximum number of datamovers deployable from a single node-agent controller pod for a generic node in the cluster. The subsequent field overrides globalConfig and manages this setting independently. It is of type integer with default value of five and minimum value of 1. If loadConcurrency is added to the config, this value is REQUIRED.

  • spec.datamoverConfiguration.datamoverPodConfig.loadConcurrency.perNodeConfig is optional.
    This allows setting a maximum number of datamovers assigned to a node. Example:
    
    loadConcurrency:
      globalConfig: 2
      perNodeConfig:
      - nodeSelector:
          matchLabels:
            kubernetes.io/hostname: node1
        number: 3
      - nodeSelector:
          matchLabels:
            beta.kubernetes.io/instance-type: Standard_B4ms
        number: 5
    The perNodeConfig uses the default nodeSelector elements from Kubernetes to select nodes by using labels. When multiple labels are specified in a matchLabels, the selected nodes are found using an AND operation, meaning all the labels must be found. These also take a more complex matchExpressions field as an alternative to nodeSelector. For more information about assigning pods to nodes, see Kubernetes documentation.

    In this example, the node with a label kubernetes.io/hostname=node1 can run up to 3 DataMovers at once. And a node with label beta.kubernetes.io/instance-type=Standard_B4ms can run up to 5 DataMovers. With the globalConfig set to 2, all other nodes with matching storage to the PVC may run up to 2 DataMovers at the same time.

spec.datamoverConfiguration.datamoverConfiguration.datamoverPodConfig.LoadAffinity
This field controls the nodes in the cluster where the DataMovers must run. For more information, see Velero reference documentation.

loadAffinity:
- nodeSelector:
    matchLabels:
      beta.kubernetes.io/instance-type: Standard_B4ms
    matchExpressions:
    - key: kubernetes.io/hostname
      values:
      - node-1
      - node-2
      - node-3
      operator: In
    - key: xxx/critial-workload
      operator: DoesNotExist

The value is a list of nodeSelectors. The selected nodes are controlled through a nodeSelector field that matches the labels. Each member of the list is evaluated independently. The nodes that match the element are combined to create the final list where the DataMovers can run. If the loadConcurrency.globalConfig is set to 0, then only the selected nodes in loadAffinity and perNodeConfig can run the backup and restore jobs.

Ensure that you meet the following criteria.
  • The intersection of nodes matching loadConcurrency.perNodeConfig and loadConcurrency.loadAffinity is where the jobs run. Make sure the size of the intersection is at least one node.
  • If the storage used during the backup or restore process is only available on a subset of nodes and loadConcurrency.globalConfig is set to 0, you must select at least one node where the storage is accessible. Otherwise, the job fails.
  • If the loadConcurrency.globalConfig is set to 0 and the selected nodes do not have enough resources in CPU and memory to schedule a DataMover pod, then the backup or restore job fails.
spec.datamoverConfiguration.datamoverPodConfig.podResources
Controls the amount of CPU and memory that is made available by the DataMovers. It has the following sub-fields and default values are set for the missing fields.

Example:


podResources:
  cpuRequest: 2
  memoryRequest: 4Gi
  ephemeralStorageRequest: 5Gi
  cpuLimit: 4
  memoryLimit: 16Gi
  ephemeralStorageLimit: 5Gi

For more information about the default Velero resource profile for large clusters, see CPU and memory requirements in Red Hat documentation.

Resource limits must be sufficient to back up the most resource-intensive volumes in the cluster.

spec.datamoverConfiguration.datamoverPodConfig.backupPVC
This section defines the format for the PersistentVolumeClaim used in the backup process. See . Velero documentation.

During backup, each execution of a volumegroup in a Recipe (default Recipe is all PersistentVolumeClaims), a VolumeSnapshot gets created from a PersistentVolumeClaim representing the volume state at the time of the VolumeSnapshot timestamp. Some storage systems, such as CephFS from Data Foundation and IBM Storage Scale, implement shallow-copy backup volume exposure without file copy to address performance concerns compared to regular volume restores.

By default, the agent configures CephFS, IBM Storage Scale, and NFS PersistentVolumeClaims to be created in ReadOnlyMany mode during backups.

Setting values in backupPVC allows you to configure PersistentVolumeClaims from any StorageClass to use their "shallow-copy" implementation. This way, IBM Fusion does not need the specifics, and you can leverage alternative StorageClasses to access different storage features.

Example:

backupPVC:
  storage-class-1:
    storageClass: backupPVC-storage-class
    readOnly: true
  storage-class-2:
    storageClass: backupPVC-storage-class
  storage-class-3:
    readOnly: true

In this example, a data transfer of PersistentVolumeClaims of StorageClass "storage-class-1" makes use of an alternative StorageClass "backupPVC-storage-class" to create the volume with accessModes: ReadOnlyMany. The storage-class-2 is same as storage-class-1 except that the accessModes remains unchanged.

In storage-class-3, no alternate StorageClass can be used, and the volume gets created with accessModes: ReadOnlyMany. All DataMovers mount the backup volume as ReadOnly, and this affects the behavior of storage on volume creation. The original application PersistentVolumeClaim is not modified. Only the PersistentVolumeClaim used for the backup is modified.

If the StorageClass belongs to storage systems CephFS or IBM Storage Scale, the setting overwrites the agent's internal behavior with the one configured.

Not all storage systems support PersistentVolumeClaims with accessModes: ReadOnlyMany. See your appropriate storage documentation. The OpenShift does not provide a way to check what accessModes are supported prior to PersistentVolumeClaim creation.

To use PVC storage as a local cache for kopia datamovers:
  1. Update ConfigMap node-agent-config.
    kind: ConfigMap
    apiVersion: v1
    metadata:
    data:
      node-agent-config: '{"loadConcurrency": {"globalConfig": 6}, "podResources": {"cpuRequest": "500m", "cpuLimit": "4", "memoryRequest": "500Mi", "memoryLimit": "4Gi", "ephemeralStorageRequest": "5Gi", "ephemeralStorageLimit": "5Gi"}, "backupPVC": {"ibm-spectrum-fusion-mgmt-sc": {"storageClass": "ibm-spectrum-fusion-mgmt-sc", "readOnly": true, "spcNoRelabeling": true}, "ocs-storagecluster-cephfs": {"storageClass": "ocs-storagecluster-cephfs", "readOnly": true, "spcNoRelabeling": true}}}'
  2. Add the following field highlighted with ** ***:
    kind: ConfigMap
    apiVersion: v1
    metadata:
    data:
      node-agent-config: '{**"storage":{"storageClassName":"<storage class to use for PVC>","size":"<size of PVC>"}**,"loadConcurrency":{"globalConfig":6},"podResources":{"cpuRequest":"500m","cpuLimit":"4","memoryRequest":"500Mi","memoryLimit":"4Gi","ephemeralStorageRequest":"5Gi","ephemeralStorageLimit":"5Gi"},"backupPVC":{"ibm-spectrum-fusion-mgmt-sc":{"storageClass":"ibm-spectrum-fusion-mgmt-sc","readOnly":true,"spcNoRelabeling":true},"ocs-storagecluster-cephfs":{"storageClass":"ocs-storagecluster-cephfs","readOnly":true,"spcNoRelabeling":true}}}'
    Where the storageClassName is the StorageClass to allocate the storage from. Example: ibm-spectrum-fusion-mgmt-sc. The size is the Kubernetes resource size field. Examples: "500Mi" "5Gi" "10Gi" "20Gi" "50Gi".
  3. Save the ConfigMap.
  4. Restart the pods of daemonset node-agent.
  5. Replace the namespace of ibm-backup-restore with the Fusion Backup & Restore install namespace.
    oc delete pod -n ibm-backup-restore -l name=node-agent
spec.datamoverConfiguration.datamoverPodConfig.maintenanceConfig
Kopia requires periodic maintenance jobs on the backup repository. See Velero documentation.

For each namespace, a Kopia repository is created and maintained. If a cloud object storage repository is used, automatic object expiration must not be used as it triggers repository corruption. The maintenance tasks enhance efficiency and reduce resource usage in future backup and restore jobs, typically lasting less than a minute. Their resource usage is similar to backup expiration tasks. These maintenance jobs run by default every 4 hours. The maintenance frequency can be adjusted by edited the OADP DataProtectionAgents under the field spec.configuration.velero.args.default-repo-maintain-frequency. These are the "Quick Maintenance Tasks" described by Kopia. See https://kopia.io/docs/advanced/maintenance/

The "Full Maintenance Tasks" are run during backup expiration. The only available configuration item through the DataProtectionAgent object is the pod resources used during maintenance jobs.

spec.datamoverConfiguration.datamoverPodConfig.maintenanceConfig.resourceAllocations
This field is set using the Kubernetes resourceAllocations object format. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
The following example is also the default values. Missing values such as requests.cpu are set to unlimited.

datamoverConfiguration:
    datamoverPodConfig:
      podResources:
        cpuLimit: '4'
        cpuRequest: '2'
        memoryLimit: 16Gi
        memoryRequest: 4Gi
        ephemeralStorageRequest: 5Gi
        ephemeralStorageLimit: 5Gi
The default is 5Gi.
spec.datamoverConfiguration.maintenanceConfig.jobSettings.ttl
This adds a Time To Live to Velero maintenance jobs. Velero defaults to keeping a record of 3 previously completed maintenance jobs. IBM Fusion changes this default value to 1. When you add this field, the Velero maintenance jobs are removed within seconds after completion.

  datamoverConfiguration
    maintenanceConfig:
      jobSettings:
        ttl: 240
      podResources:
        cpuLimit: '4'
        cpuRequest: '2'
        memoryLimit: '4Gi'
        memoryRequest: '500Mi'
        ephemeralStorageLimit: '5Gi'
        ephemeralStorageRequest: '5Gi'
If you do not set this field and you have spec.datamoverConfiguration.maintenanceConfig.storage, then it is automatically set to 300. For more information, see https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/.
spec.datamoverConfiguration.maintenanceConfig.storage
This section enables you to use a PersistentVolumeClaim to act as a cache during Velero maintenance jobs. Most usage of ephemeral-storage during maintenance jobs originates from the kopia datamover cache. The addition of this field with a sufficiently sized PersistentVolumeClaim to handle the required cached data eliminates the aforementioned ephemeral-storage usage.

    maintenanceConfig:
      jobSettings:
        ttl: 240
      storage:
        storageClassName: "ocs-external-storagecluster-ceph-rbd"
        size: 25Gi
      podResources:
        cpuLimit: '4'
        cpuRequest: '2'
        memoryLimit: '4Gi'
        memoryRequest: '500Mi'
        ephemeralStorageLimit: '5Gi'
        ephemeralStorageRequest: '5Gi'
If storage is configured, then you need the following fields:
  • storageClassName

    The name of the StorageClass to be used with the PersistentVolumeClaim. The StorageClass must support accessMode: ReadWriteOnce.

  • size

    The size of the volume to be created. This field uses the existing Kubernetes resources request format such as "5Gi" and "250Mi".

    The size of the created PVC is to be used by each data mover pod.

During maintenance jobs, a PersistentVolumeClaim gets created. It is deleted when the job is removed based on the spec.datamoverConfiguration.maintenanceConfig.jobSettings.ttl field. If spec.datamoverConfiguration.maintenanceConfig.jobSettings.ttl is not specified and the storage configuration exists, then the ttl gets set to 300 (five minutes).

Advanced: Per Namespace Configuration
The maintenance config values mentioned previously are applied globally. These fields can also be applied manually by editing the appropriate maintenance-config ConfigMap. It allows overrides by namespace in case of resource usage differences.

The format is the same as DataProtectionAgent spec.datamoverConfiguration.maintenanceConfiguration in JSON instead of YAML.

The configuration is applied using a JSON format where the key is the repository. If a value is set for a repository, the value overrides the global values. For instance, if the global values are as described in the following example:

kind: ConfigMap
apiVersion: v1
metadata:
  name: maintenance-config
  namespace: ibm-backup-restore
data:
  global: '{"jobSettings":{"ttl":300},"podResources":{"cpuRequest":"500m","cpuLimit":"4","memoryRequest":"500Mi","memoryLimit":"4Gi","ephemeralStorageLimit":"5Gi","ephemeralStorageRequest":"5Gi"}, "storage":{"storageClassName":"ocs-external-storagecluster-ceph-rbd","size":"20Gi"}}'
Additional keys are in the <namespacename>-<bsl name>-kopia format:

kind: ConfigMap
apiVersion: v1
metadata:
  name: maintenance-config
  namespace: ibm-backup-restore
data:
  global: '{"jobSettings":{"ttl":300},"podResources":{"cpuRequest":"500m","cpuLimit":"4","memoryRequest":"500Mi","memoryLimit":"4Gi","ephemeralStorageLimit":"5Gi","ephemeralStorageRequest":"5Gi"}}'
  zen-s3-kopia: '{"jobSettings":{"ttl":300},"podResources":{"cpuRequest":"500m","cpuLimit":"4","memoryRequest":"500Mi","memoryLimit":"4Gi"}, "storage":{"storageClassName":"ocs-external-storagecluster-ceph-rbd","size":"25Gi"}}'
It ensures that the maintenance jobs related to data from the zen namespace connected to BSL S3 use a 25Gi PersistentVolumeClaim cache and do not set resource requests or limits on ephemeral storage.
node-agent-config
To configure the data mover pods to use PVCs as cache in place of ephemeral storage, add the following field into the node-agent-config config map:
storage":{"storageClassName":"ocs-external-storagecluster-ceph-rbd","size":"25Gi"}
To configure the data mover pods to use PVCs as cache in place of ephemeral storage.
Example value of the node-agent-config:
{"loadConcurrency": {"globalConfig": 5}, "podResources": {"cpuRequest": "2", "cpuLimit": "4", "memoryRequest": "4Gi", "memoryLimit": "16Gi", "ephemeralStorageRequest": "5Gi", "ephemeralStorageLimit": "5Gi"}, "storage":{"storageClassName":"ocs-external-storagecluster-ceph-rbd","size":"25Gi"}, "backupPVC": {"ocs-storagecluster-cephfs": {"storageClass": "ocs-storagecluster-cephfs", "readOnly": true, "spcNoRelabeling": true}}}

Latest permissible start time for the backup process

You can set the latest permissible start time for a scheduled backup. This setting is useful whenever multiple backups use the same policy or schedule and the agent gets overloaded frequently with jobs as they all start at the same time. Without the windowEndTime setting, jobs have, by default, up to one hour for the agent to start processing before they are considered hung and subsequently cancelled. You can manually spawn additional agent replicas to handle the spike in job activity and process the backlog of jobs sooner, but these replicas remain idle for the remainder of the day. With the windowEndTime option, the job can remain queued up for processing by the agent for a custom period of time before getting cancelled. This allows for a single agent replica to be fully utilized for a longer period of time throughout the day without the need to manually create multiple policies or schedules.

Set the windowEndTime field in the BackupPolicy CR in the 24 hour "HH:MM" format.

Example:

spec:
  backupStorageLocation:  kn-aws
  provider: isf-backup-restore
  retention:
    number: 5
    unit: days
  schedule:
    cron: `45 14 * * * '
    timezone: America/Los_Angeles
    windowEndTime: `17:45'
In this example, the schedule is set to run daily at 14:45 (2.45 PM). By specifying a windowEndTime of 17:45, you can ensure that the backup has until 5:45 PM to start. If the backup does not start by 5.45 PM due to the load on the agent, it gets cancelled. If the agent does not pick it up and the windowEndTime setting does not exist, then backup gets cancelled after one hour.
Note: For all new BackupPolicy CRs that have a specific start time (not just a repeat interval), if the windowEndTime is not explicitly specified, a default end time gets set to 4 hours after the scheduled start time. Existing BackupPolicy CRs maintain their original behavior without windowEndTime, resulting in a one-hour limit before they are considered hung and subsequently cancelled. You can edit the existing Policy CRs to add the windowEndTime option.