Backup & restore resource self-monitoring

The imposed load on the Backup & restore service has a direct correlation to the number of concurrent Backup & restore jobs in progress. While the service can catch up from periodic spikes in activity, a sustained elevated level of activity may cause any one Backup & restore microservice to fall behind in job processing, which may result in job failures.

The Backup and Restore hub microservices—such as `backup-service`, `job-manager`, `application-service`, `backup-policy` and `backup-location`—now include built-in resource monitoring for Memory and CPU consumption. This monitoring capability helps identify potential resource shortages by providing a Warning alert through IBM Fusion events. This alert notifies the user of a sustained level of activity the current microservice cannot sustain and recommends scaling out the microservice with another replica to ensure successful Backup & restore job operations.

Resource Usage Monitoring and Thresholds

When Memory or CPU usage exceeds a defined threshold (default is 85%) for a specific duration (default is 5 minutes), a IBM Fusion event with a warning severity is triggered to notify about the increased resource usage. For instance, if the Memory or CPU usage for the `backup-service` pod crosses the threshold, the following event gets generated:
Pod backup-service-xxx in namespace ibm-backup-restore exceeded 85% of memory/CPU limit for 5 minutes. Consider scaling the deployment with an additional replica. 

These alerts are marked as fixed after the Memory or CPU consumption falls behind the threshold limit.

To scale out with another replica, use the following command to get the existing replica count for the deployment from the ‘AVAILABLE’ column:
oc get deployment -n <backup-restore-namespace> <deployment-name>
Then perform the following command with an increase of 1 to the existing replica count:
oc scale deployment -n <backup-restore-namespace> <deployment-name> --replicas=<new-replica-count>

Configurable parameters

You can customize the threshold levels and monitoring duration by adding or modifying parameters in the guardian-configmap. Here are the parameters available for configuration:
memoryThresholdPercentage
Sets the memory usage threshold as a percentage. Default is “85” for 85%.
cpuThresholdPercentage
Sets the CPU usage threshold as a percentage. Default is “85” for 85%.
monitorDurationInSeconds
Defines the duration (in seconds) that usage must exceed the threshold to trigger an alert. Default is 300 seconds (5 minutes).
Example configuration:

memoryThresholdPercentage: '85'
cpuThresholdPercentage: '85'
monitorDurationInSeconds: '300'