Backup & restore resource self-monitoring
The imposed load on the Backup & restore service has a direct correlation to the number of concurrent Backup & restore jobs in progress. While the service can catch up from periodic spikes in activity, a sustained elevated level of activity may cause any one Backup & restore microservice to fall behind in job processing, which may result in job failures.
The Backup and Restore hub microservices—such as `backup-service`, `job-manager`, `application-service`, `backup-policy` and `backup-location`—now include built-in resource monitoring for Memory and CPU consumption. This monitoring capability helps identify potential resource shortages by providing a Warning alert through IBM Fusion events. This alert notifies the user of a sustained level of activity the current microservice cannot sustain and recommends scaling out the microservice with another replica to ensure successful Backup & restore job operations.
Resource Usage Monitoring and Thresholds
Pod backup-service-xxx in namespace ibm-backup-restore exceeded 85% of memory/CPU limit for 5 minutes. Consider scaling the deployment with an additional replica.
These alerts are marked as fixed after the Memory or CPU consumption falls behind the threshold limit.
oc get deployment -n <backup-restore-namespace> <deployment-name>
oc scale deployment -n <backup-restore-namespace> <deployment-name> --replicas=<new-replica-count>
Configurable parameters
guardian-configmap
. Here are the parameters available for configuration:- memoryThresholdPercentage
- Sets the memory usage threshold as a percentage. Default is “85” for 85%.
- cpuThresholdPercentage
- Sets the CPU usage threshold as a percentage. Default is “85” for 85%.
- monitorDurationInSeconds
- Defines the duration (in seconds) that usage must exceed the threshold to trigger an alert. Default is 300 seconds (5 minutes).
memoryThresholdPercentage: '85'
cpuThresholdPercentage: '85'
monitorDurationInSeconds: '300'