Disaster recovery alerts
This section provides a list of all supported alerts associated with IBM Storage Fusion Data Foundation within disaster recovery environment.
Recording rules
-
Record:
ramen_sync_duration_seconds
- Expression
-
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
- Purpose
-
The time interval between the volume group’s last sync time and the time now in seconds.
-
Record:
ramen_rpo_difference
- Expression
-
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
- Purpose
-
The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
-
Record:
count_persistentvolumeclaim_total
- Expression
-
count(kube_persistentvolumeclaim_info)
- Purpose
-
Sum of all PVC from the managed cluster.
Alerts
-
Alert:
VolumeSynchronizationDelay
- Impact
-
Critical
- Purpose
-
Actual sync delay taken by the volume replication group is thrice the expected sync delay.
- YAML
-
alert: VolumeSynchronizationDelay expr: ramen_rpo_difference >= 3 for: 5s labels: severity: critical annotations: description: "The syncing of volumes is exceeding three times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})" alert_type: "DisasterRecovery"
-
Alert:
VolumeSynchronizationDelay
- Impact
-
Warning
- Purpose
-
Actual sync delay taken by the volume replication group is twice the expected sync delay.
- YAML
-
alert: VolumeSynchronizationDelay expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: severity: warning annotations: description: "The syncing of volumes is exceeding two times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})" alert_type: "DisasterRecovery"
-
Alert:
WorkloadUnprotected
- Impact
-
Warning
- Purpose
-
Application protection status is degraded for more than 10 minutes.
- YAML
-
alert: WorkloadUnprotected expr: ramen_workload_protection_status == 0 for: 10m labels: severity: warning annotations: description: "Workload is not protected for disaster recovery (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})." alert_type: "DisasterRecovery"