Disaster recovery alerts

This section provides a list of all supported alerts associated with IBM Storage Fusion Data Foundation within disaster recovery environment.

Recording rules

  • Record: ramen_sync_duration_seconds

    Expression
    sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
    Purpose

    The time interval between the volume group’s last sync time and the time now in seconds.

  • Record: ramen_rpo_difference

    Expression
    ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
    Purpose

    The difference between the expected sync delay and the actual sync delay taken by the volume replication group.

  • Record: count_persistentvolumeclaim_total

    Expression
    count(kube_persistentvolumeclaim_info)
    Purpose

    Sum of all PVC from the managed cluster.

Alerts

  • Alert: VolumeSynchronizationDelay

    Impact

    Critical

    Purpose

    Actual sync delay taken by the volume replication group is thrice the expected sync delay.

    YAML
    alert: VolumeSynchronizationDelay
    expr: ramen_rpo_difference >= 3
    for: 5s
    labels:
      severity: critical
    annotations:
      description: "The syncing of volumes is exceeding three times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})"
      alert_type: "DisasterRecovery"
  • Alert: VolumeSynchronizationDelay

    Impact

    Warning

    Purpose

    Actual sync delay taken by the volume replication group is twice the expected sync delay.

    YAML
    alert: VolumeSynchronizationDelay
    expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
    for: 5s
    labels:
      severity: warning
    annotations:
      description: "The syncing of volumes is exceeding two times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})"
      alert_type: "DisasterRecovery"
  • Alert: WorkloadUnprotected

    Impact

    Warning

    Purpose

    Application protection status is degraded for more than 10 minutes.

    YAML
    alert: WorkloadUnprotected
    expr: ramen_workload_protection_status == 0
    for: 10m
    labels:
      severity: warning
    annotations:
      description: "Workload is not protected for disaster recovery (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})."
      alert_type: "DisasterRecovery"