SLO Configuration Examples

Example 1: Application SLO with latency blueprint

Objective : Ensure 90% of the calls of the “Robot-shop” application have an average latency of better than 100 ms over a fixed period of 1 week.

The configuration of the SLO would be:

  • Entity: Robot-shop application

    • Scope:
    • Boundary: All services
    • Include internal calls: false
    • Include synthetic calls: false
    • Service: All services
    • Endpoint: All endpoints
  • Indicator:

    • Blueprint: Latency
    • Type: Time
    • Aggregation: mean
    • Threshold: 100 ms
  • Objective:

    • SLO Target: 90%
    • Time window type: Rolling
    • Time window length: 1 week

Scenario: Assuming the SLO had 400 bad minutes over the week-long SLO time window (400 minutes had mean latency > 100 ms) starting from 2025-03-04:

The error budget for this SLO would be calculated as:

  • Minutes in time period x (1 - SLO target percentage)
    • Total minutes in time window: 24 × 60 × 7 minutes in 1 week = 10080 minutes
    • SLO target percentage: 90% (0.9)
    • Error budget: 10080 x (1 - 0.9) = 1008 minutes

The SLO status would be calculated as:

  • SLO status = 100% x (total minutes in time window - bad minutes in time window) / Total minutes in time window
    • 100% x (10080 total minutes - 400 bad minutes) / 10080 total minutes = 96.03%

Example 2: Website SLO with event-based availability blueprint

Objective : Ensure HTTP requests to the shopping cart page (cart.html) of Demo website could achieve 95% of availability over a fixed period of 4 days beginning on 2025-03-01.

The configuration of the SLO would be:

  • Entity: Demo website
    • Beacon: HTTP requests
    • Custom filter: Location > Page name = Cart
  • Indicator
    • Blueprint: Availability
    • Type: Event count (Total count of good calls vs bad calls)
  • Objective:
    • SLO target: 95%
    • Time window type: Fixed
    • Time window length: 4 days
    • Start: 2025-03-01 0:00

Scenario: Assuming there are 234 successful HTTP requests and 41 failed HTTP requests during the SLO time window starting from 2025-03-01:

The error budget for this SLO would be calculated as:

  • Event:
    • Good events count: 234 beacons
    • Bad events count: 41 beacons
    • Total events count: 234 + 11 = 245 beacons
    • SLO Target Percentage: 95% (0.95)
    • Error Budget: 245 x (1 - 0.95) = 12 beacons
    • Error Budget remaining: 12 - 11 = 1 beacon
    • Error Budget remaining percentage: 100% * (12 - 11) / 12 = 8%

The SLO Status would be calculated as:

  • SLO status = 100% x Total good events count in time window / Total events count in time window
    • 100% x 234 beacons / 245 beacons = 95.5%

Example 3: Synthetic monitoring with traffic blueprint

Objective: Ensure 3 synthetic monitoring tests (shopping cart, home page and product list page) are run 15 times every minute targeting 99% regularly over a fixed period of 1 week beginning on 2025-03-18.

The configuration of the SLO would be:

  • Entities:

    • shopping cart test
    • home page test
    • product list page test
  • Indicator:

    • Blueprint: Traffic
    • Threshold: > 15 results per minute
  • Objective:

    • SLO Target: 99%
    • Time window type: Fixed
    • Time window length: 1 week
    • Start: 2025-03-18 0:00

Scenario: Assuming the SLO had 21 minutes where synthetic tests are run less than 15 times during the SLO time window starting from 2025-03-18:

The error budget for this SLO would be calculated as:

  • Minutes in time period x (1 - SLO Target Percentage)
    • Total Minutes: 24 × 60 × 7 minutes in 1 week = 10080 minutes
    • SLO Target Percentage: 99% (0.99)
    • Error Budget: 10080 x (1 - 0.99) = 101 minutes
    • Error budget remain: 101 - 21 = 80
    • Error budget remain percentage: 100% * (101 - 21) / 101 = 79.2%

The SLO Status would be calculated as:

  • SLO Status = 100% x (Total minutes in time window - bad minutes in time window) / Total minutes in time window
    • 100% x (10080 total minutes - 21 bad minutes) / 10080 total minutes = 99.8%

Example 4: Application SLO with Custom blueprint

Objective : Ensure 98% of the calls of the “Robot-shop” application do not result in an HTTP status code of 400 over a rolling period of 1 day.

The configuration of the SLO would be:

  • Entity: Robot-shop application

    • Scope:
    • Boundary: All services
    • Include internal calls: false
    • Include synthetic calls: false
    • Service: All services
    • Endpoint: All endpoints
  • Indicator:

    • Blueprint: Custom
    • Type: Event count (Total count of good events vs bad events)
    • Good Filter: HTTP status code != 400
    • Bad Filter: HTTP status code = 400
  • Objective:

    • SLO Target: 98%
    • Time window type: Rolling
    • Time window length: 1 day

Scenario: Assuming the SLO had 25000 good calls and 200 bad calls over the one day long SLO time window starting from 2025-03-10:

The error budget for this SLO would be calculated as:

  • Event:
    • Good events count: 25000 calls
    • Bad events count: 200 calls
    • Total events count: 25000 + 200 = 25200 calls
    • SLO Target Percentage: 98% (0.98)
    • Error Budget: 25200 x (1 - 0.98) = 504 calls
    • Error Budget remaining: 504 - 200 = 304 calls
    • Error Budget remaining percentage: 100% * (504 - 200) / 504 = 60.317%

The SLO Status would be calculated as:

  • SLO status = 100% x Total good events count in time window / Total events count in time window
    • 100% x 25000 beacons / 25200 beacons = 99.2%

Example 5: Application SLO with event-based latency blueprint

Objective : Ensure 92% of the calls of the “Robot-shop” application have latency of better than 100 ms over over a fixed period of 2 weeks.

The configuration of the SLO would be:

  • Entity: Robot-shop application

    • Scope:
    • Boundary: All services
    • Include internal calls: false
    • Include synthetic calls: false
    • Service: All services
    • Endpoint: All endpoints
  • Indicator:

    • Blueprint: Latency
    • Type: Event count (Total count of good calls vs bad calls)
    • Good call: Latency < 100 ms
    • Bad call: Latency > 100 ms
  • Objective:

    • SLO Target: 92%
    • Time window type: Fixed
    • Time window length: 2 week
    • Start: 2025-03-10 0:00

Scenario: Assuming the SLO had 50000 good calls and 1000 bad calls over the two week long SLO time window starting from 2025-03-10:

The error budget for this SLO would be calculated as:

  • Event:
    • Good events count: 50000 calls
    • Bad events count: 1000 calls
    • Total events count: 50000 + 1000 = 51000 calls
    • SLO Target Percentage: 92% (0.92)
    • Error Budget: 51000 x (1 - 0.92) = 4080 calls
    • Error Budget remaining: 4080 - 1000 = 3080 calls
    • Error Budget remaining percentage: 100% * (4080 - 1000) / 4080 = 75.49%

The SLO Status would be calculated as:

  • SLO status = 100% x Total good events count in time window / Total events count in time window
    • 100% x 50000 beacons / 51000 beacons = 98.039%

Example 6: Website SLO with time-based availability blueprint

Objective : Ensure HTTP requests to the Demo website could achieve 92% of availability with less than 5% of error rate over a rolling period of 3 days.

The configuration of the SLO would be:

  • Entity: Demo website
    • Beacon: HTTP requests
  • Indicator
    • Blueprint: Availability
    • Type: Time
    • Error Rate: 5%
  • Objective:
    • SLO target: 92%
    • Time window type: Rolling
    • Time window length: 3 days

Scenario: Assuming the SLO had 200 bad minutes over the 3 day long SLO time window (200 minutes had mean error rate greater than 5%) starting from 2025-03-05:

The error budget for this SLO would be calculated as:

  • Minutes in time period x (1 - SLO target percentage)
    • Total minutes in time window: 24 × 60 × 3 minutes in 3 days = 4320 minutes
    • SLO target percentage: 92% (0.92)
    • Error budget: 4320 x (1 - 0.92) = 346 minutes

The SLO status would be calculated as:

  • SLO status = 100% x (total minutes in time window - bad minutes in time window) / Total minutes in time window
    • 100% x (4320 total minutes - 200 bad minutes) / 4320 total minutes = 95.37%

Service Levels Smart Alerts Configuration Examples

Example 1: Service Levels Smart Alert to monitor the status of an SLO

Objective : Alert and raise an issue if the status of the Vending Machine Reliability SLO Configuration is less than 90%.

The configuration of the Service Levels Smart Alert would be:

Rule:
  Alert Type: Service Levels Objective
  Metric: Status 
Threshold:
  Operator: <
  value: 0.90
SLOs: Vending Machine Reliability
Time Threshold:
  Expiry: 5 Minutes
  Time window: 10 Minutes

Once the smart alert configuration is set up, the system will begin monitoring the status of the Vending Machine Reliability SLO (Service Level Objective) configuration.

Scenario: Monitoring and Event Triggering

  • SLO Drops Below Threshold

    • Assume the Vending Machine Reliability SLO drops to 89%, below the defined threshold of 90%.
    • With the time threshold set to 10 minutes, the system will wait for the entire 10-minute window before taking any action.
    • If the SLO remains below 90% after 10 minutes, the system triggers an event, raising an issue.
  • SLO Returns Above Threshold

    • If, after some time, the SLO status recovers and rises above 90%, the system will continue monitoring.
    • However, if the status stays above 90%, the system will wait for the expiry time threshold of 5 minutes.
    • If the SLO remains above 90% for the full 5 minutes, the event will be automatically closed.

Example 2: Service Levels Smart Alert to monitor the Error Budget of an SLO

Objective : Alert and raise an issue if the Error Budget Consumption Percentage of the Vending Machine Reliability SLO Configuration is more than 50%.

The configuration of the Service Levels Smart Alert would be:

Rule:
  Alert Type: Error Budget
  Metric: Burned Percentage 
Threshold:
  Operator: >
  value: 0.50
SLOs: Vending Machine Reliability
Time Threshold:
  Expiry: 5 Minutes
  Time window: 10 Minutes

Once the smart alert configuration is set up, the system will begin monitoring the Error Budget Consumption Percentage of the Vending Machine Reliability SLO (Service Level Objective) configuration.

Scenario: Monitoring and Event Triggering

  • Error Budget Consumption Exceeds 50%

    • Assume the error budget consumption exceeds 50%.
    • With the time threshold set to 10 minutes, the system will wait for the full 10-minute window before taking any action.
    • If the error budget consumption remains above 50% after 10 minutes, the system triggers an event, raising an issue.
  • Error Budget Consumption Drops Below 50%

    • If, after some time, the error budget consumption drops back below 50%, the system will continue monitoring.
    • If the error budget consumption stays below 50%, the system will wait for the expiry time threshold of 5 minutes.
    • If the error budget consumption remains below 50% for the full 5 minutes, the event will be automatically closed.

Service Levels Burn Rate Smart Alert Calculation

The burn rate is calculated using the formula:

Burn Rate = (Error Budget Consumed * SLO Time Window) / Alerting Window

  • For example:
    • Assume the error budget consumed over the last 12 hours is 70%, and the SLO time window for the Vending Machine Reliability SLO is 1 day (24 hours).
    • The burn rate for the last 12 hours would be: (0.70 * 24) / 12 = 1.4
    • Similarly, if the error budget consumed for the last 2 hours is 20%
    • The burn rate for the last 2 hours would be: (0.20 * 24) / 2 = 2.4

Example 3 - Smart Alert to monitor the burn rate of an SLO with a single alerting window and threshold

Objective: Alert and raise an issue if the burn rate of the "Vending Machine Reliability" SLO configuration is more than 1 for the last 12 hours.

The configuration of the Service Levels Smart Alert should be:

Rule:
  Alert Type: Error Budget
  Metric: Burn Rate V2
Burn Rate Config:
[
  Alert Window Type: SINGLE
  Duration: 12 Hours
  Duration Unit Type: Hour
  Threshold:
    Operator: >
    Value: 1
]
SLOs: Vending Machine Reliability
Time Threshold:
  Expiry: 5 Minutes
  Time window: 10 Minutes

After the Smart Alert configuration is set up, the system begins monitoring the burn rate of the Vending Machine Reliability SLO (Service Level Objective) configuration for the specified alerting window.

Scenario: Monitoring and Event Triggering

  • Burn rate exceeds 1 for the alerting window (Last 12 hours)

    • Assume the calculated burn rate for last 12 hours starts exceeding 1.
    • With the time threshold set to 10 minutes, the system waits for the full 10-minute window before taking any action.
    • If the burn rate still remains above 1 after 10 minutes, the system triggers an alerting event, raising an issue.
  • Burn rate drops below 1 for the alerting window (last 12 hours)

    • If, after some time, the burn rate for the alerting window drops below 1, the system continues monitoring.
    • If the burn rate stays below 1, the system will wait for the expiry time threshold of 5 minutes.
    • If the burn rate remains below 1 for the full 5 minutes, the event will be automatically closed.

Example 4 - Smart Alert to monitor the burn rate of an SLO with multiple alerting windows and respective thresholds

Objective: Alert and raise an issue if the burn rate of the "Vending Machine Reliability" SLO configuration is more than 1 for the last 24 hours and more than 4 for the last 2 hours.

The configuration of the Service Levels Smart Alert should be:

Rule:
  Alert Type: Error Budget
  Metric: Burn Rate V2
Burn Rate Config:
[
  Alert Window Type: LONG
  Duration: 24 Hours
  Duration Unit Type: Hour
    Threshold:
    Operator: >
    Value: 1
  ,
  Alert Window Type: SHORT
  Duration: 2 Hours
  Duration Unit Type: Hour
  Threshold:
    Operator: >
    Value: 4
]
SLOs: Vending Machine Reliability
Time Threshold:
  Expiry: 5 Minutes
  Time window: 10 Minutes

After the Smart Alert configuration is set up, the system begins monitoring the burn rate of the Vending Machine Reliability SLO (Service Level Objective) configuration for both long and short alerting windows.

Scenario: Monitoring and Event Triggering

  • Burn rate exceeds 1 for both long and short alerting windows (Last 24 hours and 2 hours)

    • Assume the calculated burn rate for the last 24 hours starts exceeding 1 and for the last 2 hours starts exceeding 4.
    • With the time threshold set to 10 minutes, the system waits for the full 10-minute window before taking any action.
    • If the burn rate of both alerting windows still violates the thresholds after 10 minutes, the system triggers an alerting Event, raising an issue.
  • Burn rate drops below 1 for the long alerting window but stays above 4 for the short alerting window

    • If, after some time, the burn rate for the long alerting window drops below 1, but the burn rate for the short alerting window stays above 4, the system continues monitoring.
    • If the burn rate remains below 1 for the long alerting window, the system waits for the expiry time threshold of 5 minutes.
    • If the burn rate stays below 1 for the long alerting window for the full 5 minutes, regardless of the short alerting window's value, the event is automatically closed, as both thresholds must be violated in order to send an alert. The same applies in reverse — even when the short alerting window violates the threshold but the long alerting window does not.
  • Burn rate drops below 1 for the long alerting window and below 4 for the short alerting window

    • If, after some time, the burn rate for both alerting windows drops below their respective thresholds, the system continues monitoring.
    • If the burn rate remains below the thresholds, the system waits for the expiry time threshold of 5 minutes.
    • If the burn rate stays below the thresholds for the full 5 minutes, the event is automatically closed.

The burn rate alert with multiple windows requires both thresholds (AND Condition) to be violated in order to send an alert. Even if one threshold is not violated, an alert is not sent.

Troubleshooting

The following are suggestions to resolve commonly-occurring problems with configuration SLOs.

  • Problem: No error budget is consumed, SLO status always is 100%. ​

    • Solution: Use the indicator chart on the SLO dashboard to verify if the indicator is never exceeding the threshold during the SLO time window, resulting in no consumption of error budget. You may consider modifying the threshold accordingly.
  • Problem: No error budget is consumed, SLO status always is 100%. ​

    • Solution: Use the traffic chart on the SLO dashboard to verify the entity is receiving traffic during the SLO time window. If not, the error budget and SLO status will not be impacted.
  • Problem: Error budget is consistently consumed rapidly, SLO status remains negative. ​

    • Solution: Use the indicator chart on the SLO dashboard to verify if the indicator is consistently exceeding the threshold during the SLO time window, resulting in rapid consumption of error budget. You may consider modifying the threshold accordingly.
  • Problem: Burn rate alert is not triggered due to time window misalignment. ​

    • Solution:
    • Fixed time window SLO: If the SLO is configured with a fixed time window, the alert might not trigger if the burn rate calculation is based on an alerting window that is longer than the actual elapsed time in the SLO time period. For example, if the alerting window requires data from a 12-hour period, but the SLO time window just started, there might not be enough time for the burn rate to exceed the threshold. As a result, no alert is triggered even if the burn rate is high during the elapsed time.
    • Rolling time window SLO: If the SLO is set to a rolling time window, the burn rate calculation might not trigger an alert if the alerting window extends beyond the SLO's creation time. For instance, if the alerting window goes past the period when the SLO was created or active, the burn rate cannot be calculated properly because the data is not available for the full alerting window.