Service level objectives (SLO)

DevOps practitioners or business owners face the difficult task of managing a service correctly. To ensure that service levels are consistent with the business requirements, it is imperative to know what the critical user journeys for a particular application are: a set of interactions a user has with an application to achieve a particular result, as described in Guide to setting Service Level Objectives (SLO).

Instana users can model their critical user journeys by setting up an Application Perspective or by selecting filters for a Monitored website.

After the critical user journeys are identified and ordered by the business impact, owners need to determine the metrics to use as key indicators. The configuration of such indicators is described in Service Level Indicator (SLI) configuration in detail.

When the indicators are set, the objectives or the target for the selected indicators must be defined. Service level objectives are a target of service level indicators during a specified time window. An SLO helps to measure whether the reliability of a service during a specific duration meets the expectations for most of its users. This procedure is described in the blog post Guide to setting SLOs.

SLOs are necessary because they define the quality of service (QoS) and reliability goals in concrete, measurable, objective terms. They are not intended to define the best performance level but a range of best possible and least acceptable performance standards.

Terminology

Service level indicator (SLI): Defines the defined quantitative measure of one characteristic of the level of service that is provided to a customer. Common examples include error rate or response latency of a service.

Service level objective (SLO): Defines the target value for the service level that is measured by a SLI. As an example, the SLO might specify that a particular SLI is 99.9% of the defined time.

Error Budget: The specified target value of an SLO implicitly defined a small budget where the service is allowed to not work fully reliably. This error budget allows the incorporation of planned or unplanned downtime of the service that is unavoidable in practice.

SLO widget

Instana enables users to create custom dashboard widgets for their SLOs to display and analyze the performance of their services over time. The widget can display either Time-based or Event-based SLI configurations.

The following image illustrates an example SLO widget that is called Robot Shop SLO and is configured by using a Time-based SLI configuration and an SLO target value of 95%. The SLI in the example is based on the 90th percentile of the latency metric and a threshold value of 2 seconds. An error budget is 1 minus the SLOs' target value that is multiplied by the time window. For the displayed time window of 7 days as shown on the widget, it is converted to an error budget of 504 minutes:

504 minutes = (1 - 95%) * 10080 minutes

The example SLO was violated in the selected time frame as the spent error budget of 565 minutes exceeds that limit.

SLO widget

Configuration

Adding SLO widgets

You can set up an SLO widget for any of your Application Perspectives or websites. To add an SLO widget, go to one of your custom dashboards to open the dialog for adding a widget. Next, follow these steps:

  1. In the dialog sidebar, click SLO > Next. The dialog opens the configuration section to set up an SLO widget.
  2. Select whether you want to monitor an Application Perspective or website.
    • SLOs for Application Perspectives exclude factors beyond your control such as poor internet connectivity of users, and might not accurately reflect user experience.
    • SLOs for websites most accurately reflect user experience and do not exclude factors beyond your control such as poor internet connectivity of users.
  3. Select the Application Perspective or website for the SLO from the list.
  4. Select the Service Level Indicator for the particular SLO type from the list. Create a SLI, as described in the SLI configuration, if no SLI is available for the previously selected SLO type.
  5. Enter the wanted SLO Target value, for example 99.9%.
  6. Select the Time Window Type, which defines the context and displayed time frame of the widget:
    • Dynamic time window: The SLO is calculated for the time window that is selected in the global time picker.
    • Rolling time window: A time window with a fixed window size, where the end is defined by the global time picker’s end date and time selection. As an example, the rolling time window enables the ability to always see the last week, without having to adjust to global time picker.
    • Fixed time interval: A time window with a defined start and duration. As an example, you can configure a fixed one-month window that starts on 2020-01-01. The time window will be automatically reset to the next month (2020-02-01) when the month is completed.
  7. Enter a title for the widget.
  8. Verify your widget in the preview. Note: If no preview is displayed, click Highlight missing configuration to immediately see what is missing.
  9. To create the SLO widget configuration, click Create.
  10. To save the SLO widget configuration on your custom dashboard, click Save changes.

SLI configuration

SLI types

Independent of the SLO's type, you can select one of two SLI types for your configuration:

  • Event-based SLI uses defined groups of good and bad events to measure the service's reliability. Because every call is given the same weight, it more accurately reflects the actual user experience. For this reason, handling the error budget is more difficult because it depends on the number of events.
  • Time-based SLI measures the service's reliability aggregated by minute. While the fact that the error budget is always a constant number of bad minutes makes it easier for people to manage, it is less accurate than event-based SLI since bad events are more significant in minutes with lower traffic.

Managing SLIs through the UI

To create SLI configuration or clone an existing SLI configuration, go to the SLI Management dialog by clicking Manage SLIs on the SLO widget.

Manage SLIs

Next, follow these steps to either create configuration from scratch or by cloning an existing configuration.

Creating SLI configuration for Application Perspectives

  1. On the SLI Management dialog, click Create SLI.
  2. Provide the SLI configuration Details:
    • Enter a name for the SLI configuration to identify the configuration uniquely.
  3. Select the type of SLI configuration, which can be time-based or event-based
  4. Click Create to save the new SLI configuration.

Time-based SLI for applications

Complete the following configuration of Time-based SLI:

  1. Select the boundary scope, either Inbound Calls or All Calls.
    • Inbound calls: Include only calls that are initiated from outside the application and where the destination service is part of the selected application perspective.
    • All calls: Include both inbound calls from outside the application and calls that occur within the application perspective itself.
  2. You can choose a specific service in your application or leave the default of All Services selected to apply to the entire Application Perspective.
  3. If you want to narrow down further to an endpoint, you can select an endpoint from the list. Similar to service selection, you can choose to leave the default of All Endpoints to apply to the entire service.
  4. Choose a metric on which the SLI configuration must be evaluated from the list of supported metrics.
    • The following metrics are supported:
      • Latency
      • Call Count
      • Error rate
      • Erroneous Calls
  5. Select the aggregation for the selected metric.
  6. Enter the threshold value for the selected metric.

After the metric and threshold are selected, SLI is computed as follows:

SLI = (1 - #minutes_where_threshold_is_violated / #minutes_in_time_window) * 100%

The following image shows how a time-based SLI is configured for an application k8s-demo. The Boundary scope of the application is restricted to Inbound Calls for the Endpoint DELETE /cart/:id and on a Service called cart. The Metrics to be evaluated is set to Latency with a recommended Aggregation of 90th percentile and a Threshold of 25 ms.

Example of time-based SLI for application

Event-based SLI

Event-based SLI configuration gives the full flexibility of the Unbounded Analytics query builder to select a subset of good events and bad events.

  • Good events: The set of calls that indicate the success criteria of a particular service. For example, all HTTP requests of an HTTP Service, which have the status code 2XX.
  • Bad events: The set of calls that indicate the failure criteria of a particular service. For example, all HTTP requests of an HTTP Service, which have the status code 5XX.

Complete the following configuration of Event-based SLI:

  1. Select the boundary scope, either Inbound Calls or All Calls.
    • Inbound calls: Include only calls that are initiated from outside the application and where the destination service is part of the selected application perspective.
    • All calls: Include both inbound calls from outside the application and calls that occur within the application perspective itself.
  2. Optional: You can include internal calls or synthetic calls. By default, both calls are excluded.
    • Internal calls: A particular type of calls that represent work that is done inside of a service. These calls can be created from intermediate spans that are sent through custom tracing. For more information about internal calls, see Concepts of Tracing.
    • Synthetic calls: Calls with a synthetic endpoint as the destination, such as calls to the health-check endpoints.

When bad events and good events are defined, the SLI is calculated as follows:

SLI = #good_events / (#good_events + #bad_events) * 100%

The image shows how an event-based SLI is configured for an application, whose boundary scope is restricted to Inbound Calls. The good events are defined as calls with a status code of 200 and bad events being calls with a status code of 500.

Example of event-based SLI

The resulting widget of this configuration shows the error budget of calls as opposed to minutes.

Creating SLI configuration for websites

  1. On the SLI Management dialog, click Create SLI.
  2. Provide the SLI configuration Details:
    • Enter a name for the SLI configuration to identify the configuration uniquely.
  3. Select the type of SLI configuration, which can be time-based or event-based
  4. Click Create to save the new SLI configuration.

Time-based SLI for websites

Complete the following configuration of Time-based SLI:

  1. Select the beacon scope, such as HTTP requests. By applying the beacon filter, you can further scope the configuration to a subset of website traffic, such as by geolocation, browser, or user.
  2. Choose a metric on which the SLI configuration must be evaluated from the list of supported metrics. The following metrics are supported:
    • Beacon error rate
    • Beacon duration
  3. Select the aggregation for the selected metric.
  4. Enter the threshold value for the selected metric.

After the metric and threshold are selected, SLI is computed as follows:

SLI = (1 - #minutes_where_threshold_is_violated / #minutes_in_time_window) * 100%

The following is an example of the time-based SLI configuration for the Robot Shop website that uses the GET method on the /products path, which is limited to HTTP requests for the mean of the beacon error rate metric with a threshold of 25 milliseconds.

Example of time-based SLI for websites

Event-based SLI for websites

Event-based SLI configuration gives the full flexibility of the Unbounded Analytics query builder to select a subset of good events and bad events.

  • Good events: The set of calls that indicate the success criteria of a particular service. For example, all HTTP requests of an HTTP Service, which have the status code 2XX.
  • Bad events: The set of calls that indicate the failure criteria of a particular service. For example, all HTTP requests of an HTTP Service, which have the status code 5XX.

Complete the following configuration of Event-based SLI:

  1. Select the beacon scope, for example HTTP requests. By applying a beacon filter, you can further scope the configuration to a subset of website traffic, for example, by geolocation, browser, or user.
  2. Inbound Calls or All Calls.
    • Inbound calls: Include only calls that are initiated from outside the application and where the destination service is part of the selected application perspective.
    • All calls: Include both inbound calls from outside the application and calls that occur within the application perspective itself.

When bad events and good events are defined, the SLI is calculated as follows:

SLI = #good_events / (#good_events + #bad_events) * 100%

The image shows how an event-based SLI is configured for a website, whose beacon scope is set to HTTP Requests, with good events being GET requests with a status code of 200 and _bad events+ being GET requests with a status code of 500.

Example of event-based SLI for websites

The resulting widget of this configuration shows the error budget of calls as opposed to minutes.

Cloning an existing SLI configuration

The parameters of the SLI cannot be modified to prevent invalidation of the calculated spent budgets. That is why the SLI configuration needs to be cloned when you change any parameter.

  1. From the SLI Management dialog, click the View/Clone SLI configuration icon on the selected SLI configuration that you want to clone.
  2. Edit the SLI configuration Details:
    • Change the name for the SLI configuration to identify the configuration as a clone from the already-existing configuration.
  3. Edit the SLI configuration as required.
  4. Click Clone to create a clone of the SLI configuration.

Creating SLIs through API

Instana SLI Configuration API provides endpoints to create, read, update, and delete SLI configurations.

As an example, the following curl command can be used to create a time-based SLI named My First SLI for an application with the ID appId that has a service with the ID serviceId and is limited to calls where the endpoint-id equals endpointId. The SLI is aggregated for the 90th percentile of the latency metric and a threshold value of 25 ms.

curl --location --request POST "{{base}}/api/settings/v2/sli" \
  --header "Authorization: apiToken {{apiToken}}" \
  --header "Content-Type: application/json" \
  --data '{
    "sliName": "My first SLI",
    "metricConfiguration": {
        "metricName": "latency",
        "metricAggregation": "P90",
        "threshold": 25
    },
    "sliEntity": {
        "sliType": "application",
        "applicationId": "appId",
        "serviceId": "serviceId",
        "endpointId": "endpointId",
        "boundaryScope": "ALL"
    }
  }'

Note:

  • The used API Token requires the permission "Configuration of service level indicators".

Grafana SLO plug-in

As an alternative, if you have any other custom dashboarding needs, Instana offers a Grafana plug-in that enables the ability to display specific SLO information based on data that is provided by Instana.