Service level objectives (public preview)
Service Level Objectives (public preview) and Service Level Objectives widgets (legacy) are separate capabilities. Corresponding widgets are separately available for Service Level Objectives (public preview)
Welcome to Service Level Objectives (public preview). The navigation sidebar contains a new item that gives SLOs a dedicated space to provide both a better experience and more meaningful data to help you monitor objectives for your applications, websites, and synthetic tests over fixed and rolling time periods.
You can use an overview dashboard to view all of your SLOs and their status, and a detailed page to view the indicator, error budget, and traffic across time windows.
- Do you want to monitor the availability of your applications?
- Do you want to monitor the latency of your websites?
- Do you want to montior the success rate of your synthetic tests?
- Do you want to customize which calls are good or bad and track them?
Service Level Objectives are not supported on Self-hosted Classic Edition (Docker).
Permissions
During public preview, no permission is required to access the new SLO dashboard to view and create SLOs. However, you are limited by the permissions in place for any entities (applications, websites, synthetic tests) associated with your SLO. If you have permission to view an application, website, or synthetic test, you can create an SLO for it.
Dashboard Overview
In the Instana UI, hover over the navigation sidebar and select Service Levels. The Service Levels page shows an overview of all of your SLOs including their status in alphanumeric
order. You can sort the table by name, filter tables by Tag or Entity Type, or search by SLO name.
The table displays the following information:
-
Name: The name of the SLO. The color of the border on the left signifies the current status of the Service Level Indicator.
-
Entity: The name of the application, website, or synthetic tests that are associated with this SLO.
If the associated entity is deleted or is no longer visible to you, this column displays “Unknown Entity”. The SLO definition remains, but the SLO data for deleted entities is no longer captured.
-
Blueprint: The indicator type that is being measured.
- Latency: Measures how fast the service responds relative to a set threshold.
- Availability: Measures how often a service responds without errors.
- Traffic: Measures how much traffic a service is receiving.
- Custom: Measures the count of good or bad events by using user-defined filters.
-
Error Budget Remaining: Indicates how much of your error budget is available. Represented in minutes, calls, beacons, or results depending on the selected entity type and indicator type. Also, details on the time window configuration of the SLO are displayed.
-
Status/Target: Displays the current status of the SLO represented in percentage value, along with the desired target. If the status meets the target, it is colored green until your error budget is depleted, otherwise it is colored red.
-
Tags: Displays the tags that are associated with the SLO, which can be used to filter the view for related SLOs.
-
Actions: Contains edit, copy, and delete buttons.
SLO Details
Select an SLO from the data table on the dashboard to see its details. This page shows a summary of the SLO status, error budget, and traffic. In addition, there are detailed charts showing the indicator, error budget, and traffic over time. Time can be toggled to view the data by either the configured SLO time window or the time that has been selected in the upper right corner of the page.
This page displays the following information:
-
SLO Name: The name of the Service Level Objective.
-
Analyze calls and Analyze bad Calls: These buttons take you to the Analytics page to further introspect the calls being made to the services selected within the SLO. These buttons only appear for Application and Website entity types.
-
Entity: The name of the application, website, or synthetic tests that are associated with this SLO. Additionally, any declared scope configurations are displayed.
If the associated entity is deleted or is no longer visible to you, this column displays “Unknown Entity”. The SLO definition remains, but the SLO data for deleted entities is no longer captured.
- Tags: Displays the tags that are associated with the SLO, which can be used to filter the view for related SLOs.
The Summary tab Displays aggregated information about the SLO for the selected time period with cards and charts for KPIs for the performance of the SLO.
-
Time content switcher: Toggles between the time window from the SLO configuration and the time range that is selected.
-
Configured SLO Time Window: Displays the configured time window for the SLO that determines the error budget.
-
Matching SLO Time Windows: Displays the configured time windows that span the selected time. If the selected time spans multiple windows, they will be depicted in unique colors in the corresponding charts.
-
Status: Displays the current status of the SLO represented in percentage value, along with the configured target.
-
Error Budget Remaining: Displays the remaining and total error budget. Represented in minutes, calls, beacons, or results depending on the selected entity type and indicator type.
-
Traffic: Displays the amount of calls the configured application or website is receiving, or the total results of the configured synthetic tests.
The following charts show the trend over the time frames selected:
- Indicator: Displays the indicator metric relative to the configured threshold. The data on this chart is aggregated for larger time windows. To see more granular data, select a shorter time window.
- Error Budget: Displays the remaining error budget.
- Traffic: Displays the amount of calls the configured application or website is receiving, or the total results of the configured synthetic tests.
The Configuration tab displays configuration information of the SLO. There are options to edit, copy, or delete the SLO.
The Smart Alerts tab displays information on the configured Smart Alerts for this SLO.
Creating an SLO
To create a Service Level Objective (SLO), click + Add at the bottom right of the page and select Add Service Level Objective. This opens a window where you can create a new SLO following these steps:
Select Entity
Select the entity to measure for your SLO. You can define and measure performance targets for the following entity types:
- Application: Measure the performance of calls in an application perspective.
- Website: Measure the performance of beacons and traces in a website.
- Synthetic Tests: Measure the performance of the results of one or more synthetic tests.
Set Scope
Set the scope within the entity to measure.
Applications
-
Select the Boundary
- Inbound Calls: Include calls initiated from outside the application and the destination service is part of the selected application perspective.
- All Calls: Include both inbound calls from outside the application and calls that occur within the application perspective.
-
Select Hidden Calls
- Include Internal Calls: Calls that represents work that is done inside a service. These can be created from intermediate spans that are sent through custom tracing.
- Include Synthetic Calls: Calls with a synthetic endpoint as the destination, such as calls to health-check endpoints.
-
Select Service and Endpoint
- Service & Endpoint
- Select a specific Service in your application or use the default of All Services to include the entire application perspective.
- Select an Endpoint from the specified service or use the default of All Endpoints to apply to the entire service.
- Custom
- Specify custom filters using the available tags to define the scope of the measured calls for the application. These filters can be combined to create composite queries as needed.
Websites
- Select the Beacon to monitor:
- Choose a beacon to monitor from: HTTP requests, Website Page Loads, or Website Custom Events.
- Select Custom Filter (optional):
- Specify the custom events filter to further the limit the scope of the selected beacon type. These filters can be combined to create composite queries as needed.
Set Indicator
Define the Service Level Indictor (SLI) to measure for the entity.
You can select a Blueprint to define the type of Indicator, choosing from: Latency, Availability, Traffic, and Custom. The indicator is used to specify the metric and measurement type that are used to calculate the SLO status and error budget.
- Latency: Measures how quickly the calls, beacons, or test results complete respond to a specified threshold. There are two types of measurements for the Latency blueprint:
- Time-based: The response time for all calls, beacons, or test results are aggregated into one-minute buckets. Error budget is measured in minutes and is a static value derived from the SLO target and the duration of the SLO tiime window. If the aggregated result for each minute is less than the specified threshold, it is deemed to be a a bad minute and the error budget is reduced. You can also specify the type of aggregation that is used from the following options:
- Mean
- Min
- Max
- Percentile (25, 50, 75, 90, 95, 98, 99)
- Event-based: The response time for each call, beacon, or test result is compared to the specified threshold and determined to be a good or bad event. Error budget is measured in calls, beacons, or results and is derived from the SLO target and the total number of events during the SLO time window. Since the total number of events is changing over the duration of the SLO time window, the total error budget is a dynamic value.
- Availability: Measures the success rate of calls, beacons, or test results over the designated time period. There are two types of measurements for the Availability blueprint:
- Time-based: The success rate for all calls, beacons, or test results are aggregated into one-minute buckets. Error budget is measured in minutes and is a static value derived from the SLO target and the duration of the SLO time window. If the aggregated result for each minute is less than the specified threshold, it is deemed to be a bad minute and the error budget is reduced. Currently only mean aggregation is supported for Availability blueprints.
- Event-based: The overall success rate for the calls, beacons, or test results is calculated over the SLO time period. Error budget is measured in calls, beacons, or results and is derived from the SLO target and the total number of events during the SLO time window. Since the total number of events is changing over the duration of the SLO time window, the total error budget is a dynamic value.
- Traffic: Measures the load (calls, beacons, or test results) a system encounters over time. This is a time-based measurement where the number of calls or results per minute is compared to a specified threshold.
If the threshold is exceeded, it is deemed to be a bad minute and the error budget is reduced. You can also specify the type of calls or results to count:
- All calls or results: Counts all calls or tests results. This can be used to measure the overall system traffic for the specified entities.
- Erroneous calls or results: Counts only erroneous calls or failed tests results. This can be used to focus the measurement only on erroneous traffic.
- Custom: Use custom filters to specify the definition of good and bad calls. The Custom blueprint is only available for Application and Website entity types. You can define the following filters for a Custom blueprint:
- Good calls: Specify the filter for good calls. These filters can be combined to create composite queries as needed. If only good calls are defined, calls that do not match the good calls filter are assumed to be bad calls.
- Bad calls: Optionally specify filters for bad calls. If the bad calls filter is defined, calls that do not match the good or bad calls filter are ignored.
Set Objective
Define the overall Service Level Objective target value and details for the SLO time window.
- SLO Target: Specify the target percentage that your entity should be meeting when it is performing correctly.
The maximum precision that is currently supported is four nines (99.99%).
- Time Window: Specify the type of time window for the SLO.
- Fixed: A time window with a distinct start time and duration. For example, you can configure a fixed one-week window that starts on 2020-01-01. The time window will be automatically reset to the next week (2020-01-08) when the week is completed.
- Rolling: A dynamic time window with a fixed window size, where the end is defined by the global time picker’s end date and time selection. For example, the rolling time window enables the ability to always see the last week.
- Length: Duration of the SLO time window, specified in days or weeks.
The maximum length of an SLO that is supported is 4 weeks.
- Start: For Fixed time windows, the day and time to initiate the SLO calculation.
After entering values for SLO Target and Time window, the error budget is displayed.
- For time-based blueprints, this is a true error budget in minutes as determined by the time window length and the SLO target.
- For event-based blueprints, this is an estimated error budget based on the calls or beacons encountered over the prior SLO time period, along with the specified time window length and SLO target.
Set Details
- Specify the SLO name.
- Optionally, specify a set of tags that can be used to categorize or sort the SLO.
Preview
A preview chart appears that represents the summary of your SLO. If no preview is displayed, click Highlight missing configuration to locate the missing information. The preview data is based on the previous seven days. In some cases, the preview shows sample data.
To create the SLO configuration, click Save.
Example
Consider the following example where the SLO requirement is to ensure 90% of the calls of the “Robot-shop” application have a latency of better than 100 ms over a fixed period of 1 week. The configuration of the SLO might look like:
-
Entity: Robot-shop application (all services, endpoints, calls)
-
Indicator:
- Blueprint: Latency
- Type: time-based (aggregate all calls over 1 minute as 1 data point)
- Aggregation: mean
- Threshold: 100 ms
-
Objective:
- Target: 90%
- Time window type: Fixed
- Time window length: 1 week
The error budget for this SLO would be calculated as:
- Minutes in time period x (1 - SLO Target Percentage)
- Minutes: 24 × 60 × 7 minutes in 1 week = 10080 minutes
- SLO Target Percentage: 90% (.90)
- Error Budget: 10080 x (1 - 0.9) = 1008 minutes
The SLO Status would be calculated as:
- For example, assume the SLO had 400 bad minutes over the week-long SLO time window (400 minutes had mean latency > 100 ms).
- SLO Status = 100 x (Total minutes in time window - bad minutes in time window) / Total minutes in time window
- 100 x (10080 total minutes - 400 bad minutes) / 10080 total minutes = 96.03%