What is a service level objective (SLO)?
Explore IBM's SLO solution Subscribe to AI Topic Updates
Overhead view of coworkers collaborating at a shared desk
What is an SLO?

A service level objective (SLO) is an agreed-upon performance target for a particular service over a period of time. SLOs define the expected status of services and help stakeholders manage the health of specific services, as well as optimize decisions balancing innovation and reliability.1

SLOs are measured using service level indicators (SLIs), quantitative metrics of some aspect of service. SLOs are part of a broader agreement between service providers and customers—service level agreements (SLAs)—that outline the level of service a customer can expect from providers and set penalties if targets are not met.

To ensure that service levels are consistent with business requirements as well as customer desires, site reliability engineering (SRE) teams, DevOps, IT and other relevant teams must know the critical user journeys for each application: the interactions that enable end users to reach their desired result.

Internal buy-in is crucial for successful SLOs (and therefore, SLAs), and multiple stakeholders should take part in determining the SLOs, including product managers, DevOps and problem management teams, and infrastructure engineers. External customers are incorporated in the discussion through focus groups, studies, customer complaints and social media.

The key logic to SLOs is that service reliability leads to user happiness, which provides greater business opportunity. Establishing measurable reliability targets helps an organization balance an enjoyable and efficient user experience with reasonable cost: not breaking the IT budget with service levels beyond what is needed or expected.

SLOs are necessary because they define the quality of service (QoS) and reliability goals in concrete, measurable, objective terms. They are not intended to define the best performance level but a range of best possible and least acceptable performance standards.2

The aim of SLOs is nicely summed up in 97 Things Every Cloud Engineer Should Know (link resides outside ibm.com), from O`Reilly Media: “How can you give management an easy way to instantly understand the tradeoffs between reliability, speed of innovation, and cost? SLOs are the answer. SLOs create clear reliability guidelines that balance the tradeoffs between cloud costs, speed of change, and external risks.”

Debunking the myths of observability

This ebook aims to debunk myths surrounding observability and showcase its role in the digital world.

Related content

Register for the guide to operationalize FinOps

  SLO versus SLI versus SLA

SLOs are one of several interrelated terms involved in tracking and evaluating service performance:

Service level indicator (SLI)

An SLI is a quantitative measure of some aspect of a service. SLIs provide the real numbers—the measuring sticks for system performance—such as error rates, batch throughput or request latency. Usually, measurements are aggregated and presented as a rate, average or percentile.

Service level objective (SLO)

SLOs are the target values for those measurements (like ensuring response time remains under 200 milliseconds, for instance) that must be met in order to uphold service level agreements (SLAs). These values are usually expressed as a percentage over a period of time.

Service level agreement (SLA)

 

SLAs are the contracts between vendors and customers, comprised of individual SLOs, that guarantee a certain level for service activities, functions or processes. They also set the penalties if the agreement is not met.

Error budget

An error budget is an aspect of SLOs that defines the acceptable amount of failure before a contract is broken. An error budget enables the incorporation of planned or unplanned downtime of the service that is unavoidable in practice. Building in downtime enables your development teams to make educated decisions concerning new development, operations and updating or fixing installed software.

How SLOs are measured

Reliability and responsiveness are often measured in “nines on the way to 100%”: 90%, 99%, 99.9% and so on. For example, an objective for CPU availability could be shown like this1:

Reliability level

Allowed unreliability window

 
 

 

 

 

 

Per year

Per quarter

Per 30 days

  90%

36.5 days

9 days

3 days

  95%

18.25 days

4.5 days

1.5 days

  99%

3.65 days

21.6 hours

7.2 hours

  99.5%

1.83 days

10.8 hours

3.6 hours

  99.9%

8.76 hours

2.16 hours

43.2 minutes

  99.95%

4.38 hours

1.08 hours

21.6 minutes

  99.99%

52.6 minutes

12.96 minutes

4.32 minutes

  99.999%

5.26 minutes

1.30 minutes

26.9 seconds

 

 

 

 

Each decimal point closer to 100 usually involves greater cost and complexity to achieve. Customers—internal and external—might require a certain level of responsiveness, after which they can no longer detect a difference. Setting SLOs is part science and part art, striking a balance between statistical perfection and cost-effective, realistic goals.1

The development team might want to deliver with new features, while the operations team is looking to deliver stability and quality, introducing change in a controlled way. Because the business provides products or services to internal and external customers, it’s important to measure any service level from those customers’ points of view.

SLOs help bring organizations together around reliability. Ultimately, stakeholders should agree on a measurable SLO for the customer that is an effective balance between velocity and quality of service.

Why are SLOs important?

On a basic level, service level objectives are important because they ensure service reliability and that service level agreements are met. If you are meeting SLAs, your customers are happy, and that’s good for business.

SLOs are not just valuable for external clients but they also offer valuable insight for internal customers. SLOs help various teams gauge the performance of services and applications and determine ways in which they might improve. Among other benefits, SLOs help organizations to:

Establish system reliability and efficiency

Reliability issues can cost your company money. When SLOs are set up properly, you’re able to see and uncover gaps in observability. Your SLO setup might be the only place where you can centralize insights from multiple monitoring tools used in your organization. Better observability helps you provide better products, reduce customer churn, and operate more efficiently.

Improve products and user experience

SLOs and SLIs provide insight into the performance of services and applications and provide teams with an accurate measure of downtime and other potential issues. This information is useful for DevOps, IT and other teams looking to strike a balance between innovation and reliability as they update existing products and release new features.

A well-considered SLO that measures the health of your microservices, as experienced by your customers, provides invaluable insight into product performance and user experience.

Better align internal teams and improve decision-making

Both the establishing and tracking of SLOs help unite teams from across the organization around an understanding of a service and associated expectations. Thoroughly considered SLOs help foster a culture of communication, where all stakeholders weigh in on what their units expect from a service, and understand their role in ensuring that SLAs are met.

In addition, creating reports and automations with SLOs can help each member of your team answer questions about incidents more quickly. SLOs are important for your DevOps, infrastructure and SRE teams, but they can also help transform almost every aspect of your company. The data harvested through observability can be converted to accessible, contextual and actionable information. These insights provide the visibility that your teams need to make timely, cost-effective decisions.1

Leverage automation

With clearly articulated targets, organizations can turn to automation to monitor and measure SLIs. This approach can help ensure that targets are being met, with the goal of moving beyond monitoring to fully automating end-to-end processes.

An automated monitoring system can help detect potential issues as they are developing, before service performance actually misses targets set out in SLOs or violates SLAs. Once processes that meet SLOs are established, automation can be implemented to ensure consistent performance, for instance, by using a platform that automates resource allocation based on workload demand.

Reduce downtime

SLOs provide DevOps teams with the foresight to identify potential issues before they occur. This foresight prevents unacceptable downtime or other events that could negatively impact the end user or cost the company money.

SLAs often use monthly downtime or availability percentages to calculate billing. Downtime duration is the period of time when a system fails to perform its primary function. Communications failures, for example, may cause network downtime. The availability standard in the industry remains high and so does the cost of downtime, which is constantly increasing. Aside from the financial impact, broken SLOs can also lead to customer dissatisfaction.1

Switch to predictive incident management

Many organizations operate based on a reactive incident management process. But when you wait for an incident to occur, it takes longer to mitigate and resolve issues within your system, increasing the mean time to repair (MTTR)1. Properly established SLOs help improve observability and enable organizations to be more proactive about incident management.

Minimize employee burnout

Irrelevant alerts not only increase operational costs but can also lead to high burnout rates when engineers waste time and lose productivity answering nonexistent alerts. One of the biggest challenges in alerting is simply finding the right balance between too many and too few alerts.

A relevant alert would be one that notifies an engineer when the degradation will likely cause a reliability goal to be missed—a symptom-based alert. For example, it’s a real problem when a service’s response latency in the last hour could cause the latency SLO to be noncompliant for the week.1

SLO best practices

If you ask people in business what their system uptime goal should be, many might say they’d like to try for 100%. That’s very aspirational, but also very pricey and could eat up most of your IT budget before anything else. SLOs are designed not for bragging rights but to find and deliver on customer expectations so you can keep your customers happy and coming back. Reliability is a means, not an end.

Just because a performance metric can be measured does not mean it’s important to your customer’s happiness or your bottom line. Prioritize. Focus on those metrics that most closely indicate a positive customer experience.

In Foundations of Service Level Management (link resides outside ibm.com), Rick Sturm and Wayne Morris present this checklist for setting realistic SLOs:

SLOs should be:

· Attainable

· Repeatable

· Measurable

· Understandable

· Meaningful

· Controllable

· Affordable

· Mutually acceptable

Note that their list begins with “attainable.” Shooting for the moon is very expensive and might deliver more uptime than is expected by your customers. Here are some important best practices that can help you achieve your SLO goals1:

Don't get carried away

Define SLOs that support the SLA or business objective. Are 20 SLOs really four times better than five SLOs? Or would this simply create more work for your IT team and confuse the client—without any meaningful benefit? Don’t feel you have to grade everything that can be measured.

Don't try to be a hero

Set realistic SLO targets rather than overpromising and then underdelivering, which can be costly in penalties and perhaps even cause the business to lose a client. Being realistic with internal stakeholders as well as clients enables everyone to make informed decisions. Unrealistically high SLO targets will only cost more in the long run.

Use the SLOs to promote business alignment

By agreeing on realistic expectations up front, you avoid confusion and conflict down the road between internal teams and with the client.

Automate evaluations

Manual metric collection sheets can slow remediation and might not enable root cause analysis. Collect relevant SLIs to evaluate SLOs automatically and build in automatic alerting before an SLO is violated. Include the context your staff needs and dependencies to address an issue before it becomes a significant problem.

Related solutions
Observability IBM Instana® Observability

IBM Instana democratizes observability by providing a solution that anyone across DevOps, SRE, platform, ITOps and development can use to get the data they want with the context they need. Purpose-built for cloud-native yet technology-agnostic, the platform automatically and continuously provides high fidelity data—1 second granularity and end-to-end traces—with the context of logical and physical dependencies across mobile, web, applications and infrastructure.

Explore Instana Request an Instana Observability demo

Hybrid cloud cost optimization IBM® Turbonomic®

The IBM Turbonomic® hybrid cloud cost optimization platform allows you to continuously automate critical actions in real-time that proactively deliver the most efficient use of compute, storage and network resources to your apps at every layer of the stack. 

Explore Turbonomic Try Turbonomic for free
Resources Getting value from SLO and SLI framework

Learn about how SLOs work, what they mean for SRE, and why they are important to your business.

IBM documentation: SLOs

Learn how with Instana®, you can create and manage your service level objectives to analyze the quality of service and reliability goals in concrete, measurable, objective terms.

The Enterprise Guide to Observability

Explore how enterprise observability can help you know how everything is performing, everywhere, all at once.

What is site reliability engineering?

Automate IT operations tasks, accelerate software delivery and minimize IT risk with site reliability engineering.

Take the next step

IBM Instana provides real-time observability that everyone and anyone can use. It delivers quick time-to-value while verifying that your observability strategy can keep up with the dynamic complexity of current and future environments. From mobile to mainframe, Instana supports over 250 technologies and growing. 

Explore IBM Instana Book a live demo
Footnotes

1Getting value from SLO and SLI framework,” IBM, June 2023

2
Service level objectives,” IBM 6 September 2023