What is a Service Level Objective (SLO)?

What is an SLO?

A service level objective (SLO) is an agreed-upon performance target for a particular service over a period of time. SLOs define the expected status of services and help stakeholders manage the health of specific services, and optimize decisions balancing innovation and reliability.¹

SLOs are measured through service level indicators (SLIs), quantitative metrics of some aspect of service. SLOs are part of a broader agreement between service providers and customers—service level agreements (SLAs). These agreements outline the level of service a customer can expect from providers and set penalties when targets are not met.

To ensure that service levels align with business requirements and customer expectations, site reliability engineering (SRE) teams, DevOps, IT and other relevant teams must understand the critical user journeys for each application. These journeys consist of the interactions that allow end users to achieve their wanted outcomes.

Internal buy-in is crucial for successful SLOs (and therefore, SLAs), and multiple stakeholders should take part in determining the SLOs, including product managers, DevOps and problem management teams, and infrastructure engineers. External customers are incorporated in the discussion through focus groups, studies, customer complaints and social media.

The key logic to SLOs is that service reliability leads to user happiness, which provides greater business opportunity. Establishing measurable reliability targets helps an organization balance an enjoyable and efficient user experience with reasonable cost: not breaking the IT budget with service levels beyond what is needed or expected.

SLOs are necessary because they define the quality of service (QoS) and reliability goals in concrete, measurable, objective terms. They are not intended to define the best performance level but a range of best possible and least acceptable performance standards.¹

The aim of SLOs is nicely summed up in 97 Things Every Cloud Engineer Should Know, from O`Reilly Media: “How can you give management an easy way to instantly understand the tradeoffs between reliability, speed of innovation, and cost? SLOs are the answer. SLOs create clear reliability guidelines that balance the tradeoffs between cloud costs, speed of change, and external risks.”

3D rendering of control panel on a digital automation process

Ebook: Get expert answers to why observability matters

Discover what observability really means and how observability can accelerate business outcomes.

SLO versus SLI versus SLA

SLOs are one of several interrelated terms involved in tracking and evaluating service performance:

Service level indicator (SLI)

An SLI is a quantitative measure of some aspect of a service. SLIs provide the real numbers—the measuring sticks for system performance—such as error rates, batch throughput or request latency. Usually, measurements are aggregated and presented as a rate, average or percentile.

Service level objective (SLO)

SLOs are the target values for those measurements (like ensuring response time remains under 200 milliseconds, for instance) that must be met in order to uphold service level agreements (SLAs). These values are expressed as a percentage over a period of time.

Service level agreement (SLA)

SLAs are the contracts between vendors and customers, composed of individual SLOs, that guarantee a certain level for service activities, functions or processes. They also set the penalties if the agreement is not met.

Error budget

An error budget is an aspect of SLOs that defines the acceptable amount of failure before a contract is broken. An error budget enables the incorporation of planned or unplanned downtime of the service that is unavoidable in practice. Building in downtime enables your development teams to make educated decisions concerning new development, operations and updating or fixing installed software.

How SLOs are measured

Reliability and responsiveness are often measured in “nines on the way to 100%”: 90%, 99%, 99.9% and so on. For example, an objective for CPU availability can be shown like this:

Reliability level	Allowed unreliability window

	Per year	Per quarter	Per 30 days
90%	36.5 days	9 days	3 days
95%	18.25 days	4.5 days	1.5 days
99%	3.65 days	21.6 hours	7.2 hours
99.5%	1.83 days	10.8 hours	3.6 hours
99.9%	8.76 hours	2.16 hours	43.2 minutes
99.95%	4.38 hours	1.08 hours	21.6 minutes
99.99%	52.6 minutes	12.96 minutes	4.32 minutes
99.999%	5.26 minutes	1.30 minutes	26.9 seconds

Each decimal point closer to 100 usually involves greater cost and complexity to achieve. Customers—internal and external—might require a certain level of responsiveness after which they can no longer detect a difference. Setting SLOs is part science and part art, striking a balance between statistical perfection and cost-effective, realistic goals.

The development team might want to deliver with new features, while the operations team is looking to deliver stability and quality, introducing change in a controlled way. Because the business provides products or services to internal and external customers, it’s important to measure any service level from those customers’ points of view.

SLOs help bring organizations together around reliability. Ultimately, stakeholders should agree on a measurable SLO for the customer that is an effective balance between velocity and quality of service.

Why are SLOs important?

On a basic level, service level objectives are important because they ensure service reliability and that service level agreements are met. If you are meeting SLAs, your customers are happy and that’s good for business.

SLOs are not just valuable for external clients but they also offer valuable insight for internal customers. SLOs help various teams gauge the performance of services and applications and determine ways in which they might improve. Among other benefits, SLOs help organizations to:

Establish system reliability and efficiency

Reliability issues can cost your company money. When SLOs are set up properly, you’re able to see and uncover gaps in observability. Your SLO setup might be the only place where you can centralize insights from multiple monitoring tools used in your organization. Better observability helps you provide better products, reduce customer churn, and operate more efficiently.

Improve products and user experience

SLOs and SLIs provide insight into the performance of services and applications and provide teams with an accurate measure of downtime and other potential issues.

This information is useful for DevOps, IT and other teams looking to strike a balance between innovation and reliability as they update existing products and release new features.

A well-considered SLO that measures the health of your microservices, as experienced by your customers, provides invaluable insight into product performance and user experience.

Better align internal teams and improve decision-making

Both the establishing and tracking of SLOs help unite teams from across the organization around an understanding of a service and associated expectations. Thoroughly considered SLOs help foster a culture of communication, where all stakeholders weigh in on what their units expect from a service, and understand their role in ensuring that SLAs are met.

In addition, creating reports and automations with SLOs can help each member of your team answer questions about incidents more quickly. SLOs are important for your DevOps, infrastructure and SRE teams, but they can also help transform almost every aspect of your company. The data harvested through observability can be converted to accessible, contextual and actionable information. These insights provide the visibility that your teams need to make timely, cost-effective decisions.

Leverage automation

With clearly articulated targets, organizations can turn to automation to monitor and measure SLIs. This approach can help ensure that targets are being met, with the goal of moving beyond monitoring to fully automating end-to-end processes.

An automated monitoring system can help detect potential issues as they are developing before service performance misses targets set out in SLOs or violates SLAs. When processes that meet SLOs are established, automation can be implemented to ensure consistent performance, for instance, by using a platform that automates resource allocation based on workload demand.

Reduce downtime

SLOs provide DevOps teams with the foresight to identify potential issues before they occur. This foresight prevents unacceptable downtime or other events that might negatively impact the end user or cost the company money.

SLAs often use monthly downtime or availability percentages to calculate billing. Downtime duration is the period of time when a system fails to perform its primary function. Communications failures, for example, might cause network downtime. The availability standard in the industry remains high and so does the cost of downtime, which is constantly increasing. Aside from the financial impact, broken SLOs can also lead to customer dissatisfaction.

Switch to predictive incident management

Many organizations operate based on a reactive incident management process. But when you wait for an incident to occur, it takes longer to mitigate and resolve issues within your system, increasing the mean time to repair (MTTR)¹. Properly established SLOs help improve observability and enable organizations to be more proactive about incident management.

Minimize employee burnout

Irrelevant alerts do more than increase operational costs—they also lead to high burnout rates as engineers waste time and lose productivity responding to nonexistent alerts. One of the biggest challenges in alerting is simply finding the right balance between too many and too few alerts.

A relevant alert notifies an engineer when degradation is likely to cause a reliability goal to be missed—a symptom-based alert. For example, it’s a real problem when a service’s response latency in the last hour might cause the latency SLO to be noncompliant for the week.

SLO best practices

If you ask people in business what their system uptime goal should be, many might say they’d like to try for 100%. This practice is an ambitious goal, but also high-priced and can eat up most of your IT budget before anything else. SLOs are designed not for bragging rights but to find and deliver on customer expectations so you can keep your customers happy and coming back. Reliability is a means, not an end.

Just because a performance metric can be measured does not mean it’s important to your customer’s happiness or your bottom line. Prioritize. Focus on those metrics that most closely indicate a positive customer experience.

In Foundations of Service Level Management , Rick Sturm and Wayne Morris present this checklist for setting realistic SLOs:

SLOs should be:

· Attainable

· Repeatable

· Measurable

· Understandable

· Meaningful

· Controllable

· Affordable

· Mutually acceptable

Their list begins with “attainable.” Aiming too high drives up costs and can result in higher availability than your customers need. Here are some important best practices that can help you achieve your SLO goals:

Don't get carried away

Define SLOs that support the SLA or business objective. Are 20 SLOs really four times better than five SLOs? Or would this approach create more work for your IT team and confuse the client—without any meaningful benefit? Don’t feel you have to grade everything that can be measured.

Don't try to be a hero

Set realistic SLO targets rather than overpromising and then underdelivering, which can be costly in penalties and possibly even cause the business to lose a client. Being realistic with internal stakeholders and clients enables everyone to make informed decisions. Unrealistically high SLO targets cost more in the end.

Use the SLOs to promote business alignment

By agreeing on realistic expectations up front, you avoid confusion and conflict down the road between internal teams and with the client.

Automate evaluations

Manual metric collection sheets can slow remediation and might not enable root cause analysis. Collect relevant SLIs to evaluate SLOs automatically and build in automatic alerting before an SLO is violated. Include the context your staff needs and dependencies to address an issue before it becomes a significant problem.

Get expert answers to why observability matters