Hope is not a strategy: 7 principles of site reliability engineering (SRE)

Back view of a person working in a dimly lit room at a desk with three monitors

Authors

Dan Nosowitz

Staff Writer, Automation & ITOps

IBM Think

Site reliability engineering, or SRE, is an approach that treats operations issues as if they were software issues. It was originally named and described in 2003 by Ben Treynor Sloss, an engineer at Google. As a discipline, SRE aims to maintain a particular system’s availability, performance and efficiency.

SRE can be hard to nail down. It’s an approach or discipline rather than a prescriptive set of tasks, and it takes different forms based on the needs of a given organization. Luckily, there are seven principles of site reliability engineering that can help guide an SRE team to success.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Why are SRE principles important?

Much of software development is rightfully focused on creation, including DevOps, a related but distinct field, which is more concerned with a product’s entire lifecycle. But the job is hardly complete when the system launches. In the preface of Google's guide to SRE, the authors note that “40 to 90% of the total costs of a system are incurred after birth.” SRE is concerned with what happens after launch, aiming to help ensure that a product remains as usable as possible.

The most important element of SRE is system reliability and uptime. The greatest service in the world can’t do anyone much good if it’s not operational. SRE is therefore focused on minimizing downtime and creating reliable systems.

SRE teams also ensure that all elements of the product are up to date, through careful managing of software and security updates. Standards and regulations are liable to change, and engineering teams ensure continuous compliance.

SRE practices can also convey financial savings. Many of the core principles of SRE are concerned with efficiencies that can lead to significant savings of cost and effort, including through automation and resource management.

IBM DevOps

What is DevOps?

Andrea Crawford explains what DevOps is, the value of DevOps, and how DevOps practices and tools help you move your apps through the entire software delivery pipeline from ideation through production. Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.

The 7 principles of SRE

The seven principles of SRE include:    

  • Embracing risk
  • Service level objectives
  • Eliminating toil
  • Monitoring
  • Automation
  • Release engineering
  • Simplicity

Embracing risk

While SRE is highly concerned with managing and limiting downtime, this tendency doesn’t mean that the goal is for services to maintain a perfect, 100% available service reliability. In fact, one of the key pillars of SRE is that 100% reliability is not only unrealistic, it’s not even necessarily a preferred outcome.

In SRE, risk is understood on a continuum, where reducing risk becomes exponentially more difficult and costly as it nears 100% reliability. Trying to move from 99.99% reliable to 99.999% reliable is much more difficult than moving from 80% to 99%. The resources needed to inch ever closer to 100% reduce a development team’s ability to perform other tasks, like innovating new features and updates. Instead, a team has error budgets to represent appropriate amounts of failure.

Another point against total reliability, however counterintuitive it seems, is that customers will typically not notice reliability improvements beyond a certain threshold. It’s not only costly; there’s also little reward. Ideally, a goal is set and met, but not exceeded excessively.

Instead, SRE uses availability metrics to measure the acceptability of downtime risk. In one metric, a 99.99% reliable year would include 52.6 minutes of downtime. More complex metrics consider the potential of downtime in one location or one element of a service while others remain up.

An SRE team must assess each service and determine an acceptable level of unreliability. How much downtime is allowable? Do different types of failures, arising from different root causes, have different effects on the user experience? How much (in terms of both labor and finance) will it cost to exceed that? Where is the balance?

Service level objectives (SLOs)

Choosing goals, and measuring how effectively those goals are met and why, is vital to an SRE team. A service level objective, or SLO, is a specific, measurable target that represents a level of quality that an SRE team has set as a goal. These SLOs can take the form of various metrics, but availability, query rate, error rate and response time are all common.

These objectives are measured by using a service level indicator, or SLI, which is a raw measurement of performance such as latency. So in that case, the SLI would be the latency metric, and the SLO would be for that metric to remain under a certain threshold. SLOs in turn can be part of a service level agreement, or SLA, which is a contract between provider and user that lays out the SLOs as well as the consequences for not meeting them.

Choosing SLOs can be tricky. Ideally, SLOs should be structured around what’s most important to users. For a cloud gaming service, for example, the SLO might revolve around low latency, but latency wouldn’t matter as much for an accounting service.

Ideally, a site reliability engineer would use relatively few SLOs to focus on achieving those goals; it’s most important to get the main task right. Setting realistic goals is also important; as we discussed earlier, perfection is neither a realistic nor a desired goal.

Eliminating toil

The creators of SRE make it a point to define “toil” as a category of labor which overlaps with, but is not the same as, work. Toil is instead those manual work tasks which scale linearly, which are typically manual and repetitive, and which are often capable of being accomplished by automation.

Work that must be done over and over is classified as toil; preferably, an individual task should only need one or two walkthroughs. Work that does not leave the product improved is also toil. “If your service remains in the same state after you have finished a task, the task was probably toil,” writes Vivek Rau of Google. Bug fixes, feature improvements and optimizations are not toil, but manually downloading metrics is toil. Incident response, which can include significant coordination among engineers and operations teams, is not toil, and incident management tactics should be planned ahead of release.

There’s also cognitive toil. Have you ever had a basic recipe that you use every so often, but have to look up the ingredients and measurements each time? That’s cognitive toil: it’s a waste of time and effort to have to re-learn something over and over. Instead, SRE preaches the creation of more guides and standards, which eliminate the need to continually remember or re-learn methodologies and tasks.

Monitoring

One of the most important parts of site reliability engineering is monitoring: using tools to continually measure, analyze and improve core features and system performance. Those core features often include what’s referred to as the “four golden signals” of monitoring:

Latency: At its most basic, how long does it take to fulfill a request? Note that this can vary based on whether the request was successful or not; sometimes an error message can take significantly longer to service.

Traffic: How much load or demand is placed on the service? The specific units will vary; maybe it’s pageviews, maybe it’s transactions, maybe it's HTTP requests.

Errors: Typically measured by rate, errors can include fetching incorrect data, fetching data more slowly than laid out in an SLA, or failing to fetch at all.

Saturation: Essentially, saturation is a measure of how close to capacity a service is. Understanding saturation is important because some services will begin to fail, or to slow or produce more errors, as they approach 100% saturation.

There are many monitoring tools that can collect data, set benchmarks, debug and analyze issues, provide useful observability dashboards and alert SREs to potential outages or other problems. It’s also important to provide robust postmortem reports after an incident is resolved, explaining any context around an incident, root causes and triggers, impact, resolution methodology and lessons for the future. A detailed, objective postmortem can be invaluable in avoiding the same mistake twice.

Automation

As with many other elements of modern technology, the goal of incorporating automation into a workflow is to free engineers from having to grapple with repetitive tasks that do not add value. With newly expanded free time, engineers can then work on tasks automation can’t complete: creation, ideation, large-scale guidance and more.

Automation can be especially valuable for the following goals:

Consistency: The downside of repetitive, manual tasks isn’t only that it can be boring and can take time away from more valuable actions. If those tasks, such as user account creation, are accomplished by automation tools, mistakes and inconsistencies can be nearly eliminated. A new employee might do things differently than an old one; a user might accidentally enter a value in the wrong field. An automated process (generally) will not.

Scalability: Scalability is a major long-term benefit of automation. Let's take our previous user account creation example. If account creation increases exponetially, the workload for the human responsible for the account setup also increases exponetially, pulling this employee away from other, potentially more valuable, aspects of the job. An automated system will not have this issue.

Speed: Certain tasks, such as finding and fixing bugs in code, can take a human a great deal of time. Automated software systems have the ability to monitor huge swathes of data, and can often detect errors more quickly than humans through advanced pattern recognition and other tools. Fixes can be applied just as quickly, often without any human involvement.

There are also, of course, dangers that lurk alongside any automation process. These include:

Upfront costs: Automations must be created before they can be deployed. This can take a significant amount of time, effort and even hardware costs. The value of automation must be considered as a balance between the effort to create it and the actual resources it will save once launched.

Maintenance: Automated tasks might seem as if they can run forever, but this is often not the case. Automation code must be kept up to date and in sync with other code and system updates. If new features are added, the automation code might need to also be updated through human intervention to include new actions or to prevent errors.

Artificial intelligence offers some new and exciting possibilities for SRE, most obviously in the realm of automation. Both upfront costs and maintenance can theoretically be modulated by new AI models. That said, AI also brings new potential pain points: hallucination, security and privacy, most notably.

Release engineering

Release engineering is a sub-discipline of software engineering specifically focused on the steps required to, well, release software. Those steps include versioning, release schedules, continuous or periodic builds, the selection and gathering of release metrics, and more. In SRE, release engineering is built in at the beginning, rather than as an afterthought; the goal is to avoid a haphazard assignation of release engineering tasks at the last minute.

Release engineering, as a discipline, includes several key principles. These include:

Automation and self-service: Ideally, many release processes can be automated, and require minimal or no interaction from engineers. This ensures fast and stable releases.

Velocity: In release engineering, a philosophy of rapid, frequent releases is preferred. By quickly rolling out releases, maybe even as often as hourly around launch, there are fewer changes between versions. This velocity enables easier testing and troubleshooting.

Hermetic builds: Build processes should be fully independent of the build machine itself, using the most popular compilers, libraries and tools. “If two people attempt to build the same product at the same revision number in the source code repository on different machines, we expect identical results,” Dinah McNutt of Google writes.

Standards and policies: For security reasons, it’s vital that there are checks on certain tasks, including deployment, changes in source code, new releases and changes to build configuration.

Simplicity

Much of site reliability engineering is in service of simplicity. Software is, writes Max Luebbe of Google, “inherently dynamic and unstable.” With that in mind, simplicity is key to minimizing potential issues and attempting to rein in that inherent instability.

To this end, site reliability engineering promotes various tasks that can simplify and clarify a project.

  1. Carefully selecting which features to include is helpful, but it can be just as helpful to simply delete all features which do not significantly add to the product’s utility. More features equates to more complexity.
  2. Smaller, more frequent releases enable much easier debugging and troubleshooting. A new release with dozens of new features could introduce errors that might be exceedingly difficult to track down. A release with one new feature? Any potential problems can only come from one place.
  3. Similarly, it can be tempting to add complexity to APIs through the use of multiple endpoints, microservices and more. This should be avoided. Simpler APIs are quicker to set up, require less documentation, and reduce integration time and costs.
Related solutions
IBM Instana Observability

Harness the power of AI and automation to proactively solve issues across the application stack.

    Explore IBM Instana Observability
    DevOps solutions

    Use DevOps software and tools to build, deploy and manage cloud-native apps across multiple devices and environments.

      Explore DevOps solutions
      Cloud consulting services

      Accelerate business agility and growth—continuously modernize your applications on any platform with our cloud services and consulting.

      Explore cloud consulting services
      Take the next step

      Harness the power of AI and automation to proactively solve issues across the application stack.

      Explore IBM Instana Observability Play with Instana