Site reliability engineering, or SRE, is an approach that treats operations issues as if they were software issues. It was originally named and described in 2003 by Ben Treynor Sloss, an engineer at Google. As a discipline, SRE aims to maintain a particular system’s availability, performance and efficiency.
SRE can be hard to nail down. It’s an approach or discipline rather than a prescriptive set of tasks, and it takes different forms based on the needs of a given organization. Luckily, there are seven principles of site reliability engineering that can help guide an SRE team to success.
Much of software development is rightfully focused on creation, including DevOps, a related but distinct field, which is more concerned with a product’s entire lifecycle. But the job is hardly complete when the system launches. In the preface of Google's guide to SRE, the authors note that “40 to 90% of the total costs of a system are incurred after birth.” SRE is concerned with what happens after launch, aiming to help ensure that a product remains as usable as possible.
The most important element of SRE is system reliability and uptime. The greatest service in the world can’t do anyone much good if it’s not operational. SRE is therefore focused on minimizing downtime and creating reliable systems.
SRE teams also ensure that all elements of the product are up to date, through careful managing of software and security updates. Standards and regulations are liable to change, and engineering teams ensure continuous compliance.
SRE practices can also convey financial savings. Many of the core principles of SRE are concerned with efficiencies that can lead to significant savings of cost and effort, including through automation and resource management.
The seven principles of SRE include:
While SRE is highly concerned with managing and limiting downtime, this tendency doesn’t mean that the goal is for services to maintain a perfect, 100% available service reliability. In fact, one of the key pillars of SRE is that 100% reliability is not only unrealistic, it’s not even necessarily a preferred outcome.
In SRE, risk is understood on a continuum, where reducing risk becomes exponentially more difficult and costly as it nears 100% reliability. Trying to move from 99.99% reliable to 99.999% reliable is much more difficult than moving from 80% to 99%. The resources needed to inch ever closer to 100% reduce a development team’s ability to perform other tasks, like innovating new features and updates. Instead, a team has error budgets to represent appropriate amounts of failure.
Another point against total reliability, however counterintuitive it seems, is that customers will typically not notice reliability improvements beyond a certain threshold. It’s not only costly; there’s also little reward. Ideally, a goal is set and met, but not exceeded excessively.
Instead, SRE uses availability metrics to measure the acceptability of downtime risk. In one metric, a 99.99% reliable year would include 52.6 minutes of downtime. More complex metrics consider the potential of downtime in one location or one element of a service while others remain up.
An SRE team must assess each service and determine an acceptable level of unreliability. How much downtime is allowable? Do different types of failures, arising from different root causes, have different effects on the user experience? How much (in terms of both labor and finance) will it cost to exceed that? Where is the balance?
Choosing goals, and measuring how effectively those goals are met and why, is vital to an SRE team. A service level objective, or SLO, is a specific, measurable target that represents a level of quality that an SRE team has set as a goal. These SLOs can take the form of various metrics, but availability, query rate, error rate and response time are all common.
These objectives are measured by using a service level indicator, or SLI, which is a raw measurement of performance such as latency. So in that case, the SLI would be the latency metric, and the SLO would be for that metric to remain under a certain threshold. SLOs in turn can be part of a service level agreement, or SLA, which is a contract between provider and user that lays out the SLOs as well as the consequences for not meeting them.
Choosing SLOs can be tricky. Ideally, SLOs should be structured around what’s most important to users. For a cloud gaming service, for example, the SLO might revolve around low latency, but latency wouldn’t matter as much for an accounting service.
Ideally, a site reliability engineer would use relatively few SLOs to focus on achieving those goals; it’s most important to get the main task right. Setting realistic goals is also important; as we discussed earlier, perfection is neither a realistic nor a desired goal.
The creators of SRE make it a point to define “toil” as a category of labor which overlaps with, but is not the same as, work. Toil is instead those manual work tasks which scale linearly, which are typically manual and repetitive, and which are often capable of being accomplished by automation.
Work that must be done over and over is classified as toil; preferably, an individual task should only need one or two walkthroughs. Work that does not leave the product improved is also toil. “If your service remains in the same state after you have finished a task, the task was probably toil,” writes Vivek Rau of Google. Bug fixes, feature improvements and optimizations are not toil, but manually downloading metrics is toil. Incident response, which can include significant coordination among engineers and operations teams, is not toil, and incident management tactics should be planned ahead of release.
There’s also cognitive toil. Have you ever had a basic recipe that you use every so often, but have to look up the ingredients and measurements each time? That’s cognitive toil: it’s a waste of time and effort to have to re-learn something over and over. Instead, SRE preaches the creation of more guides and standards, which eliminate the need to continually remember or re-learn methodologies and tasks.
One of the most important parts of site reliability engineering is monitoring: using tools to continually measure, analyze and improve core features and system performance. Those core features often include what’s referred to as the “four golden signals” of monitoring:
Latency: At its most basic, how long does it take to fulfill a request? Note that this can vary based on whether the request was successful or not; sometimes an error message can take significantly longer to service.
Traffic: How much load or demand is placed on the service? The specific units will vary; maybe it’s pageviews, maybe it’s transactions, maybe it's HTTP requests.
Errors: Typically measured by rate, errors can include fetching incorrect data, fetching data more slowly than laid out in an SLA, or failing to fetch at all.
Saturation: Essentially, saturation is a measure of how close to capacity a service is. Understanding saturation is important because some services will begin to fail, or to slow or produce more errors, as they approach 100% saturation.
There are many monitoring tools that can collect data, set benchmarks, debug and analyze issues, provide useful observability dashboards and alert SREs to potential outages or other problems. It’s also important to provide robust postmortem reports after an incident is resolved, explaining any context around an incident, root causes and triggers, impact, resolution methodology and lessons for the future. A detailed, objective postmortem can be invaluable in avoiding the same mistake twice.
As with many other elements of modern technology, the goal of incorporating automation into a workflow is to free engineers from having to grapple with repetitive tasks that do not add value. With newly expanded free time, engineers can then work on tasks automation can’t complete: creation, ideation, large-scale guidance and more.
Automation can be especially valuable for the following goals:
Consistency: The downside of repetitive, manual tasks isn’t only that it can be boring and can take time away from more valuable actions. If those tasks, such as user account creation, are accomplished by automation tools, mistakes and inconsistencies can be nearly eliminated. A new employee might do things differently than an old one; a user might accidentally enter a value in the wrong field. An automated process (generally) will not.
Scalability: Scalability is a major long-term benefit of automation. Let's take our previous user account creation example. If account creation increases exponetially, the workload for the human responsible for the account setup also increases exponetially, pulling this employee away from other, potentially more valuable, aspects of the job. An automated system will not have this issue.
Speed: Certain tasks, such as finding and fixing bugs in code, can take a human a great deal of time. Automated software systems have the ability to monitor huge swathes of data, and can often detect errors more quickly than humans through advanced pattern recognition and other tools. Fixes can be applied just as quickly, often without any human involvement.
There are also, of course, dangers that lurk alongside any automation process. These include:
Upfront costs: Automations must be created before they can be deployed. This can take a significant amount of time, effort and even hardware costs. The value of automation must be considered as a balance between the effort to create it and the actual resources it will save once launched.
Maintenance: Automated tasks might seem as if they can run forever, but this is often not the case. Automation code must be kept up to date and in sync with other code and system updates. If new features are added, the automation code might need to also be updated through human intervention to include new actions or to prevent errors.
Artificial intelligence offers some new and exciting possibilities for SRE, most obviously in the realm of automation. Both upfront costs and maintenance can theoretically be modulated by new AI models. That said, AI also brings new potential pain points: hallucination, security and privacy, most notably.
Release engineering is a sub-discipline of software engineering specifically focused on the steps required to, well, release software. Those steps include versioning, release schedules, continuous or periodic builds, the selection and gathering of release metrics, and more. In SRE, release engineering is built in at the beginning, rather than as an afterthought; the goal is to avoid a haphazard assignation of release engineering tasks at the last minute.
Release engineering, as a discipline, includes several key principles. These include:
Automation and self-service: Ideally, many release processes can be automated, and require minimal or no interaction from engineers. This ensures fast and stable releases.
Velocity: In release engineering, a philosophy of rapid, frequent releases is preferred. By quickly rolling out releases, maybe even as often as hourly around launch, there are fewer changes between versions. This velocity enables easier testing and troubleshooting.
Hermetic builds: Build processes should be fully independent of the build machine itself, using the most popular compilers, libraries and tools. “If two people attempt to build the same product at the same revision number in the source code repository on different machines, we expect identical results,” Dinah McNutt of Google writes.
Standards and policies: For security reasons, it’s vital that there are checks on certain tasks, including deployment, changes in source code, new releases and changes to build configuration.
Much of site reliability engineering is in service of simplicity. Software is, writes Max Luebbe of Google, “inherently dynamic and unstable.” With that in mind, simplicity is key to minimizing potential issues and attempting to rein in that inherent instability.
To this end, site reliability engineering promotes various tasks that can simplify and clarify a project.