What Is Site Reliability Engineering (SRE)?

By Michael Goodwin

Site reliability engineering, defined

Site reliability engineering (SRE) is a software engineering practice that combines DevOps and traditional IT operations to solve customer problems, automate IT operations tasks, accelerate software delivery and minimize IT risk.

SRE supports resiliency, redundancy and reliability in the DevOps cycle and deals with the day-to-day implementation of software programs. Site reliability engineers generally follow the fifty-fifty rule: they dedicate half their time to solving customer problems such as managing escalations and responding to incidents and the other half to automating IT operations. These operations include production system management, change management, incident response and emergency response.

SRE teams bridge the gap between how software developers want programs to function and how they function in real-world situations. Site reliability engineers work directly with customers to troubleshoot their issues and collect data on user experience. SRE teams feed this data back to development teams giving them deeper insights on how the software is performing and what updates need to be made.

SREs understand that failures are inevitable. Their job is to both identify (through processes such as root cause analysis) the cause of immediate issues and to use monitoring and logging data to predict potential future failures. Then, they set up automations to solve these issues, building resiliency and redundancy into the system.

This automated oversight of large-scale software systems reduces the need for system administrators to manually complete IT operations tasks. Eliminating manual functions helps IT teams save time, execute operations tasks more accurately and focus on maintaining application performance.

How does site reliability engineering work?

A site reliability engineer is a technical position that requires experience in both software development and IT operations. Understanding these positions enables SRE teams to fulfill their role in supporting the software development lifecycle. SRE is based on a strategy of resiliency through the consistent automation of processes.

Traditionally, site reliability engineering practices focused on performing IT operations and system administration tasks. These tasks include analyzing logs, performance tuning, applying patches, testing production environments, incident management and conducting postmortems. These tasks were initially done manually, which was time-consuming and prone to human error. The modernization of site reliability engineering involves the automation of these manual tasks.

Monitoring and logging play a key role in SRE. SRE teams use monitoring tools to track what is happening in software systems in real-time. Monitoring makes it possible to fix immediate technical issues and helps teams anticipate future problems and solve for them before they occur.

Logs serve as archives that can be analyzed to gain insights on how systems are functioning and improve system observability. Logging creates a roadmap that helps SRE teams understand the series of events that caused an unanticipated error.

Engineers can automate the remediation of the error and prevent it from reoccurring. Both monitoring and logging help engineers identify points of failure and programmatically solve issues through automation so that they do not need to be fixed manually.

SRE teams also look for systems deficiencies through a process called chaos engineering. Chaos engineering is a strategy that site reliability engineers implement to intentionally cause failures in production and pre-production environments. The purpose of chaos engineering is to understand the impact production failures have on software systems and to develop stronger plans to mitigate failures in the future.

SRE also focuses on capacity planning, a process that determines the resources that are needed to run essential business functions, scale those business functions and develop new applications and features. In addition, SRE teams establish metrics that are used to evaluate the delivery of updates and the implementation of new features.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Site reliability and engineering metrics

Site reliability engineers use various metrics to help track the consistency of service delivery and the reliability of software systems, including:

Service level agreements (SLA)

SLAs set the terms and conditions between a service provider and a customer. These agreements dictate the level of performance, the agreed-upon indicators for measuring performance and the repercussions for failing to deliver services. A common service that is outlined in an SLA is uptime, or the amount of time a service is available.

Error budgets

The error budget is a tool that SRE teams use to automatically reconcile a company’s service reliability with its pace of software development and innovation. Error budgets establish a level of error risk that is in line with the service level agreements.

An uptime target of 99.999%, known as the “five-nines availability,” is a common SLA threshold. That means the monthly error budget—the total amount of downtime allowable without contractual consequence for a specific month—is about 4 minutes and 23 seconds. If a development team wants to implement new features or improvements to a system, the system must not be exceeding the error budget.

Error budgets help development teams and operations teams improve the stability and performance of services. They also help make data-driven decisions about deploying new features or applications and maximize innovation by taking risks within acceptable limits.

Service level objectives (SLO)

SRE teams also help set service level objectives (SLOs), an agreed-upon performance target for a particular service over a specified period. SLOs define the expected status of services and help stakeholders manage the health of specific services and meet SLAs.

Service level indicators (SLIs)

SLOs are measured by service level indicators (SLIs). SLIs are quantitative measurements that are presented as percentages, averages or rates. They include the actual measurement of services such as uptime, latency, throughput and error rates.

IBM DevOps

6 observability myths in AIOps uncovered

In this video, IBM Vice President Chris Farrell challenges six common myths about observability, unpacking them one by one to clarify what organizations really need to achieve deeper operational insight and smarter decision-making.

Explore DevOps

SRE and DevOps

DevOps is a software development methodology that accelerates the delivery of higher-quality applications and services by combining and automating the work of software development and IT operations teams. DevOps helps automate the software development lifecycle (SDLC), gives development and operations teams more shared responsibility and gives all relevant stakeholders input into the SDLC.

SRE and DevOps are complementary strategies in software engineering that break down silos and lead to more efficient and reliable software delivery.

While DevOps teams focus on solving the question, “What should this software do?” SRE teams work on answering, “How can this software be deployed and maintained, so it works as needed?” SRE teams provide DevOps teams real-world data on software performance data, bringing a balance of practical data to the theoretical world of software development.

Like SRE, DevOps makes enterprises more agile by balancing the need to deliver applications and changes faster with the need to avoid “breaking” the production environment. Both SRE and DevOps aim to achieve this balance by establishing an acceptable risk of errors. DevOps teams focus on making updates and deploying new features while SRE practices work to protect the reliability of systems as they scale.

DevOps and SRE teams streamline methods of communication and establish a constant feedback loop. Such a loop might work like this: When an SRE team uncovers the root cause of an error, it sends its findings to the DevOps team who can develop an update for the next version of the software.

In the interim, SREs build automations to solve the issue and track monitoring and logging data to make sure that the issue has been resolved.

Benefits of SRE

In addition to supporting DevOps success, site reliability engineering can help organizations:

Gain greater visibility into service health by tracking metrics, logs and traces across all organizational services and strengthen root cause analysis capabilities.
Improve the reliability of software systems through day-to-day interactions with customers and the collaborative sharing of user data with DevOps teams.
Scale software systems by automating manual processes that remove toil, reduce errors and solve problems more precisely.
Quantify the cost of downtime and outages by helping development and operations teams understand the cost of SLA violations. Also, it can help management quantify the impact of system reliability on production, sales, marketing, customer service and other business functions.
Optimize incident response by building efficient on-call processes and streamlining alerting workflows.
Build a modern network operations center by combining in-depth understanding of IT operations with machine learning and automation to send alerts directly to the person responsible for addressing the issue.

Illustration of a transparent cube with smaller cubes inside, symbolizing accelerated innovation and infrastructure management

Accelerate innovation at scale with a unified cloud platform

A platform-centric cloud approach enables engineering teams to innovate faster, maintain security and scale efficiently with automated workflows and unified management.

SRE, cloud and cloud-native development

When organizations migrate from traditional IT and on-premises data centers to hybrid cloud, they often generate greater volumes of operational data. SRE plays a critical role in using this data to automate systems administration, operations and incident response and to improve enterprise reliability as the IT environment becomes more complex.

A cloud-native development approach—specifically, building applications as microservices and deploying them in containers—can simplify application development, deployment and scalability. But cloud-native development also creates an increasingly distributed environment that complicates administration, IT operations and management.

An SRE team can support the rapid pace of innovation enabled by a cloud-native approach and improve system reliability, without putting more operations pressure on DevOps teams.

Authors

Camilo Quiroz-Vázquez

IBM Staff Writer

Michael Goodwin

Staff Editor, Automation & ITOps

IBM Think

Empowering platform teams to do cloud right

Learn how platform teams can standardize workflows and unify infrastructure and security lifecycle management with a platform-as-a-product approach.

What is site reliability engineering (SRE)?