What is site reliability engineering (SRE)?
Explore IBM's SRE solution Subscribe to AI topic updates
Illustration showing how SRE automates IT operations tasks, accelerates software delivery, and minimizes IT risk
What is SRE?

Site reliability engineering (SRE) uses software engineering to automate IT operations tasks such as production system management, change management, incident response, even emergency response that would otherwise be performed manually by systems administrators (sysadmins).

The principle behind SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention - especially as those systems extend or migrate to the cloud.

SRE can also reduce or remove much of the natural friction between development teams because some teams want to continually release new or updated software into production. However, operations teams don't want to release any type of update or new software without being sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can play an important role in DevOps success.

The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."

The Total Economic Impact™ Of IBM Robotic Process Automation

See a cost and benefit analysis of IBM Robotic Process Automation (RPA).

Related content

Read the analyst report on IBM AI-Powered Automation Solutions

What is site reliability engineering?

Site reliability engineering (SRE) uses software engineering to automate IT operations tasks - for example production system management, change management, incident response, even emergency response - that would otherwise be performed manually by systems administrators (sysadmins). 

The principle behind SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention - especially as those systems extend or migrate to the cloud.

SRE can also reduce or remove much of the natural friction between development teams who want to continually release new or updated software into production, and operations teams who don't want to release any type of update or new software without being absolutely sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can be play an important role in DevOps success.

The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."

What do site reliability engineers do?

A site reliability engineer is a software developer with IT operations experience—someone who knows how to code and who understands how to 'keep the lights on' in a large-scale IT environment. 

Site reliability engineers spend half their time performing manual IT operations and system administration tasks—analyzing logs, performance tuning, applying patches, testing production environments, responding to incidents, conducting postmortems. The rest of their time, they develop code that automates those tasks. Their goal is to spend less time on the former and more time on the latter.

At a higher level, the SRE team serves as a bridge between development teams and operations teams, enabling the development team to bring new software or new features to production as quickly as possible. They do this while also ensuring an agreed-upon acceptable level of IT operations performance and error risk in line with the service level agreements (SLAs) the organization has in place with its customers. Based on their experience and a wealth of operations data, the SRE team helps the development and operations teams establish

  • Service level indicators (SLIs): Measurements of the service level provided by systems—metrics such as availability (uptime) or latency.

  • Service level objectives (SLOs): Agreed-upon means of measuring service level indicators.

  • Error budgets: The maximum amount of time a system can fail or underperform without violating the contractual terms of the SLA. More than a metric, the error budget is the tool a site reliability engineering team uses to automatically reconcile a company's pace of innovation with its service reliability. 
How do error budgets work?

The error budget is the tool an SRE team uses to automatically reconcile a company's service reliability with its pace of software development and innovation. 

Suppose a company's SLA promises 99.99% uptime (a common availability target) per year. That means the monthly error budget—the total amount of downtime allowable without contractual consequence for any given month—is about 4 minutes and 23 seconds.

Now let's say the development team wants to roll out some new features or improvements to the system. If the system is running under the error budget, the team can deliver the new features. If not, the team can't deliver the new features until they work with the operations team to get these errors or outages down to an acceptable level.

In this way, error budgets help development teams and operations teams to

  • Improve the stability and performance of services.

  • Make data-driven decisions about deploying new features or applications.

  • Maximize innovation by taking risks within acceptable limits.
SRE and DevOps

DevOps is a modern way to deliver higher quality applications faster - by automating the software delivery lifecycle, and by giving development and operations teams more shared responsibility and more input into each other’s work. 

Like SRE, DevOps makes a business more agile by balancing the need to deliver more applications and changes faster with the need to avoid 'breaking' the production environment. And like SRE, DevOps aims to achieve this balance by establishing an acceptable risk of errors. In fact, SRE and DevOps seem so similar that some experts say they're the same thing—but most see SRE practices as excellent ways to implement DevOps principles. For example:

DevOps principles: Reduce organizational silos, leverage tooling and automation.

SRE practice: Use the same tooling to automate and improve operations as developers use to develop and improve software.

DevOps principles: Accept failure as normal, implement gradual changes.

SRE practice: Use error budgets to continually deploy new features and functionality within acceptable levels.

DevOps principle: Measure everything.

SRE practice: Base decisions to release new software on SLA metrics.

 

Other SRE benefits

In addition to supporting DevOps success, site reliability engineering can help a company

  • Gain greater visibility into service health by tracking metrics, logs and traces across all services in the organization and by providing context for identifying root causes in the event of an incident.

  • Quantify the cost of downtime by helping development and operations teams understand the cost of SLA violations, and helping management quantify the impact of system reliability on production, sales, marketing, customer service and other business functions.

  • Optimize incident response by building efficient on-call processes and streamlining alerting workflows.

  • Build a modern network operations center by combining in-depth understanding of IT operations with machine learning and automation, to send alerts directly to the person responsible for addressing the issue.
SRE, cloud and cloud-native development

Migration from traditional IT and on-premises data centers to hybrid cloud environments is one of the chief reasons that the average enterprise generates two to three times more operations data every year. Increasingly, SRE is seen as being critical for leveraging this data to automate systems administration, operations and incident response, and to improve enterprise reliability even as the IT environment becomes more complex.

cloud-native development approach—specifically, building applications as microservices and deploying them in containers—can simplify application development, deployment and scalability. But cloud-native development also creates an increasingly distributed environment that complicates administration, operations and management. An SRE team can support the rapid pace of innovation enabled by a cloud-native approach and ensure or improve system reliability, without putting more operations pressure on DevOps teams.

Related solutions
IBM® Turbonomic®

Continuously automate critical actions in real time—and without human intervention—that proactively deliver the most efficient use of compute, storage and network resources to your apps at every layer of the stack.

Explore IBM Turbonomic
Turbonomic Application Resource Management™

Cut infrastructure spend by 33%, reduce data center refresh costs by 75% and get back 30% of your engineering time with smarter resource management.

Explore IBM Turbonomic®
IBM Instana™ Observability

Enhance your application performance monitoring to provide the context you need to resolve incidents faster.

Explore IBM Instana™
IBM AIOps Insights

AIOps Insights is a SaaS solution that addresses and solves problems in managing the availability of enterprise IT resources through AI-powered incident management.

Explore IBM AIOps Insights
Platform engineering services

IBM Consulting Platform Engineering Services increase productivity of software delivery teams by enabling self-service of infrastructure automation by developers.

Explore consulting platform engineering
Resources An SRE journey to AIOps

Explore how applying AI and automation to IT operations can help SREs ensure resiliency and robustness of enterprise applications and free valuable time and talent to support innovation.

IBM Cloud Professional Site Reliability Engineer (SRE) V2

Advance your skills to work as an SRE with professional-level training and certification from IBM. Gain knowledge with IBM Cloud environments and tools and practice exercises in virtual labs.

What is DevOps?

DevOps speeds delivery of higher quality software by combining and automating the work of software development and IT operations teams.

What are cloud-native applications?

Cloud-native applications are composed of microservices, packaged and deployed in containers, and designed to run in any cloud environment.

What is Kubernetes?

Kubernetes is an open-source container orchestration platform that automates deployment, management and scaling of applications.

Take the next step

IBM Turbonomic allows you to run applications seamlessly, continuously and cost-effectively to help assure app performance while lowering costs.

Explore Turbonomic Book a free demo