My IBM

What is site reliability engineering (SRE)?

8 October 2024

Authors

Camilo Quiroz-Vázquez

IBM Staff Writer

Michael Goodwin

Editorial lead, Automation & ITOps

What is site reliability engineering (SRE)?

Site reliability engineering (SRE) is a software engineering practice that combines DevOps and traditional IT operations to solve customer problems, automate IT operations tasks, accelerate software delivery and minimize IT risk.

SRE supports resiliency, redundancy and reliability in the DevOps cycle and deals with the day-to-day implementation of software programs. Site reliability engineers generally follow the fifty-fifty rule: they dedicate half their time to solving customer problems such as managing escalations and responding to incidents and the other half to automating IT operations. These operations include production system management, change management, incident response and emergency response.

SRE teams bridge the gap between how software developers want programs to function and how they function in real-world situations. Site reliability engineers work directly with customers to troubleshoot their issues and collect data on user experience. SRE teams feed this data back to development teams giving them deeper insights on how the software is performing and what updates need to be made.

SREs understand that failures are inevitable. Their job is to both identify (through processes such as root cause analysis) the cause of immediate issues and to use monitoring and logging data to predict potential future failures. Then, they set up automations to solve these issues, building resiliency and redundancy into the system.

This automated oversight of large-scale software systems reduces the need for system administrators to manually complete IT operations tasks. Eliminating manual functions helps IT teams save time, execute operations tasks more accurately and focus on maintaining application performance.

How does site reliability engineering work?

A site reliability engineer is a technical position that requires experience in both software development and IT operations. Understanding these positions enables SRE teams to fulfill their role in supporting the software development lifecycle. SRE is based on a strategy of resiliency through the consistent automation of processes.

Traditionally, site reliability engineering practices focused on performing IT operations and system administration tasks. These tasks include analyzing logs, performance tuning, applying patches, testing production environments, incident management and conducting postmortems. These tasks were initially done manually, which was time-consuming and prone to human error. The modernization of site reliability engineering involves the automation of these manual tasks.

Monitoring and logging play a key role in SRE. SRE teams use monitoring tools to track what is happening in software systems in real-time. Monitoring makes it possible to fix immediate technical issues and helps teams anticipate future problems and solve for them before they occur.

Logs serve as archives that can be analyzed to gain insights on how systems are functioning and improve system observability. Logging creates a roadmap that helps SRE teams understand the series of events that caused an unanticipated error. Engineers can automate the remediation of the error and prevent it from reoccurring. Both monitoring and logging help engineers identify points of failure and programmatically solve issues through automation so that they do not need to be fixed manually.

SRE teams also look for systems deficiencies through a process called chaos engineering. Chaos engineering is a strategy that site reliability engineers implement to intentionally cause failures in production and pre-production environments. The purpose of chaos engineering is to understand the impact production failures have on software systems and to develop stronger plans to mitigate failures in the future.

SRE also focuses on capacity planning, a process that determines the resources that are needed to run essential business functions, scale those business functions and develop new applications and features. In addition, SRE teams establish metrics that are used to evaluate the delivery of updates and the implementation of new features.

Keep your head in the cloud  

Get the weekly Think Newsletter for expert guidance on optimizing multicloud settings in the AI era.

Subscribe today

Site reliability engineering metrics

Site reliability engineers use various metrics to help track the consistency of service delivery and the reliability of software systems, including:

Service level agreements (SLA)

SLAs set the terms and conditions between a service provider and a customer. These agreements dictate the level of performance, the agreed-upon indicators for measuring performance and the repercussions for failing to deliver services. A common service that is outlined in an SLA is uptime, or the amount of time a service is available.

Error budgets

The error budget is a tool that SRE teams use to automatically reconcile a company's service reliability with its pace of software development and innovation. Error budgets establish a level of error risk that is in line with the service level agreements.

An uptime target of 99.999%, known as the “five-nines availability,” is a common SLA threshold. That means the monthly error budget—the total amount of downtime allowable without contractual consequence for a specific month—is about 4 minutes and 23 seconds. If a development team wants to implement new features or improvements to a system, the system must not be exceeding the error budget.

Error budgets help development teams and operations teams improve the stability and performance of services. They also help make data-driven decisions about deploying new features or applications and maximize innovation by taking risks within acceptable limits.

Service level objectives (SLO)

SRE teams also help set service level objectives (SLOs), an agreed-upon performance target for a particular service over a specified period. SLOs define the expected status of services and help stakeholders manage the health of specific services and meet SLAs.

Service level indicators (SLIs)

SLOs are measured by service level indicators (SLIs). SLIs are quantitative measurements that are presented as percentages, averages or rates. They include the actual measurement of services such as uptime, latency, throughput and error rates.

IBM DevOps

What is DevOps?

Andrea Crawford explains what DevOps is, the value of DevOps, and how DevOps practices and tools help you move your apps through the entire software delivery pipeline from ideation through production.Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.

Explore DevOps

SRE and DevOps

DevOps is a software development methodology that accelerates the delivery of higher-quality applications and services by combining and automating the work of software development and IT operations teams. DevOps helps automate the software development lifecycle (SDLC), gives development and operations teams more shared responsibility and gives all relevant stakeholders input into the SDLC.

SRE and DevOps are complimentary strategies in software engineering that break down silos and lead to more efficient and reliable software delivery.

While DevOps teams focus on solving the question, “What should this software do?” SRE teams work on answering, “How can this software be deployed and maintained, so it works as needed?” SRE teams provide DevOps teams real-world data on software performance data, bringing a balance of practical data to the theoretical world of software development.

Like SRE, DevOps makes enterprises more agile by balancing the need to deliver applications and changes faster with the need to avoid “breaking” the production environment. Both SRE and DevOps aim to achieve this balance by establishing an acceptable risk of errors. DevOps teams focus on making updates and deploying new features while SRE practices work to protect the reliability of systems as they scale.

DevOps and SRE teams streamline methods of communication and establish a constant feedback loop. Such a loop might work like this: When an SRE team uncovers the root cause of an error, it sends its findings to the DevOps team who can develop an update for the next version of the software. In the interim, SREs build automations to solve the issue and track monitoring and logging data to make sure that the issue has been resolved.

Benefits of SRE

In addition to supporting DevOps success, site reliability engineering can help organizations:

Gain greater visibility into service health by tracking metrics, logs and traces across all organizational services and strengthen root cause analysis capabilities.
Improve the reliability of software systems through day-to-day interactions with customers and the collaborative sharing of user data with DevOps teams.
Scale software systems by automating manual processes that remove toil, reduce errors and solve problems more precisely.
Quantify the cost of downtime and outages by helping development and operations teams understand the cost of SLA violations, and helping management quantify the impact of system reliability on production, sales, marketing, customer service and other business functions.
Optimize incident response by building efficient on-call processes and streamlining alerting workflows.
Build a modern network operations center by combining in-depth understanding of IT operations with machine learning and automation to send alerts directly to the person responsible for addressing the issue.

SRE, cloud and cloud-native development

When organizations migrate from traditional IT and on-premises data centers to hybrid cloud, they often generate greater volumes of operational data. SRE plays a critical role in using this data to automate systems administration, operations and incident response and to improve enterprise reliability as the IT environment becomes more complex.

A cloud-native development approach—specifically, building applications as microservices and deploying them in containers—can simplify application development, deployment and scalability. But cloud-native development also creates an increasingly distributed environment that complicates administration, IT operations and management.

An SRE team can support the rapid pace of innovation enabled by a cloud-native approach and improve system reliability, without putting more operations pressure on DevOps teams.

Deliver software efficiently

Improving software delivery efficiency is crucial for organizations facing economic headwinds, and a focus on DevOps automation is key.

Resources

The State of AI Readiness

We explored why some organizations are prepared for both the disruption and potential of AI. Find out what these AI-ready companies have in common.

Optimize your business performance with AI-powered analytics

Register now to learn how advanced AI analytics can unlock new opportunities for growth and innovation in your business. Access expert insights and explore how AI solutions can enhance operational efficiency, optimize resources and lead to measurable business outcomes.

Modernize mainframe applications with hybrid cloud patterns

Explore the latest IBM Redbooks publication on mainframe modernization for hybrid cloud environments. Learn actionable strategies, architecture solutions and integration techniques to drive agility, innovation and business success.

Enhance your z/OS DevOps with automation and modernization

Explore how IBM Wazi Deploy and modern language features can streamline your z/OS DevOps. Learn how automation and open-source tools improve efficiency across platforms.

DevOps Acceleration Program

Embark on your DevOps transformation journey with IBM’s DevOps Acceleration Program. This program guides enterprises through critical stages such as assessment, training, deployment and adoption to achieve seamless DevOps implementation.

2024 Gartner® Magic Quadrant™ for data integration tools

IBM named a leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for data integration tools.

What is site reliability engineering (SRE)?

8 October 2024

Authors

Camilo Quiroz-Vázquez

Michael Goodwin

What is site reliability engineering (SRE)?

How does site reliability engineering work?

Keep your head in the cloud

Site reliability engineering metrics

Service level agreements (SLA)

Error budgets

Service level objectives (SLO)

Service level indicators (SLIs)

What is DevOps?

SRE and DevOps

Benefits of SRE

SRE, cloud and cloud-native development

Resources

Related solutions

Keep your head in the cloud