Site reliability engineering (SRE) uses software engineering to automate IT operations tasks such as production system management, change management, incident response, even emergency response that would otherwise be performed manually by systems administrators (sysadmins).
The principle behind SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention - especially as those systems extend or migrate to the cloud.
SRE can also reduce or remove much of the natural friction between development teams because some teams want to continually release new or updated software into production. However, operations teams don't want to release any type of update or new software without being sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can play an important role in DevOps success.
The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."
See a cost and benefit analysis of IBM Robotic Process Automation (RPA).
Read the analyst report on IBM AI-Powered Automation Solutions
Site reliability engineering (SRE) uses software engineering to automate IT operations tasks - for example production system management, change management, incident response, even emergency response - that would otherwise be performed manually by systems administrators (sysadmins).
The principle behind SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention - especially as those systems extend or migrate to the cloud.
SRE can also reduce or remove much of the natural friction between development teams who want to continually release new or updated software into production, and operations teams who don't want to release any type of update or new software without being absolutely sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can be play an important role in DevOps success.
The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."
A site reliability engineer is a software developer with IT operations experience—someone who knows how to code and who understands how to 'keep the lights on' in a large-scale IT environment.
Site reliability engineers spend half their time performing manual IT operations and system administration tasks—analyzing logs, performance tuning, applying patches, testing production environments, responding to incidents, conducting postmortems. The rest of their time, they develop code that automates those tasks. Their goal is to spend less time on the former and more time on the latter.
At a higher level, the SRE team serves as a bridge between development teams and operations teams, enabling the development team to bring new software or new features to production as quickly as possible. They do this while also ensuring an agreed-upon acceptable level of IT operations performance and error risk in line with the service level agreements (SLAs) the organization has in place with its customers. Based on their experience and a wealth of operations data, the SRE team helps the development and operations teams establish
The error budget is the tool an SRE team uses to automatically reconcile a company's service reliability with its pace of software development and innovation.
Suppose a company's SLA promises 99.99% uptime (a common availability target) per year. That means the monthly error budget—the total amount of downtime allowable without contractual consequence for any given month—is about 4 minutes and 23 seconds.
Now let's say the development team wants to roll out some new features or improvements to the system. If the system is running under the error budget, the team can deliver the new features. If not, the team can't deliver the new features until they work with the operations team to get these errors or outages down to an acceptable level.
In this way, error budgets help development teams and operations teams to
DevOps is a modern way to deliver higher quality applications faster - by automating the software delivery lifecycle, and by giving development and operations teams more shared responsibility and more input into each other’s work.
Like SRE, DevOps makes a business more agile by balancing the need to deliver more applications and changes faster with the need to avoid 'breaking' the production environment. And like SRE, DevOps aims to achieve this balance by establishing an acceptable risk of errors. In fact, SRE and DevOps seem so similar that some experts say they're the same thing—but most see SRE practices as excellent ways to implement DevOps principles. For example:
DevOps principles: Reduce organizational silos, leverage tooling and automation.
SRE practice: Use the same tooling to automate and improve operations as developers use to develop and improve software.
DevOps principles: Accept failure as normal, implement gradual changes.
SRE practice: Use error budgets to continually deploy new features and functionality within acceptable levels.
DevOps principle: Measure everything.
SRE practice: Base decisions to release new software on SLA metrics.
In addition to supporting DevOps success, site reliability engineering can help a company
Migration from traditional IT and on-premises data centers to hybrid cloud environments is one of the chief reasons that the average enterprise generates two to three times more operations data every year. Increasingly, SRE is seen as being critical for leveraging this data to automate systems administration, operations and incident response, and to improve enterprise reliability even as the IT environment becomes more complex.
A cloud-native development approach—specifically, building applications as microservices and deploying them in containers—can simplify application development, deployment and scalability. But cloud-native development also creates an increasingly distributed environment that complicates administration, operations and management. An SRE team can support the rapid pace of innovation enabled by a cloud-native approach and ensure or improve system reliability, without putting more operations pressure on DevOps teams.
Continuously automate critical actions in real time—and without human intervention—that proactively deliver the most efficient use of compute, storage and network resources to your apps at every layer of the stack.
Enhance your application performance monitoring to provide the context you need to resolve incidents faster.
IBM Consulting Platform Engineering Services increase productivity of software delivery teams by enabling self-service of infrastructure automation by developers.
Explore how applying AI and automation to IT operations can help SREs ensure resiliency and robustness of enterprise applications and free valuable time and talent to support innovation.
Advance your skills to work as an SRE with professional-level training and certification from IBM. Gain knowledge with IBM Cloud environments and tools and practice exercises in virtual labs.
DevOps speeds delivery of higher quality software by combining and automating the work of software development and IT operations teams.
Cloud-native applications are composed of microservices, packaged and deployed in containers, and designed to run in any cloud environment.
Kubernetes is an open-source container orchestration platform that automates deployment, management and scaling of applications.