By: IBM Cloud Education

SRE leverages operations data and software engineering to automate IT operations tasks, and to accelerate software delivery while minimizing IT risk

What is site reliability engineering?

Site reliability engineering (SRE) uses software engineering to automate IT operations tasks - e.g. production system management, change management, incident response, even emergency response - that would otherwise be performed manually by systems administrators (sysadmins). 

The principle behind SRE is that using software code to automate oversight of large software systems is a more scalable and sustainable strategy than manual intervention - especially as those systems extend or migrate to the cloud.

SRE can also reduce or remove much of the natural friction between development teams who want to continually release new or updated software into production, and operations teams who don't want to release any type of update or new software without being absolutely sure it won't cause outages or other operations problems. As a result, while not strictly required for DevOps, SRE aligns closely with DevOps principles and can be play an important role in DevOps success.

The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google, who famously wrote that "SRE is what happens when you ask a software engineer to design an operations team."

Who are site reliability engineers and what do they do?

A site reliability engineer is a software developer with IT operations experience - someone who knows how to code, and who also understands how to 'keep the lights on' in a large-scale IT environment. 

Site reliability engineers spend no more than half their time performing manual IT operations and system administration tasks – analyzing logs, performance tuning, applying patches, testing production environments, responding to incidents, conducting postmortems - and spend the rest of their time developing code that automates those tasks. Their goal is to spend much less time on the former and much more time on the latter over time.

At a higher level, the SRE team serves as a bridge between development teams and operations teams, enabling the development team to bring new software or new features to production as quickly as possible, while also ensuring an agreed-upon acceptable level of IT operations performance and error risk in line with the service level agreements (SLAs) the organization has in place with its customers. Based on their experience and a wealth of operations data, the SRE team helps the development and operations teams establish

  • Service level indicators (SLIs): Measurements of the service level provided by systems - metrics such as availability (uptime) or latency
  • Service level objectives (SLOs): Agreed-upon means of measuring service level indicators
  • Error budgets: The maximum amount of time a system can fail or underperform without violating the contractual terms of the SLA. More than a metric, the error budget is the tool a site reliability engineering team uses to automatically reconcile a company's pace of innovation with its service reliability. 

How do error budgets work?

The error budget is the tool an SRE team uses to automatically reconcile a company's service reliability with its pace of software development and innovation. 

Suppose a company's SLA promises 99.99% uptime (a common availability target) per year. That means the monthly error budget - the total amount of downtime allowable without contractual consequence for any given month - is about 4 minutes and 23 seconds.

Now let's say the development team wants to roll out some new features or improvements to the system. If the system is running under the error budget, the team can deliver the new features. If not, the team can't deliver the new features until they work with the operations team to get these errors or outages down to an acceptable level.

In this way, error budgets help development teams and operations teams to

  • Improve the stability and performance of services
  • Make data-driven decisions about deploying new features or applications
  • Maximize innovation by taking risks within acceptable limits

SRE and DevOps

DevOps is a modern way to deliver higher quality applications faster - by automating the software delivery lifecycle, and by giving development and operations teams more shared responsibility and more input into each other’s work. 

Like SRE, DevOps makes a business more agile by balancing the need to deliver more applications and changes faster with the need to avoid 'breaking' the production environment. And like SRE, DevOps aims to achieve this balance by establishing an acceptable risk of errors. In fact, SRE and DevOps seem so similar that some experts say they're the same thing - but most see SRE practices as excellent ways to implement DevOps principles. For example:

DevOps principles: Reduce organizational silos, leverage tooling and automation

SRE practice: Use the same tooling to automate and improve operations as developers use to develop and improve software

DevOps principles: Accept failure as normal, implement gradual changes

SRE practice: Use error budgets to continually deploy new features and functionality within acceptable levels of

DevOps principle: Measure everything

SRE practice: Base decisions to release new software on SLA metrics

To learn more about DevOps, watch this video (5:58):

Other SRE benefits

In addition to supporting DevOps success, site reliability engineering can help a company

  • Gain greater visibility into service health by tracking metrics, logs and traces across all services in the organization, and providing context for identifying root causes in the event of an incident.
  • Quantify the cost of downtime by helping development and operations teams understand the cost of SLA violations, and helping management quantify the impact of system reliability on production, sales, marketing, customer service and other business functions.
  • Optimize incident response by building efficient on-call processes and streamlining alerting workflows.
  • Build a modern network operations center by combining in-depth understanding of IT operations with machine learning and automation, to send alerts directly to the person responsible for address the issue.

SRE, cloud and cloud-native development

Migration from traditional IT and on-premises data centers to hybrid cloud environments is one of the chief reasons that the average enterprise generates two to three times more operations data every year. Increasingly, SRE is seen as being critical for leveraging this data to automate systems administration, operations and incident response, and to improve enterprise reliability even as the IT environment becomes more complex.

A cloud-native development approach - specifically, building applications as microservices and deploying them in containers - can simplify application development, deployment and scalability. But cloud-native development also creates an increasingly distributed environment that complicates administration, operations and management. An SRE team can support the rapid pace of innovation enabled by a cloud-native approach and ensure or improve system reliability, without putting additional operations pressure on DevOps teams.

SRE and IBM Cloud

IBM Watson AIOps pulls together operations data across siloed IT stacks and tools, to give your SRE team a holistic view of your entire IT environment. It also provides powerful artificial intelligence (AI) for predicting and proactively resolving problems before they become incidents. With Watson AIOps you can gain a deeper understanding of metrics and events, anticipate and calculate risks, and automate your IT operations to reduce risks and lower costs. 

Learn more about IBM Watson AIOps.

Discover how Watson AIOps can help your organization achieve SRE - watch this webcast (link resides outside IBM).

Sign up for an IBMid and create your IBM Cloud account.