Site Reliability Engineering, the cloud approach to operations

Share this post:

Successful delivery of cloud applications requires more than a focus on agile development. Operations is also essential to maintaining user satisfaction, access and to scale with growth. Cloud operations is different to traditional approaches to operations.

Cloud operations

Site Reliability Engineering (SRE) is the emergent cloud approach to operations and seeks to fix issues by use of software engineering and automation solutions. Its principles are as easy to apply to a single-person startup using Bluemix as they are at Google, where it originated, or to larger users like IBM or Facebook.

Cloud is different

Running cloud applications brings new challenges. Ones that the traditional SysAdmin style of ops is not well suited to. In the Enterprise, it is assumed that the underlying infrastructure has 99.999{07c2b926d154bd5dc241f595a572d3349d41d98f2484798a4a616f4fafe1ebc0} availability and that applications can be scaled by adding more hardware. The ops focus is largely at the infrastructure level. Cloud, utilizing fixed ‘T shirt’ sized services and commodity infrastructure requires a different approach to managing reliability and application scaling. The focus here to resolving issues is at the application level. Traditional ops, in this context, is no longer as effective.

So what is SRE?

The creator of the term ‘Site Reliability Engineering’ (SRE), Ben Treynor of Google, explains SRE this way. “SRE is what happens when you ask a software engineer to design an operations team”. When I first read this, it was not immediately clear to me what he meant.

I see SRE as an approach that applies modern cloud design patterns to code lasting solutions to service issues. The focus is at the application level and uses automation to manage the infrastructure layer. Typically the SRE role will involve some or all of the following tasks:

  • eliminating performance bottlenecks by refactoring services into more scalable units.
  • isolating failures through use of the cloud native design patterns like the ‘circuit breaker’ and ‘bulkhead’ made popular by Netflix’s Hystrix.
  • creating runbooks to ensure fast service recovery.
  • automation of day to day ops processes.

The goal of SRE is to make systems more reliable.

For best user satisfaction, Dev and SRE need to work together to deliver the application performance and reliability that businesses desire. They use the same development CI/CD delivery pipelines and release processes, but each has a focus towards their own metrics of success. Dev is on the speed of release of new functions, whereas Ops is on maintaining reliability. The conflict between these priorities is a topic I will cover in a future post.

Dev and SRE are complimentary disciplines in application delivery

SRE is for all sizes

The SRE approach is appropriate to all cloud users. It is not just for the large cloud native users like IBM, Twitter, Google or LinkedIn. A review of the sessions from the usenix SRE conferences reveals many examples of smaller adopters, including Stack Overflow and ContaAzul.

The backdrop to SRE and Google’s approach and practices can be found in the book by the same name. This is the defacto standard on how to adopt SRE. If you do not want to pay real money, you can find the book online at

Bluemix Toolchains will support an SRE function in any size of organization. Services relevant to SRE include:

I suggest having a look at how the Bluemix Garage website uses this toolset.

In future posts I will dig into SRE in more detail.

Adoption Leader - Watson Cloud Platform

More DevOps stories
May 1, 2019

Two Tutorials: Plan, Create, and Update Deployment Environments with Terraform

Multiple environments are pretty common in a project when building a solution. They support the different phases of the development cycle and the slight differences between the environments, like capacity, networking, credentials, and log verbosity. These two tutorials will show you how to manage the environments with Terraform.

Continue reading

April 29, 2019

Transforming Customer Experiences with AI Services (Part 1)

This is an experience from a recent customer engagement on transcribing customer conversations using IBM Watson AI services.

Continue reading

April 26, 2019

Analyze Logs and Monitor the Health of a Kubernetes Application with LogDNA and Sysdig

This post is an excerpt from a tutorial that shows how the IBM Log Analysis with LogDNA service can be used to configure and access logs of a Kubernetes application that is deployed on IBM Cloud.

Continue reading