Site Reliability Engineering, the cloud approach to operations
Successful delivery of cloud applications requires more than a focus on agile development. Operations is also essential to maintaining user satisfaction, access and to scale with growth. Cloud operations is different to traditional approaches to operations.
Site Reliability Engineering (SRE) is the emergent cloud approach to operations and seeks to fix issues by use of software engineering and automation solutions. Its principles are as easy to apply to a single-person startup using Bluemix as they are at Google, where it originated, or to larger users like IBM or Facebook.
Cloud is different
Running cloud applications brings new challenges. Ones that the traditional SysAdmin style of ops is not well suited to. In the Enterprise, it is assumed that the underlying infrastructure has 99.999% availability and that applications can be scaled by adding more hardware. The ops focus is largely at the infrastructure level. Cloud, utilizing fixed ‘T shirt’ sized services and commodity infrastructure requires a different approach to managing reliability and application scaling. The focus here to resolving issues is at the application level. Traditional ops, in this context, is no longer as effective.
So what is SRE?
The creator of the term ‘Site Reliability Engineering’ (SRE), Ben Treynor of Google, explains SRE this way. “SRE is what happens when you ask a software engineer to design an operations team”. When I first read this, it was not immediately clear to me what he meant.
I see SRE as an approach that applies modern cloud design patterns to code lasting solutions to service issues. The focus is at the application level and uses automation to manage the infrastructure layer. Typically the SRE role will involve some or all of the following tasks:
eliminating performance bottlenecks by refactoring services into more scalable units.
isolating failures through use of the cloud native design patterns like the ‘circuit breaker’ and ‘bulkhead’ made popular by Netflix’s Hystrix.
creating runbooks to ensure fast service recovery.
automation of day to day ops processes.
The goal of SRE is to make systems more reliable.
For best user satisfaction, Dev and SRE need to work together to deliver the application performance and reliability that businesses desire. They use the same development CI/CD delivery pipelines and release processes, but each has a focus towards their own metrics of success. Dev is on the speed of release of new functions, whereas Ops is on maintaining reliability. The conflict between these priorities is a topic I will cover in a future post.
SRE is for all sizes
The SRE approach is appropriate to all cloud users. It is not just for the large cloud native users like IBM, Twitter, Google or LinkedIn. A review of the sessions from the usenix SRE conferences reveals many examples of smaller adopters, including Stack Overflow and ContaAzul.
The backdrop to SRE and Google’s approach and practices can be found in the book by the same name. This is the defacto standard on how to adopt SRE. If you do not want to pay real money, you can find the book online at https://landing.google.com/sre/book.html.
Bluemix Toolchains will support an SRE function in any size of organization. Services relevant to SRE include:
I suggest having a look at how the Bluemix Garage website uses this toolset.
In future posts I will dig into SRE in more detail.