99.95% availability. Balancing release velocity and reliability

5 min read

By: Steve Strutt

99.95% availability. Balancing release velocity and reliability

Availability and reliability are rarely at the front of developers minds when delivering new applications on Bluemix. The ease and speed of creating and deploying new features is very seductive.

In the focus on agile delivery, I have found that reliability can sometimes be an afterthought. Either it is assumed to be an operations issue, or that the platform will handle it. Cloud with its ‘system of systems’ requires a more thoughtful approach to designing in reliability. Also is consideration of operations and how service levels are maintained.

The challenge for any delivery team is how to manage service levels. Weighing the user need for access and reliable use, against the business need for new function. Where should the effort go? The challenge is more a cultural one, than it is technical.

This is the heart of the problem I address in this blog. How can teams balance these demands using Site Reliability Engineering (SRE) as the operational approach?

I introduced the topic of SRE in my earlier post on Site Reliability Engineering, the cloud approach to Ops. This is a widely adopted approach to running and operating cloud native applications and services. It is used by IBM to operate Bluemix and IBM’s cloud services, and from Facebook to Twitter. One of the core tenants of SRE is to enable delivery teams to ‘Pursue Maximum Change Velocity Without Violating a Service’s SLO’ (Service Level Objective).

The point is to maintain the users’ satisfaction in their ability to use the service. This is a real dilemma for product and service owners when the competition could be a click away.

So how does SRE help to balance these intents?

Using Error Budgets

There has always been tension between Dev and Ops. Developers work on new function and business capabilities. Operations is tasked with ensuring service availability and user access. The tension between the two groups is compounded by the use of different metrics for success. The former is new features and rate of change, the latter is service levels and user satisfaction.

ONE WAY

SRE explicitly calls out this conflict between dev and ops and puts in place an arbitration method, an Error Budget or Quota. This is the agreed ‘unreliability’ of a service over a fixed time period, say a month or quarter. Developers can now release new code as long as it is not too ‘unreliable’. In this context, I am using unreliability as shorthand for a set of metrics that define service reliability and usability.

The lower the code quality and hence reliability, the more the focus is on resilience and increasing reliability. Higher reliability means development has more freedom to deliver new function.

The point is that the ability to use the service in terms of availability, response time and throughput is as important to users, as functionality, correct execution and ease of use.

Aligned DevOps Objectives

An Error Budget aligns the goals and stresses shared ownership of the satisfaction and service level metrics between ops and product development. This is a major cultural change from traditional operations and its relationship to development. It is the joint ownership of the objective by both halves of the DevOps team that drives the desired behaviors.

Before looking at how service levels are managed using Error Budgets, it is worth some definitions:

  • Service Level Agreements (SLA): Externally facing, define a contractual level of service to be delivered or a penalty is paid. These are useful for management and service providers, but not for SRE.

  • Service Level Objectives (SLO): define target levels of service, measured by one or more Service Level Indicators (SLIs). SLOs are usually set higher than SLAs, to provide engineering with the flexibility to address the exception before SLAs are violated.

  • Service Level Indicators (SLIs) are metrics, such as latency, throughput, or error rate that indicate how well a service is performing.

Details on how to define SLOs and Error Budgets can be found in the book  Site Reliability Engineering.

Managing Reliability

For the sake of simplicity, I will use a single metric with a very relaxed SLO as an example of how to manage prioritization.

Managing Reliability

The SLO is for 95% of requests to return under 1 second, so up to 5% can exceed 1s without the SLO being violated. This 5% is the Error Budget.

Application monitoring is a must to define SLIs, build SLOs and track how the system delivers against the error budget. The following example shows an applications’ average response time, 95th percentile and maximum response time. The SLO is easily met in this case as the 95th percentile never exceeds 1s. In fact, no responses are above 1s.

Application monitoring

Chart by Datadog

If monitoring reports that the 95th percentile response time increases beyond 1s, then the Error Budget is consumed. The increase may be due to a number of factors, more load, failing hardware, or retries. As more requests exceed the SLO, the faster the budget is consumed.

As the budget is spent the effort moves towards improving performance and reducing the response time. Once the 95th percentile is back below 1s the effort shifts back to functionality. This is only a single metric, in practice multiple SLI metrics and SLOs play on the error budget.

What I find interesting here is that SRE as process and the Error Budget as an indicator, implements a positive feedback loop. This is a self correcting mechanism that encourages developers to to build services that are reliable, do not require intervention and are self healing via automation. Initial releases may have a focus on function, but as services become production ready the SRE feedback mechanism drives in quality and service reliability.

What Determines Reliability?

What attributes of a service should go into defining service levels? The following are several typical metrics. As an Enterprise Architect I would call Non-Functional Requirements (NFRs).

  • Latency: How long the system takes to respond.

  • Error Rate: How often a request fails.

  • Throughput: How much work it does.

  • Availability: How often is it able to do work.

  • Durability: How often is data lost.

  • Correctness: Does it work properly.

SLOs should be based on attributes (SLIs) looking on all aspects of a service’s metrics. All feed into the Error Budget for the service and drive prioritizing of effort.

As noted, monitoring is essential to manage SLOs. Chris Rosen’s blog post shows how to use Datadog for dashboarding of container based applications on Bluemix. It is great example of the monitoring required support an Error Budget driven approach to service management.

In my next post I will look in more detail at the role of the Service Reliability Engineer (again SRE) in cloud operations.

Be the first to hear about news, product updates, and innovation from IBM Cloud