Speed and innovation are crucial to the success of any business.
Being able to quickly innovate, test, validate and rapidly release is key for organizations who want to stay ahead of their competition. At the same time, it is important to ensure that business-critical services have built-in resiliency, performance and scalability.
Speed/innovation and resiliency are two sides of the same coin — customer confidence in the business. To achieve this confidence, the mission-critical services should be built on cloud-native principles in combination with site reliability engineering (SRE) principles.
The goal of this post is to examine the following:
- What cloud-native is and how it ties to SRE
- What SRE is and how SRE practices can be part of the development lifecycle
- How to measure SRE
- SRE organization and how to measure its effectiveness
- What SRE has to do with artificial intelligence (AI) and machine learning (ML)
The following diagram shows how adopting cloud-native practices leads to SRE efficiency and earning customer confidence. Let's dive into the flow:
What is cloud-native? How does it relate to site reliability engineering (SRE)?
Ask yourself or your colleagues for the meaning of cloud-native, as-as-Service or cloud-first. You will get different answers. Responses might vary from "cloud-first" or "born in the cloud" or "cloud-native means microservices and containerization."
The Cloud Native Computing Foundation (CNCF) defines cloud-native as follows:
"Cloud-native technologies empower organizations to build and run scalable applications in modern, dynamic environments, such as public, private and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil."
Essentially, cloud-native is all about the balance between resiliency and agility. It's an approach to build and run responsive, scalable and fault-tolerant applications that can run anywhere — in public clouds, private clouds or hybrid clouds. Another lens that can be applied to understand cloud native is the Twelve-Factor App, which consists of a set of best practices that guides the building of applications with built-in performance, automation, resiliency, elasticity and diagnosability.
Let's explore the meaning of these cloud-native terms.
- Designed for automation:
- Automation of development tasks
- Test automation
- Automation of infrastructure, provisioning updates and upgrades
- Designed for resiliency:
- High availability
- Fault tolerance and graceful degradation
- Backup and restore
- Designed for elasticity:
- Automated scale up and down
- Designed for performance:
- Responsiveness with SLO and SLI defined
- Efficiency and capacity planning
- Designed for diagnosability:
- Logs, traces, and metrics
- Designed for efficient delivery:
- Modular, microservices-based
- Automated deployments, upgrades and updates
- Efficient build process
These concepts describe SRE practices in a nutshell. Applying these practices in the development lifecycle enforces architecture toward common standards:
An important thing to note is that merely containerizing applications as is does not help achieve cloud-native characteristics. In fact, these days, it is possible to containerize any application; however, it requires additional effort to create a containerized application that can be automated and orchestrated effectively to behave as a cloud-native application that is running on a native platform like Kubernetes. Examples of this are applications that use Kubernetes health probes (e.g., liveliness and readiness probes) to enable graceful degradation. For more details, see the blog post "Are your Kubernetes readiness probes checking for readiness?"
Going through all the patterns is beyond the scope of this article. Kubernetes provides a portable, extensible platform for managing containerized workloads and services that facilitate both declarative configuration and automation. Going through how each of the cloud-native practices can be achieved with Kubernetes will be the topic of my subsequent blog. Some additional resources are included at the end of this post.
Many applications are complex and take years to build. Additionally, many applications are built in a layered architecture, with contributions from a number of teams and technology groups. With a layered architecture, any user action might go several levels deep — from user interaction to authorization to a backend business logic service to automation processing (and there can be additional layers based on use case). To reduce the complexity, improve efficiency and speed up development, it is critical to apply the lens of cloud native to each layer of the architecture when delivering such a service. Cloud native practices also apply to the software delivery model.
What is site reliability engineering (SRE)? How can SRE practices be part of the development lifecycle?
Have you ever heard the expression "SRE is what happens when you ask a software engineer to design an operations team"? To quote the IBM Learn Hub article on SRE: "Site reliability engineering (SRE) uses software engineering to automate IT operations tasks (e.g., production system management, change management, incident response, even emergency response) that would otherwise be performed manually by systems administrators (sysadmins)."
The role of the SRE is to keep the organization focused on what matters most to users — ensuring that the platform and services are reliable. If you are familiar with the traditional disciplines of development and operations, SRE bridges the two. The goal of SRE is to codify every aspect of operations in order to build resiliency within infrastructure and applications. This implies that reliability deliverables are to be delivered via the same continuous integration (CI)/continuous delivery (CD) pipeline as development, managed by using version control tools and checked for issues by using test frameworks.
In summary, SRE implies operations to be a software delivery problem. SRE uses a software engineering approach to solve operational problems.
In an Embedded SRE model (described in the SRE model section), development and SRE collaborate throughout the lifecycle of minimum viable product (MVP) delivery. As MVP progresses through technical feature specification and development, the SRE collaborates with Development and OM to ensure cloud-native practices are enabled. For example, they identify critical user journeys, associated key SLIs and SLOs for each component.
The SRE should understand service design, including frontend, backend, business logic and database dependencies. This understanding is critical in order to document all failure points and deliver automation for service restoration. By using service design knowledge, the SRE should ensure delivery of the required automation that is described in the cloud native section.
As illustrated in the following diagram, Development and SRE collaborate to deliver functionality and reliability for MVP by using the same CI/CD delivery pipelines and release processes while focusing on their success metrics:
No organization starts from scratch. Shift-left for legacy might not be as easy as for new services. Incubating shift-left SRE for new services is a good way to start, and iteratively for existing legacy services.
In some development models, there are concepts of "DONE, DONE, DONE" that imply code: DONE; test-automation: DONE; and documentation: DONE. Enabling SRE in a development organization implies DONE, DONE, DONE and DONE — the additional "DONE" is for SRE enablement.
As organizations decide to build the development process where SRE and Development work in collaboration to deliver instances of MVP, the question is how we measure the effectiveness of this process. For this measure, we need to look into the critical metrics committed both externally and internally:
- Service Level Agreement (SLA): SLA reflects customer expectation. It sets a promise to the consumer in terms of service availability and performance. There are business consequences if promises are not kept.
- Service Level Objective (SLO): SLOs are the reliability and performance goals set by the service for itself. These are visible internally. Every service should have an availability SLO. The SLO decides how much investment is needed in the reliability of a service. More critical services should have higher SLOs. From the SRE perspective, SLO is what defines the goal that SRE teams have to reach and measure themselves against. So, how is SLO defined? The metrics that define SLOs should be limited to those that truly define performance measures. Every service should consider client-side impacts when defining these metrics.
- Service Level Indicator (SLI): SLI is the metric that enables measurement of compliance against SLO. Think of SLIs as a set of Key Performance Metrics (KPIs) that matter to customers. It is important that SRE, Development and OM reach an agreement on the SLIs that define SLOs and, therefore, SLAs.
See the following diagram for examples:
Here is how these three metrics (SLI, SLO and SLA) are related — the service needs to collect KPIs that define the SLIs for the service. The service defines thresholds of metrics based on SLOs and monitors the thresholds of metrics so that it does not violate the SLA.
In other words, SLIs are the metrics in the monitoring system. SLOs are alert rules, and SLAs are the numbers of the monitoring metrics, applying to the SLOs.
The SLI and SLO definitions should be collaboratively agreed upon by the Development, SRE and the Service Offering team. Going with the definition of "you build it, you run it," each service in the layered architecture should identify the KPIs for their service and make them measurable. These KPIs are the SLIs that define the SLO for each service.
As mentioned earlier, SLA is external and should not be better than the SLO. SLA is normally a looser objective than the internal SLO and relies on a subset of metrics that make up the SLO:
Resiliency isn’t something that just happens; it takes time and is iterative. It is a result of the organization’s support towards operationalizing the SRE model that’s sustainable and resilient itself. SRE is only as good as the organization supporting it.
Effective SRE depends on how well the SRE model is established.
SRE organization and how to measure the effectiveness
No matter how well we architect and design a service, failure is inevitable at some point. Aiming to design 100% uptime is futile. Aiming for 100% uptime will slow down the development of new features and functions, and that adversely affects consumer satisfaction, as well.
Minimizing "Mean Time to Recovery" (MTTR) is the other side of the SRE coin. By recovering quickly when things go wrong, customers still perceive the service as reliable. How quickly a service returns back to running state highly depends on the operational model of the SRE team.
Let's look at some of the SRE team types:
- Embedded SRE: SRE embedded in the functional development squads. Development and SRE work together to deliver application performance and reliability by using the same development CI/CD delivery pipelines and release processes, but they each focus on their own metrics of success:
- Development focuses on the speed of release of new function.
- SRE focuses on enabling resiliency and reliability for the feature functions being delivered.
- Dedicated SRE: The dedicated SRE team responsible for keeping the service up and running. This team is efficient once the service is mature and stable with established automation and runbooks in place.
- Platform SRE: The SRE team that takes care of the underlying platform where the services are running, including Kubernetes cluster, network, storage, etc.
A lot of organizations start with a pure Dedicated SRE + Platform SRE model that, arguably, is a traditional Operations model. Organizations soon realize that SRE needs to start early in the cycle, needs to know the service components really well, and needs to be part of the Software Development Life Cycle (SDLC), boosting reliability. Once that realization comes in, they move to a hybrid approach of an Embedded + Dedicated + Platform Model.
Mean Time to Recovery (MTTR) is the average time to recover the service in the event of an outage. MTTR is dependent on the following key metrics:
- TTD: Time to detect outage or alert indicating potential outage. This depends on the quality of the 'ticket' system/how quickly the correct SRE is notified and monitoring key SLIs by setting up a threshold alert to automatically detect an outage.
- TTE: Time to engage depends on the quick routing of the issue to the right SRE. This is where the SRE model is very critical in reducing the MTTR
- TTF: Time to restore the service depends on how well the SREs know the failure points and how well the automation is in place for recovery.
- TTBD: Time to build and deploy identified bugs or additional identified automation.
The SRE model should be designed with the goal of minimizing MTTR to gain customer confidence:
What does SRE have to with artificial intelligence (AI) and machine learning (ML)?
This pattern of shifting SRE practices to the left, putting SRE measures in place and optimizing SRE organization leads to reducing Mean time to Recovery (MTTR). As I mentioned previously, failure at some point is inevitable; the best that can be done is to be prepared. This preparedness is even more critical now that businesses are going digital at lightning speed. SREs have to deal with a variety of IT data, including logs, tickets, metrics, events, alerts and more. As businesses are moving to hybrid and multicloud, the SREs are observing an explosion of this IT data. This shift has added tremendous stress on SRE teams as the increase in data has accelerated complexity that limits the ability to respond quickly.
Gartner defined AIOps as Artificial Intelligence for IT Operations: "AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination".
IBM Cloud Pak® for Watson AIOps uses AI and ML to make sense of the IT data through the steps of Observe, Learn, Act and Optimize. This relieves the manual toil that is associated with the challenging SRE role and enables organizations to speed up the development and delivery of new feature and functions. Going through the details of this will be the subject of a future blog. To learn more about the IBM Cloud Pak for Watson AIOps, see the product documentation.
Finding a balance of speed and resiliency requires a shift in the mental model. SRE is not just a set of practices and policies. It is a culture and mindset on how to develop software. I hope you found this blog interesting and informative. If you or your organization have not embraced some of these practices, socialize so that you can incorporate and take advantage of speed and resiliency in your practice.