The SRE Guide to Hyper-Resilient Hyperscale for Cloud-Native Applications

5 min read

Enterprises are putting more focus on high availability for their cloud applications and services. Enterprise Observability is a foundational element of hyper-resiliency in the cloud. Learn why.

Enterprises are putting more focus on high availability for their cloud applications and services. Enterprise Observability is a foundational element of hyper-resiliency in the cloud. Learn why.

This post was originally published on the Instana blog on August 26, 2021.

SREs are paid to ensure that their DevOps procedures produce quality software and meet operational Service Level Objectives (SLOs) for cloud applications. It’s not easy; and as the popularity of containerized, cloud-based microservices grows, the challenges increase.

One solution is hyperscale. In case this is new to you, checkpoint.com defines hyperscale this way: “Hyperscale is the ability of a technology architecture to improve and scale appropriately as more demand is added to the system. This includes the ability to provide and add more resources to the system that make up a bigger distributed computing network.”

Hyper-resilient hyperscale is a state of near-continuous operational availability that persists no matter how rapidly the application and services footprint expands and contracts. You might be thinking of cluster technology, which provides mechanisms to make critical resources automatically available on backup systems. Alas, cluster technology provides cloud high availability for infrastructure and data resiliency up to 99.99%, but not for applications and services. Many challenges for application hyper-resiliency are emerging and are being addressed now.

Hyper-resilience: The next big objective for SREs

Microservices, containers and other recent software technologies have drastically increased scalability and improved availability by rapidly initiating a new service if one fails. It works, but is it cost effective? The cloud resources consumed by failed applications and services aren’t reclaimed automatically, and they can lead to hidden costs.

Now that cloud-native hyperscale is common, hyper-resilience is the next big SRE objective for improving application and service availability AND cost efficiency. One goal is to keep cloud-native applications as available as possible in the midst of a continuous stream of updates and hyperscale activity. Another goal is to ensure immediate service recovery after a failure or disruptive event. It’s also to ensure the simultaneous removal of failed service resources to optimize cost efficiency.

Enento Group, one of the leading providers of digital information services in the Nordics, uses full visibility into their containers to enable customers to utilize their services error-free.

Measuring hyper-resilience

There are many measurements to define hyper-resilience. The service level measurements you can use are components of a Cloud Service Level Agreement (SLA). The key SLA components for cloud applications and services are Service Level Indicator (SLI), Service Level Objective (SLO) and Error Budget.

  • SLOs are specific measurable attributes, such as availability, throughput, frequency, response time or quality.
  • SLAs are the service levels you agree to provide for your customers.
  • SLIs are the Quality of Service (QoS) metrics specified for the SLO categories in an SLA.

In other words, the SLO specifies the goal and the SLI is the measurement for a goal.

An Error Budget signifies the maximum tolerance a user and the enterprise will have for application disruptions. Obviously, it’s critical not to exceed the Error Budget in order to achieve the specified SLO.

The key application and software SLI measurements are metrics, events, traces and latency. They are the basis of Enterprise Observability. Metrics, events, traces and latency provide the data needed to determine availability, throughput, frequency, response time and quality. For hyper-resilience, they ensure that application availability remains consistent and available and that the SLIs meet the defined SLOs.

The 99.95% availability metric is frequently cited as a critical benchmark and realistically represents the top-end resiliency goal. But is it even achievable within the cloud? It can be, but not without rich observability data, AIOps, machine learning and a high degree of automation across the board.

To achieve 99.95% availability, every application and service must be resilient to the point that they spin up without delay or failure, run at optimum speed on their allotted infrastructure and efficiently disperse when no longer needed. This applies equally across all cloud topology types: public, private, hybrid or multicloud.

Easy, right? Well, not so fast.

The impact of cloud-native observability on hyper-resilience

The technology to implement hyperscale solutions made applications more scalable and cloud resources consumption more efficient. But hyperscalability came at a cost — the application monitoring tools that worked well on-premises didn’t work well or at all in the cloud or in containers.

And if you can’t find the problem, you can’t fix it. Hyperscalability exacerbated the problem by making application components ephemeral, increasing the number of possible problem points in unpredictable ways. It’s like playing Whack-a-Mole blindfolded at high speed.

Fortunately, Enterprise Observability has done a lot to close the cloud visibility gap. Enterprise Observability is the canary in the coal mine. It provides immediate, precise and focused alerts and analytics from virtually anywhere, including containerized microservices in the cloud. Roll in machine learning, and Enterprise Observability reduces the likelihood of unseen problems and makes high availability more achievable. Even if you have slightly lower availability goals, you still need rich observability data to keep your application availability agreements.

How to achieve hyper-resilient hyperscale with Enterprise Observability

First, the metrics, events traces and latency data must be captured with one-second granularity. Miss gathering metrics, events, traces and latency data for a second and you might miss the glitch that caused an application or service failure. Sampled or spot-checked metrics, events, traces and latency data don’t cut it because it’s like driving over a speed bump at high speed with a blindfold on. You know something happened, but you don’t know when and why.

Metrics, events, traces and latency then need to be automatically fed into machine learning and AIOps to provide predictive analytics that inform remedial actions. Those actions can be manual, semi-automated or fully automated depending upon the level of comfort an SRE has with the predictive recommendations. The more instantaneous actions you apply, the closer you get to hyper-resiliency hyperscale.

Instana automatically discovers all your components — with no manual action required — so it keeps track even during moments of hyperscale growth. It automatically puts the data in context and sends alerts that include the context so you can take action immediately.

Instana displays CPU total for all apps across all clusters.

Instana displays CPU total for all apps across all clusters.

For more information about SLI, SLO and Error Budget, read Instana’s blog post, “Monitoring SLIs and SLOs with Instana.”

You can also check out the on-demand webinar, SRE: How Observability Tools Help Implement Reliability Engineering.

Conclusion

As cloud-native applications create complexity, organizations must track more transactions at shorter intervals. The components are harder to see in containers within cloud environments. Hyperscale requires hyper-resilience, and that requires Enterprise Observability.

The granular metrics, events, traces and latency measurements provided by Instana Enterprise Observability are an SRE’s best friend. It provides the information enhanced with context and through machine learning to provide precise knowledge about your KPIs. Automatic, up-to-the-second information is the intelligence you need to keep your environment hyper-resilient, especially during peak activity periods.

To get a look at Instana for yourself, get dirty in the APM Observability sandbox for free today.

Be the first to hear about news, product updates, and innovation from IBM Cloud