July 5, 2022 By Swanand Barve
Rakesh Shinde
4 min read

How do you know if a solution is “resilient enough,” and how do you know if your testing covers the necessary scenarios?

The cloud-native architecture paradigm has been around for quite some time now. At the core of cloud-native architecture are cohesive, independent functional components that bring in business agility, scalability and resiliency — contributing to accelerated time to market, competitive advantages and optimized costs. This paradigm has been actively supported through a polyglot technology landscape.

The solutions realized using the above combination of architecture and tech landscape can turn out to be quite complex to maintain and manage, primarily due to the sheer number of components and multiple tech frameworks required for their realization. Sub-optimal implementation of design and engineering practices exponentially increases the complexity and maintenance risks of such solutions.

What is resiliency?

“Resiliency” is one such engineering practice that is critical to the success/failure of any digital transformation initiative. As you may know, resiliency directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and it is also directly responsible for making/breaking a transformative user experience.

Resiliency is essentially the ability of a system to sustain against failures. While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.

There is already a lot of material focusing on how to “implement” resiliency in cloud-native applications. IBM’s Build for Reliability Garage practice provides a great introduction and framework for resiliency implementation. There are also frameworks like chaos monkey or tools like Gremlin that help in “testing” the resiliency of applications.

The challenge though remains — how do we verify if a solution is “resilient enough”? Specifically, how do we know if our testing covers the necessary and sufficient scenarios? How do we know what failures to induce?

We would like to propose the following four-step approach to address the above challenge.

1. Identify scenarios and architectural components that need to be tested for resiliency

This can be done by identifying “unique traversal paths” — essentially, the sequence/combination in which components of your solution can be used to support functional scenarios. These scenarios and the supporting components provide the base set that needs to be tested.

For example, your application may support one or more of following:

  • Search/browse product catalogue through a channel application that invokes backend microservices, which fetch data from a persistent data store.
  • Batch processes/schedulers executing at pre-set time/frequency.
  • Events published on pre-configured topics and processed by subscribing microservices.
  • APIs exposed and invoked by multiple consumer systems.

2. Determine points of failure

Once we’ve identified the scenarios and components, the next step is determining what could “fail” with these components. Let’s take an example of a single microservice with the following characteristics:

  • It exposes an API through a gateway.
  • It is deployed on a Kubernetes-enabled container framework.
  • It accesses a database.
  • It integrates with a downstream system.

This view can be put together through identification of the “failure surfaces,” as below:

3. Identify causes of failure across failure surfaces

Each failure surface identified in previous step could fail for multiple reasons — that’s what we need to identify next. Continuing with the same example as earlier – mapping failure surfaces to the possible causes can give you the following list:

  • Core: The core microservice itself – as a code unit — could fail due to out-of-memory issues, the application server could crash, etc.
  • Microservices pod and node: The node/pod may fail a health check. The VM hosting the Kubernetes container platform may crash.
  • API Gateway: The API Gateway engine may become unresponsive due to insufficient threads/memory required for servicing requests.
  • Backend system: The backend system may take a high amount time to respond, and the system may crash.
  • Compute/storage/network: The network between the microservice and backend system (that could be hosted in a separate location) may go down.

4. Prepare for “assault”

The causes and failure surfaces can be used to create a matrix as shown below. This now allows us to understand and plan the combination with which we need to plan for “assaults” on the solution. These, in turn, can be now implemented through chaos testing frameworks, as mentioned earlier:

Additional considerations

Last, but not least, failure testing alone will not be sufficient. Consider the following scenarios:

  • In addition to introducing failure in one component instance, you need to make sure you don’t have auto-scaling/multiple instances running on cloud platform OR ensure all replicas fail as required.
  • In order to test a degraded result (e.g., through cache), you would need to have a “before” and “after” testing capability.

This requires additional capabilities to complement your chaos testing frameworks such as Infrastructure as Code (IaC) or dynamic reconfiguration of cloud resources.

Additionally — since actual testing with components is expensive — you may also want to consider capabilities for “static” verification, such as the following:

  • Deployment descriptor validation for ReplicaSet
  • Validating auto-scaling config for VMs
  • Static code checks for retries, circuit breaker implementation, etc.

Learn more

Overall, we think that resiliency requires focus not just post-development, but throughout — from identification of scenarios early on, prioritizing them based on business impact and then using a combination of static and dynamic “assaults” to verify and validate component-level resiliency. The approach we have laid out in this blog post will help address the key challenges cited in this entire journey.

IBM’s cloud-native application development and modernization services ensure infusion of engineering practices with required consistency and rigor. Check out the following links to learn more:

Was this article helpful?

More from Cloud

Enhance your data security posture with a no-code approach to application-level encryption

4 min read - Data is the lifeblood of every organization. As your organization’s data footprint expands across the clouds and between your own business lines to drive value, it is essential to secure data at all stages of the cloud adoption and throughout the data lifecycle. While there are different mechanisms available to encrypt data throughout its lifecycle (in transit, at rest and in use), application-level encryption (ALE) provides an additional layer of protection by encrypting data at its source. ALE can enhance…

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters