How do you know if a solution is “resilient enough,” and how do you know if your testing covers the necessary scenarios?

The cloud-native architecture paradigm has been around for quite some time now. At the core of cloud-native architecture are cohesive, independent functional components that bring in business agility, scalability and resiliency — contributing to accelerated time to market, competitive advantages and optimized costs. This paradigm has been actively supported through a polyglot technology landscape.

The solutions realized using the above combination of architecture and tech landscape can turn out to be quite complex to maintain and manage, primarily due to the sheer number of components and multiple tech frameworks required for their realization. Sub-optimal implementation of design and engineering practices exponentially increases the complexity and maintenance risks of such solutions.

What is resiliency?

“Resiliency” is one such engineering practice that is critical to the success/failure of any digital transformation initiative. As you may know, resiliency directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and it is also directly responsible for making/breaking a transformative user experience.

Resiliency is essentially the ability of a system to sustain against failures. While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.

There is already a lot of material focusing on how to “implement” resiliency in cloud-native applications. IBM’s Build for Reliability Garage practice provides a great introduction and framework for resiliency implementation. There are also frameworks like chaos monkey or tools like Gremlin that help in “testing” the resiliency of applications.

The challenge though remains — how do we verify if a solution is “resilient enough”? Specifically, how do we know if our testing covers the necessary and sufficient scenarios? How do we know what failures to induce?

We would like to propose the following four-step approach to address the above challenge.

1. Identify scenarios and architectural components that need to be tested for resiliency

This can be done by identifying “unique traversal paths” — essentially, the sequence/combination in which components of your solution can be used to support functional scenarios. These scenarios and the supporting components provide the base set that needs to be tested.

For example, your application may support one or more of following:

  • Search/browse product catalogue through a channel application that invokes backend microservices, which fetch data from a persistent data store.
  • Batch processes/schedulers executing at pre-set time/frequency.
  • Events published on pre-configured topics and processed by subscribing microservices.
  • APIs exposed and invoked by multiple consumer systems.

2. Determine points of failure

Once we’ve identified the scenarios and components, the next step is determining what could “fail” with these components. Let’s take an example of a single microservice with the following characteristics:

  • It exposes an API through a gateway.
  • It is deployed on a Kubernetes-enabled container framework.
  • It accesses a database.
  • It integrates with a downstream system.

This view can be put together through identification of the “failure surfaces,” as below:

3. Identify causes of failure across failure surfaces

Each failure surface identified in previous step could fail for multiple reasons — that’s what we need to identify next. Continuing with the same example as earlier – mapping failure surfaces to the possible causes can give you the following list:

  • Core: The core microservice itself – as a code unit — could fail due to out-of-memory issues, the application server could crash, etc.
  • Microservices pod and node: The node/pod may fail a health check. The VM hosting the Kubernetes container platform may crash.
  • API Gateway: The API Gateway engine may become unresponsive due to insufficient threads/memory required for servicing requests.
  • Backend system: The backend system may take a high amount time to respond, and the system may crash.
  • Compute/storage/network: The network between the microservice and backend system (that could be hosted in a separate location) may go down.

4. Prepare for “assault”

The causes and failure surfaces can be used to create a matrix as shown below. This now allows us to understand and plan the combination with which we need to plan for “assaults” on the solution. These, in turn, can be now implemented through chaos testing frameworks, as mentioned earlier:

Additional considerations

Last, but not least, failure testing alone will not be sufficient. Consider the following scenarios:

  • In addition to introducing failure in one component instance, you need to make sure you don’t have auto-scaling/multiple instances running on cloud platform OR ensure all replicas fail as required.
  • In order to test a degraded result (e.g., through cache), you would need to have a “before” and “after” testing capability.

This requires additional capabilities to complement your chaos testing frameworks such as Infrastructure as Code (IaC) or dynamic reconfiguration of cloud resources.

Additionally — since actual testing with components is expensive — you may also want to consider capabilities for “static” verification, such as the following:

  • Deployment descriptor validation for ReplicaSet
  • Validating auto-scaling config for VMs
  • Static code checks for retries, circuit breaker implementation, etc.

Learn more

Overall, we think that resiliency requires focus not just post-development, but throughout — from identification of scenarios early on, prioritizing them based on business impact and then using a combination of static and dynamic “assaults” to verify and validate component-level resiliency. The approach we have laid out in this blog post will help address the key challenges cited in this entire journey.

IBM’s cloud-native application development and modernization services ensure infusion of engineering practices with required consistency and rigor. Check out the following links to learn more:


More from Cloud

Kubernetes version 1.28 now available in IBM Cloud Kubernetes Service

2 min read - We are excited to announce the availability of Kubernetes version 1.28 for your clusters that are running in IBM Cloud Kubernetes Service. This is our 23rd release of Kubernetes. With our Kubernetes service, you can easily upgrade your clusters without the need for deep Kubernetes knowledge. When you deploy new clusters, the default Kubernetes version remains 1.27 (soon to be 1.28); you can also choose to immediately deploy version 1.28. Learn more about deploying clusters here. Kubernetes version 1.28 In…

Temenos brings innovative payments capabilities to IBM Cloud to help banks transform

3 min read - The payments ecosystem is at an inflection point for transformation, and we believe now is the time for change. As banks look to modernize their payments journeys, Temenos Payments Hub has become the first dedicated payments solution to deliver innovative payments capabilities on the IBM Cloud for Financial Services®—an industry-specific platform designed to accelerate financial institutions' digital transformations with security at the forefront. This is the latest initiative in our long history together helping clients transform. With the Temenos Payments…

Foundational models at the edge

7 min read - Foundational models (FMs) are marking the beginning of a new era in machine learning (ML) and artificial intelligence (AI), which is leading to faster development of AI that can be adapted to a wide range of downstream tasks and fine-tuned for an array of applications.  With the increasing importance of processing data where work is being performed, serving AI models at the enterprise edge enables near-real-time predictions, while abiding by data sovereignty and privacy requirements. By combining the IBM watsonx data…

The next wave of payments modernization: Minimizing complexity to elevate customer experience

3 min read - The payments ecosystem is at an inflection point for transformation, especially as we see the rise of disruptive digital entrants who are introducing new payment methods, such as cryptocurrency and central bank digital currencies (CDBC). With more choices for customers, capturing share of wallet is becoming more competitive for traditional banks. This is just one of many examples that show how the payments space has evolved. At the same time, we are increasingly seeing regulators more closely monitor the industry’s…