The cloud-native architecture paradigm has been around for quite some time now. At the core of cloud-native architecture are cohesive, independent functional components that bring in business agility, scalability and resiliency — contributing to accelerated time to market, competitive advantages and optimized costs. This paradigm has been actively supported through a polyglot technology landscape.
The solutions realized using the above combination of architecture and tech landscape can turn out to be quite complex to maintain and manage, primarily due to the sheer number of components and multiple tech frameworks required for their realization. Sub-optimal implementation of design and engineering practices exponentially increases the complexity and maintenance risks of such solutions.
“Resiliency” is one such engineering practice that is critical to the success/failure of any digital transformation initiative. As you may know, resiliency directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and it is also directly responsible for making/breaking a transformative user experience.
Resiliency is essentially the ability of a system to sustain against failures. While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.
There is already a lot of material focusing on how to “implement” resiliency in cloud-native applications. IBM’s Build for Reliability Garage practice provides a great introduction and framework for resiliency implementation. There are also frameworks like chaos monkey or tools like Gremlin that help in “testing” the resiliency of applications.
The challenge though remains — how do we verify if a solution is “resilient enough”? Specifically, how do we know if our testing covers the necessary and sufficient scenarios? How do we know what failures to induce?
We would like to propose the following four-step approach to address the above challenge.
This can be done by identifying “unique traversal paths” — essentially, the sequence/combination in which components of your solution can be used to support functional scenarios. These scenarios and the supporting components provide the base set that needs to be tested.
For example, your application may support one or more of following:
Once we’ve identified the scenarios and components, the next step is determining what could “fail” with these components. Let’s take an example of a single microservice with the following characteristics:
This view can be put together through identification of the “failure surfaces,” as below:
Each failure surface identified in previous step could fail for multiple reasons — that’s what we need to identify next. Continuing with the same example as earlier – mapping failure surfaces to the possible causes can give you the following list:
The causes and failure surfaces can be used to create a matrix as shown below. This now allows us to understand and plan the combination with which we need to plan for “assaults” on the solution. These, in turn, can be now implemented through chaos testing frameworks, as mentioned earlier:
Last, but not least, failure testing alone will not be sufficient. Consider the following scenarios:
This requires additional capabilities to complement your chaos testing frameworks such as Infrastructure as Code (IaC) or dynamic reconfiguration of cloud resources.
Additionally — since actual testing with components is expensive — you may also want to consider capabilities for “static” verification, such as the following:
Overall, we think that resiliency requires focus not just post-development, but throughout — from identification of scenarios early on, prioritizing them based on business impact and then using a combination of static and dynamic “assaults” to verify and validate component-level resiliency. The approach we have laid out in this blog post will help address the key challenges cited in this entire journey.
IBM’s cloud-native application development and modernization services ensure infusion of engineering practices with required consistency and rigor. Check out the following links to learn more:
Explore the essentials of iOS app development, from selecting the right programming language to deploying your app on the App Store. Learn about APIs, testing strategies and how to use cloud solutions for scalable and innovative iOS applications.
Discover the key aspects of Android app development, from selecting the right tools and programming languages to optimizing your app for various devices.
Boost annual revenue by 14% and cut maintenance costs by up to 50% with targeted app modernization strategies.
A fully managed, single-tenant service for developing and delivering Java applications.
Use DevOps software and tools to build, deploy and manage cloud-native apps across multiple devices and environments.
Cloud application development means building once, iterating rapidly and deploying anywhere.