How do you know if a solution is “resilient enough,” and how do you know if your testing covers the necessary scenarios?
The cloud-native architecture paradigm has been around for quite some time now. At the core of cloud-native architecture are cohesive, independent functional components that bring in business agility, scalability and resiliency — contributing to accelerated time to market, competitive advantages and optimized costs. This paradigm has been actively supported through a polyglot technology landscape.
The solutions realized using the above combination of architecture and tech landscape can turn out to be quite complex to maintain and manage, primarily due to the sheer number of components and multiple tech frameworks required for their realization. Sub-optimal implementation of design and engineering practices exponentially increases the complexity and maintenance risks of such solutions.
What is resiliency?
“Resiliency” is one such engineering practice that is critical to the success/failure of any digital transformation initiative. As you may know, resiliency directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and it is also directly responsible for making/breaking a transformative user experience.
Resiliency is essentially the ability of a system to sustain against failures. While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.
There is already a lot of material focusing on how to “implement” resiliency in cloud-native applications. IBM’s Build for Reliability Garage practice provides a great introduction and framework for resiliency implementation. There are also frameworks like chaos monkey or tools like Gremlin that help in “testing” the resiliency of applications.
The challenge though remains — how do we verify if a solution is “resilient enough”? Specifically, how do we know if our testing covers the necessary and sufficient scenarios? How do we know what failures to induce?
We would like to propose the following four-step approach to address the above challenge.
1. Identify scenarios and architectural components that need to be tested for resiliency
This can be done by identifying “unique traversal paths” — essentially, the sequence/combination in which components of your solution can be used to support functional scenarios. These scenarios and the supporting components provide the base set that needs to be tested.
For example, your application may support one or more of following:
- Search/browse product catalogue through a channel application that invokes backend microservices, which fetch data from a persistent data store.
- Batch processes/schedulers executing at pre-set time/frequency.
- Events published on pre-configured topics and processed by subscribing microservices.
- APIs exposed and invoked by multiple consumer systems.
2. Determine points of failure
Once we’ve identified the scenarios and components, the next step is determining what could “fail” with these components. Let’s take an example of a single microservice with the following characteristics:
- It exposes an API through a gateway.
- It is deployed on a Kubernetes-enabled container framework.
- It accesses a database.
- It integrates with a downstream system.
This view can be put together through identification of the “failure surfaces,” as below:
3. Identify causes of failure across failure surfaces
Each failure surface identified in previous step could fail for multiple reasons — that’s what we need to identify next. Continuing with the same example as earlier – mapping failure surfaces to the possible causes can give you the following list:
- Core: The core microservice itself – as a code unit — could fail due to out-of-memory issues, the application server could crash, etc.
- Microservices pod and node: The node/pod may fail a health check. The VM hosting the Kubernetes container platform may crash.
- API Gateway: The API Gateway engine may become unresponsive due to insufficient threads/memory required for servicing requests.
- Backend system: The backend system may take a high amount time to respond, and the system may crash.
- Compute/storage/network: The network between the microservice and backend system (that could be hosted in a separate location) may go down.
4. Prepare for “assault”
The causes and failure surfaces can be used to create a matrix as shown below. This now allows us to understand and plan the combination with which we need to plan for “assaults” on the solution. These, in turn, can be now implemented through chaos testing frameworks, as mentioned earlier:
Last, but not least, failure testing alone will not be sufficient. Consider the following scenarios:
- In addition to introducing failure in one component instance, you need to make sure you don’t have auto-scaling/multiple instances running on cloud platform OR ensure all replicas fail as required.
- In order to test a degraded result (e.g., through cache), you would need to have a “before” and “after” testing capability.
This requires additional capabilities to complement your chaos testing frameworks such as Infrastructure as Code (IaC) or dynamic reconfiguration of cloud resources.
Additionally — since actual testing with components is expensive — you may also want to consider capabilities for “static” verification, such as the following:
- Deployment descriptor validation for ReplicaSet
- Validating auto-scaling config for VMs
- Static code checks for retries, circuit breaker implementation, etc.
Overall, we think that resiliency requires focus not just post-development, but throughout — from identification of scenarios early on, prioritizing them based on business impact and then using a combination of static and dynamic “assaults” to verify and validate component-level resiliency. The approach we have laid out in this blog post will help address the key challenges cited in this entire journey.
IBM’s cloud-native application development and modernization services ensure infusion of engineering practices with required consistency and rigor. Check out the following links to learn more: