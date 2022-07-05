“Resiliency” is one such engineering practice that is critical to the success/failure of any digital transformation initiative. As you may know, resiliency directly contributes to the overall availability of the solution through metrics like Mean Time to Recover (MTTR) and Mean Time Between Failures (MTBF), and it is also directly responsible for making/breaking a transformative user experience.

Resiliency is essentially the ability of a system to sustain against failures. While failures in systems may ultimately manifest as errors or unavailability of a component/system, the list of factors that may cause failures in a distributed, cloud-native system is significant.

There is already a lot of material focusing on how to “implement” resiliency in cloud-native applications. IBM’s Build for Reliability Garage practice provides a great introduction and framework for resiliency implementation. There are also frameworks like chaos monkey or tools like Gremlin that help in “testing” the resiliency of applications.

The challenge though remains — how do we verify if a solution is “resilient enough”? Specifically, how do we know if our testing covers the necessary and sufficient scenarios? How do we know what failures to induce?

We would like to propose the following four-step approach to address the above challenge.