Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defense posture and incident maintenance strategy.
Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers. Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise to more applications and data being hosted in the cloud, which can create an increase in security issues.
One way to address disruptions is chaos engineering. It is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solution proactively and avoid them in the live environment further down the road.
Chaos engineering is important because an error or disruption can slow down an organization’s momentum, expending precious time figuring out a solution on the fly as downtime increases. Netflix learned this firsthand when it switched from on-premise to the cloud1 (link resides outside ibm.com); they experienced an outage that led to a three-day interruption to service delivery in 2008. This predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions, and it began to introduce chaos engineering into its workflows. This allows them to identify issues before they happen and to minimize the damage if and when an unavoidable failure occurs.
Netflix created chaos monkey2(link resides outside ibm.com), an open-source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud. Many organizations now use chaos monkey to run their chaos engineering experiments.
Chaos engineering is an important defense against infrastructure failures, outages or missing components in an organization’s production environment. It helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service, understanding their vulnerabilities better and knowing how to minimize the impact if a disruption occurs.
Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars3 (link resides outside ibm.com). Organizations may be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best-possible solutions.
IBM Instana Observability gives everyone across the enterprise user-friendly access to the data they want with the context they need to deliver rapid issue prevention and remediation.
Subscribe to the IBM newsletter
Organizations with high resilience, digital maturity and high observability through dashboards and other tools should embrace chaos engineering, as they can take immediate action on issues that occur through experiments. Organizations that lack this observability4 (link resides outside ibm.com) can take too long to resolve the experiments they create through chaos engineering.
Chaos engineering is also a must-have for organizations that are using cloud, particularly public cloud, and cloud-native apps. The public cloud introduces potential outage issues that require coordination with the cloud provider, which creates a different approach than dealing with on-premise issues.
Enterprises using the cloud still often approach IT incidents without considering how the cloud and software-as-a-service (SaaS) impact those incidents differently, according to Constellation Research5.(link resides outside ibm.com)
In addition, the rise of using microservices, which increases how many hosts or containers are running in a system, creates unique challenges (link resides outside ibm.com) that can be unearthed and solved through chaos experiments. It shifts complexities from code design into system operations, which does not eliminate the complexities, but allows for greater automation.
Chaos engineering can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD as Netflix did6 (link resides outside ibm.com), enables organizations to automate continuous experiments while controlling their potential impact.
Finally, the fact that organizations increasingly connect with partners through APIs means an issue in their systems can have a knock-on impact on other organizations.
Deploying chaos engineering helps organizations understand the weak points in their architecture and correct them, ultimately creating the ability to anticipate future failures. Successful chaos engineering helps organizations to minimize technical failures with any significant customer impact and it also supports the construction of stronger and more resilient complex system architectures. Once an organization decides to pursue chaos engineering, the next step is determining whether to execute it in the pre-production or production environment.
DevOps teams have several options for running chaos engineering experiments to test various system processes.
Creating the ideal chaos engineering process requires several principles to ensure an organization can have a distributed system at scale.
Organizations that utilize chaos engineering must decide whether to use chaos testing in their production or pre-production environments. There are several reasons why chaos engineering is most beneficial in production environments. Live environments provide the most accurate environment for understanding how an incident impacts the customer experience. Another reason is the pre-production environment may not have the exact settings as the live environment, therefore introducing some variability to the experiments.
For instance, an incident in a pre-production environment may not create a realistic response because it lacks the same traffic levels as the live environment. It also may not have the same security configurations as that environment.
Some organizations fear intentionally causing issues with their live site, so they run their experiments on their pre-production or developmental site. This ensures that any issues that occur do not impact the live customer experience. To mitigate this, some organizations begin in pre-production environments to get a handle on the process before moving to the live production environment.
Organizations will choose which environment to use based on their risk tolerance. Ultimately, chaos engineering aims to test actual large-scale issues, which is why production environments will give the most accurate picture of what's happening and what requires fixing.
Chaos engineering provides organizations with several key benefits.
Customers have high expectations about the availability of the services they purchase from companies. Any downtime or inability to access what they've paid for can have a serious effect on customer satisfaction, leading to lost revenues and reputational damage. Testing systems and identifying solutions means there is less risk that a system will be down for a significant period of time.
Disruptions can come from bad code, server issues or external threats. The latter can strike even with excellent security practices. Chaos engineering helps identify issues that can be exploited, so organizations can introduce patches and bug fixes (link resides outside ibm.com) to keep their services secure.
Chaos engineering enables organizations to create a more informed blueprint for how they tackle issues that will occur in the future. Organizations that embrace chaos engineering will have specific game plans for many incidents, enabling quicker repair and less downtime. Chaos engineering can decrease downtime7 (link resides outside ibm.com) by as much as 20%.
Chaos engineering experiments identify how a system allocates resources. Introducing experiments will demonstrate how the system handles loads, showing where bottlenecks are or are likely to occur.
Chaos engineering helps teams build greater system resiliency and flexibility into their software. Therefore, organizations can approach coding new software and solutions more intelligently because they know how the current system handles issues.
1 Chaos Engineering: System Resiliency in Practice, (link resides outside ibm.com) Casey Rosenthal, Nora Jones, 2020
2 What is Chaos Monkey? Chaos engineering explained, (link resides outside ibm.com) InfoWorld, May 13, 2020
3 Knight Capital Says Trading Glitch Cost It $440 Million, (link resides outside ibm.com) New York Times, 2012
4 There Is No Resilience without Chaos, The New Stack, (link resides outside ibm.com) Apr 13th, 2023
5 Incident Management in the Cloud Era, (link resides outside ibm.com) Constellation Research, 2023
6 ChAP: Chaos Automation Platform, (link resides outside ibm.com) Netflix Blog, July 26, 2017
7 The I&O Leader’s Guide to Chaos Engineering, (link resides outside ibm.com) Gartner, October 28, 2021