Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defense posture and incident maintenance strategy.
Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers. Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise as more applications and data are hosted in the cloud, which can create an increase in security issues.
One way to address disruptions is chaos engineering. It is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solve problems proactively and avoid them in the live environment further down the road.
Chaos engineering is important because an error or disruption can slow down an organization’s momentum, expending precious time figuring out a solution on the fly as downtime increases. Netflix learned this concept firsthand when it switched from on-premises to the cloud1 (link resides outside ibm.com)-they experienced an outage that led to a three-day interruption to service delivery in 2008.
This outage predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions and it began to introduce chaos engineering into its workflows. This process allows them to identify issues before they happen and to minimize the damage if and when an unavoidable failure occurs.
Netflix created chaos monkey2 (link resides outside ibm.com), an open source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented chaos monkey when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud. Many organizations now use chaos monkey to run their chaos engineering experiments.
Chaos engineering is an important defense against infrastructure failures, outages or missing components in an organization’s production environment. It helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service. Chaos engineering helps them understand their vulnerabilities better and informs how to minimize the impact if a disruption occurs.
Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars3 (link resides outside ibm.com).
Organizations might be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best-possible solutions.
This ebook aims to debunk myths surrounding observability and showcase its role in the digital world.
Read a guide to intelligent automation
Organizations with high resilience, digital maturity and high observability through dashboards and other tools should embrace chaos engineering, as they can take immediate action on issues that occur through experiments. Organizations that lack this observability4 (link resides outside ibm.com) can take too long to resolve the experiments they create through chaos engineering.
Chaos engineering is also a must-have for organizations that are using cloud, particularly public cloud and cloud-native apps. The public cloud introduces potential outage issues that require coordination with the cloud provider, which creates a different approach than dealing with on-premises issues.
Enterprises that use the cloud still often approach IT incidents without considering how the cloud and software-as-a-service (SaaS) impact those incidents differently according to Constellation Research5 (link resides outside ibm.com).
In addition, the rise of using microservices, which increases how many hosts or containers are running in a system, creates unique challenges (link resides outside ibm.com) that can be unearthed and solved through chaos experiments. It shifts complexities from code design into system operations, which does not eliminate the complexities but allows for greater automation.
Chaos engineering can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD as Netflix did6 (link resides outside ibm.com) enables organizations to automate continuous experiments while controlling their potential impact.
Finally, the fact that organizations increasingly connect with partners through APIs means that an issue in their systems can have a knock-on impact on other organizations. Deploying chaos engineering helps organizations understand the weak points in their architecture and correct them, ultimately creating the ability to anticipate future failures.
Successful chaos engineering helps organizations to minimize technical failures with any significant customer impact and it also supports the construction of stronger and more resilient complex system architectures. Once an organization decides to pursue chaos engineering, the next step is determining whether to execute it in the pre-production or production environment.
DevOps teams have several options for running chaos engineering experiments to test various system processes.
Creating the ideal chaos engineering process requires several principles to ensure that an organization can have a distributed system at scale.
Organizations that use chaos engineering must decide whether to use chaos testing in their production or pre-production environments. There are several reasons why chaos engineering is most beneficial in production environments.
Live environments provide the most accurate environment for understanding how an incident impacts the customer experience. Another reason is that the pre-production environment might not have the exact settings as the live environment, therefore introducing some variability to the experiments.
For instance, an incident in a pre-production environment might not create a realistic response because it lacks the same traffic levels as the live environment. It also might not have the same security configurations as that environment.
Some organizations fear intentionally causing issues with their live site, so they run their experiments on their pre-production or developmental site. This ensures that any issues that occur do not impact the live customer experience. To mitigate this, some organizations begin in pre-production environments to get a handle on the process before moving to the live production environment.
Organizations choose which environment to use based on their risk tolerance. Ultimately, chaos engineering aims to test actual large-scale issues, which is why production environments give the most accurate picture of what's happening and what requires fixing.
Chaos engineering provides organizations with several key benefits.
Customers have high expectations about the availability of the services they purchase from companies. Any downtime or inability to access what they've paid for can have a serious effect on customer satisfaction, leading to lost revenue and reputational damage. Testing systems and identifying solutions means that there is less risk that a system will be down for a significant period of time.
Disruptions can come from bad code, server issues or external threats. The latter can strike even with excellent security practices. Chaos engineering helps identify issues that can be exploited, so organizations can introduce patches and bug fixes (link resides outside ibm.com) to keep their services secure.
Chaos engineering enables organizations to create a more informed blueprint for how they tackle issues that will occur in the future. Organizations that embrace chaos engineering will have specific game plans for many incidents, enabling quicker repair and less downtime. Chaos engineering can decrease downtime7 (link resides outside ibm.com) by as much as 20%.
Chaos engineering experiments identify how a system allocates resources. Introducing experiments will demonstrate how the system handles loads, showing where bottlenecks are or are likely to occur.
Chaos engineering helps teams build greater system resiliency and flexibility into their software. Therefore, organizations can approach coding new software and solutions more intelligently because they know how the current system handles issues.
Get the context that you need to resolve incidents faster with IBM’s observability solution.
Optimize software usage and cost.
Learn how Artificial Intelligence for IT Operations (AIOps) uses data and machine learning to improve and automate IT service management.
Predict and prevent performance issues before they impact your business with application performance management.
IT operations and AIOps oversee and automate the management, delivery and support of IT services throughout an organization.
ITSM is how an organization ensures its IT services work the way users and the business need them to work.
Automate IT operations tasks, accelerate software delivery and minimize IT risk with site reliability engineering.
Intelligent automation combines AI and automation technologies, enabling automation of low-level tasks within your business.
1 Chaos Engineering: System Resiliency in Practice (link resides outside ibm.com), Casey Rosenthal, Nora Jones, 2020.
2 What is Chaos Monkey? Chaos engineering explained (link resides outside ibm.com), InfoWorld, 13 May 2020.
3 Knight Capital Says Trading Glitch Cost It USD 440 Million (link resides outside ibm.com), New York Times, 2012.
4 There Is No Resilience without Chaos (link resides outside ibm.com), The New Stack, 13 Apr 2023.
5 Incident Management in the Cloud Era (link resides outside ibm.com), Constellation Research, 2023.
6 ChAP: Chaos Automation Platform (link resides outside ibm.com), Netflix Blog, 26 July 2017.
7 The I&O Leader’s Guide to Chaos Engineering (link resides outside ibm.com), Gartner, 28 October 2021.