What is chaos engineering?

A woman writing on white board with her marker pen in office conference room

Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defense posture and incident maintenance strategy.

Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers. Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise as more applications and data are hosted in the cloud, which can create an increase in security issues.

One way to address disruptions is chaos engineering. It is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solve problems proactively and avoid them in the live environment further down the road.

Chaos engineering is important because an error or disruption can slow down an organization’s momentum, expending precious time figuring out a solution on the fly as downtime increases. Netflix learned this concept firsthand when it switched from on-premises to the cloud¹-they experienced an outage that led to a three-day interruption to service delivery in 2008.

This outage predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions and it began to introduce chaos engineering into its workflows. This process allows them to identify issues before they happen and to minimize the damage if and when an unavoidable failure occurs.

Netflix created chaos monkey², an open source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented chaos monkey when it moved from a private data center to Amazon Web Services (AWS) in response to unreliability from the cloud. Many organizations now use chaos monkey to run their chaos engineering experiments.

Chaos engineering is an important defense against infrastructure failures, outages or missing components in an organization’s production environment. It helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service. Chaos engineering helps them understand their vulnerabilities better and informs how to minimize the impact if a disruption occurs.

Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars³.

Organizations might be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best-possible solutions.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Organizations that benefit from chaos engineering

Organizations with high resilience, digital maturity and high observability through dashboards and other tools should embrace chaos engineering, as they can take immediate action on issues that occur through experiments. Organizations that lack this observability⁴ can take too long to resolve the experiments they create through chaos engineering.

Chaos engineering is also a must-have for organizations that are using cloud, particularly public cloud and cloud-native apps. The public cloud introduces potential outage issues that require coordination with the cloud provider, which creates a different approach than dealing with on-premises issues.

Enterprises that use the cloud still often approach IT incidents without considering how the cloud and software-as-a-service (SaaS) impact those incidents differently according to Constellation Research⁵.

In addition, the rise of using microservices, which increases how many hosts or containers are running in a system, creates unique challengesthat can be unearthed and solved through chaos experiments. It shifts complexities from code design into system operations, which does not eliminate the complexities but allows for greater automation.

Chaos engineering can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD as Netflix did⁶ enables organizations to automate continuous experiments while controlling their potential impact.

Finally, the fact that organizations increasingly connect with partners through APIs means that an issue in their systems can have a knock-on impact on other organizations. Deploying chaos engineering helps organizations understand the weak points in their architecture and correct them, ultimately creating the ability to anticipate future failures.

Successful chaos engineering helps organizations to minimize technical failures with any significant customer impact and it also supports the construction of stronger and more resilient complex system architectures. Once an organization decides to pursue chaos engineering, the next step is determining whether to execute it in the pre-production or production environment.

AI Academy

The rise of generative AI for business

Learn about the historical rise of generative AI and what it means for business.

Go to episode

Types of chaos engineering experiments

DevOps teams have several options for running chaos engineering experiments to test various system processes.

Latency injection: DevOps teams intentionally create scenarios that emulate a slow or failing network connection. This includes the introduction of network delays or slower response times.
Fault injection: This involves purposefully introducing errors into the system to determine how it affects other dependent systems and whether it interrupts services. Examples of fault injections include inducing disk failures, terminating processes, shutting down a host or introducing power or temperature increases. Fault injections can help organizations identify any single points of failure, which can cause the entire system to fail if something happens to them.
Load generation: This relates to intentionally stressing the system by sending significant traffic levels well beyond normal operations. This helps the site reliability engineers (SREs) to understand any bottlenecks in the system, which in turn allows them to build more scalable systems.
Canary testing: This involves releasing a new product or feature to a small group of users. That way, any glitches or bugs will only affect a percentage of visitors, leaving the rest of the audience to access the existing website experience.

Best practices of chaos engineering 

Creating the ideal chaos engineering process requires several principles to ensure that an organization can have a distributed system at scale.

Understand the system: This involves a comprehensive knowledge of the holistic system, its emergent properties and functions and its topology, architecture, dependencies, steady-state behavior, output response and characteristics such as availability, latency and throughput.
Embrace failure: It seems paradoxical initially for software engineers to allow an incident to happen when they are wired to prevent such occurrences. However, disruptions will always occur in IT services and it's better to experience them in a controlled environment to identify the solution preemptively instead of after hours when an organization's team is off the clock or has not encountered that specific problem before.
Establish steady-state behavior: First, the engineering team must define how the system should behave when running correctly, so it can compare how experiments impact that steady state.
Identify real-world incidents: Chaos engineering experiments should hew as closely as possible to what might happen on a normal day instead of creating unlikely situations. Network and infrastructure failures, bad code, power issues and traffic overload are potential occurrences.
Create a game day: Chaos engineering can study the environment on a game day, where multiple tests occur during a specific day to maximize their resources to identify and resolve as many issues as possible.
Use automation: Organizations of all sizes can use chaos engineering by automating experiments, which would be too labor intensive if companies manually conducted them. This reduces some of the burden on IT teams during the chaos engineering process. Experiment design, failure injection and infrastructure provisioning are all aspects of experimentation that organizations can automate.
Be mindful of the blast radius: A chaos engineer must take great pains to minimize the blast radius so that the actual harm to customers is as minor as possible. Some ways to minimize the blast radius are:
- Target a subset of services: Chaos engineering, especially in a production setting, should not fundamentally disrupt an organization's service. Targeting a specific subset of services can minimize the impact of an incident if it occurs, ensuring that it does not take down the entire system.
- Run the experiment for a finite time: The experiment should have a beginning and an end. The point of the experiment is to create an incident and resolve it versus letting it run unchecked for a long time.
- Run the experiment away from peak traffic: Organizations should try to avoid running experiments during peak times unless they are specifically trying to gauge how high capacity affects the system during an incident.
- Run the experiment in the development environment: The easiest way to ensure that no customers experience service interruption is to run in the pre-production environment. However, that means that the conditions will differ from the production environment, potentially giving a false picture of what is occurring. To minimize this, ensure that your pre-production and production environments mirror each other as much as possible.
- Experiment on every component: Chaos experimentation never ends as an organization's system is continually changing. Another goal should be to test "everything"-examine all components, layers, services-and their dependencies throughout the process.

Production environments versus pre-production environments

Organizations that use chaos engineering must decide whether to use chaos testing in their production or pre-production environments. There are several reasons why chaos engineering is most beneficial in production environments.

Live environments provide the most accurate environment for understanding how an incident impacts the customer experience. Another reason is that the pre-production environment might not have the exact settings as the live environment, therefore introducing some variability to the experiments.

For instance, an incident in a pre-production environment might not create a realistic response because it lacks the same traffic levels as the live environment. It also might not have the same security configurations as that environment.

Some organizations fear intentionally causing issues with their live site, so they run their experiments on their pre-production or developmental site. This ensures that any issues that occur do not impact the live customer experience. To mitigate this, some organizations begin in pre-production environments to get a handle on the process before moving to the live production environment.

Organizations choose which environment to use based on their risk tolerance. Ultimately, chaos engineering aims to test actual large-scale issues, which is why production environments give the most accurate picture of what's happening and what requires fixing.

Benefits of chaos engineering

Chaos engineering provides organizations with several key benefits.

Better customer service

Customers have high expectations about the availability of the services they purchase from companies. Any downtime or inability to access what they've paid for can have a serious effect on customer satisfaction, leading to lost revenue and reputational damage. Testing systems and identifying solutions means that there is less risk that a system will be down for a significant period of time.

Improved data security

Disruptions can come from bad code, server issues or external threats. The latter can strike even with excellent security practices. Chaos engineering helps identify issues that can be exploited, so organizations can introduce patches and bug fixes to keep their services secure.

Minimized downtime

Chaos engineering enables organizations to create a more informed blueprint for how they tackle issues that will occur in the future. Organizations that embrace chaos engineering will have specific game plans for many incidents, enabling quicker repair and less downtime. Chaos engineering can decrease downtime⁷ by as much as 20%.

Increased scalability

Chaos engineering experiments identify how a system allocates resources. Introducing experiments will demonstrate how the system handles loads, showing where bottlenecks are or are likely to occur.

Inform future software development

Chaos engineering helps teams build greater system resiliency and flexibility into their software. Therefore, organizations can approach coding new software and solutions more intelligently because they know how the current system handles issues.

The 2025 CEO’s guide: 5 mindshifts to supercharge business growth

Activate these five mindshifts to cut through the uncertainty, spur business reinvention, and supercharge growth with agentic AI.

Resources

From AI projects to profits: How agentic AI can sustain financial returns

Learn how organizations are shifting from launching AI in disparate pilots to using it to drive transformation at the core.

How CEOs are forging paths to sustainability

Read how CEOs feel about sustainability in their own words and how they are baking it into their business.

Seven business bets worth making

Discover 7 business trends that experts expect to shape the world in the next 3 years—and 7 bets worth making to benefit from them.

How Climate Service uses IBM to streamline workflows

Dive into how Climate Service integrated climate data into financial decisions using IBM technology.  

Why Kraft Heinz Company migrated SAP Business Warehouse to the lightning-fast SAP HANA database

See how using the IBM Garage™ methodology helped Kraft Heinz Company improve product velocity.

Footnotes

¹ Chaos Engineering: System Resiliency in Practice (link resides outside ibm.com), Casey Rosenthal, Nora Jones, 2020.
²What is Chaos Monkey? Chaos engineering explained (link resides outside ibm.com), InfoWorld, 13 May 2020.
³Knight Capital Says Trading Glitch Cost It USD 440 Million (link resides outside ibm.com), New York Times, 2012.
⁴ There Is No Resilience without Chaos (link resides outside ibm.com), The New Stack, 13 Apr 2023.
⁵ Incident Management in the Cloud Era (link resides outside ibm.com), Constellation Research, 2023.
⁶ ChAP: Chaos Automation Platform (link resides outside ibm.com), Netflix Blog, 26 July 2017.
⁷ The I&O Leader’s Guide to Chaos Engineering (link resides outside ibm.com), Gartner, 28 October 2021.