As I mentioned in the introductory article in this series, the CTO at Amazon.com, Werner Vogels, follows a simple principle: "Everything fails, all the time." I've often said that when it comes to software, you should never assume anything. If you continually prepare for failure in all the resources you manage — be it your hardware or software — you're more likely to succeed. This paradox is at the core of various tools that randomly disable or terminate production resources to test and ensure that automatic recovery mechanisms are a part of your infrastructure.
The most notable of these tools is the Chaos Monkey (see Resources for this and other tools), which was developed by the Netflix technical team and open sourced earlier this year. In this article, I introduce the principles and steps for incorporating the Chaos Monkey into your infrastructure to ensure that it can handle the inevitable failures that will occur.
Tools like the Chaos Monkey are the result of an evolution toward ephemeral environments (discussed in the previous installment) brought on by infrastructure commoditization, virtualization, and cloud computing. It used to be that an infrastructure — with its physical machines, network switches, firewall, load balancers, software servers, and other resources — was something a team of engineers would set up manually one time. Then they'd monitor its usage, continually making manual tweaks to modify configuration, improve performance, and perform other activities. This is no longer considered a good practice and is quite simply impossible to do for any infrastructure of nontrivial scale. Tools like the Chaos Monkey apply monitoring, diagnostics, randomization, and disruption to the infrastructure to ensure that engineers apply automation to limit the impact users experience when big problems do occur.
A high-level list of the steps that make up the process for creating an environment for continuously testing an infrastructure includes:
- Launching instances : Start up some compute instances.
- Creating an autonomic infrastructure : Configure an infrastructure that launches new instances (based on the same template) when the infrastructure identifies unhealthy instances (see IBM's portal on autonomic computing in Resources).
- Applying automatic testing to ensure automatic recovery : Run the tests during hours when engineers are ready to react and fix.
- Learning and preventing : When failures do happen, react and prevent the failure from happening again.
The basis for a continuously tested and self-healing infrastructure is the conviction that:
- An infrastructure will fail.
- You need to test for these failures in production when engineers are available.
- When the failure occurs again, your infrastructure must automatically recover from it without users ever noticing.
In many ways, it's the perfect technical implementation of an organization truly committed to continuous improvement or kaizen (see Resources).
In summary, this new breed of resiliency tool and accompanying infrastructure has the following features:
- Monitoring: Daemon processes are continuously run to diagnose errors.
- Diagnostics: Diagnostic tools are run as part of system monitoring.
- Disruption: The infrastructure is intentionally disrupted by shutting down instances and other disruptive activities.
- Randomization: To prevent expected outcomes and behavior, the disruption is randomly applied to the infrastructure.
- Self-healing infrastructure: Although not a part of the resiliency tool, the expected resultant behavior is that teams continue to apply and improve on an autonomic infrastructure capable of recovering from service disruptions without users noticing.
Netflix heavily utilizes a cloud infrastructure for streaming movies to users, along with other functionality. In July 2012, it was reported that Netflix users streamed more than 1 billion hours in June 2012. In other words, Netflix isn't a trivial user of the cloud; it uses it on a massive scale.
A Quick Start Guide on GitHub authored by the Netflix tech team (see "Quick Start Guide for Chaos Monkey" in Resources) describes the steps to go through to get the Chaos Monkey up and running. The following list gives you some more information on the tools that the Chaos Monkey uses. Be sure to run the commands described in the guide to remove any unused resources, or else you will be continually charged for usage.
- Auto Scaling: Auto Scaling is a specific feature of Amazon Web Services that enables you to scale compute capacity up and down based on demand — through rules that you define. Although it's an AWS-specific feature, you can create this type of scalable environment with your — private or public — cloud infrastructure. Auto Scaling has two key components: a launch configuration and an Auto Scaling Group. A launch configuration defines how an instance within an Auto Scaling Group is launched. An Auto Scaling Group is a collection of instances for which to apply a particular launch configuration.
- SimpleDB: SimpleDB is a NoSQL database you can use to persist data. You need to define a SimpleDB domain. It's used by the Chaos Monkey to store state.
- Gradle: Gradle is a build tool. It's used to build the Chaos Monkey and to start the Jetty application container.
- Properties file: You need to modify a simianarmy.properties file with credentials and other configurable information.
- Jetty: The in-memory Jetty server runs the Chaos Monkey to disrupt your infrastructure randomly.
The Chaos Monkey is the first entry in the Netflix technical team's Simian Army. In Table 1, I list other tools that Netflix has proposed that will constitute the Simian Army (see Resources):
Table 1. A simian army
| Name | Description |
|---|---|
| Chaos Gorilla | Simulates outage of an entire availability zone |
| Conformity Monkey | Shuts down instances that don't adhere to best practices |
| Doctor Monkey | Performs health checks (such as CPU) |
| Janitor Monkey | Searches for unused resources and disposes of them |
| Latency Monkey | Creates artificial delays in client-server communication |
| Security Monkey | Finds security vulnerabilities such as improperly configured security groups |
These are just a few ideas. The possibilities for other ways to apply a combination of monitoring, diagnostics, testing, and intentional destruction in cloud-based production environments are endless.
In this article, you learned that you can truly begin creating autonomic infrastructure capable of healing itself with the help of tools such as the Chaos Monkey and a cloud environment.
In the next article, you'll learn about test-driven infrastructure. In it, you'll learn how to apply test-driven development techniques — commonly used by developers for application code — for your infrastructure, using tools such as Cucumber.
Learn
-
"Automation for the people: Continuous testing" (Paul Duvall, developerWorks, March 2007): This article discusses running automated tests with every change to a code base.
-
"Failure as a Service" (Haryadi S. Gunawi et al., University of California at Berkeley
Technical Report, July 2011): This paper discuss routinely performing large-scale failure drills in real deployments.
-
"Netflix streaming tops 1 billion hours in month for first time" (Rachel King, CNET, July 2012): Netflix's digital streaming service hit a major milestone recently: more than 1 billion hours viewed in a month.
-
Quick Start Guide for Chaos Monkey: A guide for running the Chaos Monkey in your environment.
-
Autonomic Computing: IBM's portal for autonomic computing.
-
Kaizen: Wikipedia describes this approach, which originated in Japan, to continuous improvement of processes.
-
"Chaos Monkey Released Into The Wild" (Cory Bennett and Ariel Tseitlin, Netflix, July 2012): Announcement of the official release of the open source Chaos Monkey on GitHub.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
-
Simian Army: Netflix's open source Simian Army. The Chaos Monkey is one entry in what will be a suite of open source tools.
-
Bees with machine guns: A utility for arming (creating) many bees (micro EC2 instances) to attack (load test) targets (web applications).
-
IBM Tivoli Provisioning Manager: Tivoli Provisioning Manager enables a dynamic infrastructure by automating the management of physical servers, virtual servers, software, storage, and networks.
-
IBM Tivoli System Automation for Multiplatforms: Tivoli System Automation for Multiplatforms provides high availability and automation for enterprise-wide applications and IT services.
-
Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
Discuss
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
- The developerWorks Agile transformation community provides news, discussions, and training to help you and your organization build a foundation on agile development principles.

Paul Duvall is the CTO of Stelligent. A featured speaker at many leading software conferences, he has worked in virtually every role on software projects: developer, project manager, architect, and tester. He is the principal author of Continuous Integration: Improving Software Quality and Reducing Risk (Addison-Wesley, 2007) and a 2008 Jolt Award Winner. He is also the author of Startup@Cloud and DevOps in the Cloud LiveLessons (Pearson Education, June 2012). He's contributed to several other books as well. Paul authored the 20-article Automation for the people series on developerWorks. He is passionate about getting high-quality software to users quicker and more often through continuous delivery and the cloud. Read his blog at Stelligent.com.




