Agile DevOps: Unleash the Chaos Monkey

How to deliberately use failure to ensure system resiliency

When would it ever be a good idea to randomly and intentionally try to terminate parts of your software system — including the hardware it runs on? How about early and often? In this Agile DevOps installment, DevOps expert Paul Duvall describes approaches to creating a Chaos Monkey (as it's been dubbed by Netflix) to ensure that your production infrastructure can recover from inevitable system failures.

Share:

Paul Duvall (paul.duvall@stelligent.com), CTO, Stelligent

Paul DuvallPaul Duvall is the CTO of Stelligent. A featured speaker at many leading software conferences, he has worked in virtually every role on software projects: developer, project manager, architect, and tester. He is the principal author of Continuous Integration: Improving Software Quality and Reducing Risk (Addison-Wesley, 2007) and a 2008 Jolt Award Winner. He is also the author of Startup@Cloud and DevOps in the Cloud LiveLessons (Pearson Education, June 2012). He's contributed to several other books as well. Paul authored the 20-article Automation for the people series on developerWorks. He is passionate about getting high-quality software to users quicker and more often through continuous delivery and the cloud. Read his blog at Stelligent.com.



23 October 2012

Also available in Russian Japanese Portuguese

About this series

Developers can learn a lot from operations, and operations can learn a lot from developers. This series of articles is dedicated to exploring the practical uses of applying an operations mindset to development, and vice versa — and of considering software products as holistic entities that can be delivered with more agility and frequency than ever before.

As I mentioned in the introductory article in this series, the CTO at Amazon.com, Werner Vogels, follows a simple principle: "Everything fails, all the time." I've often said that when it comes to software, you should never assume anything. If you continually prepare for failure in all the resources you manage — be it your hardware or software — you're more likely to succeed. This paradox is at the core of various tools that randomly disable or terminate production resources to test and ensure that automatic recovery mechanisms are a part of your infrastructure.

The most notable of these tools is the Chaos Monkey (see Resources for this and other tools), which was developed by the Netflix technical team and open sourced earlier this year. In this article, I introduce the principles and steps for incorporating the Chaos Monkey into your infrastructure to ensure that it can handle the inevitable failures that will occur.

Tools like the Chaos Monkey are the result of an evolution toward ephemeral environments (discussed in the previous installment) brought on by infrastructure commoditization, virtualization, and cloud computing. It used to be that an infrastructure — with its physical machines, network switches, firewall, load balancers, software servers, and other resources — was something a team of engineers would set up manually one time. Then they'd monitor its usage, continually making manual tweaks to modify configuration, improve performance, and perform other activities. This is no longer considered a good practice and is quite simply impossible to do for any infrastructure of nontrivial scale. Tools like the Chaos Monkey apply monitoring, diagnostics, randomization, and disruption to the infrastructure to ensure that engineers apply automation to limit the impact users experience when big problems do occur.

Seek and destroy

A high-level list of the steps that make up the process for creating an environment for continuously testing an infrastructure includes:

  • Launching instances : Start up some compute instances.
  • Creating an autonomic infrastructure : Configure an infrastructure that launches new instances (based on the same template) when the infrastructure identifies unhealthy instances (see IBM's portal on autonomic computing in Resources).
  • Applying automatic testing to ensure automatic recovery : Run the tests during hours when engineers are ready to react and fix.
  • Learning and preventing : When failures do happen, react and prevent the failure from happening again.

Why test the infrastructure in production?

Some argue that intentionally disrupting an infrastructure should only be done in nonproduction environments — never in production. The Chaos Monkey deliberately destroys resources in production environments at a time of day when most engineers can fix any errors that occur. Moreover, nothing compares to finding and fixing problems in real environments.

The basis for a continuously tested and self-healing infrastructure is the conviction that:

  • An infrastructure will fail.
  • You need to test for these failures in production when engineers are available.
  • When the failure occurs again, your infrastructure must automatically recover from it without users ever noticing.

In many ways, it's the perfect technical implementation of an organization truly committed to continuous improvement or kaizen (see Resources).

In summary, this new breed of resiliency tool and accompanying infrastructure has the following features:

  • Monitoring: Daemon processes are continuously run to diagnose errors.
  • Diagnostics: Diagnostic tools are run as part of system monitoring.
  • Disruption: The infrastructure is intentionally disrupted by shutting down instances and other disruptive activities.
  • Randomization: To prevent expected outcomes and behavior, the disruption is randomly applied to the infrastructure.
  • Self-healing infrastructure: Although not a part of the resiliency tool, the expected resultant behavior is that teams continue to apply and improve on an autonomic infrastructure capable of recovering from service disruptions without users noticing.

The Chaos Monkey

Netflix heavily utilizes a cloud infrastructure for streaming movies to users, along with other functionality. In July 2012, it was reported that Netflix users streamed more than 1 billion hours in June 2012. In other words, Netflix isn't a trivial user of the cloud; it uses it on a massive scale.

A Quick Start Guide on GitHub authored by the Netflix tech team (see "Quick Start Guide for Chaos Monkey" in Resources) describes the steps to go through to get the Chaos Monkey up and running. The following list gives you some more information on the tools that the Chaos Monkey uses. Be sure to run the commands described in the guide to remove any unused resources, or else you will be continually charged for usage.

  • Auto Scaling: Auto Scaling is a specific feature of Amazon Web Services that enables you to scale compute capacity up and down based on demand — through rules that you define. Although it's an AWS-specific feature, you can create this type of scalable environment with your — private or public — cloud infrastructure. Auto Scaling has two key components: a launch configuration and an Auto Scaling Group. A launch configuration defines how an instance within an Auto Scaling Group is launched. An Auto Scaling Group is a collection of instances for which to apply a particular launch configuration.
  • SimpleDB: SimpleDB is a NoSQL database you can use to persist data. You need to define a SimpleDB domain. It's used by the Chaos Monkey to store state.
  • Gradle: Gradle is a build tool. It's used to build the Chaos Monkey and to start the Jetty application container.
  • Properties file: You need to modify a simianarmy.properties file with credentials and other configurable information.
  • Jetty: The in-memory Jetty server runs the Chaos Monkey to disrupt your infrastructure randomly.

A Simian Army

The Chaos Monkey is the first entry in the Netflix technical team's Simian Army. In Table 1, I list other tools that Netflix has proposed that will constitute the Simian Army (see Resources):

Table 1. A simian army
NameDescription
Chaos GorillaSimulates outage of an entire availability zone
Conformity MonkeyShuts down instances that don't adhere to best practices
Doctor MonkeyPerforms health checks (such as CPU)
Janitor MonkeySearches for unused resources and disposes of them
Latency MonkeyCreates artificial delays in client-server communication
Security MonkeyFinds security vulnerabilities such as improperly configured security groups

These are just a few ideas. The possibilities for other ways to apply a combination of monitoring, diagnostics, testing, and intentional destruction in cloud-based production environments are endless.


Unleash the fury

Get involved

developerWorks Agile transformation provides news, discussions, and training to help you and your organization build a foundation on agile development principles.

In this article, you learned that you can truly begin creating autonomic infrastructure capable of healing itself with the help of tools such as the Chaos Monkey and a cloud environment.

In the next article, you'll learn about test-driven infrastructure. In it, you'll learn how to apply test-driven development techniques — commonly used by developers for application code — for your infrastructure, using tools such as Cucumber.

Resources

Learn

Get products and technologies

  • Simian Army: Netflix's open source Simian Army. The Chaos Monkey is one entry in what will be a suite of open source tools.
  • Bees with machine guns: A utility for arming (creating) many bees (micro EC2 instances) to attack (load test) targets (web applications).
  • IBM Tivoli Provisioning Manager: Tivoli Provisioning Manager enables a dynamic infrastructure by automating the management of physical servers, virtual servers, software, storage, and networks.
  • IBM Tivoli System Automation for Multiplatforms: Tivoli System Automation for Multiplatforms provides high availability and automation for enterprise-wide applications and IT services.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
  • The developerWorks Agile transformation community provides news, discussions, and training to help you and your organization build a foundation on agile development principles.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Agile transformation, Java technology, Open source
ArticleID=841550
ArticleTitle=Agile DevOps: Unleash the Chaos Monkey
publish-date=10232012