Chaos Monkey and Resilience Testing – Insights from the professionals

By: Mark Armstrong

Chaos Monkey and Resilience Testing – Insights from the professionals

Chaos Monkey and Resilience Testing – Insights from the professionals

In today’s world, users have little patience for service outages. If I can’t access my bank account details anytime and anywhere (from my smart phone), I’ll switch banks. Cloud services are designed for resilience and are hosted on resilient infrastructure, but problems still occur. To improve (and prove) resilience, Netflix has been championing chaos engineering since 2012. In the words of Cory Bennett and Ariel Tseitlin @ Netflix:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.Cory Bennett and Ariel Tseitlin

The technology that Netflix shared with the world (Chaos Monkey and the Simian Army) was seen by many people as the genesis of chaos testing and chaos engineering. But how revolutionary are these concepts? Weren’t we doing lots of resilience testing in the good old on-premises software days?

Cloud testing experts weigh in

To explore this further, I asked IBM Cloud testing experts three simple questions:

  • How have your responsibilities as a test professional changed in DevOps and cloud environments?

  • Does your background in testing help in understanding and adopting Chaos Monkey principles? What’s familiar? What’s different?

  • What challenges do you experience championing chaos principles and do you have any advice on how to overcome those challenges?

Lin Ju, IBM Analytics Cloud Data Services

First, I spoke with Lin Ju, a quality leader for IBM Analytics Cloud Data Services. Lin has more than 15 years of experience in application development and testing. Her team applies chaos testing principles today: proactively injecting failures into the cloud environment, adding network latency, and detecting and recovering from system failures.

Q1: How have your responsibilities as a test professional changed in DevOps and cloud environments?

Lin: With DevOps transformation in the organization, testing should be and has been penetrated into every phase of software development and delivery (from service definition through test, staging, and production environments). We’ve included many different roles in our all hands testing. For example, product managers and designers join the user acceptance testing to see if their business requirements and designs have been met and aligned. Developers have written more test automation than ever to speed up our pace of service delivery.

Test professionals become part of this process transformation. Leveraging their rich test experience, they have been instrumental in helping define test coverage and automation strategy, and in testing the whole service offering from both the users’ viewpoint and against overall quality metrics. They have helped identify risk areas, developed more test automation, and broadened their scope to include security, integration, and resilience testing. Test environments have been expanded from development to staging and production. Practicing the dark launch of new releases has become routine and it gives test engineers more exposure to real world environments.

Especially in cloud environments, there is a new aspect which has gained more and more attention. That is resilience. We are not just testing a different failure, but proactively causing system failures to see how our services react and recover. For example, a cloud service typically relies on clustered computing and load balancing for reliability and availability. It’s composed of various applications and services components working together, other supporting services (e.g. Spark), microservice architectures, and the hosting infrastructure (including compute runtime, storage, and network). It requires our test professional not just to have test expertise and programming skills, but also a deep understanding of service capability, architecture, deployment structure, dependencies, and the infrastructure of the hosting environment.

Q2: Does your background in testing help in understanding and adopting Chaos Monkey principles? What’s familiar? What’s different?

Lin: In the past, I have led teams in the area of performance and high availability testing for on-premises applications and products. Similarly to Chaos Monkey, we’ve provided stress testing on systems and created disaster situations to verify that those systems still function as intended. But there are also some differences. Chaos Monkey is a more proactive way to shut down those services/VMs and see if those services can automatically recovery. The key in the cloud environment is that we need to have our services recover automatically or shift to alternative resouces so that the service can be available 24×7. We need to ensure that no human action is required when failure happens. During resilience testing, various failures and latencies should be introduced to test the entire service and its components as well.

Q3: What challenges do you experience championing chaos principles and do you have any advice on how to overcome those challenges?

Lin: The challenge is to ensure that your teams have a good design from architecture level in terms of resilience and have a good understanding of monitoring tools and automatic recovery techniques. Since many on the team came from on-premises backgrounds, the biggest challenge has been in shifting to deliver and operate a highly reliable 24×7 cloud service. My suggestion is to have dedicated resilience evangelists (test professionals are a good starting point) within your cloud development organization to provide guidance in resilience test automation, recovery techniques, and monitoring systems.

Kevin Trinh, IBM Enterprise Content Management

Next, I talked to Kevin Trinh, System Verification Test (SVT) Architect for IBM Enterprise Content Management. When Kevin contacted me, I knew we had an evangelist in our ranks. His email signature includes a quote from the Deputy CTO @ Typesafe Viktor Klang that’s completely appropriate:

Resilience has to be designed. Has to be tested. It’s not something that happens around a table as a slew of exceptional engineers architect the perfect system. Perfection comes through repeatedly trying to break the system.Viktor Klang

Kevin is building an endurance test environment and looking to use Chaos Monkey methods to test resiliency. Previously, Kevin brought chaos to WebSphere testing. Here’s Kevin’s responses to my questions:

Q1: How have your responsibilities as a test professional changed in DevOps and cloud environments?

Kevin: My role and responsibilities have significantly changed since providing our software as a cloud service offering. As part of SVT, I’m working closer with the cloud DevOps team responsible for managing a new cloud service offering on IBM SoftLayer, and now with the Cloud Data Services (CDS) team who will manage another of our services on IBM Bluemix. Both teams are investing in chaos testing to improve and prove resilience in the offering. I work closely with the CDS team to collaborate on my early testing to ensure that the downstream team is confident in our test results as input to the deployment readiness review. Because we know our services and the components they rely on, our team will be the go-to team that DevOps, CDS, and first line support turn to if serious issues arise in production.

Q2: Does your background in testing help in understanding and adopting Chaos Monkey principles? What’s familiar? What’s different?

Kevin: Resiliency has become a hot topic not only for on-premises testing but significantly more important for the cloud. As an SVT architect, I’m educating and training our testers in adopting “Chaos Engineering”. Traditional testing for high availability is not enough when it comes to cloud testing. Resiliency of our application and the other services it depends upon is even more critical. Our application has to be able to survive various types of outages, many of which may be completely out of our control.

Our first experience of outage was in the lack of stability in our testing infrastructure. We had several outages within the first few months of the project. We were fortunate! This experience highlighted the need for adopting Chaos Engineering and we developed simple-to-use tooling to terminate instances in the infrastructure. This also led to a change in our delivery pipeline; we are creating a new long-term test environment for “endurance testing”. This will be a production-like environment that will be continuously running production load tests, and we’ll target this environment to randomly inject failures in the infrastructure and services we depend upon. This environment will be used for the rollout of maintenance updates and service upgrades as we move towards continuous integration testing. If we can mature our service and its dependencies in this testing environment, we’ll have a more resilient service in production.

Q3: What challenges do you experience championing chaos principles and do you have any advice on how to overcome those challenges?

Kevin: The biggest challenge is getting architects to design our application for the cloud with resiliency in mind. I’m extensively involved in design reviews to highlight resiliency concerns and to provide recommendations to improve resiliency prior to implementation. The other challenge is to ensure that all the teams use consistent tooling for injecting failures and reviewing service health indicators, so that each team is confident in the other’s readouts and we’re all working from a single source of truth.

Conclusion

I hope that the insight from our experts will help you on your quest to deliver better cloud services. Perhaps chaos engineering is not new to these experts, but it is for many others in the organization who play a part in delivering and operating reliable cloud services. I believe that test professionals are best positioned to drive service resilience and to champion chaos engineering throughout the organization.

I’m compiling responses from more experts in the field and will be back with a consolidated view of best practices in chaos engineering. I’d love to hear your perspective and to answer any of your questions. Add feedback here or connect with us on Twitter: @markearms (that’s me) or @linda47841505 (that’s Lin).

Resources

Be the first to hear about news, product updates, and innovation from IBM Cloud