My IBM

We Broke the Cloud — Here’s How We Fixed It

Using root cause analysis to determine the when, where and what of a real problem and how to fix it.

We broke the cloud on April 13, 2022. It wasn’t a major outage, you won’t read about it in any industry publications, but we did break it and there’s nothing we take more seriously.

Our customers bet their businesses on our cloud. The cloud should never go down, but it does.All clouds have outages. Some are large and some are small, but they all matter.

I’m going to walk you through what happens when we break the cloud, how we fix it and how we make sure we never make the same mistake twice.

This blog post will give you a peek behind the curtain of how we keep the cloud up and running.You can see a real incident, what caused it, how we fixed it and what we did to make sure it doesn’t happen again. You’re getting an open and honest look at what happens behind the scenes, the mistakes we made and the process we use to learn from them.

Root cause analysis

When something breaks in our cloud, we determine what happened and how to fix it with a process called root cause analysis (RCA). We want to fix the problem as quickly as possible, but we need to make sure we understand the when, where and what of the problem before we try to fix it. The last thing we want to do is break something else while we’re trying to fix the problem.

Finding the root cause of an issue is complex. The basic description of all issues is the same — the VPC cloud did X when it was supposed to do Y. It sounds simplistic, but everything starts with that simple description. Then, we dig deeper.

You’ve probably heard of processes like Five Whys and root cause analysis. They’re all about peeling back the layers of the problem to find out what really needs to change to prevent the problem in the future. In the IBM Cloud, we use a specific root cause analysis template. The first three questions the template asks are when, where and what happened.

I’m going to walk you through excerpts from the actual RCA from this issue.

When

The template starts with a Description of the Incident. Here’s the description for the breakage on April 13, 2022:

“This issue made it impossible for customers to delete snapshots. This also caused customers to lose data if they made snapshots at this time.”

The description looks simple enough. We broke the snapshot feature. This only happened if you used a specific feature of snapshots called Delete From Middle.

Where

The next step is to understand how much of an impact this incident had. The first question we ask is: when did this happen?

“This incident happened in SYD production between April 12 and April 13, 2022.”

This only happened in the Sydney region of the cloud. The VPC cloud is separated into regions by design — that means most incidents happen in a single region instead of impacting the global cloud. This allows customers to spread their workloads across regions and stay up and running even if one region has an outage.

This outage didn’t have a wide impact, but even a catastrophic incident in the Sydney region wouldn’t break any other regions.

In this case, the issue was caused by a new release of a specific component in the cloud. That release only went to the Sydney region and stopped. We found the issue there and halted it before it moved to any other regions.

What

Now that we know when and where the incident happened, we need to know what happened:

“This incident prevented customers from using their snapshots after deleting a snapshot from the middle of the chain. This affects both snapshot restoration and the Delete From Middle functionality of the same volume. Customer data was lost, and the snapshot chain becomes unusable.

Any snapshots created by customers at this time were lost and must be recreated.”

There’s no way around it — what happened was bad. Some customers lost data. It was a very small number, and everyone was able to recover what they lost, but we got lucky about that. It could have been worse.

I’m going to be honest, that scares us. It keeps us up at night. We never ever want to lose even on piece of customer data.

The timeline

The production pipeline takes code from a developer’s machine out to production regions where our customers can use it. Between those two points, it goes to pre-integration, integration and staging environments.

We run tests in each of these environments. If the tests fail, the build doesn’t move forward.We found this issue in the staging environment and we should have stopped the build from moving forward, but we didn’t.

Here’s what happened instead:

April 5, 2022: This issue was deployed to our staging environment on April 5, 2022. We then ran over 6,000 test suites.
April 7, 2022: This was a tricky issue to find, but we found and filed it on April 7, 2022.
April 9, 2022: This issue was added to our Top 15 list on April 9, 2022. The Top 15 list is used to track the critical issues across all of the VPC.
April 11, 2022: This issue was fixed and delivered to the pipeline on April 11, 2022. We also promoted a release to Sydney on April 11. The version we promoted didn’t have the fix in it.
April 13, 2022: On April 13, 2022, this issue impacted customers. We logged it as a Customer Impacting Event and fixed the issue.

So, what happened?

What was the root cause of this issue? The root cause of this issue was that we didn’t follow our own policies. We’re not supposed to promote code when there’s an open issue, and we’re not supposed to close the issue until the fix is delivered to the environment where the problem was found.

We made a mistake. It’s that simple. This was human error based on miscommunication.

Clouds are supposed to be perfect, but they’re made by people and people make mistakes. Then we do everything we can to keep the mistakes small and make sure we never make the same mistake twice.

What did we do about it?

The most important part of a root cause analysis (RCA) is the next steps. The analysis is useless if it doesn’t lead to change.

In this case, our change was automation. Our problem was a communication problem. The storage development team thought that they had communicated to the Release Management team that the version shouldn’t move forward. The Release Management team didn’t get the message. The Test team thought that they had communicated how important this issue was, but that message didn’t get where it needed to.

Automation removes communication problems.

We added a new process for development teams to label a build as DO NOT DEPLOY, and we added tooling to block the deploy for versions with that label. We also added an automated process to make sure versions can’t be promoted with known open issues.

The takeaway

This has been a deep dive into the process of how the cloud works. The VPC cloud deploys multiple versions to production every day. Most of them don’t have any problems, but this one broke something.

We broke something because we made a mistake.

Mistakes are scary, but they’re unavoidable. Clouds are made by people and people make mistakes. Clouds aren’t resilient because there are no mistakes; clouds are resilient because the technology, architecture and processes keep the mistakes small. They’re resilient because we keep the mistakes isolated and don’t make the same ones twice.

The cloud release process is always evolving. We get better every day and stay focused on making the cloud perfect. This look behind the curtain gives you some insight into how we do that.

Now that you know more about how we keep the cloud running, go check out what IBM Cloud has to offer.

Author

Zack Grossbart

STSM - Next Generation Cloud IaaS

Maximize hybrid cloud value in the generative AI era

Only 1 in 4 enterprises achieve a solid ROI from cloud transformation efforts. Learn how to amplify hybrid cloud and AI value across business needs.

Resources

The State of AI Readiness

We explored why some organizations are prepared for both the disruption and potential of AI. Find out what these AI-ready companies have in common.

Moving Legacy Workloads? See Why Hybrid Cloud Is the Smart Choice

This IDC report shows how enterprises are tackling the toughest cloud migration challenges - and what’s working.

Gartner’s Top Cloud Database Platforms — Who’s Delivering at Scale?

Explore the 2024 Magic Quadrant for CDBMS and discover which vendors are powering high-performance, hybrid-ready data ecosystems.

GK Cloud Solutions uses watsonx.ai

By applying IBM Watson Discovery, watsonx Assistant and watsonx.ai on IBM Cloud, the EdTech firm has not only enhanced the learning experience for its customers but also achieved significant business benefits.

What is cloud migration?

Discover IBM cloud migration solutions designed to streamline your journey to the cloud. Learn about different migration types, strategies and benefits that drive efficiency, scalability and innovation.

Public cloud vs. private cloud vs. hybrid cloud

Explore the key differences between public, private and hybrid cloud solutions with IBM. Understand which cloud model best suits your business needs for enhanced flexibility, security and scalability.

IBM Cloud accelerates innovation for clients

Learn 5 ways IBM Cloud is helping clients make the right workload-placement decisions based on resiliency, performance, security, compliance and TCO.

Take the next step

Unlock the full potential of AI and hybrid cloud with IBM’s secure, scalable portfolio. Get started by exploring our AI-ready solutions or create a free account to access always-free products and services.

Explore IBM Cloud AI solutions

Create a free IBM Cloud account