Using root cause analysis to determine the when, where and what of a real problem and how to fix it.

We broke the cloud on April 13, 2022. It wasn’t a major outage, you won’t read about it in any industry publications, but we did break it and there’s nothing we take more seriously.

Our customers bet their businesses on our cloud. The cloud should never go down, but it does.  All clouds have outages. Some are large and some are small, but they all matter. 

I’m going to walk you through what happens when we break the cloud, how we fix it and how we make sure we never make the same mistake twice.

This blog post will give you a peek behind the curtain of how we keep the cloud up and running.  You can see a real incident, what caused it, how we fixed it and what we did to make sure it doesn’t happen again. You’re getting an open and honest look at what happens behind the scenes, the mistakes we made and the process we use to learn from them.

Root cause analysis

When something breaks in our cloud, we determine what happened and how to fix it with a process called root cause analysis (RCA). We want to fix the problem as quickly as possible, but we need to make sure we understand the when, where and what of the problem before we try to fix it. The last thing we want to do is break something else while we’re trying to fix the problem.

Finding the root cause of an issue is complex. The basic description of all issues is the same — the VPC cloud did X when it was supposed to do Y. It sounds simplistic, but everything starts with that simple description. Then, we dig deeper.

You’ve probably heard of processes like Five Whys and root cause analysis. They’re all about peeling back the layers of the problem to find out what really needs to change to prevent the problem in the future. In the IBM Cloud, we use a specific root cause analysis template. The first three questions the template asks are when, where and what happened.

I’m going to walk you through excerpts from the actual RCA from this issue.


The template starts with a Description of the Incident. Here’s the description for the breakage on April 13, 2022:

“This issue made it impossible for customers to delete snapshots. This also caused customers to lose data if they made snapshots at this time.”

The description looks simple enough. We broke the snapshot feature. This only happened if you used a specific feature of snapshots called Delete From Middle.


The next step is to understand how much of an impact this incident had. The first question we ask is: when did this happen?

“This incident happened in SYD production between April 12 and April 13, 2022.”

This only happened in the Sydney region of the cloud. The VPC cloud is separated into regions by design — that means most incidents happen in a single region instead of impacting the global cloud. This allows customers to spread their workloads across regions and stay up and running even if one region has an outage.

This outage didn’t have a wide impact, but even a catastrophic incident in the Sydney region wouldn’t break any other regions.

In this case, the issue was caused by a new release of a specific component in the cloud. That release only went to the Sydney region and stopped. We found the issue there and halted it before it moved to any other regions.


Now that we know when and where the incident happened, we need to know what happened:

“This incident prevented customers from using their snapshots after deleting a snapshot from the middle of the chain. This affects both snapshot restoration and the Delete From Middle functionality of the same volume. Customer data was lost, and the snapshot chain becomes unusable.

Any snapshots created by customers at this time were lost and must be recreated.”

There’s no way around it — what happened was bad. Some customers lost data. It was a very small number, and everyone was able to recover what they lost, but we got lucky about that. It could have been worse. 

I’m going to be honest, that scares us. It keeps us up at night. We never ever want to lose even on piece of customer data.

The timeline

The production pipeline takes code from a developer’s machine out to production regions where our customers can use it. Between those two points, it goes to pre-integration, integration and staging environments. 

We run tests in each of these environments. If the tests fail, the build doesn’t move forward.  We found this issue in the staging environment and we should have stopped the build from moving forward, but we didn’t. 

Here’s what happened instead:

  • April 5, 2022: This issue was deployed to our staging environment on April 5, 2022. We then ran over 6,000 test suites. 
  • April 7, 2022: This was a tricky issue to find, but we found and filed it on April 7, 2022. 
  • April 9, 2022: This issue was added to our Top 15 list on April 9, 2022. The Top 15 list is used to track the critical issues across all of the VPC.
  • April 11, 2022: This issue was fixed and delivered to the pipeline on April 11, 2022. We also promoted a release to Sydney on April 11. The version we promoted didn’t have the fix in it.
  • April 13, 2022: On April 13, 2022, this issue impacted customers. We logged it as a Customer Impacting Event and fixed the issue.

So, what happened?

What was the root cause of this issue? The root cause of this issue was that we didn’t follow our own policies. We’re not supposed to promote code when there’s an open issue, and we’re not supposed to close the issue until the fix is delivered to the environment where the problem was found.

We made a mistake. It’s that simple. This was human error based on miscommunication.

Clouds are supposed to be perfect, but they’re made by people and people make mistakes. Then we do everything we can to keep the mistakes small and make sure we never make the same mistake twice.

What did we do about it?

The most important part of a root cause analysis (RCA) is the next steps. The analysis is useless if it doesn’t lead to change. 

In this case, our change was automation. Our problem was a communication problem. The storage development team thought that they had communicated to the Release Management team that the version shouldn’t move forward. The Release Management team didn’t get the message. The Test team thought that they had communicated how important this issue was, but that message didn’t get where it needed to.

Automation removes communication problems.

We added a new process for development teams to label a build as DO NOT DEPLOY, and we added tooling to block the deploy for versions with that label. We also added an automated process to make sure versions can’t be promoted with known open issues.

The takeaway

This has been a deep dive into the process of how the cloud works. The VPC cloud deploys multiple versions to production every day. Most of them don’t have any problems, but this one broke something.

We broke something because we made a mistake.

Mistakes are scary, but they’re unavoidable. Clouds are made by people and people make mistakes. Clouds aren’t resilient because there are no mistakes; clouds are resilient because the technology, architecture and processes keep the mistakes small. They’re resilient because we keep the mistakes isolated and don’t make the same ones twice.

The cloud release process is always evolving. We get better every day and stay focused on making the cloud perfect. This look behind the curtain gives you some insight into how we do that.

Now that you know more about how we keep the cloud running, go check out what IBM Cloud has to offer.


More from Cloud

Kubernetes version 1.28 now available in IBM Cloud Kubernetes Service

2 min read - We are excited to announce the availability of Kubernetes version 1.28 for your clusters that are running in IBM Cloud Kubernetes Service. This is our 23rd release of Kubernetes. With our Kubernetes service, you can easily upgrade your clusters without the need for deep Kubernetes knowledge. When you deploy new clusters, the default Kubernetes version remains 1.27 (soon to be 1.28); you can also choose to immediately deploy version 1.28. Learn more about deploying clusters here. Kubernetes version 1.28 In…

Temenos brings innovative payments capabilities to IBM Cloud to help banks transform

3 min read - The payments ecosystem is at an inflection point for transformation, and we believe now is the time for change. As banks look to modernize their payments journeys, Temenos Payments Hub has become the first dedicated payments solution to deliver innovative payments capabilities on the IBM Cloud for Financial Services®—an industry-specific platform designed to accelerate financial institutions' digital transformations with security at the forefront. This is the latest initiative in our long history together helping clients transform. With the Temenos Payments…

Foundational models at the edge

7 min read - Foundational models (FMs) are marking the beginning of a new era in machine learning (ML) and artificial intelligence (AI), which is leading to faster development of AI that can be adapted to a wide range of downstream tasks and fine-tuned for an array of applications.  With the increasing importance of processing data where work is being performed, serving AI models at the enterprise edge enables near-real-time predictions, while abiding by data sovereignty and privacy requirements. By combining the IBM watsonx data…

The next wave of payments modernization: Minimizing complexity to elevate customer experience

3 min read - The payments ecosystem is at an inflection point for transformation, especially as we see the rise of disruptive digital entrants who are introducing new payment methods, such as cryptocurrency and central bank digital currencies (CDBC). With more choices for customers, capturing share of wallet is becoming more competitive for traditional banks. This is just one of many examples that show how the payments space has evolved. At the same time, we are increasingly seeing regulators more closely monitor the industry’s…