Using root cause analysis to determine the when, where and what of a real problem and how to fix it.
We broke the cloud on April 13, 2022. It wasn’t a major outage, you won’t read about it in any industry publications, but we did break it and there’s nothing we take more seriously.
Our customers bet their businesses on our cloud. The cloud should never go down, but it does. All clouds have outages. Some are large and some are small, but they all matter.
I’m going to walk you through what happens when we break the cloud, how we fix it and how we make sure we never make the same mistake twice.
This blog post will give you a peek behind the curtain of how we keep the cloud up and running. You can see a real incident, what caused it, how we fixed it and what we did to make sure it doesn’t happen again. You’re getting an open and honest look at what happens behind the scenes, the mistakes we made and the process we use to learn from them.
Root cause analysis
When something breaks in our cloud, we determine what happened and how to fix it with a process called root cause analysis (RCA). We want to fix the problem as quickly as possible, but we need to make sure we understand the when, where and what of the problem before we try to fix it. The last thing we want to do is break something else while we’re trying to fix the problem.
Finding the root cause of an issue is complex. The basic description of all issues is the same — the VPC cloud did X when it was supposed to do Y. It sounds simplistic, but everything starts with that simple description. Then, we dig deeper.
You’ve probably heard of processes like Five Whys and root cause analysis. They’re all about peeling back the layers of the problem to find out what really needs to change to prevent the problem in the future. In the IBM Cloud, we use a specific root cause analysis template. The first three questions the template asks are when, where and what happened.
I’m going to walk you through excerpts from the actual RCA from this issue.
The template starts with a Description of the Incident. Here’s the description for the breakage on April 13, 2022:
“This issue made it impossible for customers to delete snapshots. This also caused customers to lose data if they made snapshots at this time.”
The description looks simple enough. We broke the snapshot feature. This only happened if you used a specific feature of snapshots called Delete From Middle.
The next step is to understand how much of an impact this incident had. The first question we ask is: when did this happen?
“This incident happened in SYD production between April 12 and April 13, 2022.”
This only happened in the Sydney region of the cloud. The VPC cloud is separated into regions by design — that means most incidents happen in a single region instead of impacting the global cloud. This allows customers to spread their workloads across regions and stay up and running even if one region has an outage.
This outage didn’t have a wide impact, but even a catastrophic incident in the Sydney region wouldn’t break any other regions.
In this case, the issue was caused by a new release of a specific component in the cloud. That release only went to the Sydney region and stopped. We found the issue there and halted it before it moved to any other regions.
Now that we know when and where the incident happened, we need to know what happened:
“This incident prevented customers from using their snapshots after deleting a snapshot from the middle of the chain. This affects both snapshot restoration and the Delete From Middle functionality of the same volume. Customer data was lost, and the snapshot chain becomes unusable.
Any snapshots created by customers at this time were lost and must be recreated.”
There’s no way around it — what happened was bad. Some customers lost data. It was a very small number, and everyone was able to recover what they lost, but we got lucky about that. It could have been worse.
I’m going to be honest, that scares us. It keeps us up at night. We never ever want to lose even on piece of customer data.
The production pipeline takes code from a developer’s machine out to production regions where our customers can use it. Between those two points, it goes to pre-integration, integration and staging environments.
We run tests in each of these environments. If the tests fail, the build doesn’t move forward. We found this issue in the staging environment and we should have stopped the build from moving forward, but we didn’t.
Here’s what happened instead:
April 5, 2022: This issue was deployed to our staging environment on April 5, 2022. We then ran over 6,000 test suites.
April 7, 2022: This was a tricky issue to find, but we found and filed it on April 7, 2022.
April 9, 2022: This issue was added to our Top 15 list on April 9, 2022. The Top 15 list is used to track the critical issues across all of the VPC.
April 11, 2022: This issue was fixed and delivered to the pipeline on April 11, 2022. We also promoted a release to Sydney on April 11. The version we promoted didn’t have the fix in it.
April 13, 2022: On April 13, 2022, this issue impacted customers. We logged it as a Customer Impacting Event and fixed the issue.
So, what happened?
What was the root cause of this issue? The root cause of this issue was that we didn’t follow our own policies. We’re not supposed to promote code when there’s an open issue, and we’re not supposed to close the issue until the fix is delivered to the environment where the problem was found.
We made a mistake. It’s that simple. This was human error based on miscommunication.
Clouds are supposed to be perfect, but they’re made by people and people make mistakes. Then we do everything we can to keep the mistakes small and make sure we never make the same mistake twice.
What did we do about it?
The most important part of a root cause analysis (RCA) is the next steps. The analysis is useless if it doesn’t lead to change.
In this case, our change was automation. Our problem was a communication problem. The storage development team thought that they had communicated to the Release Management team that the version shouldn’t move forward. The Release Management team didn’t get the message. The Test team thought that they had communicated how important this issue was, but that message didn’t get where it needed to.
Automation removes communication problems.
We added a new process for development teams to label a build as DO NOT DEPLOY, and we added tooling to block the deploy for versions with that label. We also added an automated process to make sure versions can’t be promoted with known open issues.
This has been a deep dive into the process of how the cloud works. The VPC cloud deploys multiple versions to production every day. Most of them don’t have any problems, but this one broke something.
We broke something because we made a mistake.
Mistakes are scary, but they’re unavoidable. Clouds are made by people and people make mistakes. Clouds aren’t resilient because there are no mistakes; clouds are resilient because the technology, architecture and processes keep the mistakes small. They’re resilient because we keep the mistakes isolated and don’t make the same ones twice.
The cloud release process is always evolving. We get better every day and stay focused on making the cloud perfect. This look behind the curtain gives you some insight into how we do that.
Now that you know more about how we keep the cloud running, go check out what IBM Cloud has to offer.