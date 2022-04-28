When something breaks in our cloud, we determine what happened and how to fix it with a process called root cause analysis (RCA). We want to fix the problem as quickly as possible, but we need to make sure we understand the when, where and what of the problem before we try to fix it. The last thing we want to do is break something else while we’re trying to fix the problem.

Finding the root cause of an issue is complex. The basic description of all issues is the same — the VPC cloud did X when it was supposed to do Y. It sounds simplistic, but everything starts with that simple description. Then, we dig deeper.

You’ve probably heard of processes like Five Whys and root cause analysis. They’re all about peeling back the layers of the problem to find out what really needs to change to prevent the problem in the future. In the IBM Cloud, we use a specific root cause analysis template. The first three questions the template asks are when, where and what happened.

I’m going to walk you through excerpts from the actual RCA from this issue.

When

The template starts with a Description of the Incident. Here’s the description for the breakage on April 13, 2022:

“This issue made it impossible for customers to delete snapshots. This also caused customers to lose data if they made snapshots at this time.”

The description looks simple enough. We broke the snapshot feature. This only happened if you used a specific feature of snapshots called Delete From Middle.

Where

The next step is to understand how much of an impact this incident had. The first question we ask is: when did this happen?

“This incident happened in SYD production between April 12 and April 13, 2022.”

This only happened in the Sydney region of the cloud. The VPC cloud is separated into regions by design — that means most incidents happen in a single region instead of impacting the global cloud. This allows customers to spread their workloads across regions and stay up and running even if one region has an outage.

This outage didn’t have a wide impact, but even a catastrophic incident in the Sydney region wouldn’t break any other regions.

In this case, the issue was caused by a new release of a specific component in the cloud. That release only went to the Sydney region and stopped. We found the issue there and halted it before it moved to any other regions.

What

Now that we know when and where the incident happened, we need to know what happened:

“This incident prevented customers from using their snapshots after deleting a snapshot from the middle of the chain. This affects both snapshot restoration and the Delete From Middle functionality of the same volume. Customer data was lost, and the snapshot chain becomes unusable.

Any snapshots created by customers at this time were lost and must be recreated.”

There’s no way around it — what happened was bad. Some customers lost data. It was a very small number, and everyone was able to recover what they lost, but we got lucky about that. It could have been worse.

I’m going to be honest, that scares us. It keeps us up at night. We never ever want to lose even on piece of customer data.