Whenever a high severity incident, such as a service outage, occurs in a cloud environment, repairing the system and bringing services back online is the immediate priority. But we also need to identify and eliminate the root cause of the problem. The root cause is the reason the problem was injected into the system. This is not to be confused with the immediate failure cause. Because the first priority is always to return services to normal, we first chase and correct the immediate cause of failure. In other words, we take a 'repair' action. However, that will often not prevent recurrence. For that we need to take a ‘corrective’ action as well. We have to get deeper and understand why the system entered the problem state. A cause is a root cause, when elimination of the cause eliminates injection of the problem. That's what separates root causes from all other causes. By understanding and eliminating root causes through corrective actions, we can eliminate entire classes of defects or problems, rather than simply fixing the one defect or problem we discovered. But root causes are also more costly to eliminate, especially when they require a change of human behaviors, such as failure to follow written instructions. Monitoring and correcting human system administrator behavior takes time. That's why in the on-premises world, we tend do more causal analysis, i.e. identifying clusters of similar problems/defects and targeting actions to reduce their occurrence, rather than doing RCA, i.e. determining the ultimate cause of each individual defect, which is time consuming. That balance between the more affordable causal analysis, and the more effective, but costly, root cause analysis, shifts toward RCA in the Cloud services space. To meet and exceed SLA targets, we simply cannot allow the same root cause to hit the availability number twice. Once an incident has occurred, it is relatively more likely to recur - because the triggering condition now exists - unless the root cause is eliminated. Thus, it is imperative to go after elimination of root causes of any adverse incidents observed, whether or not they caused an outage. The first instinct of many teams once they understand 'what' went wrong is to add a test case to the pre-release test case suites used to qualify new releases. But defect removal as a strategy is almost always inferior to defect prevention, and certainly more costly. By broadening our understanding from 'what' went wrong to also see 'why' it went wrong, we can take a corrective action that eliminates all the potential future problems sharing the same root cause, the same 'why'. A recent out-of-memory condition I worked with provides an example. Adding tests and throttling workloads to the troubled component might solve an immediate problem, but we need to go upstream in the development process and understand why this out-of-memory condition was not prevented by coding better memory management in the first place. By so doing, we can prevent similar out-of-memory issues in all components across our solution. Root cause analysis views the development process as a software manufacturing engine, and when it turns out a defective product, there must be a flaw in the engine to be corrected. Maniacally identifying and correcting these flaws pays off by tuning our engine to become flawless efficient, and effective. And in the cloud, that is paramount.
PS: To sort the blog and display just the Cloud Difference series, click on the “cloud_difference” tag below the title of any post in the series.