A few customer experiences this year have made me start to avoid the phrase "root cause analysis" (RCA). I recommend the elimination of this phrase, and instead using "causal chain" and here's why:
Root cause analysis is the search for the primary, sufficient condition that causes a problem. The first danger of "root" cause analysis is that you may think you're done when you're not. The word "root" suggests final, but how do you know you're done? For example, in a situation earlier this year, the problem was high user response times. The proximate cause was that the processors were saturated. The processors were being driven so heavily because System.gc was being called frequently, forcing garbage collections. This was thought to be the "root cause" so somebody suggested using the option -Xdisableexplicitgc to make calls to System.gc do nothing. Everyone sighed relief; root cause was found! Not so. The System.gc's were being called due to native OutOfMemoryErrors when trying to load classes (and -Xdisableexplicitgc doesn't affect forced GCs from the JVM handling certain NOOMs). After much more investigation, we arrived at a very complex causal chain in which there wasn't even a single cause:
The second danger of root "cause" analysis is that it suggests a single cause, which obviously isn't always the case.
Properly understood and with all the right caveats, RCA is fine, but it is rarely properly understood and rarely comes with caveats. Once someone declares that "root cause" has been found, most people are satisfied, especially if removing that cause seems to avoid the problem.
I find it interesting that the term "root" has gained such a strong hold, when it is clearly too strong of a term. My guess is that "root" was added to "cause analysis," because without "root," some people might stop at the first cause, but perversely, the phrase has caused the exact same sloppiness, laziness and false sense of accomplishment that it was probably designed to avoid. However, given that both suffer from the same problem, I think "root cause analysis" is worse than "cause analysis" because at least the latter is more open ended. Instead, I prefer the term "causal chain" because it seems to define the investigation in terms of a chain of causes and effects and is more suggestive of the open-endedness of this chain (actually, it's probably more like a causal graph, but I think chain is good enough).