A couple of years ago a well-known franchise experienced a significant computer outage that affected hundreds of their stores throughout the U.S. The impact of the outage lasted for nearly a day, and the problem made the news headlines all over the place. Obviously these are the kinds of problems that businesses don’t want to have happen to them. Loss of sales, unhappy customers, and bad press don’t make for a good day…
With that in mind, let’s assume that it had been years since this franchise had experienced any kind of outage. Do you think customers (as well as shareholders) would have been happy to hear the company’s executives say something like, “But it’s been years since we’ve had any kind outage!” Probably not… Was anyone focused on how long it had been since the last outage, or were folks more interested in the “here and now?” Given the fact that the outage made the headlines, it’s obvious that the lengthy delay in getting back to normal was the primary concern.
Now, let’s look at a hypothetical situation where a company experiences several computer outages every day, but the length of each outage is only one second (or less). Would anyone even notice? Probably not… Because the recovery time was so fast, the only folks who would likely even be aware of any of the outages would be the operations folks, and only then because they were analyzing the logs from the monitoring tools in place.
Hopefully you can see where I’m going with this – a company that experiences a failure only once every couple of years (i.e. having a very good Mean Time to Failure), but one that takes a day or more to recover from an outage (i.e., having a very poor Mean Time to Recovery), is likely to have more negative results than a company that has a very poor MTTF but that has an extremely good MTTR.
So, just to be clear, I’m not suggesting that MTTF should be ignored – referring to my second example, I’d really like to know why outages are occurring so frequently, and then work to reduce the number of outages (even if they do only last a second or so). What I am suggesting is that MTTR should be one of your “front and center” metrics. The faster you can recover from an outage, the less noticeable the outage will be and, therefore, the more negligible the impact.
If you haven’t started measuring MTTR for your offering, please allow me to suggest that you need to begin doing so ASAP. And, once you have MTTR measurements in place, then begin working to improve them (no matter how good they may be right now).