Latency almost never follows a normal Gaussian or Poisson distribution. Even if your latency does follow one of these distributions, due to the way we observe latency, it makes averages, medians and even standard deviations useless.If, for example, you’re measuring page loads, 99.9999999999% of these loads may be worse than your median. It’s part of the reason that randomly sampling your latency causes inaccurate data—but more on this topic later.

At this point, you’re probably asking yourself if we aren’t using any standard deviation, how can we meaningfully describe latencies? The answer is that we must look at percentiles and maximums. Most people think to themselves, okay, so I look at P95 and I understand the “common case.” The issue with this method is that P95 is going to hide all the bad stuff. As Gil Tene, CTO of Azul Systems, says: “It’s a ‘marketing system.’ Someone is getting duped.”

Take, for example, this graph:

When you see this graph, you can clearly see why its median and average have no real significance—they don’t show the problem area. When you see the 95th percentile shoot up to the left, you think you’re seeing the heart of the problem. Of course, it’s not true, though. When you investigate why your program had a hiccup, you’re failing to see the worst 5% of what happened. To get this kind of spike requires that the top 5% of the data is significantly worse.

Now look at the same graph that also shows the 99.99th percentile:

That red line is the 95th percentile, whereas the green line is the 99.99th percentile. As you can clearly see, the 95th percentile only shows 2 out of 22 of your issues and why you must look at the full spectrum of your data.

Many people may think that the last 5% of data doesn’t hold that much significance. Sure, it could just be a virtual machine (VM) restarting, a hiccup in your system or something like that but by ignoring it, you’re saying that it just doesn’t happen when it could be one of the most important things for you to target.

Gil Tene likes to make the bold claim that “The number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal. The rest of it is noise.” While the maximum is, indeed, a great single in a system at a large scale, it’s often not practical to pursue just the maximum case. No system is perfect, and hiccups do occur. In a large-scale practical system, pursuing the maximum case exclusively is often a good way to burn out your development team.

When looking at the 99.99th percentile, you’re seeing what happens to the large majority of your customers, and any spikes you see there you know are actual issues, whereas any spikes in your maximum may just be a hiccup in your system. When your DevOps teams focus their effort on these small hiccups, they’re doing so at a large opportunity cost, as they can’t work on more major issues instead.

It’s of note that if your 99.99th percentile and your maximum are very close to each other—and are both spiked—then it’s a great signal that it’s an issue your team should work on. In this way, Gil is right that the maximum is a great signal, but wrong that the rest of your data is just noise. As you can see in this graph, our 99.99th percentile and maximum from our previous example match up exactly. It’s a great signal that what it is you are looking at is a real bug and not just a hiccup: