Properly measuring latency requires that you have quality data. KPMG’s 2016 Global CEO Outlook (link resides outside ibm.com) found that 84% of CEOs are concerned about the quality of the data they’re basing decisions on, and it’s because data can often be misleading.
The difference between companies that care about their data and don’t is huge. MIT researchers found (link resides outside ibm.com) that companies that have adopted a data-driven design have an output that’s 5%–6% higher than what would be expected given their other investments and information technology use. This reason alone makes understanding latency critical to business success.
In just seven minutes, you’ll learn everything you need to know about measuring latency:
Dictionary.com (link resides outside ibm.com) defines latency as “the period of delay when one component of a hardware system is waiting for an action to be executed by another component.” In simpler terms, it means the amount of time between calling a function and its actual execution. Latency is inherent in all systems; even if we had a perfect system, which doesn’t exist, it would be latent the amount of time it takes for the electrons in the computer to switch the transistors from on to off or vice versa.
Latency in small operations isn’t a big deal, but when handling millions of operations, there are millions of latencies that add up fast. Latency isn’t defined by work units and time but, instead, how it behaves. Monitoring tools report back on how long it takes from the start of a function until the end of the function.
Latency can have a major impact on your business. For example (link resides outside ibm.com): “When it comes to mobile speed, every second matters—for each additional second it takes a mobile page to load, conversions can drop by up to 20%.”
Latency almost never follows a normal Gaussian or Poisson distribution. Even if your latency does follow one of these distributions, due to the way we observe latency, it makes averages, medians and even standard deviations useless.If, for example, you’re measuring page loads, 99.9999999999% of these loads may be worse than your median. It’s part of the reason that randomly sampling your latency causes inaccurate data—but more on this topic later.
At this point, you’re probably asking yourself if we aren’t using any standard deviation, how can we meaningfully describe latencies? The answer is that we must look at percentiles and maximums. Most people think to themselves, okay, so I look at P95 and I understand the “common case.” The issue with this method is that P95 is going to hide all the bad stuff. As Gil Tene, CTO of Azul Systems, says: “It’s a ‘marketing system.’ Someone is getting duped.”
Take, for example, this graph:
When you see this graph, you can clearly see why its median and average have no real significance—they don’t show the problem area. When you see the 95th percentile shoot up to the left, you think you’re seeing the heart of the problem. Of course, it’s not true, though. When you investigate why your program had a hiccup, you’re failing to see the worst 5% of what happened. To get this kind of spike requires that the top 5% of the data is significantly worse.
Now look at the same graph that also shows the 99.99th percentile:
That red line is the 95th percentile, whereas the green line is the 99.99th percentile. As you can clearly see, the 95th percentile only shows 2 out of 22 of your issues and why you must look at the full spectrum of your data.
Many people may think that the last 5% of data doesn’t hold that much significance. Sure, it could just be a virtual machine (VM) restarting, a hiccup in your system or something like that but by ignoring it, you’re saying that it just doesn’t happen when it could be one of the most important things for you to target.
Gil Tene likes to make the bold claim that “The number one indicator you should never get rid of is the maximum value. That is not noise, that is the signal. The rest of it is noise.” While the maximum is, indeed, a great single in a system at a large scale, it’s often not practical to pursue just the maximum case. No system is perfect, and hiccups do occur. In a large-scale practical system, pursuing the maximum case exclusively is often a good way to burn out your development team.
When looking at the 99.99th percentile, you’re seeing what happens to the large majority of your customers, and any spikes you see there you know are actual issues, whereas any spikes in your maximum may just be a hiccup in your system. When your DevOps teams focus their effort on these small hiccups, they’re doing so at a large opportunity cost, as they can’t work on more major issues instead.
It’s of note that if your 99.99th percentile and your maximum are very close to each other—and are both spiked—then it’s a great signal that it’s an issue your team should work on. In this way, Gil is right that the maximum is a great signal, but wrong that the rest of your data is just noise. As you can see in this graph, our 99.99th percentile and maximum from our previous example match up exactly. It’s a great signal that what it is you are looking at is a real bug and not just a hiccup:
An even worse pitfall people fall into than just looking at the 95th percentile is failing to recognize that their percentiles are averaged. Averaging percentiles is statistically absurd; it removes all significance from what it is you’re looking at. We have already shown how averages aren’t good when looking at latency and, if you’re looking at averaged percentiles, you’re simply right back to square one. Many software programs average your percentiles. Take, for example, this Grafana chart:
Whether or not you realized it before, all the percentiles on this chart are average. It says so right there in the x-axis ledger.Nearly all monitoring services average your percentiles. It’s a reality due to precomputation. When your monitoring service takes in your data, its computing the percentile of the data for that minute.
Then when you look at your 95th percentile, it’s showing you an average off all your percentiles. This shortcut for “your good” to make your service faster is, in reality, removing all statistical significance from your data.
Whether or not you know it, by monitoring tools participating in data sampling, they’re producing averaged data. Almost every monitoring tool samples its data. Take, for example, Datadog—it has major data loss. If you send them 3 million points in a minute, it will not take them all. Instead, it will randomly sample the points then aggregate them into 1 point per minute.
You must have unsampled data to understand your latency. It’s inherent that with sampled data you can’t access the full distribution. Your maximum isn’t your true maximum, nor is your global percentile an accurate representation of what is going on.
When you sample data, you’re omitting data. Say, for example, you have 10,000 operations happening in a minute, sending out 2 data points each to your monitoring system. Say you have a bug in your system and one of these data points shows this bug per 10,000 operations. Your monitoring system only has a 1/20,000 chance of choosing this bug as the data point it shows you as the maximum.
If you run long enough, the data point will show up eventually but, as a result, it will look like a sporadic edge case, even though it’s happening to one of your customers every minute. When you don’t sample data, and you have one of these spikes, it will show up clearly in your 99.99th percentile, and your maximum will show up close to it, signaling that you have a bug in your program. When you sample your data, however, it won’t show up as often, meaning you won’t see it as a bug but rather as a hiccup. This result means your engineering team will fail to realize the significance of it.
Don’t let your monitoring tool fool you into thinking you know what’s going on with your latency. One of the key features of IBM Instana™ software is its ability to measure latency efficiently. IBM Instana software uses advanced analytics and machine learning (ML) to automatically detect latency issues in real-time, allowing developers and IT teams to quickly identify the root cause of any performance problems and take corrective action before they impact users.
Choose a tool that doesn’t provide sampled data. Choose a tool that doesn’t average your global percentiles.
Start a free two-week trial today