Troubleshooting
Problem
Summary
The 99th percentile (P99) latency metric is widely used to measure system performance by capturing the latency experienced by the slowest 1% of requests. While it can be a powerful metric for high-throughput systems, P99 latency can be extremely unreliable for workloads with low traffic, such as during downtimes when the system handles only a few requests per second.
This article explains why P99 latency metrics are unreliable in such scenarios and suggests alternative approaches for monitoring and optimizing performance.
Applies to
- Astra Serverless
- Astra Classic
- DSE (DataStax Enterprise)
- All database systems
Symptoms
Observation of uncharacteristically high p99 and p999 latency metrics while the database is handling only few requests
Cause
This is a problem with statistics and not with the database itself. Due to the small sample size, the p99 and p999 latency metrics become increasingly unreliable during low traffic workloads.
-
Small Sample Size
- P99 metrics require a statistically significant sample to be meaningful. For a workload with only 5 requests per second:
- Over a 1-minute interval, there are just 300 requests.
- The 99th percentile applies to only 3 requests.
- This small number means that any single slow request disproportionately skews the P99 metric.
- P99 metrics require a statistically significant sample to be meaningful. For a workload with only 5 requests per second:
-
Outlier Sensitivity
- In low traffic scenarios, even one anomalously slow request (e.g., due to a transient issue) can dramatically inflate the P99 latency.
- This makes it difficult to distinguish between actual performance problems and random outliers.
-
Lack of Granularity
- At low traffic rates, P99 metrics may not accurately reflect typical system behavior. Instead, they highlight extreme cases that occur infrequently and may not impact overall user experience.
-
Irregular Request Patterns
- During downtimes, requests may arrive sporadically rather than in a steady stream. This irregularity can introduce variability in latency that is not representative of normal operations.
Example
Consider a system handling 5 requests per second during downtime. Over a 1-minute period:
- 300 total requests are processed.
- The 99th percentile corresponds to the 297th fastest request.
If a single request experiences a network glitch or disk contention, its latency might jump to several seconds. This single anomaly would define the P99 latency, even if the other 296 requests were handled in milliseconds. As a result, the reported P99 latency becomes misleading.
Example illustrations
Here are some metrics taken from a database during low traffic workloads due to the New Year.
Read request load
p99
p50
The negative correlation between request rate and latency increase is clearly observable and is amplified for the p99 metrics. The effect is most significant when the request rate is less than 5 requests per second.
Alternative Metrics for Low Traffic Scenarios
To monitor and optimize low-traffic workloads, consider these alternatives:
-
P90 or Median (P50) Latency
- These metrics are less sensitive to outliers and provide a more realistic view of typical system performance.
-
Latency Distribution
- Analyze the full distribution of request latencies over a longer time window to identify patterns and trends.
-
Request Success Rate
- Monitor the success rate at application level to ensure the system is meeting reliability goals, even if individual requests are slow.
-
Time-Weighted Aggregates
- Use time-weighted averages or moving percentiles over longer intervals to smooth out the impact of outliers.
Conclusion
P99 and p999 latency metrics can be misleading and unreliable in low traffic scenarios due to small sample sizes, outlier sensitivity, and irregular request patterns. For systems with low throughput, it is better to rely on alternative metrics such as median latency, latency distribution, and success rates to gain meaningful insights into performance.
When request loads drop below 5 requests per second, p99 and p999 latency metrics are effectively meaningless.
By tailoring your monitoring strategy to the characteristics of your workload, you can achieve more accurate performance evaluations and make better-informed optimization decisions.
Last Reviewed Date: 7 January 2025
Document Location
Worldwide
Historical Number
ka0Ui0000003jQPIAY
Was this topic helpful?
Document Information
Modified date:
30 January 2026
UID
ibm17258480