Measuring transaction response time against a percentile service level agreement
MarkWeatherill 0600011GHG Comments (2) Visits (26061)
We often hear requests from customers who want to measure a percentile, such as the 95th percentile, of transaction response times. That is, given a set of response times, what is the response time that 95% of transactions were faster than and 5% slower than.
For a short time period, say 5 minutes, the 95th percentile of transaction response times can be efficiently calculated. Assuming the transaction rate isn't too high, it is even easy to calculate it for an hour or a day. But for many of our customers, the transaction rate will be hundreds or thousands of transactions per second and the period might be weeks or months. The challenge with percentiles is that they cannot be accurately calculated from aggregates. For example, it is incorrect to calculate the percentile for every 5 minute period in a day and then find the average across 24 hours. The only way to accurately calculate the percentile is by using the original raw data. But to calculate the monthly percentile for a transaction rate of 100 transactions per second for example, requires 260 million samples to be stored!
Often the requirement to measure the monthly percentile is because of a Service Level Agreement (SLA) for transaction response times, such as "across the month, the 95th percentile for response time must be less than 500ms". Fortunately ITCAM for Transactions can efficiently measure conformance with this type of SLA without blowing the budget on storage. The trick is to focus on the response time threshold rather than the percentile.
It is more efficient to check whether each transaction is faster than 500ms and keep a count of "Good" and "Slow" transactions. Over any time period, simply find the ratio between "Good" and "Total" transactions . A result of 97% Good indicates that 500ms corresponds to the 97th percentile and the SLA has been met. A result of 81% Good indicates that 500ms corresponds to the 81st percentile and the SLA has been breached. These counts can be summed over any time period which provides plenty of flexibility for reporting.
The limitation of this approach is that the threshold has to be configured up front. Once the end of the month is reached, it is too late to go back and enter a new value. For an SLA, this limitation is generally acceptable since a value has been defined as part of the contract.
The Response Time agents have supported response time thresholds in all versions of ITCAM for Transactions. In version 18.104.22.168, the feature was extended to the Transaction Collector agent which means it can now be applied to any transaction monitored by a transaction tracking data collector plugin such as a Microsoft .NET web service, WebSphere Message Broker flow or a CICS transaction. The response time thresholds are configured using properties in the Application Management Configuration Editor. Response times over the "Min" threshold will be counted as "Slow" and response times over the "Max" threshold will be counted as "Failed". These counts are available in the Aggregates table and displayed in the Transactions Summary workspace.