After A Decent Interval
MartinPacker 11000094DH Visits (2942)
I’m writing about intervals again.
Two things have occasioned this:
As I’ve said before it’s important to understand the provenance of the data you’re using. This would be true whether it’s performance data, financial data, health data or anything else.
Most of the time you don’t see raw data: It’s as processed as the food we eat. I’m privileged to spend more time than most looking at raw SMF records, though they too aren’t “the horse’s mouth” really.
DB2 Statistics Trace
DB2 Version 10 introduced a significant change in the way its interval-based records are created. As previously noted STATIME governs the frequency with which Statistics Trace records are cut. The default dropped from a useless 30 minutes to 5 in DB2 Version 9. The change in Version 10 was to decouple the frequency of cutting certain types of Statistics Trace records from STATIME: IFCIDs 1, 2, 202, 217, 225 and 230 are always cut every minute. I use most of these IFCIDs (and rarely the ones that aren’t) so this is a good change for me.
But why do we care about the record-cutting interval? Let me take you on a journey from the raw SMF to the numbers I share with you (or which appear in any product’s reporting).
Consider the following spreadsheet and set of three graphs. It’s made up data but is typical of some of the counters in DB2 Statistics Trace.
Top left is the raw spreadsheet data:
The remaining columns aren’t in the record but are derived:
The result of all this is the graph on the bottom right (labelled (3)).
In reality I tend to summarise DB2 statistics by hour and all the data in the spreadsheet represents a single net data point. Also I take the lowest value in the interval (the first) and subtract from the highest (the last). I suspect everyone else does too - after all a previous record just prior to the reporting period might not exist. But back to the data:
If you examine the time intervals covered 9:01 to 9:51 is only 50 minutes out of the hour. And the delta (19500–10000) covers those 50 minutes only. So the best rate calculation is 9500 / 3000 = 3.2 per second. In fact I do 9500 / 3600 and thus underestimate a fair bit.
It’s probably a little picky to also point out that the rate within the hour might significantly vary from this 3.2 figure because of surges or drops in activity outside the 50 minutes of the hour captured.
Actually a (simulated) STATIME of 10 (as in here) isn’t so bad: Losing some rate doesn’t matter much. But consider the old default of 30 minutes for STATIME: It’s highly likely 50% of the activity isn’t captured and that there is significant variation of rate outside of the 30 mins in the hour that is.
And that’s why I was so glad the default STATIME got dropped in Version 9. And why having these records cut every single minute in Version 10 is even better: Less than 2% of the activity and time is uncaptured - with a 1-minute interval.
Recently I’ve seen customers with long RMF intervals (1 hour) and with inconsistent intervals (30 minutes for some systems and 20 for others).
Both of these lead me to have to summarise data at an hourly level. I consider it a form of “enforced squinting”.
Summarising at an hourly level makes the graphs more readable but, and this is a significant “but”, some peaks are shaved off. I’m most worried about underestimating things like peak CPU busy. (Conversely overestimating WLM Service Class Period Velocity.) Both of these can lead to installations having a flawed capacity plan, with the potential for performance crises. Over the years I’ve seen my fair share of these.
It would be a bit much, most of the time, to lower the RMF interval down to, say, 1 minute. Benchmark or test run situations might be an exception but 1440 data points a day per system is excessive. 15 minutes is generally fine.
Of course processors (or engines if you prefer) are rather binary: At any moment an engine is either 100% busy or 0% busy. Zooming right in would yield this rather useless “whiplash” information.
I could, in principle, zoom out from the lowest level of summarisation - the RMF interval - to 1 hour, 2 hours, 4, 8, then 24 - to show how the average CPU Utilisation peaks at lower and lower values. And one day I just might: It’d be a good pedagogical exercise.
In the RMF case there’s no possibility of activity loss with long intervals, but there clearly is one of resolution loss.
It’s also helpful, by the way to synchronise the RMF interval with the SMF interval: Drilling down from the RMF Service Class Period level to the Address Space level works better if you do.
Data Volume Sensitivity
So, dropping the interval with which you cut records is a good thing, in data quality terms. But how is it in terms of quantity?
In the case of DB2 Statistics Trace the amount of data is dwarfed by the amount of Accounting Trace, assuming you’re collecting the latter. So a shorter interval probably doesn’t affect your overall SMF data volume significantly.
In the case of RMF there are some high-volume records, most notably SMF 74–1 (Device Activity), 74–5 and 74–8 (Cache), and 74–4 (Coupling Facility Activity). Unless you disable these types you’ll collect much more data if you drop the RMF interval significantly.
So, I like the current product behaviours - 1 minute for most of DB2 Statistics Trace and RMF’s default interval of 15 minutes - as a good balance between quality and quantity.
But essentially I’m after a decent interval.