New CPU Information In SMF Type 30 Records
MartinPacker 11000094DH Visits (9685)
Round about now you'd be expecting posts to be geared towards the recent zEC12 announcement, or perhaps CICS TS 5.1 or the DB2 11 Preview, or IDAA V3. So what this post is about will probably have slipped by unnoticed. After all you don't spend all your time looking for obscure New Function APARs, do you?
But I think some of you will find this one of value, or at least quite interesting.
(I presented a slide on this at the UKCMG 1-day meeting, October 10 2012, so you might consider this to be the script for that slide.)
I could've given this post a provocative title like "Do You Really Need CICS PA1?" and you'll see why that's an only slightly daft question to ask in a minute.
Single TCB (task) speed has always been an important topic, and continues to be so. Here are three examples of why, irrespective of processor technology:
And these are just the most obvious examples, which we all know and love.
There's an industry trend that's beginning to make this even more important: Although the zEC12 had a very healthy single-processor speed increase over the z196, this is not a long-term trend. Processors of all architectures are getting faster more slowly, and this probably isn't going to change. All architectures are relying on more engines and more threads to support larger workloads and zEC12 upping the limit to 101 engines from 80 is a good example of that.
So it behoves us to understand the single-TCB proclivities of our workloads, for all four reasons, and more.
A key point is, of course, what to do about it. But this post introduces some new instrumentation that at least helps with the analysis.
APAR OA39629, available for z/OS Releases 12 and 13, has the title "New Function To Report The Highest Percent Of CPU Time Used By A Single Task In An Address Space".
It provides two new fields in SMF 30 Interval (subtypes 2 and 3) and Step-End (subtype 4) and Job-End (subtype 5) records:
For Step- and Job-end records the CPU % is highest percentage among the intervals during the running of the job or step.
The rest of this post is slightly speculative, as I'll confess I haven't actually seen SMF 30 records with the new fields in yet.
When there is no CPU you get blanks for the program name. If the program can't be determined you get '????????'.
Let's return to CICS: Consider the following diagram
This depicts a CICS region, though not a wholly typical one. As depicted, I would expect for most CICS regions the QR TCB to be the biggest. I don't know whether the program name will actually be "DFHSIP" but I would expect it to be mnemonic and it'll probably start with "DFH". If this is right we have a ready way in Type 30 to figure out how big the QR TCB is and therefore whether it is an impending constraint. And we can do this without creating CICS Statistic Trace records.2 I mentioned Type 30 records and QR TCB in He Picks On CICS without a solution to the question of how to distinguish QR TCB from the rest.
The diagram also shows a File-Control TCB (think "VSAM"), three MQ TCBs and four DB2 TCBs. A typical region wouldn't have all these doing much, if indeed they were present. And showing this level of evenness would, I'd hazard, be unusual.
For CICS regions with a heavy DB2 component, for example, the QR TCB might not be the biggest TCB3. In this case we'll see a different program name and we can provide an upper bound on the QR TCB %. We'd do this by subtracting the biggest TCB (whatever that is) from the headline TCB time - also in the Type 30 (with some adjustments to make the maths right)4.
Of course CICS PA and the standard DFHSTUP (CICS Statistics Utility Program), which prints CICS TCB percentages at the subsystem level, do far more than just reporting CPU at the transaction instance and region level. But sometimes all you need is to figure out if the QR TCB is a vulnerability. In fact if you do have a region that needs work you probably would drill down (using something like CICS PA).
The above would use the Interval records (subtypes 2 and 3) and the same approach could be used with any long-running address space5. But there's value for batch jobs - using subtypes 4 and 5. Admittedly most jobs are single-tasking. But not all are: For example, DB2 Utilities can be significant multitaskers6. So there is some value in finding the biggest TCB (and subtracting from the "headline" TCB number): You can better assess the benefit of faster engines (or understand which job steps are susceptible to engines not getting much faster).
So, I'm really looking forward to seeing real customer data - which I'm convinced will be very interesting.I can't believe it'll be long before I see some. And when I do I'll write some more about it.