Batch Capacity Planning, Part 1 - CPU
MartinPacker 11000094DH Visits (5654)
It's been a week since the following was posted in IBM-MAIN: Batch Capacity Planning - BWATOOL? So far there's been no reply. Though a little disappointed, I'm not surprised. "Disappointed" as I was looking for a good debate (even though it wasn't me who asked the question). "Not surprised" as I think the subject of Batch Capacity Planning is a tough one. The original post prompted me to think about posting on the subject. I think I'll do it in two parts:
An obvious place to start is by comparing and contrasting Batch with Online, from the Capacity Planning point of view. This, of course, builds on the Performance / Window perspective:
Those contrasts aren't exhaustive but they are enough. We'll use them to inform the rest of this post.
But there is a similarity that's worth articulating: For both Batch and Online "enough CPU" refers to "what gets the job done": If Online work fails to meet Service Level Agreements / Expectations / Pious Hopes or whatever you conclude something has to be done. Similarly, if important Batch fails to meet its business goals there's pressure to do something. (This post isn't going to go into the business drivers or the shape of SLAs.)
When I look at the CPU Utilisation for a Batch Window I typically see a huge amount of variability, both within the night and from night to night. This, I surmise, is caused by the "big lumps" characteristic above. And if you try to figure out which job caused a spike it's hard to do automatically - because of the "interval straddling" characteristic above. But usually it's fairly obvious - if the number of jobs running is not too large - which job is likely to have caused a spike.
It also helps if you have a decent WLM Batch service class (and, hopefully, reporting class) scheme: You can identify which service class caused the spike. And thence the list of candidate jobs could be shorter.
WLM setup helps in another way: Assuming you have a sensible hierarchy of Batch service classes you can establish whether the supp
I think you have to accept that some degree of delay is inevitable at times with spiky work like Batch. Even for the "top dog" Batch service classes. The question is "how much?" If you calculate WLM velocity for these service classes over a long enough interval and the work of the window is just completed, maybe that's a useful metric and threshold. When the velocity drops below a certain level the window's work might just fail to get done in time.
I appreciate the previous paragraph is a little vague: It's trying to impart an approach to learning how your Batch works - from the CPU perspective. The "inevitable at times" phrase might be a little controversial: Certainly if your Online day drives the CPU capacity requirement you stand less of a chance of seeing CPU delays of any note in the Batch. But for many installations that's not true: The Batch drives the CPU requirement (or, in some cases, drives the Rolling 4-Hour Average and hence the software bill).
I haven't used the "job network" characteristic in this post yet. So here are a couple of areas I think it plays in:
These two are related, I think. And they're both about topology.
A couple of other things:
I'll admit this whole area is a tough one. I'd be interested in what customers do for Batch Capacity Planning - or indeed whether it drives their overall plan.
And soon I'll write about Memory from the Batch Capacity Planning perspective.
* When I say "squint at it" I mean "use a technique that takes the spiky detail out, leaving an overall (if blurred) picture. I've used the term for many years. People don't look at me oddly when I say it so I assume they know what I mean.