I'm a well-known mainframe performance guy, with almost 30 years of experience helping customers manage systems. I also dabble in lots of other technology. I've sought to widen the Performance role, incorporating aspects of infrastructural architecture.

Suppose you have a set of numbers S over which you define a function f.
Further suppose you partition S into S_{1} and S_{2}.
I’d like to know what function g is such that g({f(S_{1}),f(S_{2})}) = f(S).

As much to the point I’d like to understand which functions f even have a corresponding function
g that meets the condition.

Whoa! Was that pretentious enough for you?

Let me start again…

When I’ve talked about cloning batch jobs one of the problems to solve is what I call “fanning
back in again”.

If you split the data into, say, 2 equal subsets and run it through 2 cloned jobs in parallel
you have to take any “report file” and recreate it from whatever you could coax these clones to create.

(Maybe you should reread that first paragraph now.)

Let’s work through a simple example…

The clones work against subsets of records in file S we’ll call S_{1} and S_{2}.

Originally the function f created a report from S - simply totalling the value in a field of
each record.

Then f(S_{1}) sums that field over some of the records and f(S_{2}) sums it over the others.
You can probably guess that g is just adding the two together.

As another example suppose f calculated an average of that field.

In this case recreating the average is merely a matter of counting the elements in S_{1} and S_{2}
and using them to compute the overall average from the averages for each subset.
(In fact just summing the values in S_{1} and S_{2} and dividing by the overall count would do
just as well but involves changing the function f.
You probably would prefer not to do that in general.)

Calculating a maximum, standard deviation, or mode are three examples where it’s almost as simple
as calculating the mean.

One feature of all of these is the need to carry forward information into some “fan-in” job step.
In some of them it’s extra information - such as the subtotals for the mean.
In others it’s the original information - such as the subtotals in the first case.

What I’d like to do is think about how one figures out whether such a fan in is even possible.
I’m sure this isn’t a particularly new one - and any “divide and conquer” algorithm since time immemorial
has had to deal with this issue.
(I’m sure Hadoop has to deal with this, but we’re dealing with COBOL and PL/I here.)

And in Paragraph 1 I actually simplified it:
The “composition function” g should be designed to cope with arbitrary subsets of S - as
we’re going to have to deal with 2-up, 4-up, 8-up cloning.
It would be a real pity if the function only worked on pairs so 4-up would require 3 applications, for
example.
It should be a single sufficiently general function to allow the application to be readily cloned to whatever
degree of parallelism required.

The whole thing is, of course, simplified:
No report ever just plonks a single number on a page.
(Unless that number is 42.)
Ultimately, though, you can break the problem down into a bunch of these simpler subproblems.

But if we are going to clone processing steps this is the kind of question that we’re going
to have to answer: “Can we clone a job and still get the right results?”

And to finish here’s a nice pretty picture. I might even make it into a slide or two.