Partitioning

In the simplest scenario you probably won’t be bothered how your data is partitioned.

It is enough that it is partitioned and that the job runs faster. In these circumstances you can safely delegate responsibility for partitioning to InfoSphere® DataStage®. Once you have identified where you want to partition data, InfoSphere DataStage will work out the best method for doing it and implement it.

The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across your processors.

When performing some operations however, you will need to take control of partitioning to ensure that you get consistent results. A good example of this would be where you are using an aggregator stage to summarize your data. To get the answers you want (and need) you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition. InfoSphere DataStage lets you do this.

There are a number of different partitioning methods available, note that all these descriptions assume you are starting with sequential data. If you are repartitioning already partitioned data then there are some specific considerations: