Specifying a partitioning method

The framework performs partitioning on the input data set or data sets of a parallel operator.

Partitioning is the process of dividing an input data set into multiple segments, or partitions. Each processing node in your system then performs an operation on an individual partition of the data set rather than on the entire data set. Typically, an operator uses the same partitioning method for each input data set; however, you can also use a different method for each input.

The following figure shows an example of a two-input, one-output operator. Each input in this figure uses its own partitioning method:

Figure 1. Two-input, one-output operator

InfoSphere® DataStage® provides a number of different partitioning methods, including:

any: The operator does not care how data is partitioned; therefore, InfoSphere DataStage can partition the data set in any way to optimize the performance of the operator. Any is the default partitioning method. Operators that use a partitioning method of any allow the user to explicitly override the partitioning method. To set a partitioning method, the user assigns a partitioner to a data set used as an input to the operator. The framework then partitions the data set accordingly.
round robin: This method divides the data set so that the first record goes to the first node, the second record goes to the second node, and so on. When the last node in the system is reached, this method starts back at the first node. This method is useful for resizing the partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions.
random: This method randomly distributes records across all nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition.
same: This method performs no repartitioning on the data; the partitions of the previous operator are inherited. This partitioning method is often used within composite operators.
entire: Every instance of an operator on every processing node receives the complete data set as input. This form of partitioning is useful when you want the benefits of parallel execution, but you also want each instance of the operator to be able to access the entire input data set.
hash by field: Uses a field in a record as a hash key and partitions the records based on a function of this hash key. You use the class APT_HashPartitioner to implement this partitioning method.
modulus: Partitioning is based on a key field modulo the number of partitions. This method is like hash by field, but involves simpler computation.
range: Divides a data set into approximately equal size partitions based on one or more partitioning keys. You use the APT_RangePartitioner class to implement this partitioning method.
Db2: Partitions an input data set in the same way that Db2 partitions it. For example, if you use this method to partition an input data set containing update information for an existing Db2 table, records are assigned to the processing node containing the corresponding Db2 record. Then, during the execution of the parallel operator, both the input record and the Db2 table record are local to the processing node. Any reads and writes of the Db2 table entails no network activity.
other: You can define a custom partitioning operator by deriving a class from the C++ APT_Partitioner class. Other is the partitioning method for operators that use custom partitioners.

By default, operators use the partitioning method any. The any partitioning method allows operator users to prefix the operator with a partitioning operator to control partitioning. For example, a user could insert the hash partitioner operator in the stage before the derived operator.

To set an explicit partitioning method for the operator that cannot be overridden, you must include a call to APT_Operator::setPartitionMethod() within APT_Operator::describeOperator().

Another option is to define your own partitioner for each input to the operator. To do so, you must derive a partitioner class from APT_Partitioner.