The partitioning library

The partitioning library is a set of related operators that are concerned with partitioning your data.

The partitioning operators are not separate InfoSphere® DataStage® stages, but rather appear as options on the Advanced tab of stage Input pages.

By default, InfoSphere DataStage inserts partition and sort operators in your data flow to meet the partitioning and sorting needs of your job.

Use the partitioners described in this topic when you want to explicitly control the partitioning and sorting behavior of an operator. You can also create a custom partitioner using the C++ API.

The partitioning library contains seven partitioners. They are:

  • The entire partitioner. Every instance of an operator on every processing node receives the complete data set as input. It is useful when you want the benefits of parallel execution but every instance of the operator needs access to the entire input data set.
  • The hash partitioner. Records are partitioned based on a function of one or more fields (the hash partitioning keys) in each record.
  • The modulus partitioner. This partitioner assigns each record of an input data set to a partition of its output data set as determined by the value of a specified key field modulo the number of partitions.
  • The random partitioner. Records are randomly distributed across all processing nodes. Like roundrobin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition.
  • The range partitioner. Divides a data set into approximately equal size partitions based on one or more partitioning keys. It is used with the help of one of the following:
    • The writerangemap operator. This operator takes an input data set produced by sampling and partition sorting a data set and writes it to a file in a form usable by the range partitioner. The range partitioner uses the sampled and sorted data set to determine partition boundaries.
    • The makerangemap utility, which determines the approximate range of a data set by sampling the set.
  • The roundrobin partitioner. The first record goes to the first processing node, the second to the second processing node, and so on. When InfoSphere DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size.
  • The same partitioner. No repartitioning is done. With this partitioning method, records stay on the same processing node. .