IBM InfoSphere DataStage and InfoSphere QualityStage, Version 8.5

Modulus partitioner

Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation.

In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket number. You can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified key field in the input data set. This field can be the tag field.

The partition number of each record is calculated as follows:

partition_number = fieldname mod number_of_partitions

where:

fieldname is a numeric field of the input data set.
number_of_partitions is the number of processing nodes on which the partitioner executes. If a partitioner is executed on three processing nodes it has three partitions.

In this example, the modulus partitioner partitions a data set containing ten records. Four processing nodes run the partitioner, and the modulus partitioner divides the data among four partitions. The input data is as follows:

Table 1. Input data
Column name	SQL type
bucket	Integer
date	Date

The bucket is specified as the key field, on which the modulus operation is calculated.

Here is the input data set. Each line represents a row:

Table 2. Input data set
bucket	date
64123	1960-03-30
61821	1960-06-27
44919	1961-06-18
22677	1960-09-24
90746	1961-09-15
21870	1960-01-01
87702	1960-12-22
4705	1961-12-13
47330	1961-03-21
88193	1962-03-12

The following table shows the output data set divided among four partitions by the modulus partitioner.

Table 3. Output data set
Partition 1	Partition 2	Partition 3
61821 1960-06-27	21870 1960-01-01	64123 1960-03-30
22677 1960-09-24	87702 1960-12-22	44919 1961-06-18
47051961-12-13	47330 1961-03-21
88193 1962-03-12	90746 1961-09-15

Here are three sample modulus operations corresponding to the values of three of the key fields:

22677 mod 4 = 1; the data is written to Partition 1.
47330 mod 4 = 2; the data is written to Partition 2.
64123 mod 4 = 3; the data is written to Partition 3.

None of the key fields can be divided evenly by 4, so no data is written to Partition 0.

This topic is also in the IBM InfoSphere DataStage and QualityStage Parallel Job Developer's Guide.

Update timestamp

Last updated: 2012-10-8