IBM InfoSphere DataStage and InfoSphere QualityStage, Version 8.5

Example: using a parallel partition sort operator

A parallel psort operator executes on multiple processing nodes in your system to sort the records within each partition of a data set. To use the psort operator to execute the sort in parallel, you must specify a partitioning method for the operator. It is by specifying the partitioning method that you configure the operator to run in parallel.

Choose a partitioning method that is correct for the sorting operation. For example, assume that you are sorting records in a data set based on the last-name field of each record. If you randomly allocate records into any partition, records with similar last names are not guaranteed to be in the same partition and are not, therefore, processed by the same node. Similar records can be sorted by an operator only if they are in the same partition of the data set.

A better method of partitioning data in this case would be to hash the records by the first five or six characters of the last name. All records containing similar names would be in the same partition and, therefore, would be processed by the same node. The psort operator could then compare the entire last names, first names, addresses, or other information in the records, to determine the sorting order of the records.

For example:

record ( fname:string[30]; lname:string[30]; )
... | modify -spec "lname_hash:string[6] = substring[0,6](lname)"
        | hash -key lname_hash
     | tsort -key lname | ...

InfoSphere® DataStage® supplies a hash partitioner operator that allows you to hash records by one or more fields. See "The hash Partitioner" for more information on the hash operator. You can also use any one of the supplied InfoSphere DataStage partitioning methods.

The following example is a modification of the previous example, "Example: Using a Sequential Partition Sort Operator" , to execute the psort operator in parallel using a hash partitioner operator. In this example, the hash operator partitions records using the integer field a, the primary sorting key. Therefore, all records containing the same value for field a are assigned to the same partition. The figure below shows this example:

Shows a psort operator performing a parallel sort on a data set

To configure the psort operator in osh to execute in parallel,use the [par] annotation. Shown below is the osh command line for this step:

$ osh " hash -key a -key e < unSortedDS.ds | 
                   psort -key a -key e [par] > sortedDS.ds"

This topic is also in the IBM InfoSphere DataStage and QualityStage Parallel Job Advanced Developer's Guide.

Update timestamp

Last updated: 2012-10-8