Join stage in DataStage

The Join stage performs join operations on two or more data sets input to the stage and then outputs the resulting data set.

The Join stage is a processing stage. It performs join operations on two or more data sets input to the stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based on the values of key columns. The other two are:

The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input (for example, whether it is sorted). See "Join Versus Lookup" for help in deciding which stage to use.

In the Join stage, the input data sets are notionally identified as the "right" set and the "left" set, and "intermediate" sets. You can specify which is which. It has any number of input links and a single output link.

The stage can perform one of four join operations:

  • Inner transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped. The stage will not compile if you try to propagate columns with the same name from different input links of an inner join. You can put a Modify or Transformer stage in front of the Join stage to rename the columns. You can also switch to a full outer join.
  • Left outer transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets.
  • Right outer transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets.
  • Full outer transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.)
Requirements for input data include the following:
  • The metadata of the columns on input links must match the values in the database table.
  • The data sets input to the Join stage must be key partitioned and sorted in ascending order. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Join stage, IBM® DataStage® in auto mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method).

The Join stage editor has three pages:

  • Stage. This is always present and is used to specify general information about the stage.
  • Input. This is where you specify details about the data sets being joined.
  • Output. This is where you specify details about the joined data being output from the stage.

Watch the following video for an example of how to work with the DataStage Join stage.

TThis video provides a visual method to learn the concepts and tasks in this documentation.

Input tab

The Columns section specifies the column definitions of incoming data. The Advanced section allows you to change the default buffering settings for the input link.

Output tab

The Join stage can have only one output link.

The Columns section specifies the column definitions of the data. The Maps from column input column section that appears when you click Edit in the columns section allows you to specify the relationship between the columns being input to the Join stage and the output columns. The Advanced section allows you to change the default buffering settings for the output links.