Merge stage in DataStage

The Merge stage joins tables based on the values of key columns. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.

The Merge stage combines a master data set with one or more update data sets. It is a processing stage. It is one of three stages that join tables based on the values of key columns. The other two are:

The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input (for example, whether it is sorted).

The Merge stage combines a master data set with one or more update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record that are required. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Merge stage, IBM® DataStage® in auto partition mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method).

As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. See Remove Duplicates Stage (DataStage) for information about the Remove Duplicates stage.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering section on the Stage tab lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

The stage editor has three tabs:

  • Stage. This is always present and is used to specify general information about the stage.
  • Input. This is where you specify details about the data sets being merged.
  • Output. This is where you specify details about the merged data being output from the stage and about the reject links.

Input tab

The Columns section specifies the column definitions of incoming data. The Advanced section allows you to change the default buffering settings for the input link.

Output tab

The Merge stage can have only one master output link carrying the merged data and a number of reject links, each carrying rejected records from one of the update links. Choose an output link from the Output name drop down list to specify which link you want to work on.

The Columns section specifies the column definitions of the data. The Maps from column input column section that appears when you click Edit in the columns section allows you to specify the relationship between the columns being input to the Merge stage and the output columns. You can also specify how output columns are derived, that is, what input columns map onto them or how they are generated. The Advanced section allows you to change the default buffering settings for the output links.