Compare stage in DataStage

The Compare stage performs a column-by-column comparison of records in two presorted input data sets. You can restrict the comparison to specified key columns.

The Compare stage is a processing stage. It can have two input links and a single output link.

The Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set.

You can use runtime column propagation in this stage and allow IBM DataStage to define the output column schema for you at runtime. The stage outputs a data set with three columns:

  • result. Carries the code giving the result of the comparison.
  • first. A subrecord containing the columns of the first input link.
  • second. A subrecord containing the columns of the second input link.
If you specify the output link metadata yourself, you must define the columns carrying the data as sub-records of a parent column that you also define. IBM DataStage will not let you specify two groups of identical column names, and so you make them sub-records to give them unique names such as first.col1 and second.col1. Specify metadata by doing the following steps:
  1. Specify the parent column for the output data corresponding to the first input link, and set the SQL type to unknown.
  2. Specify the actual columns that carry your data and make these sub-records of the parent column. Name each column first.colname, for example first.col1, first.col2 and so on. Make each column a subrecord by selecting the column, selecting edit row from the shortcut menu, and specifying a level number (for example, 03) for that column. (You can speed up this process by making the first column a subrecord and using the propagate values feature to make the remaining columns sub-records of the parent column.)
  3. Specify the parent column for output data corresponding to the second input link, and set the SQL type to unknown.
  4. Specify the actual columns that carry the data from the second input link, name them second.colname (for example, second.col1, second.col2) and make these sub-records of the parent column.

When you double click the Compare stage, the properties panel opens. The properties panel has three tabs:

  • Stage. This is always present and is used to specify general information about the stage.
  • Input. This is where you specify details about the data being grouped or aggregated.
  • Output. This is where you specify details about the groups being output from the stage.

Input tab

The Columns section specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link.

Output tab

The Columns section specifies the column definitions of the data. The Advanced section allows you to change the default buffering settings for the output link.