Difference stage

The Difference stage is a processing stage. It performs a record-by-record comparison of two input data sets, which are different versions of the same data set designated the before and after data sets.

The Difference stage is a processing stage. It performs a record-by-record comparison of two input data sets, which are different versions of the same data set designated the before and after data sets. The Difference stage outputs a single data set whose records represent the difference between them. The stage assumes that the input data sets have been key-partitioned and sorted in ascending order on the key columns you specify for the Difference stage comparison. You can achieve this by using the Sort stage or by using the built in sorting and partitioning abilities of the Difference stage.

The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other.

The Difference stage is similar, but not identical, to the Change Capture stage described in Change Capture stage. The Change Capture stage is intended to be used in conjunction with the Change Apply stage (Change Apply stage); it produces a change data set which contains changes that need to be applied to the before data set to turn it into the after data set. The Difference stage outputs the before and after rows to the output data set, plus a code indicating if there are differences. If the before and after data have the same column names, then one data set effectively overwrites the other data set and so you only see one set of columns in the output. Which data set is output is controlled by the settings on the Link Order tab and the Mapping tab. If your before and after data sets have different column names, columns from both data sets are available to be output as set on the Mapping tab. Any columns that are designated as key or value columns in the input data sets must have the same names.

The stage generates an extra column, Diff, which indicates the result of each record comparison.

Shows a Difference stage being used to compare two data sets

The stage editor has three pages: