Remove Duplicates Stage (DataStage)
The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.
The Remove Duplicates stage is a processing stage. It can have a single input link and a single output link.
Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two rows are considered duplicates if they are adjacent in the input data set and have identical values for the key column(s). A key column is any column you designate to be used in determining whether two rows are identical.
The data set input to the Remove Duplicates stage must be sorted so that all records with identical key values are adjacent. You can either achieve this using the in-stage sort facilities available on the Input page Partitioning tab, or have an explicit Sort stage feeding the Remove Duplicates stage.
The stage editor has three tabs:
- Stage. This is always present and is used to specify general information about the stage.
- Input. This is where you specify details about the data set that is having its duplicates removed.
- Output. This is where you specify details about the processed data that is being output from the stage.
Input tab
The Columns section specifies the column definitions of incoming data. The Advanced section allows you to change the default buffering settings for the input link.
Output tab
The Columns section specifies the column definitions of the data. The Maps from column input section that appears when you click Edit in the columns section allows you to specify the relationship between the columns being input to the Remove Duplicates stage and the output columns. Here, you can specify how the output columns are derived, that is, what input columns map onto them or how they are generated. The Advanced section allows you to change the default buffering settings for the output links.