Consolidating duplicate records to create the representative record

The Survive stage constructs column values from groups of related or duplicate records and stores the column values in the survive record (the best result) from each group.

This "best result" helps you to populate the columns of surviving records with the best available data. The stage implements the business and mapping rules, creating the necessary output structures for the target application and identifies columns that do not conform to load standards.

The Survive job is the last job in the InfoSphere® QualityStage® workflow and is usually run after the One-source Match stage job. The output from the One-source Match stage, and in some cases the Two-source Match stage, becomes the source data that you use for the Survive stage.

The Survive stage accepts all basic data types (non-vector, non-aggregate) other than binary. The Survive stage accepts a single data source from any of the following groups:

database connector
flat file
data set
processing stage

The Survive stage requires one input source. If your input is the result of a match stage, you need to set up another stage (for example, a Funnel stage) to combine the master and duplicate records into one input source.

While it is not necessary to process the data through the match stages before you use the Survive stage, the source data must include related or duplicate groups of rows. Also, the data must be able to be sorted on one or more columns that identify each group. These columns are referred to as group keys.

To order the records, you sort on the group key or keys so that all records in a group are contiguous. The Survive stage automatically sorts records if the Pre-sort Input Option is selected in the Survive Stage window. However, the automatic sort provides no control over the order of the records within each group. To control the order within groups, you can pre-sort the input by using the Sort stage.

The Survive stage can have only one output link. This link can send output to any other stage. You specify which columns and column values from each group create the output record for the group. The output record can include the following information:

An entire input record
Selected columns from the record
Selected columns from different records in the group

You select column values based on rules for testing the columns. A rule contains a set of conditions and a list of one or more target columns. If a column tests true against the conditions, the column value for that record becomes the best candidate for the target. After each record in the group is tested, the columns that are the best candidates are combined to become the output record for the group.

To select a best candidate match, you can specify multiple columns, for example:

Record creation date
Data source from which the record originated
Length of data in the column
Frequency of data in a group

For example, the One-source Match stage identified the following portions of three records as representing the same person using different variations of the name.

Column Name qsMatchSetID	Given Name	Middle Initial	Family Name	Suffix
9	JON		SMITH	JR
9	J		SMITHE
9	JOHN	E	SMITH

The Survive stage constructs the best record using length analysis on the columns Given Name, Middle Initial, and Suffix, and using frequency analysis on the column Family Name, with the following result.

Column Name qsMatchSetID	Given Name	Middle Initial	Family Name	Suffix
9	JOHN	E	SMITH	JR