Survive stage in DataStage
The Survive stage groups duplicate records and constructs output records with the best available data for each group.
You can use the Survive stage as the last stage in data preparation to process and improve the output of a match stage. The Survive stage uses the rules that you specify to select the best columns from a group of records and create an output record with the group's best available data.
Input
The Survive stage requires one input source. If your input is the result of a match stage, you need to set up another stage (such as a Funnel stage) to combine the master and duplicate records into one input source. The Survive stage accepts all basic data types other than binary. The input data does not need to be processed by a match stage and can be from a database, connector, flat file, data set, or other stage. The input data must contain related groups of records with one or more columns that can identify each group.
Properties
Select a Group identification column. The Survive stage will group records with duplicate values for this column. The Survive stage also sorts your data by group identification column value. If you do not want your data to be sorted on the group identification column, deselect Sort input data (selected by default). To control the order of records within groups, use a Sort stage to presort the input data.
Specifying rules
Edit the Survive rule columns to add rules for the other columns. A rule contains a set of conditions and a list of one or more target columns. Each record is tested against the rules and if a column tests true against the conditions, the column value for that record becomes the best candidate value for the target column. After all records in a group are tested, the output record is constructed out of the best candidate values for each column.
Technique | Pattern |
---|---|
Shortest Field | SIZEOF(TRIM(c."column"))<=
SIZEOF(TRIM(b."column")) |
Longest Field | SIZEOF(TRIM(c."column"))>=
SIZEOF(TRIM(b."column")) |
Most Frequent | FREQUENCY |
Most Frequent (Nonblank) | FREQUENCY (Skips missing values when it counts most frequent.) |
Equals | c."column" = "DATA" |
Not Equals | c."column" <> "DATA" |
Greater Than | c."column" >= "DATA" |
Less Than | c."column" <= "DATA" |
At Least One | 1 (At least one record survives, regardless of other rules.) |
b.column
is the value stores as the best column available, while
c.column
is the current column that is being analyzed. If all columns contain a
value, this rule outputs the first record as the best record.
SIZEOF(TRIM(b.column)) = 0 AND SIZEOF(TRIM(c.column)) >= 0
Output
The Survive stage can have only one output link, which produces the output records from each group. Each output record can consist of an entire input record, selected columns from the record, or selected columns from different records in the group.
For example, take the following input from a One-source match stage, which identified these records as likely to belong to the same person and assigned them a shared qsMatchSetID value.
qsMatchSetID | Given Name | Middle Initial | Family Name | Suffix |
---|---|---|---|---|
9 | JON | SMITH | JR | |
9 | J | SMITHE | ||
9 | JOHN | E | SMITH |
Using the Survive stage, you can specify rules that analyze the Given Name, Middle Initial, and Suffix columns according to length, with the longest field as the best value. You can specify a rule to analyze the Family Name column according to frequency, with the most common value as the best value. The Survive stage constructs the following result as an output record.
qsMatchSetID | Given Name | Middle Initial | Family Name | Suffix |
---|---|---|---|---|
9 | JOHN | E | SMITH | JR |