Survive stage in DataStage

The Survive stage groups duplicate records and constructs output records with the best available data for each group.

You can use the Survive stage as the last stage in data preparation to process and improve the output of a match stage. The Survive stage uses the rules that you specify to select the best columns from a group of records and create an output record with the group's best available data.

Input

The Survive stage requires one input source. If your input is the result of a match stage, you need to set up another stage (such as a Funnel stage) to combine the master and duplicate records into one input source. The Survive stage accepts all basic data types other than binary. The input data does not need to be processed by a match stage and can be from a database, connector, flat file, data set, or other stage. The input data must contain related groups of records with one or more columns that can identify each group.

Properties

Select a Group identification column. The Survive stage will group records with duplicate values for this column. The Survive stage also sorts your data by group identification column value. If you do not want your data to be sorted on the group identification column, deselect Sort input data (selected by default). To control the order of records within groups, use a Sort stage to presort the input data.

Specifying rules

Edit the Survive rule columns to add rules for the other columns. A rule contains a set of conditions and a list of one or more target columns. Each record is tested against the rules and if a column tests true against the conditions, the column value for that record becomes the best candidate value for the target column. After all records in a group are tested, the output record is constructed out of the best candidate values for each column.

When you add a rule, you can select a built-in condition from the list of techniques, or click the three vertical dots and create a complex rule in the expression editor. A simple rule compares each column of a record to the current best value for that column, evaluated according to the selected technique.
Technique Pattern
Shortest Field SIZEOF(TRIM(c."column"))<= SIZEOF(TRIM(b."column"))
Longest Field SIZEOF(TRIM(c."column"))>= SIZEOF(TRIM(b."column"))
Most Frequent FREQUENCY
Most Frequent (Nonblank) FREQUENCY (Skips missing values when it counts most frequent.)
Equals c."column" = "DATA"
Not Equals c."column" <> "DATA"
Greater Than c."column" >= "DATA"
Less Than c."column" <= "DATA"
At Least One 1 (At least one record survives, regardless of other rules.)
Specify a value for DATA only if you select Equals, Not Equals, Greater Than, or Less Than.
A complex rule is an expression that you write in the expression editor. To define a complex rule, you select the columns that the rule is used on, the functions that are applied to their values, and the operation that compares the results. Select a current record and a best record from the Columns list, select whether to use SIZEOF, TRIM, or both from the Functions list, and select an operator from the Operators list to construct your complex rule. The following rule is an example of a complex rule that uses both functions, the AND operator, the = operator, and the >= operator. b.column is the value stores as the best column available, while c.column is the current column that is being analyzed. If all columns contain a value, this rule outputs the first record as the best record.
SIZEOF(TRIM(b.column)) = 0 AND SIZEOF(TRIM(c.column)) >= 0

Output

The Survive stage can have only one output link, which produces the output records from each group. Each output record can consist of an entire input record, selected columns from the record, or selected columns from different records in the group.

For example, take the following input from a One-source match stage, which identified these records as likely to belong to the same person and assigned them a shared qsMatchSetID value.

qsMatchSetID Given Name Middle Initial Family Name Suffix
9 JON   SMITH JR
9 J   SMITHE  
9 JOHN E SMITH  

Using the Survive stage, you can specify rules that analyze the Given Name, Middle Initial, and Suffix columns according to length, with the longest field as the best value. You can specify a rule to analyze the Family Name column according to frequency, with the most common value as the best value. The Survive stage constructs the following result as an output record.

qsMatchSetID Given Name Middle Initial Family Name Suffix
9 JOHN E SMITH JR