DataStage stages

A DataStage flow consists of stages linked together, which describe the flow of data from a data source to a data target. A stage describes a data source, a processing step, or a target system. The stage also defines the processing logic that moves the data from the input links to the output links.

Stage types and functions

Use stages to manipulate data that you have read from a data source before writing it to a data target. The stages are of the following types:

Processing

Powerful data processing allows you to aggregate, copy, filter, funnel, join, lookup, merge, modify, remove duplicates, sort, transform, check differences and sums, compare, encode, and decode your data, along with many other processes.

Development and debug

Use debug and development stages to peek into your data, pull data samples at the head (first few rows) or tail data partitions, sample data, or generate test data by using the column and row generator stages.

Restructure

Use stages such as column import and export to restructure the data in your sources and targets.

Data quality

Use data quality stages to:

Resolve data conflicts and ambiguities
Uncover new or hidden attributes from free-form or loosely controlled source columns
Conform data by transforming data types into a standard format

A stage usually has at least one data input or one data output. However, some stages can accept more than one data input, and output to more than one stage. The different types of flows have different stage types. The following table lists the available stage types and gives details on their functions:

Table 1. Stage editors
Stage	Type	Function
Aggregator	Processing	Classifies incoming data into groups, computes totals and other summary functions for each group, and passes them to another stage in the job.
Address Verification	Data quality	Provides comprehensive address standardization, validation, geocoding, and reverse geocoding.
Bloom Filter	Operator	Looks up incoming keys against previous values.
Change Apply	Processing	Applies encoded change operations to a before data set based on a changed data set. The before and after data sets come from the Change Capture stage.
Change Capture	Processing	Compares two data sets and makes a record of the differences.
Checksum	Processing	Generates a checksum value from the specified columns in a row and adds the checksum to the row.
Column Import	Restructure	Imports data from a single column and outputs it to one or more columns.
Column Export	Restructure	Exports data from a number of columns of different data types into a single column of data types ustring, string, or binary.
Column Generator	Development/ Debug	Adds columns to incoming data and generates mock data for these columns for each data row processed.
Compare	Processing	Performs a column-by-column comparison of records in two presorted input data sets.
Compress	Processing	Uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data.
Copy	Processing	Copies a single input data set to a number of output data sets.
Data Set	File stage	Reads data from or writes data to a data set.
Decode	Processing	Decodes a data set using a UNIX decoding command that you supply.
Difference	Processing	Performs a record-by-record comparison of two input data sets, which are different versions of the same data set.
Encode	Processing	Encodes a data set using a UNIX encoding command that you supply.
Expand	Processing	Uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data.
External Filter	Processing	Allows you to specify a UNIX command that acts as a filter on the data you are processing.
File Set	File stage	Reads data from or writes data to a file set.
Filter	Processing	Transfers, unmodified, the records of the input data set which satisfy requirements that you specify and filters out all other records.
Funnel	Processing	Copies multiple input data sets to a single output data set.
Generic	Processing	Lets you incorporate an Orchestrate® Operator in your job.
Head	Development/ Debug	Selects the first N records from each partition of an input data set and copies the selected records to an output data set.
Hierarchical (XML)	Processing	Parses JSON and XML data.
Investigate	Data quality	The character investigation type of Investigate stage analyzes and classifies data, parsing it into a single-pattern report. The word investigation type of Investigate stage uses a set of rules for classifying data such as personal names, business names, and addresses.
Java Integration	Processing	Invokes Java classes from parallel jobs.
Join	Processing	Performs join operations on two or more data sets input to the stage and then outputs the resulting data set.
Lookup	Processing	Used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data or provided by one of the database stages that support reference output links. It can also perform a look up on a lookup table contained in a Lookup File Set stage.
Merge	Processing	Combines a sorted master data set with one or more sorted update data sets.
Modify	Processing	Alters the record schema of its input data set.
Peek	Development/ Debug	Lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets.
Pivot Enterprise	Processing	The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically. Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows. Vertical pivoting maps a set of rows in the input data to single or multiple output columns.
Remove Duplicates	Processing	Takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set.
Row Generator	Development/ Debug	Produces a set of mock data fitting the specified meta data.
Sample	Development/ Debug	Samples an input data set.
Sort	Processing	Sorts input columns.
Standardize	Data quality	Makes source data internally consistent, so each data type has the same kind of content and format.
Surrogate Key Generator stage	Processing	Generates surrogate key columns and maintains the key source.
Switch	Processing	Takes a single data set as input and assigns each input record to an output data set based on the value of a selector field.
Tail	Development/ Debug	Selects the last N records from each partition of an input data set and copies the selected records to an output data set.
Transformer	Processing	Handles extracted data, performs any conversions required, and passes data to another active stage or a stage that writes data to a target database or file.
Wave Generator	Processing	Monitors a stream of data and inserts end-of-wave markers where needed.
Write Range Map	Development/ Debug	Allows you to write data to a range map. The stage can have a single input link.