Supported stages

When you create and edit transform jobs with the DataStage® service, you have numerous stages to work with. You use the stages to transform the data in useful ways. The stages are easy to add to the job by dragging them from the palette to the canvas. You can also configure the stages by clicking them to open the Details card on the Transform data page of DataStage.

For more information, go to the InfoSphere Information Server Knowledge Center and search for a specific stage name.

The following stages are supported:

Aggregator

Aggregator stages classify data rows from a single input link into groups and compute totals or other aggregate functions for each group. The summed totals for each group are output from the stage through an output link.

Annotation

You can use annotations for a wide variety of purposes throughout your job design. For example, you can use annotations to explain, summarize, or describe a job design or to help identify parts of a job design. You enter annotation text yourself and you can add as many of this annotations as required. Use it to annotate stages and links in your job design.

Annotations do not obstruct the display of the stages, links, or other components in your job design.

Bloom Filter

You can use the bloom filter operator to perform lookups on keys more efficiently. The bloom filter operator can generate false positives but never generates false negatives in your output data set.

You can use the bloom filter operator to more efficiently lookup incoming keys against previous values. You should use the bloom filter operator only when a small number of false positives are acceptable in your output dataset. This operator takes a single input dataset, and can generate multiple output sets depending on the operating mode. The Bloom Filter operator manages bloom filter file sets when they are described. It also adds or deletes files from the file set based on the options that are specified on the operator.

Change apply

The Change Apply stage is a processing stage. It takes the change data set, that contains the changes in the before and after data sets, from the Change Capture stage and applies the encoded change operations to a before data set to compute an after data set.

The before input to Change Apply must have the same columns as the before input that was input to Change Capture, and an automatic conversion must exist between the types of corresponding columns. In addition, results are only guaranteed if the contents of the before input to Change Apply are identical (in value and record order in each partition) to the before input that was fed to Change Capture, and if the keys are unique.

The Change Apply stage reads a record from the change data set and from the before data set, compares their key column values, and acts accordingly:

  • If the before keys come before the change keys in the specified sort order, the before record is copied to the output. The change record is retained for the next comparison.
  • If the before keys are equal to the change keys, the behavior depends on the code in the change_code column of the change record:
    • Insert: The change record is copied to the output; the stage retains the same before record for the next comparison. If key columns are not unique, and there is more than one consecutive insert with the same key, then Change Apply applies all the consecutive inserts before existing records. This record order might be different from the after data set given to Change Capture.
    • Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output and the stage retains the same change record for the next comparison. If key columns are not unique, the value columns ensure that the correct record is deleted. If more than one record with the same keys have matching value columns, the first-encountered record is deleted. This might cause different record ordering than in the after data set given to the Change Capture stage. A warning is issued and both change record and before record are discarded, that is, no output record results.
    • Edit: The change record is copied to the output; the before record is discarded. If key columns are not unique, then the first before record encountered with matching keys will be edited. This might be a different record from the one that was edited in the after data set given to the Change Capture stage. A warning is issued and the change record is copied to the output; but the stage retains the same before record for the next comparison.
    • Copy: The change record is discarded. The before record is copied to the output.
  • If the before keys come after the change keys, behavior also depends on the change_code column:
    • Insert. The change record is copied to the output, the stage retains the same before record for the next comparison. (The same as when the keys are equal.)
    • Delete. A warning is issued and the change record discarded while the before record is retained for the next comparison.
    • Edit or Copy. A warning is issued and the change record is copied to the output while the before record is retained for the next comparison.

Change capture

The Change Capture Stage is a processing stage. The stage compares two data sets and makes a record of the differences.

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The stage produces a change data set, whose table definition is transferred from the after data set's table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit.

The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other.

The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.

You can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to reproduce the after data set,

Checksum

Use the Checksum stage to generate a checksum value from the specified columns in a row and add the checksum to the row.

You can use the checksum value to check the validity of each row when it is written to the data target. If the checksum value does not equate to the columns from which it was generated, then the data is corrupt and is no longer valid.

Typically you create the Checksum stage in one job to add the checksum column, and then use a Checksum stage in another job to check the validity of the data.

Column export

The Column Export stage is a restructure stage, which is used to restructure complex data. This stage exports data from a number of columns of different data types into a single column of data type ustring, string, or binary. The input data column definitions determine the order in which the columns are exported to the single output column. In addition to exporting a column you can also pass other columns straight through the stage. For example, you could pass a key column straight through.

The stage editor has three tabs:
  • Stage Page. This is always present and is used to specify general information about the stage.
  • Input Page. This is where you specify the details about the single input set from which you are selecting records.
  • Output Page: This is where you specify details about the processed data being output from the stage.

Column generator

The Column Generator stage is a development and debugging stage. This stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new data set is then output. It can have a single input link and a single output link.

The stage editor has three tabs:
  • Stage Page. This is always present and is used to specify general information about the stage.
  • Input Page. This is where you specify details about the input link.
  • Output Page. This is where you specify details about the generated data being output from the stage.

Column import

The Column Import stage is a restructure stage, which is used to restructure complex data. This stage imports data from a single column and outputs it to one or more columns. You would typically use this stage to divide data arriving in a single column into multiple columns. The data would be fixed-width or delimited in some way to tell the Column Import stage where to make the divisions. The input column must be a string or binary data, the output columns can be any data type.

You supply an import table definition to specify the target columns and their types. This also determines the order in which data from the import column is written to output columns. In addition to importing a column you can also pass other columns straight through the stage. So, for example, you could pass a key column straight through.

The stage editor has three tabs:
  • Stage Page. This is always present and is used to specify general information about the stage.
  • Input Page. This is where you specify the details about the single input set from which you are selecting records.
  • Output Page. This is where you specify details about the processed data being output from the stage.

Combine Records

The Combine Records stage is a restructure stage that combines records (that is, rows), in which particular key-column values are identical, into vectors of subrecords. As input, the stage takes a data set in which one or more columns are chosen as keys. All adjacent records whose key columns contain the same value are gathered into the same record in the form of subrecords.

Compare

The Compare stage is a processing stage. It can have two input links and a single output link.

The Compare stage performs a column-by-column comparison of records in two presorted input data sets. You can restrict the comparison to specified key columns.

The Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set.

The stage outputs a data set with three columns:

  • result. Carries the code giving the result of the comparison.
  • first. A subrecord containing the columns of the first input link.
  • second. A subrecord containing the columns of the second input link.

Compress

The Compress stage is a processing stage. It can have a single input link and a single output link.

The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data. The complement to the Compress stage is the Expand stage.

A compressed data set is similar to an ordinary data set and can be stored in a persistent form by a Data Set stage. However, a compressed data set cannot be processed by many stages until it is expanded, that is, until its rows are returned to their normal format. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the Copy stage to create a copy of the compressed data set.

Because compressing a data set removes its normal record boundaries, the compressed data set must not be repartitioned before it is expanded.

Copy

The Copy stage is a processing stage. The Copy stage copies a single input data set to a number of output data sets. Some SMP systems allow scalability of disk I/O.

Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns. Copy lets you make a backup copy of a data set on disk while performing an operation on another copy

Decode

The Decode stage is a processing stage. It decodes a data set using a UNIX decoding command, such as gzip, that you supply. It converts a data stream of raw binary data into a data set. Its companion stage, Encode, converts a data set from a sequence of records to a stream of raw binary data.

Difference

The Difference stage is a processing stage. It performs a record-by-record comparison of two input data sets, which are different versions of the same data set designated the before and after data sets. The Difference stage outputs a single data set whose records represent the difference between them. The stage assumes that the input data sets have been key-partitioned and sorted in ascending order on the key columns you specify for the Difference stage comparison. You can achieve this by using the Sort stage or by using the built in sorting and partitioning abilities of the Difference stage.

The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other.

Encode

The Encode stage is a processing stage. It encodes a data set using a UNIX encoding command, such as gzip, that you supply. The stage converts a data set from a sequence of records into a stream of raw binary data. The companion Decode stage reconverts the data stream to a data set.

An encoded data set is similar to an ordinary one, and can be written to a data set stage. You cannot use an encoded data set as an input to stages that performs column-based processing or re-orders rows, but you can input it to stages such as Copy. You can view information about the data set in the data set viewer, but not the data itself. You cannot repartition an encoded data set, and you will be warned at runtime if your job attempts to do that.

Expand

The Expand stage is a processing stage that converts a previously compressed data set back into a sequence of records from a stream of raw binary data. It can have a single input link and a single output link.

The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data. The complement to the Expand stage is the Compress stage

External Filter

The External Filter stage is a processing stage that allows you to specify a UNIX command that acts as a filter on the data that you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records that did not contain a match. This technique can be a quick and efficient way of filtering data.

Filter

The Filter stage is a processing stage. It can have a single input link and a any number of output links and, optionally, a single reject link.

The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links. The data that did not meet filter conditions can be routed to the reject link

Funnel

The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.

The Funnel stage can operate in one of three modes:

  • Continuous Funnel combines the records of the input data in no guaranteed order. It takes one record from each input link in turn. If data is not available on an input link, the stage skips to the next link rather than waiting.
  • Sort Funnel combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.
  • Sequence copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on.

For all methods the meta data of all input data sets must be identical.

The sort funnel method has some particular requirements about its input data. All input data sets must be sorted by the same key columns as to be used by the Funnel operation.

Typically all input data sets for a sort funnel operation are hash-partitioned before they're sorted (choosing the auto partitioning method will ensure that this is done). Hash partitioning guarantees that all records with the same key column values are located in the same partition and so are processed on the same node. If sorting and partitioning are carried out on separate stages before the Funnel stage, this partitioning must be preserved.

The sortfunnel operation allows you to set one primary key and multiple secondary keys. The Funnel stage first examines the primary key in each input record. For multiple records with the same primary key value, it then examines secondary keys to determine the order of records it will output.

Generic

The Generic stage is a processing stage that allows you to call an Orchestrate® operator from within a stage and pass it options as required.

Head

The Head Stage is a Development/Debug stage. It can have a single input link and a single output link. It is one of a number of stages that helps you sample data.

Join

The Join stage is a processing stage. It performs join operations on two or more data sets input to the stage and then outputs the resulting data set. The Join stage is one of three stages that join tables based on the values of key columns. The other two are the Lookup stage and the Merge stage.

The stage can perform one of four join operations:

  • Inner transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped.
  • Left outer transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets.
  • Right outer transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets.
  • Full outer transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.)

The data sets input to the Join stage must be key partitioned and sorted in ascending order. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Join stage, the system in auto mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method).

Lookup

The Lookup stage is a processing stage. It is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. It can also perform lookups directly in a Db2® or Oracle database or in a lookup table contained in a Lookup File Set stage.

The most common use for a lookup is to map short codes in the input data set onto expanded information from a lookup table which is then joined to the incoming data and output. For example, you could have an input data set carrying names and addresses of your U.S. customers. The data as presented identifies state as a two letter U. S. state postal code, but you want the data to carry the full name of the state. You could define a lookup table that carries a list of codes matched to states, defining the code as the key column. As the Lookup stage reads each line, it uses the key to look up the state in the lookup table. It adds the state to a new column defined for the output link, and so the full state name is added to each address. If any state codes have been incorrectly entered in the data set, the code will not be found in the lookup table, and so that record will be rejected.

Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the key's values, the row is rejected.

Make Subrecord

The Make Subrecord stage is a restructure stage. It can have a single input link and a single output link. It combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. The Split Subrecord stage performs the inverse operation.

Make Vector

The Make Vector stage is an active stage that combines specified columns of an input data record into a vector of columns.

Merge

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.

The Merge stage combines a master data set with one or more update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record that are required. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stages before the Merge stage, the system in auto partition mode will detect this and not repartition (alternatively you could explicitly specify the Same partitioning method).

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

Modify

The Modify stage is a processing stage that alters the record schema of its input data set. The modified data set is then output. You can drop or keep columns from the schema, or change the type of a column.

Peek

The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. Like the Head stage and the Tail stage, the Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application.

Pivot Enterprise

Use the Pivot stage to pivot data horizontally.

The pivot operator maps a set of fields in an input row to a single column in multiple output records. This type of mapping operation is known as horizontal pivoting. The data output by the pivot operator usually has fewer fields, but more records than the input data. You can map several sets of input fields to several output columns. You can also output any of the fields in the input data with the output data.

You can generate a pivot index that will assign an index number to each record with a set of pivoted data.

Promote Subrecord

The Promote Subrecord stage is a restructure stage. It promotes the columns of an input subrecord to top-level columns. The number of output columns equals the number of subrecord elements. The data types of the input subrecord columns determine those of the corresponding top-level columns.

Remove Duplicates

The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two rows are considered duplicates if they are adjacent in the input data set and have identical values for the key column(s). A key column is any column you designate to be used in determining whether two rows are identical.

The data set input to the Remove Duplicates stage must be sorted so that all records with identical key values are adjacent. You can either achieve this using the in-stage sort facilities available on the Input page Partitioning tab, or have an explicit Sort stage feeding the Remove Duplicates stage.

Row Generator

The Row Generator stage is a Development/Debug stage.

The Row Generator Stage has no input links, and a single output link.

The Row Generator stage produces a set of mock data fitting the specified metadata. This is useful where you want to test your job but have no real data available to process.

The metadata you specify on the output link determines the columns you are generating.

For decimal values the Row Generator stage uses dfloat. As a result, the generated values are subject to the approximate nature of floating point numbers. Not all of the values in the valid range of a floating point number are representable. The further a value is from zero, the greater the number of significant digits, the wider the gaps between representable values.

Sample

The Sample stage is a Development/Debug stage that samples an input data set.

It can have a single input link and any number of output links when operating in percent mode, or a single input and single output link when operating in period mode. It is one of a number of stages that help you sample data.

The Sample stage operates in two modes. In Percent mode, it extracts rows, selecting them by means of a random number generator, and writes a given percentage of these to each output data set. You specify the number of output data sets, the percentage written to each, and a seed value to start the random number generator. You can reproduce a given distribution by repeating the same number of outputs, the percentage, and the seed value.

In Period mode, it extracts every Nth row from each partition, where N is the period, which you supply. In this case all rows will be output to a single data set, so the stage used in this mode can only have a single output link

Slowly changing dimension

The Slowly Changing Dimension (SCD) stage is a processing stage that works within the context of a star schema database. The SCD stage has a single input link, a single output link, a dimension reference link, and a dimension update link.

The SCD stage reads source data on the input link, performs a dimension table lookup on the reference link, and writes data on the output link. The output link can pass data to another SCD stage, to a different type of processing stage, or to a fact table. The dimension update link is a separate output link that carries changes to the dimension. You can perform these steps in a single job or a series of jobs, depending on the number of dimensions in your database and your performance requirements.

Sort

The Sort stage is a processing stage that is used to perform more complex sort operations than can be provided for on the Input page Partitioning tab of parallel job stage editors.

You can also use the Sort stage to insert a more explicit simple sort operation where you want to make your job easier to understand. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data.

You specify sorting keys as the criteria on which to perform the sort. A key is a column on which to sort the data, for example, if you had a name column you might specify that as the sort key to produce an alphabetical list of names. The first column you specify as a key to the stage is the primary key, but you can specify additional secondary keys. If multiple rows have the same value for the primary key column, then the system uses the secondary columns to sort these rows.

The stage uses temporary disk space when performing a sort. It looks in the following locations, in the following order, for this temporary space.

  1. Scratch disks in the disk pool sort (you can create these pools in the configuration file).
  2. Scratch disks in the default disk pool (scratch disks are included here by default).
  3. The directory specified by the TMPDIR environment variable.
  4. The directory /tmp.

You might perform a sort for several reasons. For example, you might want to sort a data set by a zip code column, then by last name within the zip code. Once you have sorted the data set, you can filter the data set by comparing adjacent records and removing any duplicates.

However, you must be careful when processing a sorted data set: many types of processing, such as repartitioning, can destroy the sort order of the data. For example, assume you sort a data set on a system with four processing nodes and store the results to a data set stage. The data set will therefore have four partitions. You then use that data set as input to a stage executing on a different number of nodes, possibly due to node constraints. The system automatically repartitions a data set to spread out the data set to all nodes in the system, unless you tell it not to, possibly destroying the sort order of the data. You could avoid this by specifying the Same partitioning method. The stage does not perform any repartitioning as it reads the input data set; the original partitions are preserved.

You must also be careful when using a stage operating sequentially to process a sorted data set. A sequential stage executes on a single processing node to perform its action. Sequential stages will collect the data where the data set has more than one partition, which might also destroy the sorting order of its input data set. You can overcome this if you specify the collection method as follows:

  • If the data was range partitioned before being sorted, you should use the ordered collection method to preserve the sort order of the data set. Using this collection method causes all the records from the first partition of a data set to be read first, then all records from the second partition, and so on.
  • If the data was hash partitioned before being sorted, you should use the sort merge collection method specifying the same collection keys as the data was partitioned on.

Split Subrecord

The Split Subrecord stage separates an input subrecord field into a set of top-level vector columns.

Split Vector

The Split Vector stage is a restructure stage. It promotes the elements of a fixed-length vector to a set of similarly named top-level columns.

Surrogate Key Generator

The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and maintains the key source.

A surrogate key is a unique primary key that is not derived from the data that it represents, therefore changes to the data do not change the primary key. In a star schema database, surrogate keys are used to join a fact table to a dimension table.

Switch

The Switch stage is a processing stage that takes a single data set as input and assigns each input row to an output data set based on the value of a selector field. You can select one of the following selector fields:
  • Select Auto when there are as many distinct selector values as output links.
  • Select Hash so that when the rows are hashed on the selector column modulo, the number of output links are assigned to an output link accordingly. In this case, the selector column must be of a type that is convertible to Unsigned Integer and might not be nullable.

Tail

The Tail Stage is a Development/Debug stage that helps you to sample data.

The Tail Stage can have a single input link and a single output link. It is one of a number of stages that help you sample data.

The Tail Stage selects the last N records from each partition of an input data set and copies the selected records to an output data set. You determine which records are copied by setting properties which allow you to specify:

  • The number of records to copy
  • The partition from which the records are copied

This stage is helpful in testing and debugging applications with large data sets. For example, the Partition property lets you see data from a single partition to determine if the data is being partitioned as you want it to be. The Skip property lets you access a certain portion of a data set.

Transformer

Transformer stages allow you to create transformations to apply to your data. These transformations can be simple or complex and can be applied to individual columns in your data. Transformations are specified using a set of functions.

Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

Wave Generator

Use the Wave Generator stage to monitor a stream of data and insert end-of-wave markers where needed.

Write Range Map

The writerangemap operator takes an input data set produced by sampling and partition sorting a data set and writes it to a file in a form usable by the range partitioner. The range partitioner uses the sampled and sorted data set to determine partition boundaries.