Processing stage types

Database native SQL does not provide some of the functionality provided by certain stage types that you can use in job designs.

Transformer stage

InfoSphere® DataStage® Balanced Optimization optimizes the functionality contained in a Transformer stage. It does not attempt to optimize functionality contained in a BASIC Transformer stage.

SQL does not support static variables. Therefore any Transformer stage that contains a stage variable whose derivation either directly or indirectly refers to itself cannot be optimized. Such logic is often used to implement functionality such as accumulators and control break logic. To be optimized, this logic must be expressed in other ways, for example, by using aggregation.

SQL does not support temporary variables. There is no way for an expression that produces one select list element to refer to any other select list element. Therefore, any Transformer stage that contains an output column derivation that refers to another output column cannot be optimized. To enable optimization, compute common subexpressions as stage variables, and then use them in multiple-output column derivations.

The use of static variables or temporary variables is detected when you attempt to optimize a job that contains them. The InfoSphere Balanced Optimization tool provides detailed log information about why the job could not be optimized. It might be possible to reorganize the computations to allow optimization of that job.

By default, InfoSphere DataStage supports intermediate results with a precision/scale of decimal(38,10) (up to 38 total decimal digits, of which up to 10 can be fractional digits) during complex arithmetic calculations. SQL databases individually support different maximums. For example, Teradata releases earlier than V6.2 support a maximum precision of 18 decimal digits. So complex calculations with decimal data types, when pushed into database SQL, can exhibit rounding errors due to lower precision/scale. To ensure comparable results, you can set the InfoSphere DataStage environment variables APT_DECIMAL_INTERM_PRECISION, APT_DECIMAL_INTERM_SCALE, and APT_DECIMAL_INTERM_ROUNDMODE to control the precision, scale, and rounding of intermediate results.

NULL and 0 values are often equivalent in InfoSphere DataStage. In a root job, the logical expression Link.Column = 0 might evaluate to true if Link.Column has a NULL value. SQL has no such equivalence. If the intent is to test for NULL, then use the function IsNull(Link.Column). The equivalence of 0 and NULL can cause a problem where the Transformer stage consumes the output of any stage that can produce NULL values (for example, a Join stage performing an outer join).

Filter stage

The InfoSphere DataStage Balanced Optimization tool optimizes the functionality contained in the Filter stage. When the Filter stage is optimized, all the predicates specified in the Filter stage get combined with SQL statements in the connector.

The InfoSphere Balanced Optimization tool does not support the following Filter stage functions:
  • Nulls value (when set to Greater than)
  • Output rows only once (when set to True)

Aggregator stage

The InfoSphere DataStage Balanced Optimization tool supports the calculation aggregation type. It does not support the count or recalculation aggregation types.

The InfoSphere Balanced Optimization tool does not support the following aggregation functions:
  • Corrected sum of squares
  • Missing value
  • Preserve Type
  • Summary

For certain statistical functions, such as standard deviation and variance, there might be small differences in calculated answers between the root job and an optimized job where aggregation is pushed into the database. This is because of differences in parallel calculation algorithms, and the precision of intermediate results in InfoSphere DataStage and the database. These differences are typically beyond the fourth significant digit and are insignificant for practical purposes. Values for most other calculations (count, sum, mean) are unaffected.

Remove duplicates stage

The InfoSphere DataStage Balanced Optimization tool optimizes the functionality contained in the Remove Duplicates stage. Key columns defined in the Remove Duplicates stage are used to group rows with the same values in non-key columns. For the Duplicate to retain option, both the First (keep first row in each group) and Last (keep last row in each group) values are supported. The Remove Duplicates stage does not support Case Sensitive for key column option = FALSE. Additionally, the Remove Duplicates stage can only be pushed to a source connector.

Join stage

The InfoSphere DataStage Balanced Optimization tool supports joins with two inputs.

By default, the Join stage sorts data on both input links. This sorting results in output data that is sorted on the join keys. If the InfoSphere DataStage project environment variable APT_NO_SORT_INSERTION is set, the sort is suppressed. Pushing a Join stage into a database source always generates an ORDER BY clause for the sort keys, as if APT_NO_SORT_INSERTION was not set.

InfoSphere DataStage performs implicit type conversions to make join keys of different types comparable. SQL, however, is much stricter and such implicit conversions can lead to runtime database errors. Therefore Join key columns within different type groups (for examples, numbers and strings) cannot be optimized.

Lookup stage

The InfoSphere DataStage Balanced Optimization tool supports single lookups with two input links.

The following lookup conditions are supported:
  • Lookup key equality (case sensitive for character-type columns)
  • Multi-key lookups
The following lookup conditions are not supported:
  • Range lookups
  • Case-insensitive key equality lookups on character-type columns.

Only continue and drop actions are supported for lookup failure and condition-not-met. Both lookup failure and condition-not-met must be set to the same actions. In any other case, a warning is given in the optimization log and optimization of the lookup stage is not performed.

InfoSphere DataStage performs implicit type conversions to make join keys of different types comparable. SQL, however, is much stricter and such implicit conversions can lead to runtime database errors. Therefore Lookup key columns within different type groups (for examples, numbers and strings) cannot be optimized.

By default, for a normal (non-sparse) lookup, InfoSphere DataStage uses only one row matching on the lookup key or keys from a lookup table, and ignores any other lookup table rows that match. This behavior is not supported by SQL joins. However, since the typical InfoSphere DataStage usage is for lookup tables to have unique rows, Balanced Optimization issues a warning to remind you to ensure that the lookup table has unique rows. To eliminate this warning, set the Multiple rows returned from link property on the Lookup stage to the name of the reference link, or set the Lookup type property of the database connector supplying the reference table to sparse. If the lookup table does not have unique rows on the lookup keys, and the intended behavior is to use only one matching row, do not optimize this lookup.

Copy stage

The InfoSphere DataStage Balanced Optimization tool optimizes the functionality contained in the Copy stage. The Copy stage is not optimized if the Force option is set to True. When the Copy stage is optimized, it is removed from the job.

Sort stage

The InfoSphere DataStage Balanced Optimization tool optimizes the functionality contained in the Sort stage. When the Sort stage is optimized, an equivalent SQL ORDER BY clause is generated.

The Sort stage is not optimized if any of these option values are set:
  • Output Statistics = True
  • Create Key Change Column = True
  • Create Cluster Key Change Column = True
  • Sort as EBCDIC = True
  • Nulls Position = Last
  • Stable Sort = True

Containers

The InfoSphere DataStage Balanced Optimization tool does not support containers. Optimization stops at any container boundaries.

For a job with a local container, use the local container Deconstruct option to make the container contents part of the job, and so eligible for optimization. Select the local container stage in the job, right-click, and select Deconstruct.

For a job with a shared container, use the Convert to local container option, and then deconstruct the local container. Select the shared container stage in the job, right-click, and select Convert to local container.

Stages with multiple output links or reject links

The Transformer, Filter, and Copy processing stages can have multiple output links including reject links. Reject link optimization is not widely supported, except in the Filter stage. A reject link in the Filter stage works like an output link, except that it is associated with an Otherwise predicate. For a Transformer stage or Copy stage without a reject link or a Filter stage, multiple output links can be optimized. In the optimization process, each of these stages with multiple output links is first split into duplicate copies, one for each output link. In this expanded intermediate job, each stage has only one output link, and is optimized like any other job that contains stages with one output link.