Aggregator stage: Stage tab (DataStage®)

The Aggregator stage tab enables you to control aspects of the Aggregator stage.

Double-click the stage to open the stage properties panel. The Properties section lets you specify what the stage does. The Advanced section allows you to specify how the stage executes. Specify an optional description of the stage.

Properties section

Use the Properties section to define what the stage actually does.

The following table lists the properties and their attributes.

Table 1. Properties
Category/Property Values Default Mandatory? Repeats? Dependent of
Grouping Keys/Group Input column N/A Y Y N/A
Grouping Keys/Case Sensitive True/ False True N N Group
Aggregations/Aggregation Type Calculation/ Recalculation/ Count rows Calculation Y N N/A
Aggregations/Column for Calculation Input column N/A Y (if Aggregation Type = Calculation) Y N/A
Aggregations/Count Output Column Output column N/A Y (if Aggregation Type = Count Rows) Y N/A
Aggregations/ Summary Column for Recalculation Input column N/A Y (if Aggregation Type = Recalculation) Y N/A
Aggregations/ Default To Decimal Output precision, scale 8,2 N N N/A
Aggregations/Corrected Sum of Squares Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Maximum Value Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Mean Value Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Minimum Value Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Missing Value Output column N/A N Y Column for Calculation
Aggregations/Missing Values Count Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Non-missing Values Count Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Percent Coefficient of Variation Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Range Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Standard Deviation Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Standard Error Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Sum of Weights Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Sum Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Summary Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Uncorrected Sum of Squares Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Variance Output column N/A N N Column for Calculation & Summary Column for Recalculation
Aggregations/Variance divisor Default/ Nrecs Default N N Variance
Aggregations/Calculation and Recalculation Dependent Properties Input column N/A N N Column for Calculation or Count Output Column
Aggregations/Decimal Output precision, scale 8,2 N N Calculation or Recalculation method
Options/Grouping keys hash/sort hash Y Y N/A
Options/Allow Null Outputs True/ False False Y N N/A

Grouping keys

Group

Repeat the property to select multiple columns as group keys. Click Edit to select several group keys at once if required. This property has a dependent property:

  • Case Sensitive

    Use this to specify whether each group key is case sensitive or not, this is set to True by default, that is, the values "CASE" and "case" in would end up in different groups.

Aggregation category

Aggregation type
Choose calculation (the default), recalculation, or count rows.
Column for calculation
Click Edit in the Aggregations section to select a column for calculation. The Calculate aggregate type allows you to summarize the contents of a particular column or columns in your input data set by applying one or more aggregate functions to it. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for calculation at once if required).
Count output column
The Count Rows aggregate type performs a count of the number of records within each group. Specify the column on which the count is output.
Summary column for recalculation
This aggregate type allows you to apply aggregate functions to a column that has already been summarized. This is like calculate but performs the specified aggregate operation on a set of data that has already been summarized. In practice this means you should have performed a calculate (or recalculate) operation in a previous Aggregator stage with the Summary property set to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for recalculation at once if required).
Weighting column
Configures the stage to increment the count for the group by the contents of the weight column for each record in the group, instead of by 1. Not available for Summary Column for Recalculation. Setting this option affects only the following options:
  • Percent Coefficient of Variation
  • Mean Value
  • Sum
  • Sum of Weights
  • Uncorrected Sum of Squares
Default to decimal output
The output type of a calculation or recalculation column is double. Setting this property causes it to default to decimal. You can specify that individual columns have decimal output while others retain the default type of double. You can also set a default precision and scale.

Options

Method

Your choice of mode depends primarily on the number of groupings in the input data set, taking into account the amount of memory available. You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory to be used.

When using hash mode, you should hash partition the input data set by one or more of the grouping key columns so that all the records in the same group are in the same partition. However, hash partitioning is not mandatory, you can use any partitioning method you choose if keeping groups together in a single partition is not important. For example, if you're summing records in each partition and later you'll add the sums across all partitions, you don't need all records in a group to be in the same partition to do this. Note, though, that there will be multiple output records for each group.

If the number of groups is large, which can happen if you specify many grouping keys, or if some grouping keys can take on many values, you would normally use sort mode. However, sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys (this happens automatically if auto is set in the Partitioning tab). Sorting requires a pregrouping operation: after sorting, all records in a given group in the same partition are consecutive.

The method property is set to hash by default.

You might want to try both modes with your particular data and application to determine which gives the better performance. You might find that when calculating statistics on large numbers of groups, sort mode performs better than hash mode, assuming the input data set can be efficiently sorted before it is passed to group.

Allow null outputs
Set this to True to indicate that null is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the null value will have 0 substituted when all input values for the calculation column are null. It is False by default.

Advanced

Specify the following advanced properties:
  • Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data set is processed by the available nodes as specified in the Configuration file, and by any node constraints specified in the Advanced section. In Sequential mode the entire data set is processed by the conductor node.
  • Combinability mode. This is Auto by default, which allows IBM DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
  • Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage will request that the next stage in the job attempt to maintain the partitioning.