Aggregator stage: Stage tab (DataStage®)

The Aggregator stage tab enables you to control aspects of the Aggregator stage.

Double-click the stage to open the stage properties panel. The Properties section lets you specify what the stage does. The Advanced section allows you to specify how the stage executes. Specify an optional description of the stage.

Properties section
Grouping keys
Aggregation category
Options

Properties section

Use the Properties section to define what the stage actually does.

The following table lists the properties and their attributes.

Table 1. Properties
Category/Property	Values	Default	Mandatory?	Repeats?	Dependent of
Grouping Keys/Group	Input column	N/A	Y	Y	N/A
Grouping Keys/Case Sensitive	True/ False	True	N	N	Group
Aggregations/Aggregation Type	Calculation/ Recalculation/ Count rows	Calculation	Y	N	N/A
Aggregations/Column for Calculation	Input column	N/A	Y (if Aggregation Type = Calculation)	Y	N/A
Aggregations/Count Output Column	Output column	N/A	Y (if Aggregation Type = Count Rows)	Y	N/A
Aggregations/ Summary Column for Recalculation	Input column	N/A	Y (if Aggregation Type = Recalculation)	Y	N/A
Aggregations/ Default To Decimal Output	precision, scale	8,2	N	N	N/A
Aggregations/Corrected Sum of Squares	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Maximum Value	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Mean Value	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Minimum Value	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Missing Value	Output column	N/A	N	Y	Column for Calculation
Aggregations/Missing Values Count	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Non-missing Values Count	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Percent Coefficient of Variation	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Range	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Standard Deviation	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Standard Error	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Sum of Weights	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Sum	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Summary	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Uncorrected Sum of Squares	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Variance	Output column	N/A	N	N	Column for Calculation & Summary Column for Recalculation
Aggregations/Variance divisor	Default/ Nrecs	Default	N	N	Variance
Aggregations/Calculation and Recalculation Dependent Properties	Input column	N/A	N	N	Column for Calculation or Count Output Column
Aggregations/Decimal Output	precision, scale	8,2	N	N	Calculation or Recalculation method
Options/Grouping keys	hash/sort	hash	Y	Y	N/A
Options/Allow Null Outputs	True/ False	False	Y	N	N/A

Grouping keys

Group

Repeat the property to select multiple columns as group keys. Click Edit to select several group keys at once if required. This property has a dependent property:

Case Sensitive
Use this to specify whether each group key is case sensitive or not, this is set to True by default, that is, the values "CASE" and "case" in would end up in different groups.

Aggregation category

Aggregation type

Choose calculation (the default), recalculation, or count rows.

Column for calculation

Click Edit in the Aggregations section to select a column for calculation. The Calculate aggregate type allows you to summarize the contents of a particular column or columns in your input data set by applying one or more aggregate functions to it. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for calculation at once if required).

Count output column

The Count Rows aggregate type performs a count of the number of records within each group. Specify the column on which the count is output.

Summary column for recalculation

This aggregate type allows you to apply aggregate functions to a column that has already been summarized. This is like calculate but performs the specified aggregate operation on a set of data that has already been summarized. In practice this means you should have performed a calculate (or recalculate) operation in a previous Aggregator stage with the Summary property set to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for recalculation at once if required).

Weighting column

Configures the stage to increment the count for the group by the contents of the weight column for each record in the group, instead of by 1. Not available for Summary Column for Recalculation. Setting this option affects only the following options:

Percent Coefficient of Variation
Mean Value
Sum
Sum of Weights
Uncorrected Sum of Squares

Default to decimal output

The output type of a calculation or recalculation column is double. Setting this property causes it to default to decimal. You can specify that individual columns have decimal output while others retain the default type of double. You can also set a default precision and scale.

Options

Method

Your choice of mode depends primarily on the number of groupings in the input data set, taking into account the amount of memory available. You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory to be used.

When using hash mode, you should hash partition the input data set by one or more of the grouping key columns so that all the records in the same group are in the same partition. However, hash partitioning is not mandatory, you can use any partitioning method you choose if keeping groups together in a single partition is not important. For example, if you're summing records in each partition and later you'll add the sums across all partitions, you don't need all records in a group to be in the same partition to do this. Note, though, that there will be multiple output records for each group.

If the number of groups is large, which can happen if you specify many grouping keys, or if some grouping keys can take on many values, you would normally use sort mode. However, sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys (this happens automatically if auto is set in the Partitioning tab). Sorting requires a pregrouping operation: after sorting, all records in a given group in the same partition are consecutive.

The method property is set to hash by default.

You might want to try both modes with your particular data and application to determine which gives the better performance. You might find that when calculating statistics on large numbers of groups, sort mode performs better than hash mode, assuming the input data set can be efficiently sorted before it is passed to group.

Allow null outputs

Set this to True to indicate that null is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the null value will have 0 substituted when all input values for the calculation column are null. It is False by default.

Advanced

Specify the following advanced properties:

Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data set is processed by the available nodes as specified in the Configuration file, and by any node constraints specified in the Advanced section. In Sequential mode the entire data set is processed by the conductor node.
Combinability mode. This is Auto by default, which allows IBM DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage will request that the next stage in the job attempt to maintain the partitioning.