Aggregator stage: Stage tab (DataStage®)
The Aggregator stage tab enables you to control aspects of the Aggregator stage.
Double-click the stage to open the stage properties panel. The Properties section lets you specify what the stage does. The Advanced section allows you to specify how the stage executes. Specify an optional description of the stage.
Properties section
Use the Properties section to define what the stage actually does.
The following table lists the properties and their attributes.
Category/Property | Values | Default | Mandatory? | Repeats? | Dependent of |
---|---|---|---|---|---|
Grouping Keys/Group | Input column | N/A | Y | Y | N/A |
Grouping Keys/Case Sensitive | True/ False | True | N | N | Group |
Aggregations/Aggregation Type | Calculation/ Recalculation/ Count rows | Calculation | Y | N | N/A |
Aggregations/Column for Calculation | Input column | N/A | Y (if Aggregation Type = Calculation) | Y | N/A |
Aggregations/Count Output Column | Output column | N/A | Y (if Aggregation Type = Count Rows) | Y | N/A |
Aggregations/ Summary Column for Recalculation | Input column | N/A | Y (if Aggregation Type = Recalculation) | Y | N/A |
Aggregations/ Default To Decimal Output | precision, scale | 8,2 | N | N | N/A |
Aggregations/Corrected Sum of Squares | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Maximum Value | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Mean Value | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Minimum Value | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Missing Value | Output column | N/A | N | Y | Column for Calculation |
Aggregations/Missing Values Count | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Non-missing Values Count | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Percent Coefficient of Variation | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Range | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Standard Deviation | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Standard Error | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Sum of Weights | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Sum | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Summary | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Uncorrected Sum of Squares | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Variance | Output column | N/A | N | N | Column for Calculation & Summary Column for Recalculation |
Aggregations/Variance divisor | Default/ Nrecs | Default | N | N | Variance |
Aggregations/Calculation and Recalculation Dependent Properties | Input column | N/A | N | N | Column for Calculation or Count Output Column |
Aggregations/Decimal Output | precision, scale | 8,2 | N | N | Calculation or Recalculation method |
Options/Grouping keys | hash/sort | hash | Y | Y | N/A |
Options/Allow Null Outputs | True/ False | False | Y | N | N/A |
Grouping keys
Group
Repeat the property to select multiple columns as group keys. Click Edit to select several group keys at once if required. This property has a dependent property:
- Case Sensitive
Use this to specify whether each group key is case sensitive or not, this is set to True by default, that is, the values "CASE" and "case" in would end up in different groups.
Aggregation category
- Aggregation type
- Choose calculation (the default), recalculation, or count rows.
- Column for calculation
- Click Edit in the Aggregations section to select a column for calculation. The Calculate aggregate type allows you to summarize the contents of a particular column or columns in your input data set by applying one or more aggregate functions to it. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for calculation at once if required).
- Count output column
- The Count Rows aggregate type performs a count of the number of records within each group. Specify the column on which the count is output.
- Summary column for recalculation
- This aggregate type allows you to apply aggregate functions to a column that has already been summarized. This is like calculate but performs the specified aggregate operation on a set of data that has already been summarized. In practice this means you should have performed a calculate (or recalculate) operation in a previous Aggregator stage with the Summary property set to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result. You can use the Column Selection dialog box to select several columns for recalculation at once if required).
- Weighting column
- Configures the stage to increment the count for the group by the contents of the weight column
for each record in the group, instead of by 1. Not available for Summary Column for Recalculation.
Setting this option affects only the following options:
- Percent Coefficient of Variation
- Mean Value
- Sum
- Sum of Weights
- Uncorrected Sum of Squares
- Default to decimal output
- The output type of a calculation or recalculation column is double. Setting this property causes it to default to decimal. You can specify that individual columns have decimal output while others retain the default type of double. You can also set a default precision and scale.
Options
- Method
-
Your choice of mode depends primarily on the number of groupings in the input data set, taking into account the amount of memory available. You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory to be used.
When using hash mode, you should hash partition the input data set by one or more of the grouping key columns so that all the records in the same group are in the same partition. However, hash partitioning is not mandatory, you can use any partitioning method you choose if keeping groups together in a single partition is not important. For example, if you're summing records in each partition and later you'll add the sums across all partitions, you don't need all records in a group to be in the same partition to do this. Note, though, that there will be multiple output records for each group.
If the number of groups is large, which can happen if you specify many grouping keys, or if some grouping keys can take on many values, you would normally use sort mode. However, sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys (this happens automatically if auto is set in the Partitioning tab). Sorting requires a pregrouping operation: after sorting, all records in a given group in the same partition are consecutive.
The method property is set to hash by default.
You might want to try both modes with your particular data and application to determine which gives the better performance. You might find that when calculating statistics on large numbers of groups, sort mode performs better than hash mode, assuming the input data set can be efficiently sorted before it is passed to group.
- Allow null outputs
- Set this to True to indicate that null is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the null value will have 0 substituted when all input values for the calculation column are null. It is False by default.
Advanced
- Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data set is processed by the available nodes as specified in the Configuration file, and by any node constraints specified in the Advanced section. In Sequential mode the entire data set is processed by the conductor node.
- Combinability mode. This is Auto by default, which allows IBM DataStage to combine the operators that underlie parallel stages so that they run in the same process if it is sensible for this type of stage.
- Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage will request that the next stage in the job attempt to maintain the partitioning.