Setting options for the Aggregate node

On the Aggregate node you specify the following.

One or more key fields to use as categories for the aggregation
One or more aggregate fields for which to calculate the aggregate values
One or more aggregation modes (types of aggregation) to output for each aggregate field

You can also specify the default aggregation modes to use for newly added fields, and use expressions (similar to formulae) to categorize aggregation.

Note that for added performance, aggregations operations may benefit from enabling parallel processing.

Key fields. Lists fields that can be used as categories for aggregation. Both continuous (numeric) and categorical fields can be used as keys. If you choose more than one key field, the values will be combined to produce a key value for aggregating records. One aggregated record will be generated for each unique key field. For example, if Sex and Region are your key fields, each unique combination of M and F with regions N and S (four unique combinations) will have an aggregated record. To add a key field, use the Field Chooser button to the right of the window.

The rest of the dialog box is split into two main areas - Basic Aggregates and Aggregate Expressions.

Basic Aggregates

Aggregate fields. Lists the fields for which values will be aggregated as well as the selected modes of aggregation. To add fields to this list, use the Field Chooser button on the right. The following aggregation modes are available.

Note: Some modes are not applicable to non-numeric fields (for example, Sum for a date/time field). Modes that cannot be used with a selected aggregate field are disabled.

Sum. Select to return summed values for each key field combination. The sum is the total of the values, across all cases with nonmissing values.
Mean. Select to return the mean values for each key field combination. The mean is a measure of central tendency, and is the arithmetic average (the sum divided by the number of cases).
Min. Select to return minimum values for each key field combination.
Max. Select to return maximum values for each key field combination.
SDev. Select to return the standard deviation for each key field combination. The standard deviation is a measure of dispersion around the mean, and is equal to the square root of the variance measurement.
Median. Select to return the median values for each key field combination. The median is a measure of central tendency that is not sensitive to outlying values (unlike the mean, which can be affected by a few extremely high or low values). Also known as the 50th percentile or 2nd quartile.
Count. Select to return the count of non-null values for each key field combination.
Variance. Select to return the variance values for each key field combination. The variance is a measure of dispersion around the mean, equal to the sum of squared deviations from the mean divided by one less than the number of cases.
1st Quartile. Select to return the 1st quartile (25th percentile) values for each key field combination.
3rd Quartile. Select to return the 3rd quartile (75th percentile) values for each key field combination.

Note: When running a stream containing an Aggregate node, the values returned for 1st and 3rd Quartiles when pushing back SQL to an Oracle database may differ from those returned in native mode.

Default mode. Specify the default aggregation mode to be used for newly added fields. If you frequently use the same aggregation, select one or more modes here and use the Apply to All button on the right to apply the selected modes to all fields listed above.

New field name extension. Select to add a suffix or prefix, such as 1 or new, to duplicate aggregated fields. For example, the result of a minimum values aggregation on the field Age will produce a field name called Age_Min_1 if you have selected the suffix option and specified 1 as the extension. Note that aggregation extensions such as _Min or Max_ are automatically added to the new field, indicating the type of aggregation performed. Select Suffix or Prefix to indicate your preferred extension style.

Include record count in field. Select to include an extra field in each output record called Record_Count, by default. This field indicates how many input records were aggregated to form each aggregate record. Create a custom name for this field by typing in the edit field.

Note: System null values are excluded when aggregates are computed, but they are included in the record count. Blank values, on the other hand, are included in both aggregation and record count. To exclude blank values, you can use a Filler node to replace blanks with null values. You can also remove blanks using a Select node.

Aggregate Expressions

Expressions are similar to formulas that are created from values, field names, operators, and functions. Unlike functions that work on a single record at a time, aggregate expressions operate on a group, set, or collection of records.

Note: You can only create aggregate expressions if the stream includes a database connection (by means of a Database source node).

New expressions are created as derived fields; to create an expression you use the Database Aggregates functions which are available from the Expression Builder.

For more information, see The Expression Builder.

Note that there is a connection between the Key Fields and any aggregate expressions you create because the aggregate expressions are grouped by the key field.

Valid aggregate expressions are ones that evaluate to aggregate outcomes; a couple of examples of valid aggregate expressions, and the rules that govern them, are as follows:

You can use scalar functions to combine multiple aggregation functions together to produce a single aggregation result. For example:
```
max(C01) - min(C01)
```
An aggregation function can operate on the result of multiple scalar functions. For example:
```
sum (C01*C01)
```