The data sources you are extracting data from can contain many thousands of rows of data. For example, the data in a sales database can contain information about each transaction or sale. You could pass all this data into your data warehouse. However, this would mean you would have to search through large volumes of data in the data warehouse before you get the results you need.
If you only want summary information, for example, the total of product A sold since 01/01/96, you can aggregate your data and only pass the summed total to the data warehouse. This reduces the amount of data you store in the data warehouse, speeds up the time taken to find the data you want, and ensures the data warehouse stores data in a format you need.
The Aggregator stage allows you to group by or summarize any columns on any of the output links.
A group of input data is a set of input rows that share the same values for all the grouped by columns. For example, if your sales database contained information about three different products, A, B, and C, you could group by the Product column. All the information about product A would be grouped together, as would all the information for products B and C.
By summarizing data, you can perform basic calculations on the values in a particular column. The actions you can perform depend on the SQL data type of the selected column.
For numeric SQL data types you can perform the following actions:
In calculating Standard Deviation, IBM® InfoSphere® DataStage® uses the formula:
standarddeviation = sqrt [ (sum(Xi2) - N avg(Xi)2) / N ]
Some other packages, such as Microsoft Excel, use the formula:
standarddeviation = sqrt [ (sum(Xi2) - N avg(Xi)2) / (N-1)]
For any other SQL data types you can perform the following actions:
For example, if you want to know the total number of product A sold, you would sum the values in the QtySold column.
To group by or summarize a column, you must edit the Derivation column in the Output Column dialog box. Do this by double-clicking the cell to open the Derivation dialog box.
The Derivation dialog box contains the following fields and option:
If you want to group by the column, select the Group by this column check box. The aggregate function is automatically set to (grouped), and you cannot select an aggregate function from the list.
To use an aggregate function, clear the Group by this column check box and select the function you want to use from the Aggregate function list.
Click OK to save the settings for the column.