TwoStep clustering on IBM® Db2 Warehouse

TwoStep clustering is a data mining algorithm for large data sets. It is faster than traditional methods because it typically scans a data set only once before it saves the data to a clustering feature (CF) tree.

Note: These stored procedures are implemented using Apache Spark, and can be used only on a Db2 Warehouse system for which Spark capability is enabled. To find out whether Spark capability is enabled on your Db2 Warehouse system, ask your Db2 Warehouse administrator.

TwoStep clustering can make clustering decisions without repeated data scans, whereas other clustering methods scan all data points, which requires multiple iterations. Non-uniform points are not gathered, so each iteration requires a reinspection of each data point, regardless of the significance of the data point. Because TwoStep clustering treats dense areas as a single unit and ignores pattern outliers, it provides high-quality clustering results without exceeding memory constraints.

The TwoStep algorithm has the following advantages:

It automatically determines the optimal number of clusters. You do not have to manually create a different clustering model for each number of clusters.
It detects input columns that are not useful for the clustering process. These columns are automatically set to supplementary. Statistics are gathered for these columns but they do not influence the clustering algorithm.
The configuration of the CF tree can be granular, so that you can balance between memory usage and model quality, according to the environment and needs.