Background of TwoStep clustering

To cluster data, the TwoStep clustering algorithm scans the data, builds a cluster tree, and clusters the leaves of the tree hierarchically. Moreover, it refines the clustering result.

The following list describes the clustering tasks more detailed.

  1. The TwoStep clustering algorithm scans all data and builds a clustering feature (CF) tree. This tree is built by arranging the input records in a way that similar records become part of the same tree node. If there is a memory issue, the tree is rebuilt with an increased threshold and outliers are removed.
  2. The leaves of the CF tree are clustered hierarchically in memory. The clustering is done by calculating the n * (n-1) / 2 distances between each pair of leaves and merging the two clusters with the smallest distance. The process of calculating distances between clusters and merging the closest two is repeated until the root of the tree is reached. All data is contained in one cluster, thus forming a binary tree. Starting with the root node of the binary tree, the child node with the worst quality is added. The process of adding child nodes continues until the number of clusters that is determined automatically or that is specified by the user is reached. The number of clusters that is determined automatically is also the optimal number of clusters.
  3. The clustering result is refined by a final pass over the data where each record is assigned to the closest cluster. This behavior is similar to the behavior of the K-means algorithm.