Usage of TwoStep clustering

Unlike the K-Means and divisive clustering algorithms, the TwoStep algorithm can determine an optimal number of clusters. Furthermore, it supports log-likelihood distance. This distance is a distribution-based distance measure that is suitable for nominal and numerical attributes.

The TwoStep algorithm takes the following steps:

  1. In the preprocessing step, a CF-tree is built with a limited number of intermediate clusters. The build is controlled by the maxleaves parameter. Thus, you can exchange model quality with a high maxleaves value for good performance with a low maxleaves value according to your needs.
  2. In the refinement step, the final model is built. For this reason, TwoStep is especially suitable for large data sets.

The TwoStep clustering model has the following elements:

  • Support for nominal and numerical attributes
  • Support for missing values
  • Predefined distance measures:
    • Log-likelihood (the default)
    • Euclidean
    • Norm_Euclidean