Usage of TwoStep clustering
Unlike the K-Means and divisive clustering algorithms, the TwoStep algorithm can determine an optimal number of clusters. Furthermore, it supports log-likelihood distance. This distance is a distribution-based distance measure that is suitable for nominal and numerical attributes.
The TwoStep algorithm takes the following steps:
- In the preprocessing step, a CF-tree is built with a limited number of intermediate clusters. The build is controlled by the maxleaves parameter. Thus, you can exchange model quality with a high maxleaves value for good performance with a low maxleaves value according to your needs.
- In the refinement step, the final model is built. For this reason, TwoStep is especially suitable for large data sets.
The TwoStep clustering model has the following elements:
- Support for nominal and numerical attributes
- Support for missing values
- Predefined distance measures:
- Log-likelihood (the default)
- Euclidean
- Norm_Euclidean