Data mining — Clusterer operators

Clusterer operators compute groups of records in the input table on the basis of similarity of the records.

A data record might, for example, consist of information about a customer. The Clustering algorithm groups similar customers together. At the same time, it maximizes the differences between the different customer groups that are formed in this way. The groups that are found are known as clusters. Each cluster tells a specific story about customer identity or behavior, for example, about their demographic background, or about their preferred products or product combinations. In this way, customers that are similar are grouped together in homogeneous groups that are then available for marketing or for other business processes. Another common term for creating clusters of customers is customer segmentation. You cannot only cluster customers but data records of all kinds. For example, in retail you may also cluster stores based on their characteristics to perform store profiling.

The output of the clusterer operator is a cluster model that contains the characteristics about each cluster. The cluster model can be visualized using the visualizer operator or exported into a tabular description of the clusters using the cluster extractor operator.

To assign new data records into the clusters, you can use the scorer operator with the cluster model created by the clusterer operator.

You can use distribution-based Demographic clustering, center-based Kohonen clustering, or Enhanced BIRCH clustering. When you create a new clusterer operator, the distribution-based Demographic Clustering algorithm is used by default. You can change this default setting by selecting the center-based Kohonen Clustering algorithm or the Enhanced BIRCH algorithm from the Algorithm drop-down list of the clusterer operator.

The center-based clustering algorithm is based on a Kohonen feature map and uses Euclidean distance. The Enhanced BIRCH clustering algorithm uses a log-likelihood measure to calculate distances. The log-likelihood measure allows better comparison of categorical and numeric fields.