Demographic clustering

Demographic clustering is distribution-based. It provides fast and natural clustering of very large databases. Clusters are characterized by the value distributions of their members. It automatically determines the number of clusters to be generated.

Typically, demographic data contains many categorical variables. The mining function works well with data sets that consist of this type of variables.

You can also use numerical variables. The Demographic Clustering algorithm treats numerical variables by assigning similarities according to the numeric difference of the values.

Demographic Clustering is an iterative process over the input data. Each input record is read in succession. The similarity of each record with each of the currently existing clusters is calculated. If the biggest calculated similarity is above a given threshold, the record is added to the relevant cluster. This cluster's characteristics change accordingly. If the calculated similarity is not above the threshold, or if there is no cluster (which is initially the case) a new cluster is created that contains the record alone. You can specify the maximum number of clusters, as well as the similarity threshold.

Demographic Clustering uses the statistical Condorcet criterion to manage the assignment of records to clusters and the creation of new clusters. The Condorcet criterion evaluates how homogeneous each discovered cluster is (in that the records it contains are similar) and how heterogeneous the discovered clusters are among each other. The iterative process of discovering clusters stops after two or more passes over the input data if the improvement of the clustering result according to the Condorcet criterion does not justify a new pass.

Demographic clustering specific parameters

Besides the common clustering parameters, you can define specific parameters for the Demographic Clustering algorithm.

Feedback | Information roadmap

https://www.ibm.com/docs/en/db2/10.5.0?topic=function-demographic-clustering