Distribution-based clustering, sometimes called probabilistic clustering, groups together data points based on their probability distribution. This approach assumes that there is a process generating normal distributions for each dimension of the data which create the cluster centers. It’s different from centroid-based clustering in that it doesn’t use a distance metric like a Euclidean or Manhattan distance. Instead, distribution based approaches look for a well-defined distribution which appears across each dimension. The cluster means are the means of the Gaussian distribution across each dimension. Distribution based clustering is a model-based approach to clustering because it requires fitting a distribution multiple times across each dimension to find clusters, which means that it can be computationally expensive when working with large data sets.
One commonly used approach to distribution-based clustering is to create Gaussian Mixture Model (GMM) through Expectation-Maximization. A GMM is named because of the assumption that each cluster is defined by a Gaussian Distribution, often called a normal distribution.
Consider a dataset with two distinct clusters, A and B, which are both defined by two different normal distributions: one along the x-axis and one along the y-axis. Expectation-Maximization begins a random guess for what those two distributions along each axis and then proceeds to improve iteratively by alternating two steps:
Expectation: Assign each data point to each of the clusters and compute the probability that it came from the Cluster A and the Cluster B.
Maximization: Update the parameters that define each cluster, a weighted mean location and a variance-covariance matrix, based on the likelihood of each data point being in the cluster. Then repeat the Expectation step until the equation converges on the distributions observed for each cluster.
Each data point is given a probability of being associated with a cluster. This means that clustering via Expectation Maximization is a soft clustering approach and that a given point may be probabilistically associated with more than one cluster. This makes sense in some scenarios, a song might be somewhat folk and somewhat rock or a user may prefer television shows in Spanish but sometimes also watch shows in English.