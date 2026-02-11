Measuring how two data points are different is one of the most fundamental tasks in machine learning and artificial intelligence. In clustering or classification applications, the distance between points determines whether they belong to the same class or cluster.

In natural language processing applications, distance metrics measure how different two vectors or sets are, and can detect whether sentences or documents are similar. In statistical applications, the distance between distributions can determine whether two populations are similar or different.

Many distance metrics are derived from similarity measures, meaning that we first measure the similarity of data points and then subtract that from 1 or otherwise invert the similarity metric. A data scientist will choose from the different distance metrics based on the type of data they’re working with and the use case or problem space in which they’re working.

All the methods discussed in this article will have implementations in a language like R or Python, often through an API provided by a library like NumPy or SciPy.

Distance metrics can be grouped into several fundamental types based on the structure of the data they compare and the notion of similarity they encode. The first type is geometric or norm-based distances, these measure distances in a continuous vector space.

Some of the most commonly used are Euclidean distance, Manhattan distance and Chebyshev distance. These methods assume that the quantities that are being compared are numerical values or features. They rely on distance in a coordinate system that can be as simple as a single dimension and as complex as thousands of dimensions.

Another fundamental category is composed of metrics that measure angular distances. These metrics treat data points as vectors and measure the orientation of them rather than their absolute separation. These metrics minimize the importance of the magnitude of the vector and instead emphasize the orientation of the vector. Cosine distance is the most common example and is widely used for high-dimensional data such as text embeddings, where relative direction matters more than magnitude.

A third type is set-based distances, which compare overlap between sets or binary vectors interpreted as sets. Jaccard distance is a key example and is commonly used for sparse data where the presence of features is more informative than their absence. These metrics are conceptually similar to edit or sequence-based distances, which measure how many operations are required to transform one sequence into another.

Some commonly used sequence-based distance metrics are Hamming distance for fixed-length aligned sequences of equal length and Levenshtein distance, which can compare variable-length strings. These metrics are commonly widely used in with text in natural language processing applications and in error detection.

Another important category is probabilistic or distribution-based distances, which compare entire probability distributions rather than individual observations. Examples include Kullback–Leibler divergence, Jensen–Shannon distance and Wasserstein distance. These methods are common in generative modeling and statistical learning.

Finally, there are kernel-induced and learned distances, where the distance is defined implicitly through a kernel function or learned from data. Examples include distances calculated by using radial basis function kernels or Mahalanobis distance, which adapts to feature correlations and scaling.