Measuring how two data points are different is one of the most fundamental tasks in machine learning and artificial intelligence. In clustering or classification applications, the distance between points determines whether they belong to the same class or cluster.
In natural language processing applications, distance metrics measure how different two vectors or sets are, and can detect whether sentences or documents are similar. In statistical applications, the distance between distributions can determine whether two populations are similar or different.
Many distance metrics are derived from similarity measures, meaning that we first measure the similarity of data points and then subtract that from 1 or otherwise invert the similarity metric. A data scientist will choose from the different distance metrics based on the type of data they’re working with and the use case or problem space in which they’re working.
All the methods discussed in this article will have implementations in a language like R or Python, often through an API provided by a library like NumPy or SciPy.
Distance metrics can be grouped into several fundamental types based on the structure of the data they compare and the notion of similarity they encode. The first type is geometric or norm-based distances, these measure distances in a continuous vector space.
Some of the most commonly used are Euclidean distance, Manhattan distance and Chebyshev distance. These methods assume that the quantities that are being compared are numerical values or features. They rely on distance in a coordinate system that can be as simple as a single dimension and as complex as thousands of dimensions.
Another fundamental category is composed of metrics that measure angular distances. These metrics treat data points as vectors and measure the orientation of them rather than their absolute separation. These metrics minimize the importance of the magnitude of the vector and instead emphasize the orientation of the vector. Cosine distance is the most common example and is widely used for high-dimensional data such as text embeddings, where relative direction matters more than magnitude.
A third type is set-based distances, which compare overlap between sets or binary vectors interpreted as sets. Jaccard distance is a key example and is commonly used for sparse data where the presence of features is more informative than their absence. These metrics are conceptually similar to edit or sequence-based distances, which measure how many operations are required to transform one sequence into another.
Some commonly used sequence-based distance metrics are Hamming distance for fixed-length aligned sequences of equal length and Levenshtein distance, which can compare variable-length strings. These metrics are commonly widely used in with text in natural language processing applications and in error detection.
Another important category is probabilistic or distribution-based distances, which compare entire probability distributions rather than individual observations. Examples include Kullback–Leibler divergence, Jensen–Shannon distance and Wasserstein distance. These methods are common in generative modeling and statistical learning.
Finally, there are kernel-induced and learned distances, where the distance is defined implicitly through a kernel function or learned from data. Examples include distances calculated by using radial basis function kernels or Mahalanobis distance, which adapts to feature correlations and scaling.
Industry newsletter
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think Newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
Euclidean distance, sometimes called straight-line distance, is one of the most commonly used distance metrics and it’s easy to calculate. It’s simply the square root of the sum of differences between two different data points.
Many machine learning algorithms like k-means clustering or nearest neighbors use Euclidean distance to calculate the similarity between two data points. As long as the two sets of data both have numerical data types like integers or floats, they can easily be measured in a Euclidean space.
To calculate the Euclidean distance between two vectors a and b, sum the squared differences of those vectors to ensure that the values are non-negative, and take the square root of the result:
Another commonly used geometric metric is the Manhattan distance. This metric is also known as city block distance or taxicab distance because it calculates the distance between real-valued vectors in integer values. Mathematically, the Manhattan distance is calculated as the sum of absolute distances between two different data points.
This method is most accurate when two vectors lie on a uniform grid like a chessboard. The name taxicab creates an intuition for what measures the shortest path that a taxicab would take between square city blocks that are coordinates on the grid, like the streets of Manhattan in New York City.
To create a Manhattan distance function for two vectors, sum the absolute differences, and take the sum of the result:
An alternative to the Manhattan and Euclidean distance is the Minkowski distance, which introduces a normalizing term. In data science, the Minkowski similarity function is useful for numerical datasets where one wants to determine the similarity of size between multiple vectors.
Another conceptually related distance metric is the Chebyshev distance. You can think of it as being the Manhattan distance but with diagonal moves allowed. This distance calculates the maximum absolute difference along any coordinate dimension. It’s used for path-finding in games or robotics or computer vision applications.
For all geometric distances, it’s important to note that if your data have features with different scales, those values need to be normalized or standardized before calculating the distance. Without normalization, the large values in features will dominate the distance, rendering the results possibly inaccurate.
Angular distance has become an important metric for measuring similarity of embeddings generated by an embedding model and other input and output from large language models. Among the most commonly used is cosine distance, which measures how different the angle and magnitude of the vectors are from one another. Cosine distance relies on cosine similarity, which takes the cosine of the angles created by two vectors are in vector space. Vectors which are highly similar will have 0 degrees between them, that is, they are pointing exactly the same direction. Taking the cosine of 0 gives us 1. Vectors pointing in opposite directions from one another have a value of -1, from taking the cosine of 180. Cosine distance is simply 1 minus the cosine similarity.
The cosine of two non-zero vectors can be derived with the Euclidean dot product formula:
Although often called a distance metric, cosine distance is not a true distance metric because it doesn’t measure the triangle inequality property. This states that the shortest path from one point to the other is the line segment connecting two points, although in the case of cosine distance, this isn’t necessarily true.
A key difference between geometric and angular distance is how magnitude is treated. Geometric distances increase when vectors become longer or contain larger or smaller values, while angular distances remain unchanged. This means that a very large vector pointing in the same direction as a smaller vector will be very different by a geometric distance metric but very similar by a angular distance metric. You could think of this as indicating whether a sentence about William Shakespeare is close to a three paragraph length text about the playwright. A geometric measure may see the number of items making them very different where an angular distance would indicate that their topic makes them similar. This makes angular metrics particularly useful when feature counts or intensities vary widely but relative proportions matter.
Another difference lies in their behavior in high-dimensional spaces. Geometric distances can become less informative in very high dimensions due to distance concentration, where points tend to appear similarly distant from one another. Angular distances often remain more discriminative in such settings, which is why they are commonly used for text, embeddings, and recommendation systems. Geometric-based metrics assume that the underlying coordinate system and units are meaningful and comparable across dimensions. Angular metrics assume that only the relative orientation of vectors encodes meaningful information, not their absolute size.
While geometric and angular distances help measure the difference between numerical data that exists in a normed vector space of some kind, they don’t work for categorical or text data. In these cases, set and sequence distances can help machine learning models calculate how different data like labels from a classifier neural network, information retrieval, text strings, DNA strings, are from one another.
Jaccard distance, also called Jaccard index or Tanimoto similarity, is often called intersection over union since is takes the length of the intersection of two sets and divides it by the union. Jaccard similarity is often used in image detection applications where it measures the accuracy of object detection algorithms. The Jaccard distance is simply 1 minus the Jaccard similarity:
Hamming distance is a measure of distance between two sets or vectors of equal length. The objects being compared can be binary data or binary strings, character strings or vectors of categorical values, if they have the same length. It measures the number of positions where the corresponding elements in the two objects are different.
For example, if two binary strings differ in two bit positions, their Hamming distance is two. You can think of this distance as the edit distance between strings, that is, the number of edits it would take to turn one string into the other. This characteristic makes Hamming distance a simple and intuitive way to quantify how much two fixed-length representations differ from each other.
Hamming distance is given as:
Levenshtein distance, also called edit distance, measures how many edits would be required to make one string match another. It is conceptually similar to Hamming distance but it can work with data points of unequal lengths and incorporates additions, deletions and substitutions.
Mathematically, Levenshtein distance of is the sum of the minimum number deletions the insertions or the substitutions required to make two sequences match.
Rather than comparing individual data points or vectors, distribution-based distance metrics measure how different an entire probability distribution is from another distribution. They are commonly used when data is represented as histograms, probability density functions or empirical samples.
One important type of distance measure is information-theoretic distances. For instance, the Kullback–Leibler (KL) divergence (also called relative entropy) measures how much information is lost when one distribution is used to approximate another.
An intuition for this might be guessing the distribution of heights in a high school mathematics class. If you used the distribution of heights in a class from Bolivia to represent heights in a class in the Netherlands, you might find that they don’t match up especially well.
KL divergence relies on entropy, which you can think of as the amount of unexpected variance in a signal or distribution. Entropy tells us the minimum amount of surprise possible if we knew the true distribution, usually represented as . Think of it as the cost of the ideal estimate.
Cross-entropy tells us the actual amount of surprise that we incur when using an approximation, usually represented as . Think of it as the cost of the actual estimate. KL-Divergence subtracts P(x) from Q(x) to show how much extra surprise the wrong distribution Q(x) introduces compared to the real one P(x). It can be thought of as the penalty for having the wrong assumptions.
One important feature of KL divergence is that it does not satisfy the triangle inequality so it is a divergence metric rather than a true distance metric. An alternative measure of relative entropy is the Jensen–Shannon distance. This metric is a symmetrized and smoothed version of KL divergence that does satisfy metric properties.
Another major class is optimal transport distances. The Wasserstein distance, also known as earth mover’s distance, measures the minimum “cost” required to transform one distribution into another by moving probability mass. This distance has a clear geometric interpretation and is sensitive to the underlying space on which the distributions are defined.
Distribution-based distances are widely used in machine learning tasks such as generative modeling, domain adaptation, anomaly detection and reinforcement learning. They are especially useful when the shape, spread or support of distributions is more important than point-wise differences between samples.
Geometric distance metrics rely on all dimensions of the vector space being equally important and directly comparable. However, this is not always the case. When similarity depends on specific interactions between features or curved decision boundaries, then a kernel can implicitly map the data into a space where those relationships become linear. This situation is common in text data, embeddings, biological sequences and any domain where distance is not well described by straight-line separation.
Imagine that we have two hand-written characters that we need to distinguish from hundreds of samples, as can be found in the MNIST data set.
Each hand-written digit image can be represented as a vector of pixel intensities and that can make for hundreds or thousands of dimensions to analyze to determine the core nature of a hand-written ‘3’. Two different images will probably differ significantly at the pixel level because of the pen used, a slant in the stroke, or even where it is located in the image.
Using a Euclidean distance, two images of a ‘3’ can easily appear very far apart even though a human would easily know that they’re the same number.
A radial basis function kernel (RBF) on the other hand, can easily learn a kernel that processes all the pixels of an image and learns what local patterns of pixels are most associated with ‘3’. As a distance metric, the RBF kernel supports tasks like nearest-neighbor classification or clustering by ensuring that distance reflects perceptual similarity rather than raw pixel differences.
Two images are close if one can be obtained from the other by many small deformations and that maps well onto how humans read hand-written characters.
The distance of an RBF is given as:
The Kernel Function will represent the similarity of x and y, where 1 means the points are identical 0 means they’re far apart. The term takes the squared Euclidean norm of the vector for each value in the vector. This measures how far apart x and y are in feature space.
The bandwidth parameter controls the “width” or “spread” of the kernel. A small means that the kernel decays quickly and only very close points are considered similar. A large means that the kernel decays slowly, meaning that comparatively distant points are still considered similar.
Taking the exponent ensures that the output decreases smoothly as distance increases. When x is the same as y, the exponent of zero is 1. With large negative values the distance approaches 0. The normalizes the distance and controls how quickly the similarity falls off with distance.
Conceptually, a kernel represents the inner product in a feature space . The squared distance between two points in that space is:
This allows distance calculations in complex feature spaces using only kernel evaluations.
Kernel-based distances are also used with string and graph kernels, where defining a direct geometric distance is difficult. In these cases, the kernel captures domain-specific similarity, and the induced distance reflects differences in structure rather than numeric coordinates.
Create and manage smart streaming data pipelines through an intuitive graphical interface, facilitating seamless data integration across hybrid and multicloud environments.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting®, building an insight-driven organization that delivers business advantage.