What is Hamming Distance?

By Joshua Noble

What is Hamming Distance?

Hamming distance is a measure of dissimilarity between two data objects of equal length. It is defined as the number of positions at which the corresponding elements in the two objects are different. If they have the same length, the objects being compared can be binary strings, character strings or vectors of categorical values.

For example, if two binary strings differ in two-bit positions, their Hamming distance is two. You can think of this process as the edit distance between strings, that is, the number of edits it would take to turn one string into the other.

This method makes Hamming distance a simple and intuitive way to quantify how much two fixed-length representations differ from each other. The technique is named after its inventor, Richard Hamming, one of the most influential thinkers in computer science and information theory.

In data science, Hamming distance is commonly applied when working with binary or categorical data. One frequent use is in similarity measurement, where data points are encoded as binary vectors and compared to identify how similar or different they are. This process is useful in tasks such as recommendation systems and user profiling, where preferences or attributes are often represented as binary features.

Hamming distance is also used in clustering algorithms designed for categorical data, for instance k-modes. In these cases, traditional distance measures like Euclidean or Manhattan distance don’t work well, while Hamming distance is a helpful metric for grouping similar data points.

Another important application is in error detection and correction. In data management and storage systems, like vector databases, Hamming distance helps detect whether bits have been corrupted and determines how many errors can be corrected. The concept is also relevant in data science workflows that involve noisy or unreliable binary data.

In natural language processing and bioinformatics, Hamming distance is used to compare fixed-length strings, such as short words, DNA sequences or encoded text features. It allows practitioners to identify small differences between sequences efficiently. Hamming distance plays a role in large-scale similarity search through techniques such as locality-sensitive hashing. Binary hash representations of data items can be compared quickly using Hamming distance, enabling efficient approximate nearest-neighbor searches in high-dimensional spaces.

Equation for Hamming Distance

The Hamming distance between two equal length strings or vectors is defined as the number of positions at which their corresponding elements differ.

Imagine two vectors $x = (x_{1}, x_{2}, \dots, x_{n})$ and $y = (y_{1}, y_{2}, \dots, y_{n})$ containing the same number n of items. The Hamming distance is given as:

$H (x, y) = \sum_{i = 1}^{n} 1 (x_{i} \neq y_{i})$

where $I 1 (x_{i} \neq y_{i})$ is an indicator function that equals 1 if $(x_{i} \neq y_{i})$ and 0 otherwise. Taken across both vectors this will return a summation of indices where each vector matches.

Computing Hamming distance

Computing the Hamming distance between two strings is as simple as counting the number of shared characters in those strings. For instance, in Python:

def hamming_distance_str(s1, s2):
    “””Calculate the Hamming distance between two strings of equal length.”””
    if len(s1) != len(s2):
        raise ValueError(“Hamming distance does not work for strings of different lengths.”)
    # Sum the number of times characters are not equal
    return sum(char1 != char2 for char1, char2 in zip(s1, s2))

string1 = “12345678”
string2 = “87654321”
distance_str = hamming_distance_str(string1, string2)
print(f”The Hamming distance between ‘{string1}’ and ‘{string2}’ is: {distance_str}”)

In the preceding example, the Hamming distance will be 8 because none of the characters match at any of the locations even though both strings contain the same characters. Any substitutions or deletions at any position increase the distance. The maximum distance for sets of length n is always n.

Computing the Hamming distance between two integers is often done by comparing the bits in the binary representations of the integers rather than calculating a mathematical distance.

def hamming_distance_int(n1, n2):
    “””Calculate the Hamming distance between two integers.”””
    xor_result = n1 ^ n2
    # Python 3.10+ provides an efficient built-in method
    return xor_result.bit_count()

num1 = 9 # Binary: 01001
num2 = 14 # Binary: 01110
distance_int = hamming_distance_int(num1, num2)
print(f”The Hamming distance between {num1} and {num2} is: {distance_int}”)

Applications of Hamming distance

Hamming distance is used widely in data science and machine learning as a measure of dissimilarity between two vectors of equal length by counting the number of positions at which their elements differ. It is most often used when working with binary, categorical and other kinds of discrete data as opposed to continuous numerical values or sparse data.

One common application where a data scientist might use Hamming distance is comparing binary or categorical feature vectors like user profiles, survey responses or genetic markers. These markers are encoded as fixed-length vectors and so Hamming distance can provide a fast and interpretable way to measure how many attributes disagree between two observations.

Hamming distance is also widely used in classification and clustering tasks when features are binary. In methods like k-nearest neighbors or hierarchical clustering, the Hamming distance can serve as a more meaningful distance metric than Euclidean distance for presence-absence data because it directly reflects the number of mismatched features.

In text and natural language processing, Hamming distance is used when strings or documents are represented in fixed-length encodings, such as binary hashes or aligned character sequences. It is useful for detecting near-duplicate items or small differences when the representations are comparable position by position.

In multi-label classification, Hamming distance underlies Hamming loss, which measures how many label predictions are incorrect across all labels. This method allows model performance to be evaluated at a granular level, rather than requiring all labels for an instance to be predicted correctly at once. Libraries like Scikit-Learn (often referred to as sklearn) in Python offer implementations of Hamming loss.

It is used to compare model structures or configurations encoded as binary vectors, such as feature-selection masks or rule-based model decisions. In these cases, it quantifies how different the two models are in terms of their discrete choices.

However, there are caveats to all these uses. As we’ve seen, Hamming distance doesn’t indicate the presence of similar values in two sets, but the exact similarity of those sets in value and position. This is also not indicative in working with highly sparse data, for instance, matrix data where there are no values in particular locations or where there are multiple non-exclusive labels.

Hamming bit

One interesting and widely used extension of Hamming distance is in coding theory and cryptography. A Hamming bit, more commonly called a parity or check bit, is an extra bit added to a data word for error detection and correction. It does not represent original information but instead encodes redundancy that allows errors in the data to be identified.

The Hamming distance of a code determines how many errors can be detected and corrected. Hamming distance measures the number of bit positions in which two binary strings differ. In error-correcting codes, the minimum Hamming distance between valid codewords determines the code’s ability to detect and correct errors.

Hamming distance measures how many bit positions differ between two binary strings. In error correction, we consider the minimum Hamming distance between any two valid codewords in a coding scheme. This minimum distance determines the error-handling capability of the code.

Hamming bits are placed at specific positions within a binary word, typically at positions that are powers of two like 1, 2, 4, or 8. Each Hamming bit is responsible for checking a particular subset of data bits.

A Hamming code indicates the number of data bit and total bits in a byte. For instance, a 7,4 code would encode 4 data bits into 7 bits with 3 Hamming bits included. The higher the number of Hamming bits, the higher the number of errors that can be detected.

Detecting errors in encodings or signal transmission is vitally important when there might be corruption. For instance, radio transmissions or readings from physical sensors can suffer from the substitution or deletion of one bit in a datastream and dramatically change the meaning of a signal.

To compute the Hamming bit, we define a parity bit $p$ and use that to check a set of data bits $(b_{1}, b_{2}, \dots, b_{k})$

For even parity, the parity bit is defined as: $p = b_{1} \oplus b_{2} \oplus \dots \oplus b_{k}$

For odd parity, the parity bit is: $p = 1 \oplus b_{1} \oplus b_{2} \oplus \dots \oplus b_{k}$

The XOR operation ensures that the total number of ones in the checked bits, including the parity bit, is even or odd as required.

Hamming bits are constructed so that the set of all valid codewords has a guaranteed minimum Hamming distance. For a standard Hamming code, the minimum Hamming distance is $d_{m i n} = 3$

This minimum distance allows the detection of up to $(d_{m i n} - 1)$ errors and the correction of up to $(⌊ \frac{d_{\min} - 1}{2} ⌋)$ errors. This is how the equations for Hamming bits directly determine the Hamming distance properties of the code used for the binary data.

Author

Joshua Noble

Data Scientist

Increasing AI Adoption with AI-Ready Data

Gain actionable insights on how to invest in AI technology for data and preparing data for AI.

Resources

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Data management explained

Techsplainers by IBM breaks down the essentials of data for AI, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Legal overhead turned into strategic insight

Learn how an AI-powered legal agent helps accelerate decision-making, reduce manual work and improve compliance.

AI Academy: Building a data strategy for enterprise AI

In this episode, Cathy Reese explains how organizations today need a data strategy that’s ready for advanced AI, which will require them to harness their highest quality data assets.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Cost of a Data Breach Report 2025

Data breach costs have hit a new high. Get up-to-date insights into cybersecurity threats and their financial impacts on organizations.

The data leader’s guide to AI-ready data

Understand the actionable steps data leaders can take to overcome data challenges, establish the groundwork for a trusted data foundation and help get your organization’s data ready for AI.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.