What is Hamming Distance?

What is Hamming Distance?

Hamming distance is a measure of dissimilarity between two data objects of equal length. It is defined as the number of positions at which the corresponding elements in the two objects are different. If they have the same length, the objects being compared can be binary strings, character strings or vectors of categorical values.

For example, if two binary strings differ in two-bit positions, their Hamming distance is two. You can think of this process as the edit distance between strings, that is, the number of edits it would take to turn one string into the other.

This method makes Hamming distance a simple and intuitive way to quantify how much two fixed-length representations differ from each other. The technique is named after its inventor, Richard Hamming, one of the most influential thinkers in computer science and information theory.

In data science, Hamming distance is commonly applied when working with binary or categorical data. One frequent use is in similarity measurement, where data points are encoded as binary vectors and compared to identify how similar or different they are. This process is useful in tasks such as recommendation systems and user profiling, where preferences or attributes are often represented as binary features.

Hamming distance is also used in clustering algorithms designed for categorical data, for instance k-modes. In these cases, traditional distance measures like Euclidean or Manhattan distance don’t work well, while Hamming distance is a helpful metric for grouping similar data points.

Another important application is in error detection and correction. In data management and storage systems, like vector databases, Hamming distance helps detect whether bits have been corrupted and determines how many errors can be corrected. The concept is also relevant in data science workflows that involve noisy or unreliable binary data.

In natural language processing and bioinformatics, Hamming distance is used to compare fixed-length strings, such as short words, DNA sequences or encoded text features. It allows practitioners to identify small differences between sequences efficiently. Hamming distance plays a role in large-scale similarity search through techniques such as locality-sensitive hashing. Binary hash representations of data items can be compared quickly using Hamming distance, enabling efficient approximate nearest-neighbor searches in high-dimensional spaces.

Equation for Hamming Distance

The Hamming distance between two equal length strings or vectors is defined as the number of positions at which their corresponding elements differ.

Imagine two vectors  x=(x1,x2,,xn)  and  y=(y1,y2,,yn) containing the same number n of items. The Hamming distance is given as:

 H(x,y)=i=1n1(xiyi) 

where  I1(xiyi)  is an indicator function that equals 1 if  (xiyi) and 0 otherwise. Taken across both vectors this will return a summation of indices where each vector matches.

Computing Hamming distance

Computing the Hamming distance between two strings is as simple as counting the number of shared characters in those strings. For instance, in Python:

def hamming_distance_str(s1, s2):
    “””Calculate the Hamming distance between two strings of equal length.”””
    if len(s1) != len(s2):
        raise ValueError(“Hamming distance does not work for strings of different lengths.”)
    # Sum the number of times characters are not equal
    return sum(char1 != char2 for char1, char2 in zip(s1, s2))

string1 = “12345678”
string2 = “87654321”
distance_str = hamming_distance_str(string1, string2)
print(f”The Hamming distance between ‘{string1}’ and ‘{string2}’ is: {distance_str}”)

In the preceding example, the Hamming distance will be 8 because none of the characters match at any of the locations even though both strings contain the same characters. Any substitutions or deletions at any position increase the distance. The maximum distance for sets of length n is always n.

Computing the Hamming distance between two integers is often done by comparing the bits in the binary representations of the integers rather than calculating a mathematical distance.

def hamming_distance_int(n1, n2):
    “””Calculate the Hamming distance between two integers.”””
    xor_result = n1 ^ n2
    # Python 3.10+ provides an efficient built-in method
    return xor_result.bit_count()

num1 = 9  # Binary: 01001
num2 = 14 # Binary: 01110
distance_int = hamming_distance_int(num1, num2)
print(f”The Hamming distance between {num1} and {num2} is: {distance_int}”)

Applications of Hamming distance

Hamming distance is used widely in data science and machine learning as a measure of dissimilarity between two vectors of equal length by counting the number of positions at which their elements differ. It is most often used when working with binary, categorical and other kinds of discrete data as opposed to continuous numerical values or sparse data.

One common application where a data scientist might use Hamming distance is comparing binary or categorical feature vectors like user profiles, survey responses or genetic markers. These markers are encoded as fixed-length vectors and so Hamming distance can provide a fast and interpretable way to measure how many attributes disagree between two observations.

Hamming distance is also widely used in classification and clustering tasks when features are binary. In methods like k-nearest neighbors or hierarchical clustering, the Hamming distance can serve as a more meaningful distance metric than Euclidean distance for presence-absence data because it directly reflects the number of mismatched features.

In text and natural language processing, Hamming distance is used when strings or documents are represented in fixed-length encodings, such as binary hashes or aligned character sequences. It is useful for detecting near-duplicate items or small differences when the representations are comparable position by position.

In multi-label classification, Hamming distance underlies Hamming loss, which measures how many label predictions are incorrect across all labels. This method allows model performance to be evaluated at a granular level, rather than requiring all labels for an instance to be predicted correctly at once. Libraries like Scikit-Learn (often referred to as sklearn) in Python offer implementations of Hamming loss.

It is used to compare model structures or configurations encoded as binary vectors, such as feature-selection masks or rule-based model decisions. In these cases, it quantifies how different the two models are in terms of their discrete choices.

However, there are caveats to all these uses. As we’ve seen, Hamming distance doesn’t indicate the presence of similar values in two sets, but the exact similarity of those sets in value and position. This is also not indicative in working with highly sparse data, for instance, matrix data where there are no values in particular locations or where there are multiple non-exclusive labels.

Hamming bit

One interesting and widely used extension of Hamming distance is in coding theory and cryptography. A Hamming bit, more commonly called a parity or check bit, is an extra bit added to a data word for error detection and correction. It does not represent original information but instead encodes redundancy that allows errors in the data to be identified.

The Hamming distance of a code determines how many errors can be detected and corrected. Hamming distance measures the number of bit positions in which two binary strings differ. In error-correcting codes, the minimum Hamming distance between valid codewords determines the code’s ability to detect and correct errors.

Hamming distance measures how many bit positions differ between two binary strings. In error correction, we consider the minimum Hamming distance between any two valid codewords in a coding scheme. This minimum distance determines the error-handling capability of the code.

Hamming bits are placed at specific positions within a binary word, typically at positions that are powers of two like 1, 2, 4, or 8. Each Hamming bit is responsible for checking a particular subset of data bits.

A Hamming code indicates the number of data bit and total bits in a byte. For instance, a 7,4 code would encode 4 data bits into 7 bits with 3 Hamming bits included. The higher the number of Hamming bits, the higher the number of errors that can be detected.

Detecting errors in encodings or signal transmission is vitally important when there might be corruption. For instance, radio transmissions or readings from physical sensors can suffer from the substitution or deletion of one bit in a datastream and dramatically change the meaning of a signal.

To compute the Hamming bit, we define a parity bit  p  and use that to check a set of data bits  (b1,b2,,bk)

For even parity, the parity bit is defined as:  p=b1b2bk

For odd parity, the parity bit is:  p=1b1b2bk

The XOR operation ensures that the total number of ones in the checked bits, including the parity bit, is even or odd as required.

Hamming bits are constructed so that the set of all valid codewords has a guaranteed minimum Hamming distance. For a standard Hamming code, the minimum Hamming distance is  dmin=3

This minimum distance allows the detection of up to  (dmin-1)  errors and the correction of up to  (dmin-12)  errors. This is how the equations for Hamming bits directly determine the Hamming distance properties of the code used for the binary data.

Author

Joshua Noble

Data Scientist

Related solutions
IBM watsonx.data®

Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.

Discover watsonx.data
Database software and solutions

Use IBM database solutions to meet various workload needs across the hybrid cloud.

Discover database solutions
Data and AI consulting services

Successfully scale AI with the right strategy, data, security and governance in place.

Discover data and AI services
Take the next step

Unify all your data for AI and analytics with IBM watsonx.data®. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data Explore database solutions