What is Jaccard similarity?

This is statistical measure used to quantify how similar two sets are and is widely used when working with labels and categorical data.

By Joshua Noble

What is Jaccard similarity?

Jaccard similarity is a statistical measure used to quantify how similar two sets are. It equals the ratio of the intersection size to the union size. The value of Jaccard similarity ranges from zero to one, where zero indicates that the sets have no elements in common and one indicates that the sets are identical.

For two sets $A$ and $B$ , the formula defines Jaccard similarity as:

$J (A, B) = \frac{| A \cap B |}{| A \cup B |}$

This definition leads to the Jaccard similarity being described as Intersection over Union (IoU) because the numerator is the intersection of the sets, while the denominator is the union.

The scientist Paul Jaccard developed Jaccard similarity at the beginning of the 20th century to help differentiate species of plants. There are several names for this measure of similarity and the associated measure of distance. Scientists Taffee Tadashi Tanimoto and Grove Karl Gilbert also independently developed the same measure, leading to what they each called the ‘ratio of verification’ and the Tanimoto index. All of these methods are calculated in the same way and should be considered the same similarity algorithm, regardless of name.

In data science, Jaccard similarity is widely used when data is represented as sets or as binary features indicating the presence or absence. It is an effective way to calculate document similarity, where documents are represented as sets of words.

Documents are often represented as sets of words, n-grams or shingles. Jaccard similarity is then used to detect document similarity, content-based filtering, near-duplicate documents and plagiarism. It is also used in information retrieval to compare queries with documents based on shared terms, as in vector search with vector databases.

In recommendation systems, user behavior is often represented as sets of items, such as products viewed, movies watched or articles clicked. By comparing the overlap between user item sets, systems can identify users with similar interests or suggest items that similar users have interacted with.

Jaccard similarity is also used in clustering and classification tasks involving categorical or binary data because it focuses on shared positive attributes and ignores features absent in both sets. This property makes it useful for sparse data, such as text data or transaction records.

In clustering and classification, Jaccard similarity is useful for categorical or binary feature spaces, especially when the data is sparse. It is commonly used in market basket analysis, clickstream analysis and transactional data, where the presence of an item is more informative than its absence.

Industry newsletter

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Computing Jaccard similarity

To compute Jaccard similarity in Python, define two sets and compare the intersection divided by the union of those two sets.

v1 = {1,2,3,4,5}
v2 = {9,8,7,6,5,4,3,2,1,0}
intersection = v1.intersection(v2)
union = v1.union(v2)
print(f’intersection: {intersection} union: {union } Jaccard similarity: {float(len(intersection))/float(len(union))}’)

This calculation produces a Jaccard similarity score of 0.5. Half of the numbers in the longer sequence are present in the shorter, although only the value ‘5’ appears in the same location.

In confusion matrices employed for binary classification, the Jaccard index is computed as the ratio of the True Positive (TP) divided by the sum of the True Positives, False Positives (FP) and False Negatives (FN).

$J a c c a r d I n d e x = \frac{T P}{T P + F P + F N}$

The Jaccard index is closely related to the F1 score but applies a stricter criterion. For binary classification, the relationship is:

$J a c c a r d = \frac{F 1}{(2 - F 1)}$

To calculate this from a confusion matrix in Python:

import numpy as np

def jaccard_from_confusion_matrix(cm):
tn, fp, fn, tp = cm.ravel()
return tp / (tp + fp + fn)

cm = np.array([[50, 10],
[5, 40]])

jaccard = jaccard_from_confusion_matrix(cm)
print(jaccard)

Jaccard distance

Jaccard distance is a measure of dissimilarity between two sets, derived directly from Jaccard similarity. While Jaccard similarity measures how much two sets overlap, Jaccard distance measures how different they are. It is defined as one minus the Jaccard similarity. The value ranges from zero to one, where zero indicates identical sets and one indicates disjoint sets.

For two vectors A and B, the Jaccard distance is defined as:

$d_{J} (A, B) = 1 - \frac{| A \cap B |}{| A \cup B |}$

In data science, Jaccard distance is commonly used when data is represented as sets or as binary vectors indicating the presence or absence of features. It is useful for measuring dissimilarity in sparse, high-dimensional data where most features are zero.

One major application of Jaccard distance is in clustering tasks. Algorithms that operate on distance matrices, such as hierarchical clustering or density-based methods, can use Jaccard distance. The method places data points with fewer common elements farther apart and data points with greater overlap closer together. This approach is common in text clustering, transaction data analysis and market basket analysis.

Jaccard distance is also used in anomaly and outlier detection. Data points that have a high Jaccard distance from most other points can represent unusual or rare combinations of features, which can indicate anomalies in user behavior, network traffic or transaction patterns.

In information retrieval and text mining, Jaccard distance is used to filter out near-duplicate documents or to quantify how different two documents are based on their word or n-gram sets. It is often combined with approximate methods such as MinHash to scale computations to large datasets.

In systems involving large language models, Jaccard distance is typically applied to symbolic representations such as token sets, keyword lists or metadata rather than dense embeddings. It can be used to measure how different prompts, documents or model outputs are at a lexical level. The method is sometimes combined with vector-based distances to provide a more complete notion of dissimilarity.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

Computing Jaccard score and distance

Imagine a model predicting what items someone will pick from a set of recommended products and comparing it to what they purchased. To keep this example simple, imagine that there are only 5 recommended items and the model predicts whether the shopper will buy them or not. The predictions are stored in an array called y_pred :

import numpy as np
from sklearn.metrics import jaccard_score

# which items our model predicts the shopper will purchase
y_pred = np.array([1, 1, 0, 0, 1]).reshape(-1, 1)

Next, there is what the shopper purchased, y_true :

# which items the shopper purchased
y_true = np.array([1, 1, 0, 0, 0]).reshape(-1, 1)

The Jaccard score represents the similar items that they bought, while the Jaccard distance measures dissimilarity and is calculated as 1 minus the Jaccard similarity score.

jaccard_index = jaccard_score(y_true, y_pred)
jaccard_distance = 1 - jaccard_index

print(f”Jaccard Index: {jaccard_index} Jaccard Distance: {jaccard_distance}”)

Jaccard similarity in AI

Both Jaccard distance and similarity have broad applications in AI and machine learning. For example, when training large language models, preparing the corpus of data that the model will use for training requires careful data cleaning and filtering.

Jaccard similarity is used in approximate similarity search with vector databases like Opensearch through techniques like MinHash and locality-sensitive hashing. These methods enable efficient estimation of Jaccard similarity at large scale and are used in natural language processing, document deduplication, large corpus filtering and preprocessing pipelines for training large language models.

In systems that combine large language models with vector databases, Jaccard similarity is sometimes used alongside vector similarity measures. While cosine similarity or dot product are effective for dense embeddings, Jaccard similarity applies to sparse representations such as bag-of-words vectors, keyword indexes or metadata tags. This hybrid approach balances semantic similarity from embeddings with exact or symbolic overlap from set-based features.

In graph-based network analysis, Jaccard similarity is used to compare node neighborhoods in graphs, helping identify and compare clusters and networks of relationships. When analyzing data derived from a social media platform, for instance, the Jaccard distance can be used to compare the strength of relationships or patterns of behavior across different groups.

A group of 500 can act in ways similar to a group of 50. Because Jaccard distance is robust to differences in set size, it provides a meaningful measure of similarity without being misled by size differences between sets.

Author

Joshua Noble

Data Scientist

Increasing AI Adoption with AI-Ready Data

Gain actionable insights on how to invest in AI technology for data and preparing data for AI.

Resources

AI agents run on data—is yours ready?

Your data is your competitive edge. Learn how to unlock it securely and drive measurable ROI from AI in this short webinar.

Data management explained

Techsplainers by IBM breaks down the essentials of data for AI, from key concepts to real‑world use cases. Clear, quick episodes help you learn the fundamentals fast.

Unify and access your data to help scale your AI

Learn why the path to AI-ready data often starts with effective access to both structured and unstructured data and the challenges that can impede data leaders.

Legal overhead turned into strategic insight

Learn how an AI-powered legal agent helps accelerate decision-making, reduce manual work and improve compliance.

AI Academy: Building a data strategy for enterprise AI

In this episode, Cathy Reese explains how organizations today need a data strategy that’s ready for advanced AI, which will require them to harness their highest quality data assets.

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Cost of a Data Breach Report 2025

Data breach costs have hit a new high. Get up-to-date insights into cybersecurity threats and their financial impacts on organizations.

The data leader’s guide to AI-ready data

Understand the actionable steps data leaders can take to overcome data challenges, establish the groundwork for a trusted data foundation and help get your organization’s data ready for AI.

How the C-suite is turning information into impact

Explore insights from 1,700 CDOs in this cross-industry report for data leaders.

What is Jaccard similarity?

What is Jaccard similarity?

The latest tech news, backed by expert insights

Thank you! You are subscribed.

Computing Jaccard similarity

Jaccard distance

Is data management the secret to generative AI?

Computing Jaccard score and distance

Jaccard similarity in AI

Author

Resources