This is statistical measure used to quantify how similar two sets are and is widely used when working with labels and categorical data.
Jaccard similarity is a statistical measure used to quantify how similar two sets are. It equals the ratio of the intersection size to the union size. The value of Jaccard similarity ranges from zero to one, where zero indicates that the sets have no elements in common and one indicates that the sets are identical.
For two sets and , the formula defines Jaccard similarity as:
This definition leads to the Jaccard similarity being described as Intersection over Union (IoU) because the numerator is the intersection of the sets, while the denominator is the union.
The scientist Paul Jaccard developed Jaccard similarity at the beginning of the 20th century to help differentiate species of plants. There are several names for this measure of similarity and the associated measure of distance. Scientists Taffee Tadashi Tanimoto and Grove Karl Gilbert also independently developed the same measure, leading to what they each called the ‘ratio of verification’ and the Tanimoto index. All of these methods are calculated in the same way and should be considered the same similarity algorithm, regardless of name.
In data science, Jaccard similarity is widely used when data is represented as sets or as binary features indicating the presence or absence. It is an effective way to calculate document similarity, where documents are represented as sets of words.
Documents are often represented as sets of words, n-grams or shingles. Jaccard similarity is then used to detect document similarity, content-based filtering, near-duplicate documents and plagiarism. It is also used in information retrieval to compare queries with documents based on shared terms, as in vector search with vector databases.
In recommendation systems, user behavior is often represented as sets of items, such as products viewed, movies watched or articles clicked. By comparing the overlap between user item sets, systems can identify users with similar interests or suggest items that similar users have interacted with.
Jaccard similarity is also used in clustering and classification tasks involving categorical or binary data because it focuses on shared positive attributes and ignores features absent in both sets. This property makes it useful for sparse data, such as text data or transaction records.
In clustering and classification, Jaccard similarity is useful for categorical or binary feature spaces, especially when the data is sparse. It is commonly used in market basket analysis, clickstream analysis and transactional data, where the presence of an item is more informative than its absence.
Industry newsletter
Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
To compute Jaccard similarity in Python, define two sets and compare the intersection divided by the union of those two sets.
This calculation produces a Jaccard similarity score of 0.5. Half of the numbers in the longer sequence are present in the shorter, although only the value ‘5’ appears in the same location.
In confusion matrices employed for binary classification, the Jaccard index is computed as the ratio of the True Positive (TP) divided by the sum of the True Positives, False Positives (FP) and False Negatives (FN).
The Jaccard index is closely related to the F1 score but applies a stricter criterion. For binary classification, the relationship is:
To calculate this from a confusion matrix in Python:
Jaccard distance is a measure of dissimilarity between two sets, derived directly from Jaccard similarity. While Jaccard similarity measures how much two sets overlap, Jaccard distance measures how different they are. It is defined as one minus the Jaccard similarity. The value ranges from zero to one, where zero indicates identical sets and one indicates disjoint sets.
For two vectors A and B, the Jaccard distance is defined as:
In data science, Jaccard distance is commonly used when data is represented as sets or as binary vectors indicating the presence or absence of features. It is useful for measuring dissimilarity in sparse, high-dimensional data where most features are zero.
One major application of Jaccard distance is in clustering tasks. Algorithms that operate on distance matrices, such as hierarchical clustering or density-based methods, can use Jaccard distance. The method places data points with fewer common elements farther apart and data points with greater overlap closer together. This approach is common in text clustering, transaction data analysis and market basket analysis.
Jaccard distance is also used in anomaly and outlier detection. Data points that have a high Jaccard distance from most other points can represent unusual or rare combinations of features, which can indicate anomalies in user behavior, network traffic or transaction patterns.
In information retrieval and text mining, Jaccard distance is used to filter out near-duplicate documents or to quantify how different two documents are based on their word or n-gram sets. It is often combined with approximate methods such as MinHash to scale computations to large datasets.
In systems involving large language models, Jaccard distance is typically applied to symbolic representations such as token sets, keyword lists or metadata rather than dense embeddings. It can be used to measure how different prompts, documents or model outputs are at a lexical level. The method is sometimes combined with vector-based distances to provide a more complete notion of dissimilarity.
Imagine a model predicting what items someone will pick from a set of recommended products and comparing it to what they purchased. To keep this example simple, imagine that there are only 5 recommended items and the model predicts whether the shopper will buy them or not. The predictions are stored in an array called
Next, there is what the shopper purchased,
The Jaccard score represents the similar items that they bought, while the Jaccard distance measures dissimilarity and is calculated as 1 minus the Jaccard similarity score.
Both Jaccard distance and similarity have broad applications in AI and machine learning. For example, when training large language models, preparing the corpus of data that the model will use for training requires careful data cleaning and filtering.
Jaccard similarity is used in approximate similarity search with vector databases like Opensearch through techniques like MinHash and locality-sensitive hashing. These methods enable efficient estimation of Jaccard similarity at large scale and are used in natural language processing, document deduplication, large corpus filtering and preprocessing pipelines for training large language models.
In systems that combine large language models with vector databases, Jaccard similarity is sometimes used alongside vector similarity measures. While cosine similarity or dot product are effective for dense embeddings, Jaccard similarity applies to sparse representations such as bag-of-words vectors, keyword indexes or metadata tags. This hybrid approach balances semantic similarity from embeddings with exact or symbolic overlap from set-based features.
In graph-based network analysis, Jaccard similarity is used to compare node neighborhoods in graphs, helping identify and compare clusters and networks of relationships. When analyzing data derived from a social media platform, for instance, the Jaccard distance can be used to compare the strength of relationships or patterns of behavior across different groups.
A group of 500 can act in ways similar to a group of 50. Because Jaccard distance is robust to differences in set size, it provides a meaningful measure of similarity without being misled by size differences between sets.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Use IBM database solutions to meet various workload needs across the hybrid cloud.
Successfully scale AI with the right strategy, data, security and governance in place.