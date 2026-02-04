Jaccard similarity is a statistical measure used to quantify how similar two sets are. It equals the ratio of the intersection size to the union size. The value of Jaccard similarity ranges from zero to one, where zero indicates that the sets have no elements in common and one indicates that the sets are identical.

For two sets A and B , the formula defines Jaccard similarity as:

J ( A , B ) = | A ∩ B | | A ∪ B |

This definition leads to the Jaccard similarity being described as Intersection over Union (IoU) because the numerator is the intersection of the sets, while the denominator is the union.

The scientist Paul Jaccard developed Jaccard similarity at the beginning of the 20th century to help differentiate species of plants. There are several names for this measure of similarity and the associated measure of distance. Scientists Taffee Tadashi Tanimoto and Grove Karl Gilbert also independently developed the same measure, leading to what they each called the ‘ratio of verification’ and the Tanimoto index. All of these methods are calculated in the same way and should be considered the same similarity algorithm, regardless of name.

In data science, Jaccard similarity is widely used when data is represented as sets or as binary features indicating the presence or absence. It is an effective way to calculate document similarity, where documents are represented as sets of words.

Documents are often represented as sets of words, n-grams or shingles. Jaccard similarity is then used to detect document similarity, content-based filtering, near-duplicate documents and plagiarism. It is also used in information retrieval to compare queries with documents based on shared terms, as in vector search with vector databases.

In recommendation systems, user behavior is often represented as sets of items, such as products viewed, movies watched or articles clicked. By comparing the overlap between user item sets, systems can identify users with similar interests or suggest items that similar users have interacted with.

Jaccard similarity is also used in clustering and classification tasks involving categorical or binary data because it focuses on shared positive attributes and ignores features absent in both sets. This property makes it useful for sparse data, such as text data or transaction records.

In clustering and classification, Jaccard similarity is useful for categorical or binary feature spaces, especially when the data is sparse. It is commonly used in market basket analysis, clickstream analysis and transactional data, where the presence of an item is more informative than its absence.