Document retrieval and ranking
Document retrieval using Term Frequency–Inverse Document Frequency (TF-IDF)
- Term Frequency (TF) measures how often a term appears in a document. The more frequently a term appears within a document, the more is the relevance of the term.
- Inverse Document Frequency (IDF) measures how rare a term is across the entire collection of documents. Terms that appear in many documents are considered less informative and receive low weightage.
- TF-IDF is calculated when you multiply TF by IDF. The calculated value gives higher scores to terms that are frequent in a specific document but rare in the overall corpus, making them more useful for distinguishing relevant documents.
Elasticsearch scoring models
Elasticsearch supports multiple scoring methods to rank search results. It combines traditional scoring models like BM25 with vector-based semantic search and offers powerful hybrid search capabilities.
- BMS25 scoring algorithm
Elasticsearch uses a scoring model called the Practical Scoring Function (BM25) by default. It is an advanced ranking function based on the classic TF-IDF model. The algorithm has the following improvements over the classic TF-IDF model:
- Term frequency saturation - BM25 applies diminishing returns for repeated terms within a document. The score does not increase linearly when a term is repeated multiple times within a document.
- Document length normalization - It normalizes scores to make sure that longer documents are not considered due to increased word count.
These enhancements allow the BM25 algorithm to maintain relevance across documents with different lengths and term distributions. For more information, see the topic Understanding Elasticsearch scoring and the Explain API
. - Vector searchInstead of keywords, the vector search represents document information as numerical vectors to find relevant documents. The vector search works as follows:
- It transforms the documents into feature vectors, often using word embedding or other machine learning models.
- The vectors capture the semantic meaning of the text, allowing searches that understand context rather than exact keywords.
- The vector search ranking depends on the quality and type of embedding model used. This ranking depends on how the embedding model computes the similarity scores.
Content Search Services scoring models
- Boolean scoring model
The Boolean model treats documents as bags of words and applies Boolean logic (such as AND, OR, NOT) to identify the documents that match the query. If a document satisfies the Boolean expression formed by the query terms, it is included in the result set. However, this model does not rank the results by relevance.
- Vector space model
The vector space model represents both documents and queries as vectors in a high-dimensional space, where each dimension corresponds to a unique term. The components of these vectors are the TF-IDF weights of the terms. To determine relevance, Lucene compares the query vector with each document vector using cosine similarity, which is the dot product of the vectors divided by the product of their Euclidean norms. For more information, see the topic Lucene Similarity
.
Lucene first uses the Boolean model to filter documents that contain the query terms. It then
applies the vector space model with TF-IDF weighting and cosine similarity to rank those documents
by relevance. For more information, see the topic Lucene
scoring
.