Document retrieval and ranking

The section discusses the various algorithms that are used for document retrieval and ranking when you use Content Search Services or Elasticsearch as the CBR engine. The section also discusses why the search results will have different ranking across the two engines due to differences in their scoring models.

Document retrieval using Term Frequency–Inverse Document Frequency (TF-IDF)

Many of the scoring models use the classic information retrieval technique that involves Term Frequency–Inverse Document Frequency (TF-IDF) to assign weights to terms. The following are the characteristics of the technique:
  • Term Frequency (TF) measures how often a term appears in a document. The more frequently a term appears within a document, the more is the relevance of the term.
  • Inverse Document Frequency (IDF) measures how rare a term is across the entire collection of documents. Terms that appear in many documents are considered less informative and receive low weightage.
  • TF-IDF is calculated when you multiply TF by IDF. The calculated value gives higher scores to terms that are frequent in a specific document but rare in the overall corpus, making them more useful for distinguishing relevant documents.
Beyond the TF-IDF document scoring mechanism, both Elasticsearch and IBM® Content Search Services engines use multiple other scoring algorithms for document retrieval and ranking. Therefore, for the same search criteria, the search results will have different ranking across the two engines due to differences in their scoring models.

Elasticsearch scoring models

Elasticsearch supports multiple scoring methods to rank search results. It combines traditional scoring models like BM25 with vector-based semantic search and offers powerful hybrid search capabilities.

  • BMS25 scoring algorithm

    Elasticsearch uses a scoring model called the Practical Scoring Function (BM25) by default. It is an advanced ranking function based on the classic TF-IDF model. The algorithm has the following improvements over the classic TF-IDF model:

    • Term frequency saturation - BM25 applies diminishing returns for repeated terms within a document. The score does not increase linearly when a term is repeated multiple times within a document.
    • Document length normalization - It normalizes scores to make sure that longer documents are not considered due to increased word count.

    These enhancements allow the BM25 algorithm to maintain relevance across documents with different lengths and term distributions. For more information, see the topic Understanding Elasticsearch scoring and the Explain API External link opens a new window or tab.

  • Vector search
    Instead of keywords, the vector search represents document information as numerical vectors to find relevant documents. The vector search works as follows:
    • It transforms the documents into feature vectors, often using word embedding or other machine learning models.
    • The vectors capture the semantic meaning of the text, allowing searches that understand context rather than exact keywords.
    • The vector search ranking depends on the quality and type of embedding model used. This ranking depends on how the embedding model computes the similarity scores.

Content Search Services scoring models

IBM Content Search Services uses the Lucene scoring model. Lucene uses a combination of algorithms for document retrieval and ranking. The first is a simple retrieval method known as the Boolean model, and the second is a more advanced ranking model called the Vector Space Model (VSM).
  • Boolean scoring model

    The Boolean model treats documents as bags of words and applies Boolean logic (such as AND, OR, NOT) to identify the documents that match the query. If a document satisfies the Boolean expression formed by the query terms, it is included in the result set. However, this model does not rank the results by relevance.

  • Vector space model

    The vector space model represents both documents and queries as vectors in a high-dimensional space, where each dimension corresponds to a unique term. The components of these vectors are the TF-IDF weights of the terms. To determine relevance, Lucene compares the query vector with each document vector using cosine similarity, which is the dot product of the vectors divided by the product of their Euclidean norms. For more information, see the topic Lucene Similarity External link opens a new window or tab.

Lucene first uses the Boolean model to filter documents that contain the query terms. It then applies the vector space model with TF-IDF weighting and cosine similarity to rank those documents by relevance. For more information, see the topic Lucene scoring External link opens a new window or tab.