Overview

There are a lot of questions that one may have regarding a RAG solution:

How should we determine the design choices that maximize retrieval performance?
How do we know which embeddings model creates the best vector representation of our documents?
Is an agentic approach necessary?
Would a reranker boost our results? Or perhaps, these parameter choices have a marginal impact?

Having a clear evaluation strategy throughout the development of a RAG based solution is crucial to ensure a successful path to production. We see all sorts of empirical evaluations performed during pilots that are sometimes not reproducible. In order to improve a RAG based solution performance under development or to properly diagnose a production issue, evaluation tasks must be reproducible and quick to execute. RAG pipelines should be evaluated systematically and consistently for both retrieval and generation components.

Understanding RAG based solution performance plays a critical part at various steps of the solution Lifecycle during :

Experiment and tuning phase
Monitoring phase

However, the effort to build an evaluation engine should not be underestimated especially when it comes to create a golden dataset (ground truth) with reference answers and reference contexts.

In this document, we will discuss different evaluation approaches and metrics as well as highlight some of the reusable assets that are out there to make the evaluation of these solutions easier.

The IBM RAG Cookbook

Approaches

AI Evaluating AI

LLMaaJ (LLM as a Judge) has emerged as a leading metric in the last year to overcome the challenge of building a reference-based evaluation engine. This eval technique has been shown to produce decent correlation with human judgment. Here are several properties that cannot be quantified by existing metrics and benchmarks but can be evaluated by LLMaaJ:

Safety – Are models generating harmful or unsafe content?
Groundedness – In the case of summarization and Retrieval Augmented Generation, is the generated output grounded in facts present in the input context?
Sentiment – Are generated responses generally positive, negative, or any other prescribed sentiment?
Toxicity – Are models generating offensive, aggressive, or discriminatory content?
Language style – Are models speaking in a casual, formal, or common voice? This includes evaluating sarcasm, humor, and irony.

For example, when using a scoring model to evaluate the output of other models, the scoring prompt should contain a description of the attributes to score and the grading scale, and should be interpolated to include the response to be evaluated.

In this example, the model is asked to evaluate the response’s language style and return a classification.

You are a fair and unbiased scoring judge. You are asked to classify a chatbot's response according to its sentiment. Evaluate the response below and extract the corresponding class. Possible classes are POSITIVE, NEUTRAL, NEGATIVE. Explain your reasoning and conclude by stating the classified sentiment.
{{response}}

Or, for instance, below is a few-shot prompting example of LLM-driven evaluation for NER (named-entity recognition) tasks.

-----------------------------------Prompt---------------------------------------------
You are a professional evaluator, and your task is to assess the accuracy of entity extraction as a Score in a given text. You will be given a text, an entity, and the entity value.
Please provide a numeric score on a scale from 0 to 1, where 1 is the best score and 0 is the worst score. Strictly use numeric values for scoring.

Here are the examples:

Text: Where is the IBM's office located in New York?
Entity: organization name
Value: IBM's
Score: 0

Text: Call the customer service at 1-800-555-1234 for assistance.
Entity: phone number
Value: +1 888 426 4409
Score: 1

Text: watsonx has three components: watsonx.ai, watsonx.data, and watsonx.governance.
Entity: product name
Value: Google
Score: 0.33

Text: The conference is scheduled for 15th August 2024.
Entity: date
Value: 15th August 2024
Score: 1

Text: My colleagues John and Alice will join the meeting.
Entity: person’s name
Value: Alice
Score: 1

-----------------------------------Output---------------------------------------------Score: 0.67
--------------------------------------------------------------------------------------

Metrics

The complexity of RAG systems is significantly influenced by the enigmatic nature of Large Language Models (LLMs), as well as the intricate and interconnected components within the RAG pipeline. As technology continues to progress at an unprecedented rate, evaluating such a complex system becomes an increasingly arduous task. To address this challenge, a myriad of benchmarks and evaluation tools have been developed specifically for RAG systems. These resources serve to provide a standardized and systematic approach to assessing the performance and efficacy of these systems.

For instance, as illustrated in the table below (adapted from "Evaluation of Retrieval-Augmented Generation: A Survey"), there exists a diverse array of RAG evaluation methods and tools, each with its unique strengths and applications. This table, which is not exhaustive, serves to provide a succinct overview of the current landscape of RAG evaluation.

In the context of the retrieval component of RAG systems, several challenges arise,

In terms of retrieval component, challenges arise primarily due to the extensive and dynamic nature of prospective knowledge repositories, the temporal facets of data, and the heterogeneity of information sources. Considering these challenges, it becomes apparent that conventional evaluation metrics, such as Recall and Precision, are inadequate and ill-equipped to provide a comprehensive assessment. Instead, there is a need for more nuanced and context-dependent metrics that can effectively capture the complexities and subtleties of the retrieval process.

Concerning the generation component, it is crucial to consider the intricate relationship between the precision of the retrieval process and the quality of the generated output. This necessitates the development and implementation of comprehensive evaluation metrics that can provide a holistic and nuanced assessment of the system's performance.

In turn, the evaluation of the RAG system as a whole requires a thorough examination of the impact of the retrieval component on the generation process, as well as an assessment of the overall effectiveness and efficacy of the system in achieving its intended goals and objectives.

The RAG triad is an evaluation framework for assessing the reliability and contextual accuracy of the Large Language Model (LLM) responses. It consists of three assessments: Context Relevance, Groundedness, and Answer Relevance. These assessments aim to identify LLM response hallucinations by verifying context relevance, response reliability to the context, and answer alignment with user inquiries.

RAG evaluation can be achieved using both automatic reference-based and reference-less metrics. There’s a leaderboard on HuggingFace that looks at how well the open-source LLMs stack up against each other.

Retrieval Metrics

The retrieval metrics below are reference-based which means that every chunk must be uniquely identified (contexts_id) and every question has unique IDs of the ground truth contexts.

Rank-aware evaluation metrics used for recommendation systems are appropriate for RAG.

MRR (Mean Reciprocal Rank)

MRR is used in Unitxt and measures the position of the first relevant document in the search results. A higher MRR, close to 1, indicates that relevant results appear near the top, reflecting high search quality. Conversely, a lower MRR means lower search performance, with relevant answers positioned further down in the results.

Pros: Emphasizes the importance of the first relevant result, which is often critical in search scenarios.
Cons: a limitation is that it does not penalize the retrieval for assigning low rank to other ground-truths; Not suitable for evaluating the entire list of retrieved results, focusing only on the first relevant item.

NDCG (Normalized Discounted Cumulative Gain)

ranking quality metrics that assesses how well a list of items is ordered compared to an ideal ranking, where all relevant items are positioned at the top.

NDCG@k is calculated as DCG@k divided by the ideal DCG@k (IDCG@k), which represents the score of a perfectly ranked list of items up to position k. DCG measures the total relevance of items in a list.

ranges from 0 to 1

Pros: Takes into account the position of relevant items, providing a more holistic view of ranking quality; Can be adjusted for different levels of ranking (e.g., NDCG@k).
Cons: More complex to compute and interpret compared to simpler metrics like MRR; Requires an ideal ranking for comparison, which may not always be available or easy to define.

MAP (Mean Average Precision)

Mean Average Precision (MAP) is a metric that evaluates the ranking of each correctly retrieved document within a list of results

It is beneficial when your system needs to consider the order of results and retrieve multiple documents in a single run.

Pros: Considers both precision and recall, providing a balanced evaluation of retrieval performance; Suitable for tasks requiring multiple relevant documents and their correct ordering.
Cons: Can be more demanding in terms of computation compared to simpler metrics; May not be as straightforward to interpret as other metrics, requiring more context to understand the results fully.

Generation Metrics

Faithfulness

Measures whether the output is based on the given context or if the model generates hallucinated responses.

Pros: Ensures that the generated responses are trustworthy and based on the provided context; Vital for applications where factual correctness is paramount.
Cons: Often requires human judgment to assess, making it labor-intensive and subjective; May not fully capture partial inaccuracies or subtle hallucinations.

Robustness (insensitivity)

Robustness is generally defined as the solution's capability of adapting to different input variations such as data perturbations like whitespace, lower/upper case, tabs, etc.

Testing Robustness is an important aspect of the evaluation process and can be achieved for instance using Unitxt semantic

Pros: Ensures the model performs reliably across varied input conditions; Real-World Applicability: Important for practical applications where input data may not be perfectly formatted.
Cons: Requires thorough testing across many variations, which can be time-consuming; Defining Variations: Challenging to define and measure all possible input perturbations.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

It measures the quality of text generation by comparing the overlap of n-grams, word sequences, and word pairs between the machine-generated text and a set of reference texts. Widely used for evaluating tasks such as text summarization and translation.

Pros: Established and recognized in the NLP community, providing a standard for comparison; Suitable for tasks where capturing all relevant information is important.
Focuses on n-gram overlap, which may not capture semantic quality or fluency; Can be influenced by the length of the generated text, potentially penalizing shorter or more concise outputs.

BLEU (Bilingual Evaluation Understudy)

It measures the quality of machine-translated text by comparing it to one or more reference translations. It evaluates the precision of n-grams in the generated text with respect to the reference texts. Primarily used for evaluating translation.

Pros: Effective for tasks where precision and exact matches are important; Standard Metric: Widely adopted in the machine translation community, providing a benchmark for comparison.
Cons: May penalize legitimate variations in phrasing that do not exactly match reference texts; Partial Inaccuracy Blindness: May not fully capture partial inaccuracies or subtle differences in meaning.

Cost Metrics

GPU/CPU Utilization

CPU utilization is primarily used for the retrieval phase, and GPU utilization is primarily used for the generation phase.

LLM Calls Cost

Example: Cost from OpenAI API calls

Infrastructure Cost

Costs from storage, networking, computing resources, etc.

Operation Cost

Costs from maintenance, support, monitoring, logging, security measures, etc.

Understanding Evaluation Results

If the retrieval metrics indicate suboptimal performance, yet the generation metrics yield favourable results, it is advisable to:

Revisit and adjust the chunking strategy (e.g., chunk size and overlap) to better balance context and relevance.
Clean and preprocess the data to remove noise and irrelevant information.
Add metadata, such as dates, to chunks to help filter and prioritize data based on specific use cases.
Implement reranking: This allows your retrieval system to refine the top nodes for context. Both * LangChain and LlamaIndex offer easy-to-use abstractions for reranking.
Have the LLM rephrase the query and retry, as similar questions for humans may not appear similar in embedding space.
Fine-tune embeddings with LlamaIndex for improved accuracy.

Conversely, if the retrieval metrics demonstrate strong performance but the generation results are suboptimal, consider the following strategies to improve model performance:

Fine-tune the language model: Customize the model to your domain by fine-tuning it on relevant datasets to enhance its accuracy and contextual understanding.
Refine prompt engineering: Experiment with prompt structure and wording to guide the model towards more precise outputs.
Use different decoding strategies: Adjust decoding techniques like beam search, top-k sampling, or nucleus sampling (top-p) to improve the quality of generated responses.
Control generation length: Set constraints on response length to ensure outputs are concise and accurate.
Incorporate feedback loops: Implement a system that identifies and corrects suboptimal responses, enabling continuous improvement.
Leverage multi-turn dialogue: Break down complex tasks into multi-step interactions, allowing the model to refine its answers over several iterations.

In the scenario where both the retrieval and generation metrics exhibit subpar performance, it would be prudent to revisit and reconsider the initial stages of the pipeline, such as enhancing metadata, refining the knowledge base, and optimizing the retrieval mechanism.

This approach emphasizes the importance of a comprehensive and nuanced evaluation of the RAG system, taking into account the interplay between retrieval and generation components and the overall effectiveness of the system in achieving its intended goals and objectives.

Result Evaluation