There are a lot of questions that one may have regarding a RAG solution:
Having a clear evaluation strategy throughout the development of a RAG based solution is crucial to ensure a successful path to production. We see all sorts of empirical evaluations performed during pilots that are sometimes not reproducible. In order to improve a RAG based solution performance under development or to properly diagnose a production issue, evaluation tasks must be reproducible and quick to execute. RAG pipelines should be evaluated systematically and consistently for both retrieval and generation components.
Understanding RAG based solution performance plays a critical part at various steps of the solution Lifecycle during :
However, the effort to build an evaluation engine should not be underestimated especially when it comes to create a golden dataset (ground truth) with reference answers and reference contexts.
In this document, we will discuss different evaluation approaches and metrics as well as highlight some of the reusable assets that are out there to make the evaluation of these solutions easier.
LLMaaJ (LLM as a Judge) has emerged as a leading metric in the last year to overcome the challenge of building a reference-based evaluation engine. This eval technique has been shown to produce decent correlation with human judgment. Here are several properties that cannot be quantified by existing metrics and benchmarks but can be evaluated by LLMaaJ:
For example, when using a scoring model to evaluate the output of other models, the scoring prompt should contain a description of the attributes to score and the grading scale, and should be interpolated to include the response to be evaluated.
In this example, the model is asked to evaluate the response’s language style and return a classification.
You are a fair and unbiased scoring judge. You are asked to classify a chatbot's response according to its sentiment. Evaluate the response below and extract the corresponding class. Possible classes are POSITIVE, NEUTRAL, NEGATIVE. Explain your reasoning and conclude by stating the classified sentiment. {{response}}
Or, for instance, below is a few-shot prompting example of LLM-driven evaluation for NER (named-entity recognition) tasks.
-----------------------------------Prompt--------------------------------------------- You are a professional evaluator, and your task is to assess the accuracy of entity extraction as a Score in a given text. You will be given a text, an entity, and the entity value. Please provide a numeric score on a scale from 0 to 1, where 1 is the best score and 0 is the worst score. Strictly use numeric values for scoring. Here are the examples: Text: Where is the IBM's office located in New York? Entity: organization name Value: IBM's Score: 0 Text: Call the customer service at 1-800-555-1234 for assistance. Entity: phone number Value: +1 888 426 4409 Score: 1 Text: watsonx has three components: watsonx.ai, watsonx.data, and watsonx.governance. Entity: product name Value: Google Score: 0.33 Text: The conference is scheduled for 15th August 2024. Entity: date Value: 15th August 2024 Score: 1 Text: My colleagues John and Alice will join the meeting. Entity: person’s name Value: Alice Score: 1 -----------------------------------Output---------------------------------------------Score: 0.67 --------------------------------------------------------------------------------------
The complexity of RAG systems is significantly influenced by the enigmatic nature of Large Language Models (LLMs), as well as the intricate and interconnected components within the RAG pipeline. As technology continues to progress at an unprecedented rate, evaluating such a complex system becomes an increasingly arduous task. To address this challenge, a myriad of benchmarks and evaluation tools have been developed specifically for RAG systems. These resources serve to provide a standardized and systematic approach to assessing the performance and efficacy of these systems.
For instance, as illustrated in the table below (adapted from "Evaluation of Retrieval-Augmented Generation: A Survey"), there exists a diverse array of RAG evaluation methods and tools, each with its unique strengths and applications. This table, which is not exhaustive, serves to provide a succinct overview of the current landscape of RAG evaluation.
In the context of the retrieval component of RAG systems, several challenges arise,
In terms of retrieval component, challenges arise primarily due to the extensive and dynamic nature of prospective knowledge repositories, the temporal facets of data, and the heterogeneity of information sources. Considering these challenges, it becomes apparent that conventional evaluation metrics, such as Recall and Precision, are inadequate and ill-equipped to provide a comprehensive assessment. Instead, there is a need for more nuanced and context-dependent metrics that can effectively capture the complexities and subtleties of the retrieval process.
Concerning the generation component, it is crucial to consider the intricate relationship between the precision of the retrieval process and the quality of the generated output. This necessitates the development and implementation of comprehensive evaluation metrics that can provide a holistic and nuanced assessment of the system's performance.
In turn, the evaluation of the RAG system as a whole requires a thorough examination of the impact of the retrieval component on the generation process, as well as an assessment of the overall effectiveness and efficacy of the system in achieving its intended goals and objectives.
The RAG triad is an evaluation framework for assessing the reliability and contextual accuracy of the Large Language Model (LLM) responses. It consists of three assessments: Context Relevance, Groundedness, and Answer Relevance. These assessments aim to identify LLM response hallucinations by verifying context relevance, response reliability to the context, and answer alignment with user inquiries.
RAG evaluation can be achieved using both automatic reference-based and reference-less metrics. There’s a leaderboard on HuggingFace that looks at how well the open-source LLMs stack up against each other.
The retrieval metrics below are reference-based which means that every chunk must be uniquely identified (contexts_id) and every question has unique IDs of the ground truth contexts.
Rank-aware evaluation metrics used for recommendation systems are appropriate for RAG.
MRR is used in Unitxt and measures the position of the first relevant document in the search results. A higher MRR, close to 1, indicates that relevant results appear near the top, reflecting high search quality. Conversely, a lower MRR means lower search performance, with relevant answers positioned further down in the results.
Pros: Emphasizes the importance of the first relevant result, which is often critical in search scenarios.
Cons: a limitation is that it does not penalize the retrieval for assigning low rank to other ground-truths; Not suitable for evaluating the entire list of retrieved results, focusing only on the first relevant item.
ranking quality metrics that assesses how well a list of items is ordered compared to an ideal ranking, where all relevant items are positioned at the top.
NDCG@k is calculated as DCG@k divided by the ideal DCG@k (IDCG@k), which represents the score of a perfectly ranked list of items up to position k. DCG measures the total relevance of items in a list.
ranges from 0 to 1
Pros: Takes into account the position of relevant items, providing a more holistic view of ranking quality; Can be adjusted for different levels of ranking (e.g., NDCG@k).
Cons: More complex to compute and interpret compared to simpler metrics like MRR; Requires an ideal ranking for comparison, which may not always be available or easy to define.
Mean Average Precision (MAP) is a metric that evaluates the ranking of each correctly retrieved document within a list of results
It is beneficial when your system needs to consider the order of results and retrieve multiple documents in a single run.
Pros: Considers both precision and recall, providing a balanced evaluation of retrieval performance; Suitable for tasks requiring multiple relevant documents and their correct ordering.
Cons: Can be more demanding in terms of computation compared to simpler metrics; May not be as straightforward to interpret as other metrics, requiring more context to understand the results fully.
Measures whether the output is based on the given context or if the model generates hallucinated responses.
Pros: Ensures that the generated responses are trustworthy and based on the provided context; Vital for applications where factual correctness is paramount.
Cons: Often requires human judgment to assess, making it labor-intensive and subjective; May not fully capture partial inaccuracies or subtle hallucinations.
Robustness is generally defined as the solution’s capability of adapting to different input variations such as data perturbations like whitespace, lower/upper case, tabs, etc.
Testing Robustness is an important aspect of the evaluation process and can be achieved for instance using Unitxt semantic
Pros: Ensures the model performs reliably across varied input conditions; Real-World Applicability: Important for practical applications where input data may not be perfectly formatted.
Cons: Requires thorough testing across many variations, which can be time-consuming; Defining Variations: Challenging to define and measure all possible input perturbations.
It measures the quality of text generation by comparing the overlap of n-grams, word sequences, and word pairs between the machine-generated text and a set of reference texts. Widely used for evaluating tasks such as text summarization and translation.
Pros: Established and recognized in the NLP community, providing a standard for comparison; Suitable for tasks where capturing all relevant information is important.
Focuses on n-gram overlap, which may not capture semantic quality or fluency; Can be influenced by the length of the generated text, potentially penalizing shorter or more concise outputs.
It measures the quality of machine-translated text by comparing it to one or more reference translations. It evaluates the precision of n-grams in the generated text with respect to the reference texts. Primarily used for evaluating translation.
Pros: Effective for tasks where precision and exact matches are important; Standard Metric: Widely adopted in the machine translation community, providing a benchmark for comparison.
Cons: May penalize legitimate variations in phrasing that do not exactly match reference texts; Partial Inaccuracy Blindness: May not fully capture partial inaccuracies or subtle differences in meaning.
CPU utilization is primarily used for the retrieval phase, and GPU utilization is primarily used for the generation phase.
Example: Cost from OpenAI API calls
Costs from storage, networking, computing resources, etc.
Costs from maintenance, support, monitoring, logging, security measures, etc.
If the retrieval metrics indicate suboptimal performance, yet the generation metrics yield favourable results, it is advisable to:
Conversely, if the retrieval metrics demonstrate strong performance but the generation results are suboptimal, consider the following strategies to improve model performance:
In the scenario where both the retrieval and generation metrics exhibit subpar performance, it would be prudent to revisit and reconsider the initial stages of the pipeline, such as enhancing metadata, refining the knowledge base, and optimizing the retrieval mechanism.
This approach emphasizes the importance of a comprehensive and nuanced evaluation of the RAG system, taking into account the interplay between retrieval and generation components and the overall effectiveness of the system in achieving its intended goals and objectives.
Vicky Kuo, Amna Jamal, Luke Major, Chris Kirby
Updated: November 15, 2024