RAG evaluation can be achieved using both automatic reference-based and reference-less metrics. There’s a leaderboard on HuggingFace that looks at how well the open-source LLMs stack up against each other.
Retrieval Metrics
The retrieval metrics below are reference-based which means that every chunk must be uniquely identified (contexts_id) and every question has unique IDs of the ground truth contexts.
Rank-aware evaluation metrics used for recommendation systems are appropriate for RAG.
MRR (Mean Reciprocal Rank)
MRR is used in Unitxt and measures the position of the first relevant document in the search results. A higher MRR, close to 1, indicates that relevant results appear near the top, reflecting high search quality. Conversely, a lower MRR means lower search performance, with relevant answers positioned further down in the results.
Pros: Emphasizes the importance of the first relevant result, which is often critical in search scenarios.
Cons: a limitation is that it does not penalize the retrieval for assigning low rank to other ground-truths; Not suitable for evaluating the entire list of retrieved results, focusing only on the first relevant item.
NDCG (Normalized Discounted Cumulative Gain)
ranking quality metrics that assesses how well a list of items is ordered compared to an ideal ranking, where all relevant items are positioned at the top.
NDCG@k is calculated as DCG@k divided by the ideal DCG@k (IDCG@k), which represents the score of a perfectly ranked list of items up to position k. DCG measures the total relevance of items in a list.
ranges from 0 to 1
Pros: Takes into account the position of relevant items, providing a more holistic view of ranking quality; Can be adjusted for different levels of ranking (e.g., NDCG@k).
Cons: More complex to compute and interpret compared to simpler metrics like MRR; Requires an ideal ranking for comparison, which may not always be available or easy to define.
MAP (Mean Average Precision)
Mean Average Precision (MAP) is a metric that evaluates the ranking of each correctly retrieved document within a list of results
It is beneficial when your system needs to consider the order of results and retrieve multiple documents in a single run.
Pros: Considers both precision and recall, providing a balanced evaluation of retrieval performance; Suitable for tasks requiring multiple relevant documents and their correct ordering.
Cons: Can be more demanding in terms of computation compared to simpler metrics; May not be as straightforward to interpret as other metrics, requiring more context to understand the results fully.
Generation Metrics
Faithfulness
Measures whether the output is based on the given context or if the model generates hallucinated responses.
Pros: Ensures that the generated responses are trustworthy and based on the provided context; Vital for applications where factual correctness is paramount.
Cons: Often requires human judgment to assess, making it labor-intensive and subjective; May not fully capture partial inaccuracies or subtle hallucinations.
Robustness (insensitivity)
Robustness is generally defined as the solution's capability of adapting to different input variations such as data perturbations like whitespace, lower/upper case, tabs, etc.
Testing Robustness is an important aspect of the evaluation process and can be achieved for instance using Unitxt semantic
Pros: Ensures the model performs reliably across varied input conditions; Real-World Applicability: Important for practical applications where input data may not be perfectly formatted.
Cons: Requires thorough testing across many variations, which can be time-consuming; Defining Variations: Challenging to define and measure all possible input perturbations.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
It measures the quality of text generation by comparing the overlap of n-grams, word sequences, and word pairs between the machine-generated text and a set of reference texts. Widely used for evaluating tasks such as text summarization and translation.
Pros: Established and recognized in the NLP community, providing a standard for comparison; Suitable for tasks where capturing all relevant information is important.
Focuses on n-gram overlap, which may not capture semantic quality or fluency; Can be influenced by the length of the generated text, potentially penalizing shorter or more concise outputs.
BLEU (Bilingual Evaluation Understudy)
It measures the quality of machine-translated text by comparing it to one or more reference translations. It evaluates the precision of n-grams in the generated text with respect to the reference texts. Primarily used for evaluating translation.
Pros: Effective for tasks where precision and exact matches are important; Standard Metric: Widely adopted in the machine translation community, providing a benchmark for comparison.
Cons: May penalize legitimate variations in phrasing that do not exactly match reference texts; Partial Inaccuracy Blindness: May not fully capture partial inaccuracies or subtle differences in meaning.
Cost Metrics
GPU/CPU Utilization
CPU utilization is primarily used for the retrieval phase, and GPU utilization is primarily used for the generation phase.
LLM Calls Cost
Example: Cost from OpenAI API calls
Infrastructure Cost
Costs from storage, networking, computing resources, etc.
Operation Cost
Costs from maintenance, support, monitoring, logging, security measures, etc.
Understanding Evaluation Results