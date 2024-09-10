There are many different RAG evaluation frameworks and evaluation metrics. Apart from Ragas, other frameworks include IBM's unitxt and OpenAI's Evals. Unlike the other frameworks, Ragas uses another LLM-as-a-judge to evaluate the performance of a RAG pipeline.

There are several evaluation metrics available for measuring the performance of our RAG pipeline. The metrics we will be using in the open source Ragas framework can be split into two parts:

Generation evaluation Faithfulness measures if all generated answers can be inferred from the retrieved context. Answer relevancy measures the relevancy of the generated response to the question.

Retrieval evaluation Context precision measures the ranking of ground-truth relevant entities in the context. Higher context precision means ground-truth relevant items are ranked higher than “noise.” Context recall measures the extent to which the LLM’s generated answers to user queries can be found in the retrieved context.



These metrics are meant to be subjective proxies for how well a RAG pipeline retrieves relevant information from its knowledge base to form a response. It is important to note, there is no ideal for data, prompts or LLMs. Even context that has a low scoring context_relevance is not necessarily bad context. The low score might be due to some amount of "noise," or less relevant information, or simply because the task itself is open to multiple interpretations. Noise is not necessarily bad either. We, as humans, produce a certain amount of noise in our responses while also being intelligible in answering questions.

There are also biases that affect the evaluation of a RAG pipeline such as preference for either shorter or longer responses, otherwise known as length bias. This type of bias can lead to one response being evaluated higher than another because of its length and not its substance.

For these reasons, it is best practice to perform multiple evaluations. This exercise can be accomplished by changing the LLM's prompt template, metrics, sequence of evaluation and more. If you are creating your own data set for your RAG pipeline, it is also recommended to use different models for the LLM generating the responses and the LLM critiquing the responses. If the same model is used for both, there is greater potential for self-evaluation bias. Because these evaluation metrics are subjective, the results produced by these frameworks should also be checked by human judges.

In this tutorial, we do not create a RAG system. Instead, we are using Ragas to evaluate the output of a previously created RAG system. For more information about how to build your RAG system using LangChain, see our detailed RAG tutorial.