In this tutorial, we will use the Ragas framework for Retrieval-Augmented Generation (RAG) evaluation in Python using LangChain.
RAG is a technique in natural language processing (NLP) that combines information retrieval and generative models to produce more accurate, relevant and contextually aware responses. In traditional language generation tasks, large language models (LLMs) such as OpenAI’s GPT-4 (Generative Pre-trained Transformer) or IBM® Granite™ Models are used to construct responses based on an input prompt. Common real-world use cases of these large language models are chatbots. These models struggle to produce responses that are contextually relevant, factually accurate or up to date.
RAG applications address this limitation by incorporating a retrieval step before response generation. During retrieval, additional text fragments relevant to the prompt are pulled from a knowledge base, such as relevant documents from a large corpus of text, typically stored in a vector database. Finally, an LLM is used for generating responses based on the original prompt augmented with the retrieved context.
There are many different RAG evaluation frameworks and evaluation metrics. Apart from Ragas, other frameworks include IBM's unitxt and OpenAI's Evals. Unlike the other frameworks, Ragas uses another LLM-as-a-judge to evaluate the performance of a RAG pipeline.
There are several evaluation metrics available for measuring the performance of our RAG pipeline. The metrics we will be using in the open source Ragas framework can be split into two parts:
These metrics are meant to be subjective proxies for how well a RAG pipeline retrieves relevant information from its knowledge base to form a response. It is important to note, there is no ideal for data, prompts or LLMs. Even context that has a low scoring
There are also biases that affect the evaluation of a RAG pipeline such as preference for either shorter or longer responses, otherwise known as length bias. This type of bias can lead to one response being evaluated higher than another because of its length and not its substance.
For these reasons, it is best practice to perform multiple evaluations. This exercise can be accomplished by changing the LLM's prompt template, metrics, sequence of evaluation and more. If you are creating your own data set for your RAG pipeline, it is also recommended to use different models for the LLM generating the responses and the LLM critiquing the responses. If the same model is used for both, there is greater potential for self-evaluation bias. Because these evaluation metrics are subjective, the results produced by these frameworks should also be checked by human judges.
In this tutorial, we do not create a RAG system. Instead, we are using Ragas to evaluate the output of a previously created RAG system. For more information about how to build your RAG system using LangChain, see our detailed RAG tutorial.
You need an IBM Cloud® account to create a watsonx.ai™ project. Sign up for a free account here.
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step opens a notebook environment where you can copy the code from this tutorial to implement a RAG evaluation of your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook is also available on GitHub.
Create a Watson Machine Learning service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an API Key in WML.
Associate the WML service to the project you created in watsonx.ai.
We need a few libraries and modules for this tutorial. Make sure to import the ones listed and if they are not installed, a quick pip installation resolves the problem.
Set up your credentials. Input your API key and project ID as strings. Depending on your provisioned service instance, use one of the listed regions as your watsonx URL:
Ragas evaluation requires a dataset containing RAG pipeline executions of several different prompts. In addition to the questions themselves, the dataset needs to contain the expected answers known as "ground truths," the answers produced by the LLM and the list of context pieces retrieved by the RAG pipeline while answering each question. You can create your own end-to-end dataset but for the purposes of this tutorial, the dataset we are using in this tutorial is available on Hugging Face. Let's load the dataset.
Output:
The data is loaded as a DatasetDict and the features we are interested in are within the "eval" split.
Output:
Now, load the data into a Pandas dataframe. To see an example of an entry in this dataset, see the HuggingFace documentation.
Datasets for RAG evaluation can be created in various ways. A key element for the creation of this dataset was the external knowledge base provided to an LLM. This knowledge can be obtained from a scraped webpage, basic text file, imported document and more. In this case, reports collected from Amnesty International are used. The content of the dataset might have been created end-to-end or by using a synthetic data generation approach such as Ragas' TestsetGenerator. Using TestsetGenerator requires the loaded documents, a generator LLM, a critic LLM and an embedding model.
In turn, the end-to-end approach involves several steps. Let's assume this approach was taken for the creation of this dataset. This means that either an LLM or a human user generated the questions stored in the question column. To generate the ground truths for each question, the user might have manually created them or generated them using an LLM with the appropriate prompt template. These responses are deemed as the ideal answers and are stored in the ground_truth column. Lastly, a RAG pipeline was used to generate the answers seen in the answer column. When building the RAG pipeline, the external knowledge base was vectorized. Then, when querying the RAG system, the relevant chunks of text that the LLM used for generating each answer were obtained from the vector store by using a similarity algorithm such as the top-k retrieval algorithm. These datasets were stored in the contexts column.
In this tutorial, we are using an IBM Granite model as the judge.
Ragas uses Open AI models by default.
For this tutorial, we suggest using the IBM Granite 13B Chat model as the LLM to achieve similar results. You are free to use any AI model of your choice to compare to this benchmark and choose the best fit for your application. The foundation models available through watsonx.ai can be found here. The purpose of these models in LLM applications is to serve as the reasoning engine that decides which actions to take and responses to produce. To use the
The Granite model is used as the evaluation model. We are not going to use a model to generate any responses because the responses are already stored in the dataset's
The embedding model that we are using is an IBM Slate™ model through a watsonx.ai LangChain wrapper. If no embedding model is defined, Ragas uses OpenAI embeddings by default. The embeddings model is essential for evaluation as it is used to embed the data from the separate columns to measure the distance between them.
Finally, we can now run the Ragas evaluation on the dataset. Here, we pass in the dataset, the metrics for evaluation, the LLM and the embedding model as parameters.
If warning messages appear, disregard them, allow the evaluation to complete and print the result as shown.
Output:
And that's it. One evaluation of the RAG pipeline has been completed. As mentioned, you can run multiple evaluations, try different models and alter parameters. The more evaluations are performed, the more we can comprehensively assess the accuracy and effectiveness of an LLM system using RAG.
In this tutorial, you used Ragas to evaluate your RAG pipeline. Your output included the
The evaluation performed is important as it can be applied to future generative AI workflows to assess the performance of your RAG systems and improve upon them.
We encourage you to check out the Ragas documentation page for more information on their metrics and evaluation process.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data.
Redefine how you work with AI for business. IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business.
IBM’s artificial intelligence solutions help you build the future of your business. These include: IBM® watsonx™, our AI and data platform and portfolio of AI-powered assistants; IBM® Granite™, our family of open-sourced, high-performing and cost-efficient models trained on trusted enterprise data; IBM Consulting, our AI services to redesign workflows; and our hybrid cloud offerings that enable AI-ready infrastructure to better scale AI.