Evaluate RAG pipeline using Ragas in Python with watsonx

In this tutorial, we will use the Ragas framework for Retrieval-Augmented Generation (RAG) evaluation in Python using LangChain.

RAG is a technique in natural language processing (NLP) that combines information retrieval and generative models to produce more accurate, relevant and contextually aware responses. In traditional language generation tasks, large language models (LLMs) such as OpenAI’s GPT-4 (Generative Pre-trained Transformer) or IBM® Granite™ Models are used to construct responses based on an input prompt. Common real-world use cases of these large language models are chatbots. These models struggle to produce responses that are contextually relevant, factually accurate or up to date.

RAG applications address this limitation by incorporating a retrieval step before response generation. During retrieval, additional text fragments relevant to the prompt are pulled from a knowledge base, such as relevant documents from a large corpus of text, typically stored in a vector database. Finally, an LLM is used for generating responses based on the original prompt augmented with the retrieved context.

Overview of RAG evaluation

There are many different RAG evaluation frameworks and evaluation metrics. Apart from Ragas, other frameworks include IBM’s Unitxt and OpenAI’s Evals. Unlike the other frameworks, Ragas uses another LLM-as-a-judge to evaluate the performance of a RAG pipeline.

There are several evaluation metrics available for measuring the performance of our RAG pipeline. The metrics we will be using in the open source Ragas framework can be split into two parts:

Generation evaluation
- Faithfulness measures if all generated answers can be inferred from the retrieved context.
- Answer relevancy measures the relevancy of the generated response to the question.
Retrieval evaluation
- Context precision measures the ranking of ground-truth relevant entities in the context. Higher context precision means ground-truth relevant items are ranked higher than “noise.”
- Context recall measures the extent to which the LLM’s generated answers to user queries can be found in the retrieved context.

These metrics are meant to be subjective proxies for how well a RAG pipeline retrieves relevant information from its knowledge base to form a response. It is important to note, there is no ideal for data, prompts or LLMs. Even context that has a low scoring context_relevance is not necessarily bad context. The low score might be due to some amount of “noise,” or less relevant information, or simply because the task itself is open to multiple interpretations. Noise is not necessarily bad either. We, as humans, produce a certain amount of noise in our responses while also being intelligible in answering questions.

There are also biases that affect the evaluation of a RAG pipeline such as preference for either shorter or longer responses, otherwise known as length bias. This type of bias can lead to one response being evaluated higher than another because of its length and not its substance.

For these reasons, it is best practice to perform multiple evaluations. This exercise can be accomplished by changing the LLM’s prompt template, metrics, sequence of evaluation and more. If you are creating your own data set for your RAG pipeline, it is also recommended to use different models for the LLM generating the responses and the LLM critiquing the responses. If the same model is used for both, there is greater potential for self-evaluation bias. Because these evaluation metrics are subjective, the results produced by these frameworks should also be checked by human judges.

In this tutorial, we do not create a RAG system. Instead, we are using Ragas to evaluate the output of a previously created RAG system. For more information about how to build your RAG system using LangChain, see our detailed RAG tutorial.

Prerequisites

You need an IBM Cloud® account to create a watsonx.ai™ project. Sign up for a free account here.

Steps

Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.

This step opens a notebook environment where you can copy the code from this tutorial to implement a RAG evaluation of your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community. This Jupyter Notebook is also available on GitHub.

Step 2. Set up a watsonx.ai Runtime instance and API key

Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials

We need a few libraries and modules for this tutorial. Make sure to import the ones listed and if they are not installed, a quick pip installation resolves the problem. This tutorial was built using Python 3.11.9.

#installations
%pip install -q langchain_community
%pip install -q "ragas==0.2.1"
%pip install -q langchain_ibm
%pip install -q ibm_watson_machine_learning
%pip install -q ibm_watsonx_ai
%pip install -q langchain_core
%pip install -q nltk import os

from langchain_community.llms import WatsonxLLM as _WatsonxLLM
from langchain_ibm import WatsonxEmbeddings
from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.schema import LLMResult
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import answer_relevancy, context_precision, context_recall, faithfulness
from typing import List, Optional, Any
from datasets import load_dataset
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes

Set up your credentials. Input your API key and project ID as strings. Depending on your provisioned service instance, use one of the listed regions as your watsonx URL:

Dallas: https://us-south.ml.cloud.ibm.com
London: https://eu-gb.ml.cloud.ibm.com
Frankfurt: https://eu-de.ml.cloud.ibm.com
Tokyo: https://jp-tok.ml.cloud.ibm.com

os.environ["WATSONX_APIKEY"] = <API_KEY>
os.environ["WATSONX_PROJECT_ID"] = <PROJECT_ID>
os.environ["WATSONX_URL"] = "https://us-south.ml.cloud.ibm.com"

Step 4. Load the dataset

Ragas evaluation requires a dataset containing RAG pipeline executions of several different prompts. In addition to the questions themselves, the dataset needs to contain the expected answers known as “ground truths,” the answers produced by the LLM and the list of context pieces retrieved by the RAG pipeline while answering each question. You can create your own end-to-end dataset but for the purposes of this tutorial, the dataset we are using in this tutorial is available on Hugging Face. Let’s load the dataset.

amnesty_qa = load_dataset(“explodinggradients/amnesty_qa”, “english_v2”)
amnesty_qa

Output:

Repo card metadata block was not found. Setting CardData to empty.
DatasetDict({
    eval: Dataset({
        features: [‘question’, ‘ground_truth’, ‘answer’, ‘contexts’],
        num_rows: 20
    })
})

The data is loaded as a DatasetDict and the features we are interested in are within the “eval” split.

dataset = amnesty_qa[“eval”]
dataset

Output:

Dataset({
features: [‘question’, ‘ground_truth’, ‘answer’, ‘contexts’],
num_rows: 20
})

Now, load the data into a Pandas dataframe. To see an example of an entry in this dataset, see the HuggingFace documentation.

df = dataset.to_pandas()

Datasets for RAG evaluation can be created in various ways. A key element for the creation of this dataset was the external knowledge base provided to an LLM. This knowledge can be obtained from a scraped webpage, basic text file, imported document and more. In this case, reports collected from Amnesty International are used. The content of the dataset might have been created end-to-end or by using a synthetic data generation approach such as Ragas’ TestsetGenerator. Using TestsetGenerator requires the loaded documents, a generator LLM, a critic LLM and an embedding model.

In turn, the end-to-end approach involves several steps. Let’s assume this approach was taken for the creation of this dataset. This means that either an LLM or a human user generated the questions stored in the question column. To generate the ground truths for each question, the user might have manually created them or generated them using an LLM with the appropriate prompt template. These responses are deemed as the ideal answers and are stored in the ground_truth column. Lastly, a RAG pipeline was used to generate the answers seen in the answer column. When building the RAG pipeline, the external knowledge base was vectorized. Then, when querying the RAG system, the relevant chunks of text that the LLM used for generating each answer were obtained from the vector store by using a similarity algorithm such as the top-k retrieval algorithm. These datasets were stored in the contexts column.

Step 5. Establish the models for evaluating and embedding

In this tutorial, we are using an IBM Granite model as the judge.

Ragas uses Open AI models by default. WatsonxLLM is the wrapper for IBM watsonx.ai foundation models. A Ragas-compatible WatsonxLLM wrapper is a work in progress and not yet available. For now, to use Ragas with the Granite models, we need to alter the wrapper’s properties.

class WatsonxLLM(_WatsonxLLM):
    temperature: float = 0.05
    “””
    A workaround for interface incompatibility: Ragas expected all LLMs to
    have a `temperature` property whereas WatsonxLLM does not define it.
    “””

    def _generate(
        self,
        prompts: List[str],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        stream: Optional[bool] = None,
        **kwargs: Any,
    ) -> LLMResult:
        “””
        A workaround for interface incompatibility: Ragas expected the
        `token_usage` property of the LLM result be of a particular shape.
        WatsonX returns it in a slightly different shape.
        “””
        result: LLMResult = super()._generate(prompts, stop, run_manager, stream, **kwargs)
        if not result.llm_output or “token_usage” not in result.llm_output:
            return result
        usage = result.llm_output[“token_usage”]
        if not isinstance(usage, dict):
            return result
        result.llm_output[“token_usage”] = {
            “prompt_tokens”: usage[“input_token_count”],
            “completion_tokens”: usage[“generated_token_count”],
            “total_tokens”: usage[“input_token_count”] + usage[“generated_token_count”],
        }
    return result

For this tutorial, we suggest using the IBM Granite-3.0-8B-Instruct model as the LLM to achieve similar results. You are free to use any AI model of your choice to compare to this benchmark and choose the best fit for your application. The foundation models available through watsonx.ai can be found here. The purpose of these models in LLM applications is to serve as the reasoning engine that decides which actions to take and responses to produce. To use the WatsonxLLM wrapper with Ragas, we need to use a LangchainLLMWrapper .

watsonx_llm = LangchainLLMWrapper(
    langchain_llm = WatsonxLLM(
        model_id = “ibm/granite-3-8b-instruct”,
        url = os.getenv(“WATSONX_URL”),
        apikey = os.getenv(“WATSONX_APIKEY”),
        project_id = os.getenv(“WATSONX_PROJECT_ID”),
        params = {
            GenParams.MAX_NEW_TOKENS: 200,
            GenParams.MIN_NEW_TOKENS: 1,
            GenParams.STOP_SEQUENCES: [“<|endoftext|>“],
            GenParams.TEMPERATURE: 0.2,
            GenParams.TOP_K: 50,
            GenParams.TOP_P: 1,
        }
    )
)

The Granite model is used as the evaluation model. We are not going to use a model to generate any responses because the responses are already stored in the dataset’sanswer column.

The embedding model that we are using is an IBM Slate™ model through a watsonx.ai LangChain wrapper. If no embedding model is defined, Ragas uses OpenAI embeddings by default. The embeddings model is essential for evaluation as it is used to embed the data from the separate columns to measure the distance between them.

watsonx_embeddings = WatsonxEmbeddings(
    model_id = EmbeddingTypes.IBM_SLATE_30M_ENG.value,
    url = os.getenv(“WATSONX_URL”),
    apikey = os.getenv(“WATSONX_APIKEY”),
    project_id = os.getenv(“WATSONX_PROJECT_ID”)
)

Step 6. Generate an evaluation with Ragas

Finally, we can now run the Ragas evaluation on the dataset. Here, we pass in the dataset, the metrics for evaluation, the LLM and the embedding model as parameters.

If warning messages appear, disregard them, allow the evaluation to complete and print the result as shown.

result = evaluate(
    amnesty_qa[“eval”],
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
    llm=watsonx_llm,
    embeddings=watsonx_embeddings)

print(result)

Output:

{‘context_precision’: 0.9444, ‘faithfulness’: 0.6000, ‘answer_relevancy’: 0.6917, ‘context_recall’: 0.8333}

And that’s it. One evaluation of the RAG pipeline has been completed. As mentioned, you can run multiple evaluations, try different models and alter parameters. The more evaluations are performed, the more we can comprehensively assess the accuracy and effectiveness of an LLM system using RAG.

Summary

In this tutorial, you used Ragas to evaluate your RAG pipeline. Your output included the context_precision , faithfulness , answer_relevancy and context_recall metrics. The LLM used for evaluation was an IBM Granite Model and the embedding model used was an IBM Slate model accessed through the watsonx.ai embeddings API.

The evaluation performed is important as it can be applied to future generative AI workflows to assess the performance of your RAG systems and improve upon them.

We encourage you to check out the Ragas documentation page for more information on their metrics and evaluation process.

Unlock the power of generative AI and ML

Learn how to confidently incorporate generative AI and machine learning into your business.