Large language models (LLMs) have remarkable text generation and reasoning abilities but often produce factual inaccuracies or hallucinations due to their reliance on internal knowledge. Retrieval augmented generation (RAG) based solutions aim to resolve this issue by injecting external documents into the model’s context. However, traditional RAG approaches retrieve a fixed number of passages regardless of their necessity or quality, leading to redundancy, inefficiency and inconsistent factual grounding.
The self-RAG framework provides a practical solution to this problem. It retrieves information on-demand by using special control tokens that dynamically decide when and how to perform retrieval during generation. Unlike agentic or multi-agent approaches that coordinate multiple models or components, self-RAG is a model-centric framework where a single model manages retrieval, generation and critique internally. Its self-critique process is a structured step where the model evaluates both its own output and the quality of the retrieved information, allowing it to adapt its retrieval behavior through self-reflection tokens. It combines retrieval, generation and self-critique of its own generations with a single model trained end-to-end that allows more efficient, factual and controllable text generation. This method was originally introduced in the paper on Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection (2024), which explores how fine-tuning LLMs for self-evaluation can improve factual consistency in natural language processing (NLP) tasks.
The workflow of self-RAG is orchestrated by special reflection tokens that the model generates alongside its text output, making the entire inference process dynamic and controllable. When additional information is needed, a single LLM takes on both the retriever and critic roles. A retriever component fetches relevant external passages, and the same LLM then uses reflection tokens to evaluate and refine its own generation during inference. This architecture represents a broader trend in artificial intelligence (AI) toward models capable of introspection and dynamic reasoning, bridging advances in prompt engineering and long-form generations.
The LLM first generates a retrieval token to determine whether external factual information is necessary for the query. The model skips the remaining retrieval-based steps and continues with standard generation if it concludes that retrieval is not necessary. If the retrieval token is decoded as “yes,” a retriever is called to fetch a set of relevant passages from an external knowledge base. This step makes sure that retrieval occurs when its expected utility is high.
If retrieval is required, the retriever fetches relevant passages from an external knowledge base. The LLM simultaneously processes the input and retrieved passages and generates text continuation for each passage.
For each segment generated, the model concurrently generates special critique tokens that are embedded directly within the output sequence. These tokens are not separate evaluations, rather they appear as part of the generated sequence and help the model check its own work as it goes:
ISREL (Relevance): Assesses the usefulness of the retrieved passage.
ISSUP (Support or factuality): Evaluates whether the generated text segment has whole, partial or no factual support from the source material.
ISUSE (Utility): Evaluates the created segment’s overall quality, usefulness and structure.
During inference, reflection tokens are used to decide when to retrieve information or not. It enables the model to adjust to different tasks, such as retrieving less for creative activities and more for factual ones. When generating text, reflection tokens help the model in adhering to particular guidelines. They either provide clear boundaries or guide word choice, which makes the model’s responses more flexible and appropriate for various contexts.
During training, reflection tokens are inserted into the training data based on evaluations made by the critic model. This approach keeps self-rag training efficient by allowing the model to learn how to judge its own outputs and decide when it needs to look up information. Hence, the model becomes better at producing accurate, controlled and high-quality responses.
In the experiment conducted in the research mentioned previously, self-RAG outperforms many standard retrieval-augmented and instruction-tuned baselines across various tasks, including open-domain question answering, reasoning and fact verification. It improves factuality and citation accuracy by using self-reflection tokens and on-demand retrieval, matching or outperforming OpenAI’s models.
In this tutorial, you’ll learn how to build a robust self-reflective RAG agent by using an IBM® Granite® model on watsonx® and LangGraph. Similar frameworks and tools, such as ChatGPT, llama2, LlamaIndex or LangChain, also enable complex RAG flows. However, this tutorial focuses on using the powerful multimodal models available through IBM. These models understand both text and images as well as its enterprise-grade design supports secure deployment, governance and scalability. These features make Granite well suited for building reliable, production-ready RAG systems that can handle complex data and maintain high standards of trust and performance.
This tutorial demonstrates how to build a self-RAG agent designed to answer complex, multifaceted queries over internal knowledge bases that include both text and visual data. This agent will analyze PDF documents including technical guidelines and survey data. It will guide you to implement the self-RAG algorithm, which:
Creates a multimodal knowledge base: Uses a language model (granite-3-3-8b-instruct) and vision LLM (granite-vision-3.3-2B) to extract text and images from PDFs, generate descriptive captions and create embeddings for both text and image data to enable semantic retrieval.
Generates and reflects: It creates an answer segment, adds reflection tokens (such as ISREL, ISSUP and ISUSE) and evaluates its own output quality and factual accuracy.
Executes self-correction: The LangGraph workflow extends the standard self-RAG approach by using a critique score derived from reflection tokens to guide its next steps. When the score is low, the agent requests stronger context and retrieves more relevant information before generating the next segment, helping produce a higher-quality final output.
Provides segmented answers: Provides thorough and traceable responses by generating complex answers in a sequence of factually validated chunks.
You need an IBM Cloud® account to create a watsonx.ai® project. Ensure that you have access to both your watsonx API Key and Project ID.
While you can choose from several tools, this tutorial walks you through how to set up an IBM account by using a Jupyter Notebook.
Log in to watsonx.ai by using your IBM Cloud account.
Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step opens a notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community. This tutorial is also available on GitHub.
Note: You can run the multimodal self-RAG tutorial entirely on a local CPU system. This setup is achievable by adapting it to use local resources instead of remote cloud services. You can initialize the Granite instruct model (the 3.2 2B version) directly from Hugging Face by using the appropriate transformers steps. For data handling, save your PDF files directly on your local system and easily read them into your Jupyter Notebook environment by using their local file path, bypassing the need for IBM Cloud Object Storage. To handle the complex reasoning and self-critique, the larger remote Granite 3.3-8B model can be replaced by a powerful open source LLM hosted locally by using a dedicated server setup. This setup requires installing specific local Python dependencies, such as langgraph, faiss-cpu, sentence-transformers and pymupdf for the vector store, RAG logic, embeddings and PDF parsing. Models can be configured for efficient CPU operation by explicitly setting the device to “cpu” and adjusting the floating-point data type. This step manages memory usage and prevents crashes common with large models on typical desktop hardware.
Create a watsonx.ai Runtime service instance (choose the Lite plan, which is a free instance).
Generate an application programming interface (API) Key.
Associate the watsonx.ai Runtime service to the project that you created in watsonx.ai.
To build and orchestrate this multimodal self-reflective RAG agent, we require a comprehensive set of libraries. Install langgraph to define the core state machine that orchestrates the self-correction loop based on critique ratings. For integrating IBM Granite LLMs and embeddings from the watsonx platform, install langchain-ibm and ibm-watsonx-ai. For quick retrieval, install faiss-cpu that offers indexing for the vector store. We use deep learning libraries like torch and the Hugging Face transformers library to load and run the granite-vision-3.3-2B model. To extract and process the text and images from our PDF documents, pillow and pymupdf are essential. Lastly, to access raw data from Cloud Object Storage, ibm-cos-sdk is included.
Note: No GPU is required, but execution can be slower on CPU-based systems.
Next, import all the necessary modules to set up the fundamental tools for managing the multimodal components, processing documents, coordinating the RAG workflow and connecting to IBM watsonx.
Multimodal context: This tutorial uses a vision model and libraries like fitz to process both text and visual data into a unified context. This approach surpasses simple text-based RAG by enabling the agent to retrieve richer information and provide highly accurate answers derived from complex documents.
Self-correction loop: The system uses LangGraph (StateGraph) to build a self-reflective RAG agent. This approach allows the LLM to critique its own output for relevance and accuracy, and then automatically initiate a correction cycle by querying the vector store or refining the prompt, minimizing hallucinations.
Production-ready integration: The tutorial demonstrates a high-performance stack by integrating enterprise LLMs (such as Granite) accessed through an external application programming interface (API) or hugging face (depending on the setup). This approach also includes efficient vector storage (FAISS) and streamlined RAG logic, proving its viability for real-world deployment.
This step prepares your environment to securely connect to the IBM watsonx platform, allowing you to use the hosted granite LLMs and embeddings.
This critical step configures the three distinct models required for our multimodal self-RAG agent.
This configuration will:
Initialize the granite-3-3-8B-instruct model to function as both the primary generator and the self-critic by producing the reflection tokens (ISREL, ISSUP, and ISUSE). For the self-critique loop, the parameters are optimized for factual, deterministic and stable answers.
Initialize the granite-embedding-278m-multilingual model. This model generates the textual embeddings essential for efficient semantic search and retrieval in the FAISS vector store.
Load the granite-vision-3.3-2B model locally by using the transformers library. This model creates text captions for images extracted from PDF documents.
This step focuses on securely retrieving the source dataset from IBM Cloud Object Storage into the memory of your execution environment. This process is necessary before any text splitting or multimodal analysis can begin. We have uploaded two PDF files to the database for this tutorial.
This step is crucial for transforming our raw PDF documents into a multimodal, searchable knowledge base for the self-RAG agent.
This parsing will:
• Define the function and use fitz to accurately pull both text and embedded image bytes from structured documents, a task simple text readers often fail at.
• Pass the extracted images and a descriptive prompt to the locally loaded Granite vision model as it is crucial for multimodality. By converting images into descriptive text captions, we make visual information searchable through the standard text embedding model. This mechanism ensures that the agent is not “blind” to nontextual context, thus improves the completeness of the knowledge base.
• Implement caching logic to store the results, preventing the time-consuming and computationally demanding multimodal captioning process from having to be repeated. Storing the processed knowledge base speeds up development and repeated execution.
• Ensure that the final knowledge base gives the self-reflective agent full context that includes both textual and visual data. This objective is the main one of the entire process, giving the later self-reflective retrieval the foundation it needs to be precise and well-founded.
This step completes the preparation of the multimodal knowledge base by indexing all processed document chunks into an efficient, searchable vector store that forms the basis for the agent’s initial retrieval capability.
This configuration plays a key role in preparing the retrieval layer for the self-RAG workflow:
• It builds a high efficiency vector store by using FAISS that is well known for its speed and scalability when handling dense vector indexes. This step ensures that similarity searches run quickly, which is critical for maintaining a responsive RAG pipeline.
• It transforms the multimodal knowledge base into vector representations, allowing the retriever to match user queries by meaning rather than relying on exact keyword overlap.
• It fine tunes context delivery by typically retrieving the top five most relevant documents (k=5), balancing precision and relevance within the model’s context window.
• It establishes a single, consistent knowledge source that the self-RAG agent can depend on for factual grounding that is an essential element of any trustworthy retrieval augmented system.
This step sets up the main sections of the self-RAG workflow. The agent state tracks the entire process. The LangGraph node functions manage the flexible, self-correcting logic.
This code serves several purposes:
• The agent keeps a core memory that stores its evolving response, the evidence it has retrieved and internal feedback. This memory helps the agent’s logic to dynamically improve its reasoning by storing context across various steps.
• The agent first determines whether adequate factual grounding is present before producing any segments. To ensure that the generated response is accurate and pertinent, the agent intelligently seeks for stronger, more supportive information if the existing context is deemed incomplete.
• Alongside each generated segment, the model issues internal reflection tokens that immediately quantify the output’s relevance, factual support and overall quality. These critical signals are then combined into a single critique score, giving the agent an objective, measurable way to judge its own performance.
• Determined by the critique score, the agent then decides whether to rework, expand upon or finalize its answer. This iterative process makes the system inherently resilient, forcing it to improve incorrect generations and maintain factual precision over multiple reasoning rounds.
The entire self-RAG workflow begins with this last step.
The ICH E6(R3) Guideline primarily focuses on good clinical practice for design and conduct of clinical trials on medicinal products. It aims to harmonize these practices across different regions to ensure the protection of human subjects involved in clinical trials and the quality and integrity of the data generated. Regarding remote inspections, the EFPIA 2024 inspection survey reveals that while there is a trend of fewer remote inspections in the EU/EEA post-pandemic, the US shows no clear trend, with a slight decrease. The survey also highlights the potential for minimizing increased efforts through strategies like utilizing local inspectorates as leads, leveraging different time zones for document reviews, and producing one inspection report with agreed observations. However, uncertainty about return on investment and business priorities were cited as reasons for not applying in the 2024 pilot.
The countries with the highest inspection counts per manufacturing site, according to the EFPIA 2024 data, are Germany and Denmark, each with four multiple inspections at their sites. This indicates significant regulatory scrutiny and importance in the pharmaceutical manufacturing sector. Germany stands out with additional sites also facing inspections from Belarus, Türkiye, Russia, and the US-FDA, while Denmark has inspections from Japan, Brazil, US-FDA, Türkiye, Kenya, Chinese Taipei, and the Rep. of Korea. The high number of inspections suggests that these countries play crucial roles in global pharmaceutical manufacturing oversight, likely due to their central positions in the industry and stringent regulatory environments.
Once the agent either reaches the maximum number of segments or completes its multisegment answer, it produces the final output to the user question. The .stream() method is then used to run the compiled graph, represented by the app object.
The initial state, which contains the detailed user_query, is passed in through the inputs dictionary.
As the graph streams, each loop processes one node at a time based on the system’s internal logic. Every node’s output is printed as it runs, letting us watch the agent refine its reasoning in real time and build its multipart response ultimately ending with a well supported final answer. The final step reruns the full self-RAG workflow to create a refined answer. It executes the LangGraph and watches the streaming state updates until the finalize_answer or END node shows up. It pulls the generated segments and joins them into a grounded final answer whenever the final state is reached.
The self-reflective retrieval augmented generation setup in this tutorial offers major advantages over standard RAG, mainly in terms of reliability and smart efficiency. Its biggest strength is improved factual accuracy and traceability, made possible by the Granite LLM running its own self-critiques with reflection tokens. These critiques produce a score that guides the workflow, allowing adaptive retrieval and the model pulls new context only when a segment isn’t well supported. This approach also makes it easier to work with complex, multimodal documents because image captions can be added to the vector store. The result is a more trustworthy, flexible query agent that checks and segments its answers against the knowledge base before giving the final result.
