Inference scaling to improve multimodal RAG

Inference scaling to improve multimodal RAG 

Inference scaling in artificial intelligence (AI) refers to techniques that enhance model performance by allocating computational resources during the inference phase (when models generate outputs) rather than relying on larger training datasets or model architectures. As large language models (LLMs) continue to expand in both model parameters and dataset scale, optimizing inference time and managing inference compute scaling—particularly on GPU hardware—have become central challenges for deploying high-performance multimodal retrieval-augmented generation (RAG) systems. 

Introduction to inference scaling

Recent advances in inference strategies that increase computational resources and employ complex algorithms at test-time—are redefining how LLMs tackle complex reasoning tasks and deliver higher-quality outputs across diverse input modalities. Inference scaling optimizes chain of thought (CoT) by expanding reasoning depth. This expansion allows models to produce longer, more detailed chains of thought through iterative prompting or multistep generation. Inference scaling can be leveraged to improve multimodal RAG, focusing on the interplay between model sizes, computer budgets and the practical optimization of inference time for real-world applications.

Furthermore, scaling laws and benchmark results emphasize the tradeoffs between pretraining, fine-tuning, inference-time strategies and advanced algorithms for output selection. Both larger models and smaller models benefit from inference scaling as it also enables resource-constrained systems to approach the performance of cutting-edge LLMs. This tutorial demonstrates the impact of optimization techniques on model performance, offering actionable guidance for balancing accuracy, latency and cost in multimodal RAG deployments.

This tutorial is designed for artificial intelligence developers, researchers and enthusiasts looking to enhance their knowledge of document management and advanced natural language processing (NLP) techniques. You will learn how to harness the power of inference scaling to improve the multimodal RAG pipeline created in a previous recipe. While this tutorial focuses on strategies for scalability in multimodal RAG specifically focused on IBM® Granite® large language models, similar principles are applicable to most popular models including those from OpenAI (for example, GPT-4, GPT-4o, ChatGPT) and DeepMind.

This tutorial guides you through the following processes:

  • Document preprocessing: You will learn how to handle documents from various sources, parse and transform them into usable formats and store them in vector databases by using Docling. Docling is an IBM open-source toolkit used to efficiently parse document formats—such as PDF, DOCX, PPTX, XLSX, images, HTML, AsciiDoc and Markdown. It then exports the document contents into machine-readable formats like Markdown or JSON. You will use a Granite machine learning (ML) model to generate image descriptions of images in the documents. In this tutorial, Docling will download the PDF documents and process them so we can obtain the text and images the documents contain. In this tutorial Docling will download the PDF documents and process them so we can obtain the text and images the documents contain.

  • Retrieval-augmented generation (RAG): Understand how to connect LLMs such as Granite with external knowledge bases to enhance query responses and generate valuable insights. RAG is a large language model (LLMs) technique used to connect LLMs with a knowledge base of information outside the data the LLM has been trained. This technique is applied to LLMs without the need for fine-tuning. Traditional RAG is limited to text-based use cases such as text summarization and chatbots.

  • Multimodal RAG: Learn how multimodal RAG uses multimodal large language models (MLLMs) to process information from multiple types of data. This data can then be included as part of the external knowledge base used in RAG. Multimodal data can include text, images, audio, video or other forms. In this tutorial, we use IBM’s latest multimodal vision model, Granite 3.2 vision.

  • Implementing demonstration-based RAG (DRAG) and iterative demonstration-based RAG (IterDRAG): Apply the inference scaling techniques from the research paper to significantly improve RAG performance when working with long context. The DRAG method leverages in-context learning to improve RAG performance. By including multiple RAG examples as demonstrations, DRAG helps models learn to locate relevant information in long contexts. Unlike standard RAG that might plateau with more documents, DRAG shows linear improvements with increased context length. IterDRAG is an extension of DRAG that addresses complex multihop queries by decomposing them into simpler subqueries. Multihop is a process where a complex query is broken down and answered in simple subquestions. Each subquestion might require information retrieved and or synthesized from different sources. IterDRAG interleaves retrieval and generation steps, creating reasoning chains that bridge compositional gaps. This approach is particularly effective for handling complex queries across long contexts.

  • LangChain for workflow integration: Discover how to use LangChain to streamline and orchestrate document processing and retrieval workflows, enabling seamless interaction between different components of the system.

During this tutorial, you will also use three cutting-edge technologies:

  1. Docling: An open-source toolkit used to parse and convert documents.

  2. Granite: A state-of-the-art family of LLMs that provide robust natural language capabilities and a vision language model that provides image to text generation.

  3. LangChain: A powerful framework used to build applications powered by language models, designed to simplify complex workflows and integrate external tools seamlessly.

By the end of this tutorial, you will accomplish the following:

  • Gain proficiency in document preprocessing, chunking and image understanding.

  • Integrate vector databases to enhance retrieval capabilities.

  • Implement DRAG and IterDRAG to perform efficient and accurate data retrieval with inference scaling.

  • Experience firsthand how scaling inference compute can lead to almost linear performance improvements in RAG performance.

Understanding long-context challenges

Traditional language models struggle with long contexts for several reasons:

  • Traditional attention mechanism like transformers scale quadratically, which can incur immense computational resources. 

  • Difficulty in locating relevant information in very long sequences. 

  • Challenges in preserving coherence across distant parts of the input. 

  • Increased computational demands for processing long sequences.

The techniques in this tutorial address these challenges through strategic allocation of inference computation.

Inference scaling methods: DRAG and IterDRAG

DRAG vs IterDRAG
DRAG vs IterDRAG

More on these two advanced inference scaling techniques (DRAG and IterDRAG) can be found in the research paper “Inference Scaling for Long-Context Retrieval Augmented Generation

These methods show that scaling inference computation can improve RAG performance almost linearly when optimally allocated, allowing RAG systems to make better use of long-context capabilities of modern LLMs. For this implementation, we'll use an IBM Granite model capable of processing different modalities. You'll create an AI system to answer real-time user queries from unstructured data, applying the principles from the paper.

Prerequisites

  • Familiarity with Python programming.

  • Basic understanding of LLMs, NLP concepts and computer vision.

Steps

Ensure that you are running Python 3.10, 3.11 or 3.12 in a freshly created virtual environment. Note, you can also access this tutorial on GitHub.

Step 1: Setting up the environment

import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), "Use Python 3.10, 3.11, or 3.12 to run this notebook."

Step 2: Install dependencies

! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    pillow \
    langchain_community \
    langchain_huggingface \
    langchain_milvus \
    docling \
    replicate

Logging

To see some logging information, we can configure INFO log level.

NOTE: It is okay to skip running this cell.

import logging

logging.basicConfig(level=logging.INFO)

Step 3: Selecting the AI models

Load the Granite models

Specify the embeddings model to use for generating text embedding vectors. Here we will use one of the Granite Embeddings models.

To use a different embeddings model, replace this code cell with one from this Embeddings Model recipe.

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
    model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

Specify the MLLM to use for image understanding. We will use the Granite vision model.

from ibm_granite_community.notebook_utils import get_env_var
from langchain_community.llms import Replicate
from transformers import AutoProcessor

vision_model_path = "ibm-granite/granite-vision-3.2-2b"
vision_model = Replicate(
    model=vision_model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": embeddings_tokenizer.max_len_single_sentence, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
        "temperature": 0.01,
    },
)
vision_processor = AutoProcessor.from_pretrained(vision_model_path)

Specify the language model to use for the RAG generation operation. Here we use the Replicate LangChain client to connect to a Granite model from the ibm-granite org on Replicate.

To get set up with Replicate, see Getting Started with Replicate.

To connect to a model on a provider other than Replicate, substitute this code cell with one from the LLM component recipe.

model_path = "ibm-granite/granite-3.3-8b-instruct"
model = Replicate(
    model=model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
        "temperature": 0.01
    },
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Step 4: Preparing the documents for the vector database with Docling

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    generate_picture_images=True,
)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}
converter = DocumentConverter(format_options=format_options)

sources = [
    "https://midwestfoodbank.org/images/AR_2020_WEB2.pdf",
]
conversions = { source: converter.convert(source=source).document for source in sources }

With the documents processed, we then further process the text elements in the documents and chunk them into appropriate sizes for the embeddings model we are using. A list of LangChain documents is created from the text chunks.

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc import DocItem, TableItem
from langchain_core.documents import Document

doc_id = 0
texts: list[Document] = []
for source, docling_document in conversions.items():
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
        items: list[DocItem] = chunk.meta.doc_items # type: ignore
        if len(items) == 1 and isinstance(items[0], TableItem):
            continue # we will process tables later
        refs = " ".join(map(lambda item: item.get_ref().cref, items))
        print(refs)
        text = chunk.text
        document = Document(
            page_content=text,
            metadata={
                "doc_id": (doc_id:=doc_id+1),
                "source": source,
                "ref": refs,
            },
        )
        texts.append(document)

print(f"{len(texts)} text document chunks created")

Next, we process any tables in the documents. We convert the table data to markdown format so the language model can process it. A list of LangChain documents is created from the table's markdown renderings.

from docling_core.types.doc import DocItemLabel

doc_id = len(texts)
tables: list[Document] = []
for source, docling_document in conversions.items():
    for table in docling_document.tables:
        if table.label in [DocItemLabel.TABLE]:
            ref = table.get_ref().cref
            print(ref)
            text = table.export_to_markdown(docling_document)
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref
                },
            )
            tables.append(document)


print(f"{len(tables)} table documents created")

Finally, we process any images in the documents. Here we use the vision language model to understand the images’ content. In this example, we are interested in any textual information in the image.

Choosing an appropriate image prompt is critical as it directs what aspects of the image the model will focus on. For example:

  • A prompt like “Give a detailed description of what is depicted in the image” (used below) will provide general information about all visual elements.

  • A prompt like “What text appears in this image?” would focus specifically on extracting textual content.

  • A prompt like “Describe the graphical data visualization in this image” would be better for charts and graphs.

  • You should experiment with different prompts based on the types of images in your documents and the information that you need to extract from them.

NOTE: Image processing might require significant processing time based on the number of images and the service running the vision language model.

import base64
import io
import PIL.Image
import PIL.ImageOps

def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert("RGB")

    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
    uri = f"data:image/{format};base64,{encoding}"
    return uri

# Feel free to experiment with this prompt
image_prompt = "Give a detailed description of what is depicted in the image"
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": image_prompt},
        ],
    },
]
vision_prompt = vision_processor.apply_chat_template(
    conversation=conversation,
    add_generation_prompt=True,
)
pictures: list[Document] = []
doc_id = len(texts) + len(tables)
for source, docling_document in conversions.items():
    for picture in docling_document.pictures:
        ref = picture.get_ref().cref
        print(ref)
        image = picture.get_image(docling_document)
        if image:
            text = vision_model.invoke(vision_prompt, image=encode_image(image))
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref,
                },
            )
            pictures.append(document)

print(f"{len(pictures)} image descriptions created")

We can then display the LangChain documents created from the input documents.

import itertools
from docling_core.types.doc import RefItem
from IPython.display import display

# Print all created documents
for document in itertools.chain(texts, tables):
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80)  # Separator for clarity

for document in pictures:
    print(f"Document ID: {document.metadata['doc_id']}")
    source = document.metadata['source']
    print(f"Source: {source}")
    print(f"Content:\n{document.page_content}")
    docling_document = conversions[source]
    ref = document.metadata['ref']
    picture = RefItem(cref=ref).resolve(docling_document)
    image = picture.get_image(docling_document)
    print("Image:")
    display(image)
    print("=" * 80)  # Separator for clarity

Populate the vector database

Using the embedding model, we load the documents from the text chunks and generated image captioning into a vector database. Creating this vector database allows us to easily conduct a semantic similarity search across our documents.

NOTE: Population of the vector database might require significant processing time depending on your embedding model and service.

Choose your vector database

Specify the database to use for storing and retrieving embedding vectors. For the purpose of this tutorial we will be using Milvus via Langchain. As a vector database, Milvus will store, index and manage numerical embeddings generated by neural networks and various ML algorithms.

To connect to a vector database other than Milvus, replace this code cell with one from this Vector Store recipe.

import tempfile
from langchain_core.vectorstores import VectorStore, VectorStoreRetriever
from langchain_milvus import Milvus

db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db: VectorStore = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

Now, we add all the LangChain documents for the text, tables and image descriptions to the vector database.

import itertools

documents = list(itertools.chain(texts, tables, pictures))
ids = vector_db.add_documents(documents)
print(f"{len(ids)} documents added to the vector database")
retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 10})

Step 5: RAG with Granite

Now that we have successfully converted our documents and vectorized them, we can set up our RAG pipeline.

Validate retrieval quality

Here we test the vector database by searching for chunks with relevant information to our query in the vector space. We display the documents associated with the retrieved image description.

This validation step is important to help ensure that our retrieval system is working correctly before we build our full RAG pipeline. We want to see if the returned documents are relevant to our query.

Feel free to try different queries.

query = "Analyze how Midwest Food Bank's financial efficiency changed during the pandemic by comparing their 2019 and 2020 performance metrics. What specific pandemic adaptations had the greatest impact on their operational capacity, and how did their volunteer management strategy evolve to maintain service levels despite COVID-19 restrictions? Provide specific statistics from the report to support your analysis."
for doc in vector_db.as_retriever().invoke(query):
    print(doc)
    print("=" * 80)  # Separator for clarity

The returned documents should be responsive to the query. Let's go ahead and construct our RAG pipeline.

The returned documents should be responsive to the query. Let's go ahead and construct our RAG pipeline.

Create the RAG pipeline for Granite

First, we create the prompts for Granite to perform the RAG query. We use the Granite chat template and supply the placeholder values that the LangChain RAG pipeline will replace.

{context} will hold the retrieved chunks, as shown in the previous search, and feed this to the model as document context for answering our question.

Then, we construct the RAG pipeline by using the Granite prompt templates we created.

from ibm_granite_community.notebook_utils import escape_f_string
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create a Granite prompt for question-answering with the retrieved context
prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": "{input}",
    }],
    documents=[{
        "doc_id": "0",
        "text": "{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)
prompt_template = PromptTemplate.from_template(template=escape_f_string(prompt, "input", "context"))

# Create a Granite document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
<|end_of_text|>
<|start_of_role|>document {{"document_id": "{doc_id}"}}<|end_of_role|>
{page_content}""")
document_separator=""

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
    retriever=retriever,
    combine_docs_chain=combine_docs_chain,
)

Generate a retrieval-augmented response to a question

The pipeline uses the query to locate documents from the vector database and use them as context for the query.

outputs = rag_chain.invoke({"input": query})
print(outputs['answer'])

Standard RAG limitations and why we need inference scaling

While the standard RAG approach works reasonably well, it has several key limitations when dealing with long or complex content:

  1. Context management: When dealing with many documents, standard RAG struggles to effectively utilize all the available context.

  2. Retrieval quality: Without guidance on how to use retrieved information, models often focus on the wrong parts of documents.

  3. Compositional reasoning: The process of understanding complex queries requiring multistep reasoning are challenging for standard RAG.

  4. Performance plateaus: Adding more documents to standard RAG often results in diminishing returns after a certain threshold.

Inference scaling techniques address these limitations by strategically allocating more computation at inference time.

Enhanced RAG with DRAG (demonstration-based RAG)

Now we'll implement the DRAG technique from the research paper "Inference Scaling for Long-Context Retrieval Augmented Generation" to enhance our RAG system.

DRAG uses in-context examples to demonstrate to the model how to extract and use information from documents, improving performance for long-context scenarios.

Step 1: Create sample in-context demonstrations

These would typically come from a curated dataset of high-quality QA pairs. For this purpose, we'll create some synthetic examples that match the expected domain.

Here, we define a data class to represent an individual demonstration and then create some demonstrations.

from dataclasses import dataclass, field, InitVar
from langchain_core.documents import Document

@dataclass
class DRAG_Demonstration:
    query: str
    answer: str
    retriever: InitVar[VectorStoreRetriever] = field(kw_only=True)
    documents: list[Document] = field(default_factory=list, kw_only=True)

    def __post_init__(self, retriever: VectorStoreRetriever):
        if not self.documents:
            self.documents = retriever.invoke(self.query)

    def __format__(self, format_spec: str) -> str:
        formatted_documents = "\n".join(
            f"Document {i+1}:\n{document.page_content}"
            for i, document in enumerate(self.documents)
        )
        return f"""\
{formatted_documents}
Question: {self.query}
Answer: {self.answer}
"""

def create_enhanced_drag_demonstrations(vector_db: VectorStore) -> list[DRAG_Demonstration]:
    """Create high-quality demonstrations for DRAG technique that showcase effective document analysis"""
    demonstration_retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 5})
    demonstrations = [
        DRAG_Demonstration(
            query="How did the COVID-19 pandemic impact Midwest Food Bank's operations in 2020?",
            answer="The COVID-19 pandemic significantly impacted Midwest Food Bank's operations in 2020. Despite challenges, MFB remained open and responsive to increased needs. They implemented safety protocols, reduced volunteer numbers for social distancing, and altered their distribution model to allow partner agencies to receive food safely. The pandemic created unprecedented food insecurity, with many people seeking assistance for the first time. MFB distributed 37% more food than in 2019, with a record 179 semi-loads of Disaster Relief family food boxes sent nationwide. The organization also faced supply chain disruptions and food procurement challenges in the early months but continued to find and distribute food. Community, business, and donor support helped fund operations and food purchases. Additionally, MFB began participating in the USDA Farmers to Families Food Box program in May 2020, distributing over $52 million worth of nutritious produce, protein, and dairy products.",
            retriever=demonstration_retriever
        ),
        DRAG_Demonstration(
            query="What role did volunteers play at Midwest Food Bank during 2020, and how were they affected by the pandemic?",
            answer="Volunteers were described as 'the life-blood of the organization' in the 2020 annual report. Despite the pandemic creating safety challenges, volunteers demonstrated courage and dedication by increasing their hours to meet growing needs. MFB implemented safety protocols at each location and limited volunteer group sizes to allow for social distancing. This created a challenge as food needs increased while fewer volunteers were available to help. To address this gap, multiple MFB locations received assistance from the National Guard, who filled vital volunteer positions driving trucks, operating forklifts, and helping with food distributions. In 2020, 17,930 individuals volunteered 300,898 hours of service, equivalent to 150 full-time employees. The volunteer-to-staff ratio was remarkable with 450 volunteers for every 1 paid MFB staff member, highlighting the volunteer-driven nature of the organization during the crisis.",
            retriever=demonstration_retriever
        ),
        DRAG_Demonstration(
            query="How did Midwest Food Bank's international programs perform during 2020, particularly in Haiti and East Africa?",
            answer="In 2020, Midwest Food Bank's international operations in East Africa and Haiti faced unique challenges but continued to serve communities. In East Africa (operated as Kapu Africa), strict lockdowns led to mass hunger, especially in slum areas. Kapu Africa distributed 7.2 million Tender Mercies meals, working with partner ministries to share food in food-insecure slums. A notable outcome was a spiritual awakening among recipients, with many asking why they were receiving help. In Haiti, the pandemic added to existing challenges, closing airports, seaports, factories, and schools. MFB Haiti more than doubled its food shipments to Haiti, delivering over 160 tons of food relief, nearly three-quarters being Tender Mercies meals. As Haitian children primarily receive nourishment from school lunches, MFB Haiti distributed Tender Mercies through faith-based schools and also partnered with over 20 feeding centers serving approximately 1,100 children daily. Nearly 1 million Tender Mercies meals were distributed in Haiti during 2020.",
            retriever=demonstration_retriever
        ),
    ]

    return demonstrations

Step 2: Format the demonstrations for inclusion in the prompt

We then format all the demonstrations together for the prompt.

# Format all demonstrations together
demonstrations = create_enhanced_drag_demonstrations(vector_db)

formatted_demonstrations = "\n\n".join(
    f"Example {i+1}:\n{demo}"
    for i, demo in enumerate(demonstrations)
)

Step 3: Create the DRAG prompt template

Then we create the DRAG prompt for the model which includes the formatted demonstration examples.

drag_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": f"""\
Here are examples of effectively extracting information from documents to answer questions.

{formatted_demonstrations}

Follow these examples when answering the user's question:

{{input}}""",
    }],
    documents=[{
        "doc_id": "0",
        "text": "Placeholder{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)

# Convert to prompt template
drag_prompt_template = PromptTemplate.from_template(template=escape_f_string(drag_prompt, "input", "context"))

Step 4: Create a custom retriever that reorders documents

Normally the retriever will return the documents in similarity order, the most similar document being first. We define a reordering retriever to reverse the order of results. The order now displays the most similar document last, thus closer to the end of the prompt.

import typing
from langchain_core.retrievers import BaseRetriever, RetrieverInput, RetrieverOutput
from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun

class ReorderingRetriever(BaseRetriever):
    base_retriever: BaseRetriever

    def _get_relevant_documents(
        self, query: RetrieverInput, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: typing.Any
    ) -> RetrieverOutput:
        docs = self.base_retriever._get_relevant_documents(query, run_manager=run_manager, **kwargs)
        return list(reversed(docs))  # Reverse the order so higher-ranked docs are closer to query in prompt

reordering_retriever = ReorderingRetriever(base_retriever=retriever)

Step 5: Create DRAG pipeline

We create the pipeline for the DRAG query by using the DRAG prompt template and the reordering retriever.

drag_combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=drag_prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)

drag_chain = create_retrieval_chain(
    retriever=reordering_retriever,
    combine_docs_chain=drag_combine_docs_chain,
)

Step 6: Generate a DRAG-enhanced response to a question

drag_outputs = drag_chain.invoke({"input": query})
print("\n=== DRAG-Enhanced Answer ===")
print(drag_outputs['answer'])

Great, looks like we got some improvements in the answer by giving it some examples. Let's try an even more thorough RAG technique next!

Implementing IterDRAG (iterative demonstration-based RAG)

IterDRAG extends DRAG by decomposing complex queries into simpler subqueries and performing interleaved retrieval. This approach is particularly effective for complex multihop questions that require integrating information from multiple sources or reasoning across several steps.
 
Key benefits of the iterative approach:

  • Breaks down complex questions into manageable pieces.

  • Retrieves more relevant information for each subquestion.

  • Creates explicit reasoning chains.

  • Enables tackling questions that would be challenging in a single step.

Step 1: Create a query decomposition chain

The decomposition step is critical because it takes a complex query and breaks it into simpler, more focused subqueries that can be answered individually.

decompose_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that breaks down complex questions into simpler sub-questions.
For multi-part or complex questions, generate 1-3 sub-questions that would help answer the main question.

Here are examples of how to decompose complex questions:
{demonstrations}

Follow the above examples when breaking down the user's question.
If the following question is already simple enough, just respond with "No follow-up needed."

Otherwise, break down the following question into simpler sub-questions. Format your response as:
Follow up: [sub-question]

Question: {input}"""
    }],
    add_generation_prompt=True,
    tokenize=False,
)

decompose_prompt_template = PromptTemplate.from_template(template=escape_f_string(decompose_prompt, "input", "demonstrations"))
decompose_chain = decompose_prompt_template | model

Step 2: Create a subquery answering chain

The subquery answering component handles each individual subquestion by retrieving relevant documents and generating focused intermediate answers.

intermediate_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that answers specific questions based on the provided documents.

Focus only on the sub-question and provide a concise intermediate answer.
Please answer the following sub-question based on the provided documents.
Format your response as:
Intermediate answer: [your concise answer to the sub-question]

Sub-question: {input}
"""
    }],
    documents=[{
        "doc_id": "0",
        "text": "Placeholder{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)

intermediate_prompt_template = PromptTemplate.from_template(template=escape_f_string(intermediate_prompt, "input", "context"))
intermediate_combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=intermediate_prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
intermediate_chain = create_retrieval_chain(
    retriever=reordering_retriever,
    combine_docs_chain=intermediate_combine_docs_chain,
)

Step 3: Create a final answer generation chain

The final answer generation component combines all the intermediate answers to produce a comprehensive response to the original question.

final_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that provides comprehensive answers to questions.
Use the intermediate answers to sub-questions to formulate a complete final answer.
Please provide a final answer to the main question based on the intermediate answers to sub-questions.
Format your response as:
So the final answer is: [your comprehensive answer to the main question]

Main question: {input}

Sub-questions and intermediate answers:
{context}"""
    }],
    add_generation_prompt=True,
    tokenize=False,
)

final_prompt_template = PromptTemplate.from_template(template=escape_f_string(final_prompt, "input", "context"))
final_chain = final_prompt_template | model

Step 4: Create example demonstrations for IterDRAG

Creating effective demonstrations is crucial for IterDRAG’s performance. These examples show the model how to:

  1. Break down complex questions into simpler subquestions.

  2. Generate relevant intermediate answers.

  3. Combine these answers into a coherent final response.
@dataclass
class IterDRAG_Demonstration_Base:
    query: str
    answer: str

@dataclass
class IterDRAG_Demonstration(IterDRAG_Demonstration_Base):
    intermediate: list[IterDRAG_Demonstration_Base]

    def __format__(self, format_spec: str) -> str:
        sub_questions="\n".join(
            f"Follow up: {sub.query}"
            for sub in self.intermediate
        )

        return f"Question: {self.query}\n{sub_questions}"

def create_iterdrag_demonstrations() -> list[IterDRAG_Demonstration]:
    """Create examples showing how to decompose and answer complex questions"""

    demonstrations = [
        IterDRAG_Demonstration(
            query="What impact did the pandemic have on the food bank's operations and distribution?",
            answer="The pandemic had a profound impact on food bank operations and distribution. Distribution volume increased by 60% to over 100 million pounds of food in 2020. Operationally, the food bank faced supply chain disruptions, volunteer shortages, and safety protocol challenges. In response, they implemented contactless distribution, expanded mobile pantries, created emergency food boxes for vulnerable populations, and developed virtual nutrition education. Despite these challenges, they successfully scaled operations to meet the unprecedented community need during the crisis.",
            intermediate=[
                IterDRAG_Demonstration_Base(
                    query="How did food distribution volume change during the pandemic?",
                    answer="Food distribution volume increased by 60% during the pandemic, rising from approximately 62 million pounds in 2019 to over 100 million pounds in 2020.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What operational challenges did the food bank face during the pandemic?",
                    answer="The food bank faced challenges including supply chain disruptions, volunteer shortages due to social distancing requirements, and the need to implement new safety protocols for food handling and distribution.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What new programs were implemented in response to the pandemic?",
                    answer="New programs included contactless distribution methods, expanded mobile pantry operations, emergency food boxes for vulnerable populations, and virtual nutrition education classes.",
                ),
            ],
        ),
        IterDRAG_Demonstration(
            query="How does the food bank's financial management compare to industry standards for non-profits?",
            answer="The food bank demonstrates excellent financial management compared to industry standards. With 94% of its budget allocated to program services and only 6% to administrative and fundraising costs, it exceeds the industry benchmark of 85-90% for program spending. This financial efficiency places the food bank among the top-performing non-profits in terms of maximizing donor impact and minimizing overhead expenses.",
            intermediate=[
                IterDRAG_Demonstration_Base(
                    query="What percentage of the food bank's budget goes to program services versus administrative costs?",
                    answer="94% of the food bank's budget goes directly to program services, with only 6% allocated to administrative and fundraising costs.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What are the industry standards for program spending versus overhead for food banks?",
                    answer="Industry standards suggest that well-run food banks typically allocate 85-90% of their budget to program services, with 10-15% for administrative and fundraising expenses.",
                ),
            ],
        ),
    ]
    return demonstrations

Step 5: Implement the IterDRAG function

This function orchestrates the entire iterative process:

  1. Decompose the main question into subquestions.

  2. For each subquestion, retrieve relevant documents and generate an intermediate answer.

  3. Combine all intermediate answers to produce the final response.
import re

def iterative_drag(main_question: str) -> dict[str, typing.Any]:
    """
    Implements IterDRAG: decomposing queries, retrieving documents for sub-queries,
    and generating a final answer based on intermediate answers.
    """
    print(f"\n=== Processing query with IterDRAG: '{main_question}' ===")

    # Step 1: Decompose the main question into sub-questions
    print("Step 1: Decomposing the query into sub-questions...")
    iterdrag_demonstrations = create_iterdrag_demonstrations()
    formatted_demonstrations = "\n\n".join(
        f"Example {i+1}:\n{demo}"
        for i, demo in enumerate(iterdrag_demonstrations)
    )
    decompose_result = decompose_chain.invoke({
        "input": main_question,
        "demonstrations": formatted_demonstrations,
    })
    decompose_answer = decompose_result

    # Extract sub-questions using regex
    sub_questions = re.findall(r"Follow up: (.*?)(?=Follow up:|\n|$)", decompose_answer, re.DOTALL)
    sub_questions = [sq.strip() for sq in sub_questions if sq.strip()]
    if not sub_questions:
        print("No decomposition needed or found. Using standard DRAG approach.")
        return drag_chain.invoke({"input": main_question})
    print(f"Decomposed into {len(sub_questions)} sub-questions")

    # Step 2: Answer each sub-question
    intermediate_pairs: list[dict[str, str]] = []
    for i, sub_question in enumerate(sub_questions):
        print(f"\nStep 2.{i+1}: Processing sub-question: '{sub_question}'")

        # Generate answer for this sub-question
        intermediate_result = intermediate_chain.invoke({"input": sub_question})
        intermediate_answer = intermediate_result["answer"]

        # Extract intermediate answer using regex
        intermediate_answer_match = re.search(r"Intermediate answer: (.*?)$", intermediate_answer, re.DOTALL)
        if intermediate_answer_match:
            intermediate_answer = intermediate_answer_match.group(1).strip()

        print(f"Generated intermediate answer: {intermediate_answer[:100]}...")

        # Store the sub-question and its answer
        intermediate_pairs.append({"input": sub_question, "answer": intermediate_answer})

    # Step 3: Generate the final answer based on sub-question answers
    print("\nStep 3: Generating final answer based on intermediate answers...")
    final_result = final_chain.invoke({
        "input": main_question,
        "context": "\n\n".join(
            f"Sub-question: {pair['input']}\nIntermediate answer: {pair['answer']}"
            for pair in intermediate_pairs
        ),
    })
    final_answer = final_result

    # Extract final answer
    final_answer_match = re.search(r"So the final answer is: (.*?)$", final_answer, re.DOTALL)
    if final_answer_match:
        final_answer = final_answer_match.group(1).strip()

    return {"input": main_question, "answer": final_answer, "intermediate": intermediate_pairs}

Comparing RAG approaches

Now that we have all three RAG approaches set up, let's compare their responses to the same query, this time much more complex to see the differences.

The comparison will help us understand the benefits of each approach and when each might be most appropriate to use.

# Run all approaches on the same complex query
comparison_query = "What was the full impact chain of the National Guard's assistance during the pandemic? Specifically, how did their involvement affect volunteer operations, what specific tasks did they perform, and how did this ultimately translate to community impact in terms of food distribution capabilities and reach?"

print("\n=== Standard RAG ===")
standard_result = rag_chain.invoke({"input": comparison_query})
print(standard_result["answer"])

print("\n=== DRAG ===")
drag_result = drag_chain.invoke({"input": comparison_query})
print(drag_result["answer"])

print("\n=== IterDRAG ===")
iterdrag_result = iterative_drag(comparison_query)
print(iterdrag_result["answer"])

Results comparison and analysis

Here we summarize the performance differences between the three RAG approaches implemented:

Approach

 

Strengths

 

Limitations

 

Best Use Cases

 

Standard RAG

  • Simple implementation
  • Good for straightforward queiries
  • Lower computational requirements
  • Limited context utilization
  • Performance plateaus with more documents
  • Poor at complex reasoning 
  • Simple factual queries
  • When computation is limited
  • When context is small 

DRAG

  • Better context utilization
  • Improved performance with more documents
  • Good for moderately complex queries
  • Still limited by one-step generation
  • Less effective for multi-hop questions
  • Moderate complexity queries
  • When more documents are available
  • When in-context examples can be provided

IterDRAG

  • Best for complex queries
  • Explicit reasoning chains
  • Most effective use of context
  • Highest computational requirements
  • More complex implementation
  • Multi-hop questions
  • Complex analyses requiring composite reasoning
  • When maximum performance is needed 
    

As we've seen in our implementation inference scaling techniques, like DRAG and IterDRAG, can significantly improve RAG performance. This method is especially true for complex queries requiring deep analysis of multiple documents.

Conclusion

In this tutorial, we've explored how inference scaling can dramatically improve RAG performance. By strategically allocating additional computation at inference time through techniques like DRAG and IterDRAG, we can achieve substantial gains in response quality for complex queries.

Challenges with traditional RAG and transformer-based models

Expensive inference: Transformer-based models, which use self-attention mechanisms, have inference costs that scale quadratically with input length. This method makes handling long contexts computationally expensive, limiting the practical application of RAG to shorter documents or requiring aggressive truncation.

Limited context utilization: Standard RAG systems often retrieve and process a fixed number of documents that can be insufficient for complex, multihop queries. Performance plateaus as context length increases, especially beyond 128,000 tokens, because the model struggles to synthesize information across many retrieved passages.

Inefficient computation allocation: Without careful allocation, adding more retrieved documents or context simply increases computational cost without proportional gains in accuracy, leading to diminishing returns or even degraded performance due to information overload.

How DRAG and IterDRAG address these challenges

Demonstration-based RAG (DRAG):

DRAG leverages multiple retrieved examples, questions and answers as demonstrations within the prompt, enabling the model to learn in-context how to locate and apply relevant information.

This approach is particularly effective for shorter effective context lengths as it allows the model to utilize rich context without overwhelming the attention mechanism, improving both retrieval and generation quality.

Iterative demonstration-based RAG (IterDRAG):

IterDRAG decomposes complex queries into simpler subqueries, iteratively retrieving and generating answers for each substep.

By interleaving retrieval and generation, IterDRAG builds reasoning chains that bridge the gap for multihop queries, making it especially effective for exceptionally long contexts.

This process allows the model to allocate computation more efficiently, focusing on the most relevant information at each step and avoiding the risk of long-context attention overload. By applying these inference scaling techniques to your RAG applications, you can achieve significantly better performance on knowledge-intensive tasks without changing your underlying models.

Next steps:

  • Experiment with different retrieval models and document preprocessing approaches.

  • Try different prompt formulations for image understanding.

  • Explore model parameter optimization to find the ideal settings for your specific use case.
Related solutions
IBM watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo
Footnotes

1. “A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems,” Ke, Zixuan, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, et al.,  ArXiv.org, 2025.

2. “Reasoning in Granite 3.2 Using Inference Scaling,” Lastras, Luis. 2025,  IBM Research, IBM, February 26, 2025.

3. “Inference Scaling for Long-Context Retrieval Augmented Generation,” Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, ArXiv.org, 2024.