推理扩展以提升多模态 RAG

人工智能(AI) 中的推理扩展是指通过在推理阶段（模型产生输出）分配计算资源而不是依赖更大的训练数据集或模型架构来提高模型的性能。随着大语言模型 (LLM) 在模型参数和数据集规模方面的不断扩展，优化推理时间和管理推理计算扩展（尤其是在 GPU 硬件上）已成为部署高性能多模态检索增强生成 (RAG) 系统的核心挑战。

推理扩展简介

近年来在推理策略上的进展，通过增加计算资源并在推理阶段使用复杂算法，正在重新定义 LLM 处理复杂推理任务的方式，并在多种输入模态下提供更高质量的输出。推理扩展通过加深推理深度来优化思维链 (CoT)。这种扩展使模型能够通过迭代提示或多步生成，产生更长、更详细的思维链。推理扩展可用于提升多模态 RAG，重点关注模型规模、计算资源预算以及在实际应用中优化推理时间之间的相互作用。

此外，扩展规律和基准测试结果强调了在预训练、微调、推理阶段策略以及用于输出选择的高级算法之间的权衡。无论是大型模型还是小型模型，都能从推理扩展中受益，因为它还能使资源受限的系统接近先进 LLM 的性能。本教程展示了优化技术对模型性能的影响，并提供了可操作的指导，以在多模态 RAG 部署中平衡准确性、延迟和成本。

本教程专为希望增强文档管理和高级自然语言处理 (NLP) 技术知识的人工智能开发人员、研究人员和爱好者而设计。您将学习如何利用推理扩展的力量，改进在之前的代码模板中创建的多模态 RAG 流程。虽然本教程重点介绍多模态 RAG 中的可扩展性策略，特别关注 IBM® Granite 大型语言模型，但类似的原则也适用于大多数流行的模型，包括来自 OpenAI（如 GPT-4、GPT-4o、ChatGPT）和 DeepMind 的各类模型。

本教程指导您完成以下流程：

文档预处理：您将学习如何使用 Dodling 处理各种来源的文档，将其解析并转换为可用格式，然后将其存储在矢量数据库中。Docling 是 IBM 的开源工具包，用于高效解析文档格式，例如 PDF、DOCX、PPTX、XLSX、图像、HTML、AsciiDoc 和 Markdown。然后，它将文档内容导出为机器可读格式，如 Markdown 或 JSON。您将使用 Granite 机器学习 (ML) 模型来生成文档中图像的图像描述。在本教程中，Docling 将下载 PDF 文档并进行处理，以便我们可以获得文档中包含的文本和图像。在本教程中，Docling 将下载 PDF 文档并进行处理，以便我们可以获得文档中包含的文本和图像。
检索增强生成 (RAG)：了解如何将 Granite 等 LLM 与外部知识库连接起来，以增强查询响应并生成有价值的洞察分析。RAG 是一种大语言模型 (LLM) 技术，用于将 LLM 与 LLM 所训练数据之外的信息知识库连接起来。该技术可应用于 LLM，无需进行微调。传统的 RAG 仅限于基于文本的用例，例如文本摘要和聊天机器人。
多模态 RAG：了解多模态 RAG 如何使用多模态大型言模型 (MLLM) 来处理来自多种类型数据的信息。然后，这些数据可以作为 RAG 使用的外部知识库的一部分。多模态数据可以包括文本、图像、音频、视频或其他形式。在本教程中，我们将使用 IBM 最新的多模态视觉模型 Granite 3.2 Vision。
实现基于演示的 RAG (DRAG) 和迭代基于演示的 RAG (IterDRAG)：应用研究论文中的推理扩展技术，在处理较长上下文时显著提升 RAG 性能。DRAG 方法利用上下文学习来提升 RAG 的性能。通过包含多个 RAG 示例作为演示，DRAG 可帮助模型学习在长上下文中查找相关信息。与可能随着文档增多而性能趋于平稳的标准 RAG 不同，DRAG 在上下文长度增加时表现出线性提升。IterDRAG 是 DRAG 的扩展，通过将复杂的多跳查询分解为更简单的子查询来处理。多跳是将复杂查询拆解为简单子问题并逐步回答的过程。每个子问题可能需要从不同来源检索和/或综合信息。IterDRAG 交错执行检索和生成步骤，构建弥合组合性缺口的推理链。这种方法对于处理跨长上下文的复杂查询特别有效。
LangChain 用于工作流集成：了解如何使用 LangChain 来简化和协调文档处理与检索工作流，实现系统各组件之间的无缝交互。

在本教程中，您还将使用三项尖端技术：

Docling：用于解析和转换文档的开源工具包。
Granite：一系列最先进的大型语言模型，提供强大的自然语言处理能力，同时配备视觉语言模型，实现图像到文本的生成。
LangChain：一个强大的框架，用于构建由语言模型支持的应用程序，旨在简化复杂的工作流，并无缝整合外部工具。

完成本教程后，您将实现以下目标：

掌握文档预处理、分块和图像理解的技能。
集成矢量数据库以增强检索能力。
实施 DRAG 和 IterDRAG，以通过推理扩展执行高效、准确的数据检索。
亲身体验如何通过扩展推理计算，使 RAG 性能几乎呈线性提升。

了解长上下文挑战

传统语言模型在处理长上下文时会遇到几个问题，原因如下：

像 transformers 这样的传统注意力机制呈二次增长，这会消耗大量计算资源。
在很长的序列中难以找到相关信息。
在输入的远距离部分之间保持连贯性所面临的挑战。
处理长序列的计算需求增加。

本教程中的技术通过对推理计算的策略性分配来应对这些挑战。

推理扩展方法：DRAG 和 IterDRAG

DRAG 与 IterDRAG

有关这两种高级推理扩展技术（DRAG 和 IterDRAG）的更多信息，请参阅研究论文“Inference Scaling for Long-Context Retrieval Augmented Generation”

这些方法表明，在最佳分配情况下，扩展推理计算可以几乎线性地提高 RAG 性能，从而使 RAG 系统能够更好地利用现代 LLM 的长上下文能力。在本次实施中，我们将使用能够处理不同模态的 IBM® Granite 模型。您将创建一个 AI 系统，利用论文中的原理，从非结构化数据中回答用户的实时查询。

先决条件

熟悉 Python 编程。
对 LLM、NLP 概念和计算机视觉有基本的了解。

步骤

确保在新创建的虚拟环境中运行 Python 3.10、3.11 或 3.12。注意，您也可以在 GitHub 上访问本教程。

第 1 步：设置环境

import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), "Use Python 3.10, 3.11, or 3.12 to run this notebook."

第 2 步：安装依赖项

! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    pillow \
    langchain_community \
    langchain_huggingface \
    langchain_milvus \
    docling \
    replicate

日志记录

为了查看一些日志信息，我们可以将日志级别配置为 INFO。

注意：可以跳过运行此单元。

import logging

logging.basicConfig(level=logging.INFO)

第 3 步：选择 AI 模型

加载 Granite 模型

指定用于生成文本嵌入矢量的嵌入模型。这里我们将使用其中一个 Granite Embeddings 模型。

若要使用不同的嵌入模型，请将此代码单元替换为来自此 Embeddings Model 代码模板的单元。

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
    model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

指定用于图像理解的 MLLM。我们将使用 Granite Vision 模型。

from ibm_granite_community.notebook_utils import get_env_var
from langchain_community.llms import Replicate
from transformers import AutoProcessor

vision_model_path = "ibm-granite/granite-vision-3.2-2b"
vision_model = Replicate(
    model=vision_model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": embeddings_tokenizer.max_len_single_sentence, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
        "temperature": 0.01,
    },
)
vision_processor = AutoProcessor.from_pretrained(vision_model_path)

指定用于 RAG 生成操作的语言模型。这里我们使用 Replicate LangChain 客户端，连接到 Replicate 上 ibm-granite 组织的 Granite 模型。

要设置 Replicate，请参阅 Replicate 入门。

若要连接 Replicate 之外其他服务商的模型，请将当前代码单元替换为 LLM 组件代码模板中对应的代码单元。

model_path = "ibm-granite/granite-3.3-8b-instruct"
model = Replicate(
    model=model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
        "temperature": 0.01
    },
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

第 4 步：使用 Dodling 为矢量数据库准备文档

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    generate_picture_images=True,
)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}
converter = DocumentConverter(format_options=format_options)

sources = [
    "https://midwestfoodbank.org/images/AR_2020_WEB2.pdf",
]
conversions = { source: converter.convert(source=source).document for source in sources }

在处理完文档后，我们进一步处理文档中的文本元素，并将其分块为适合所使用嵌入模型的大小。LangChain 文档列表是从文本块创建的。

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc import DocItem, TableItem
from langchain_core.documents import Document

doc_id = 0
texts: list[Document] = []
for source, docling_document in conversions.items():
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
        items: list[DocItem] = chunk.meta.doc_items # type: ignore
        if len(items) == 1 and isinstance(items[0], TableItem):
            continue # we will process tables later
        refs = " ".join(map(lambda item: item.get_ref().cref, items))
        print(refs)
        text = chunk.text
        document = Document(
            page_content=text,
            metadata={
                "doc_id": (doc_id:=doc_id+1),
                "source": source,
                "ref": refs,
            },
        )
        texts.append(document)

print(f"{len(texts)} text document chunks created")

接下来，我们处理文档中的所有表格。我们将表数据转换为 markdown 格式，以便语言模型可以处理它。从表格的 Markdown 渲染生成了一个 LangChain 文档列表。

from docling_core.types.doc import DocItemLabel

doc_id = len(texts)
tables: list[Document] = []
for source, docling_document in conversions.items():
    for table in docling_document.tables:
        if table.label in [DocItemLabel.TABLE]:
            ref = table.get_ref().cref
            print(ref)
            text = table.export_to_markdown(docling_document)
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref
                },
            )
            tables.append(document)


print(f"{len(tables)} table documents created")

最后，我们会处理文档中的所有图像。在这里，我们使用视觉语言模型来理解图像的内容。在此示例中，我们关注图像中的任何文本信息。

选择合适的图像提示非常关键，因为它决定了模型将关注图像的哪些方面。例如：

像“Give a detailed description of what is depicted in the image”这样的提示（下面会用到）会提供关于所有视觉元素的总体信息。
像“这张图中出现了哪些文字？”这样的提示会专门关注提取文本内容。
像“请描述此图像中的图形数据可视化”这样的提示更适用于图表。
您应该根据文档中图像的类型以及需要提取的信息，尝试不同的提示。

注：图像处理可能需要大量处理时间，具体取决于图像的数量和运行视觉语言模型的服务。

import base64
import io
import PIL.Image
import PIL.ImageOps

def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert("RGB")

    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
    uri = f"data:image/{format};base64,{encoding}"
    return uri

# Feel free to experiment with this prompt
image_prompt = "Give a detailed description of what is depicted in the image"
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": image_prompt},
        ],
    },
]
vision_prompt = vision_processor.apply_chat_template(
    conversation=conversation,
    add_generation_prompt=True,
)
pictures: list[Document] = []
doc_id = len(texts) + len(tables)
for source, docling_document in conversions.items():
    for picture in docling_document.pictures:
        ref = picture.get_ref().cref
        print(ref)
        image = picture.get_image(docling_document)
        if image:
            text = vision_model.invoke(vision_prompt, image=encode_image(image))
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref,
                },
            )
            pictures.append(document)

print(f"{len(pictures)} image descriptions created")

接着，我们可以显示从输入文档生成的 LangChain 文档。

import itertools
from docling_core.types.doc import RefItem
from IPython.display import display

# Print all created documents
for document in itertools.chain(texts, tables):
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80)  # Separator for clarity

for document in pictures:
    print(f"Document ID: {document.metadata['doc_id']}")
    source = document.metadata['source']
    print(f"Source: {source}")
    print(f"Content:\n{document.page_content}")
    docling_document = conversions[source]
    ref = document.metadata['ref']
    picture = RefItem(cref=ref).resolve(docling_document)
    image = picture.get_image(docling_document)
    print("Image:")
    display(image)
    print("=" * 80)  # Separator for clarity

填充矢量数据库

利用嵌入模型，我们将文本块中的文档和生成的图像描述加载到矢量数据库中。通过创建这个矢量数据库，我们可以轻松地在文档中进行语义相似性搜索。

注意：根据您的嵌入模型和服务，填充矢量数据库可能需要相当长的处理时间。

选择矢量数据库

指定用于存储和检索嵌入矢量的数据库。在本教程中，我们将通过 LangChain 使用 Milvus。作为矢量数据库，Milvus 将存储、索引并管理由神经网络和各种机器学习算法生成的数值嵌入。

若要连接到 Milvus 以外的矢量数据库，请将此代码单元替换为来自此 Vector Store 代码模板的单元。

import tempfile
from langchain_core.vectorstores import VectorStore, VectorStoreRetriever
from langchain_milvus import Milvus

db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db: VectorStore = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

现在，我们将文本、表格及图像描述对应的所有 LangChain 文档添加到矢量数据库中。

import itertools

documents = list(itertools.chain(texts, tables, pictures))
ids = vector_db.add_documents(documents)
print(f"{len(ids)} documents added to the vector database")
retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 10})

第 5 步：使用 Granite 模型进行 RAG

现在我们已经成功转换了文档并将其矢量化，我们可以建立 RAG 管道了。

验证检索质量

在这里，我们通过在矢量空间中搜索与查询相关的信息块来测试矢量数据库。我们显示与检索到的图像描述相关的文档。

此验证步骤非常重要，可在我们构建完整的 RAG 管道之前确保我们的检索系统正常工作。我们希望查看返回的文档是否与我们的查询相关。

欢迎尝试不同的查询方式。

query = "Analyze how Midwest Food Bank's financial efficiency changed during the pandemic by comparing their 2019 and 2020 performance metrics. What specific pandemic adaptations had the greatest impact on their operational capacity, and how did their volunteer management strategy evolve to maintain service levels despite COVID-19 restrictions? Provide specific statistics from the report to support your analysis."
for doc in vector_db.as_retriever().invoke(query):
    print(doc)
    print("=" * 80)  # Separator for clarity

返回的文档应能针对查询做出响应。接下来，我们来构建 RAG 管道。

为 Granite 创建 RAG 管道

首先，我们为 Granite 创建提示以执行 RAG 查询。我们使用 Granite 聊天模板，并提供 LangChain RAG 管道将替换的占位符值。

{context} 将保存检索到的块，如上一个搜索所示，并将其作为文档上下文提供给模型，以回答我们的问题。

然后，我们使用创建的 Granite 提示模板构建 RAG 管道。

from ibm_granite_community.notebook_utils import escape_f_string
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create a Granite prompt for question-answering with the retrieved context
prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": "{input}",
    }],
    documents=[{
        "doc_id": "0",
        "text": "{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)
prompt_template = PromptTemplate.from_template(template=escape_f_string(prompt, "input", "context"))

# Create a Granite document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
<|end_of_text|>
<|start_of_role|>document {{"document_id": "{doc_id}"}}<|end_of_role|>
{page_content}""")
document_separator=""

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
    retriever=retriever,
    combine_docs_chain=combine_docs_chain,
)

生成对问题的检索增强响应

管道使用查询从矢量数据库中查找文档，并将其用作查询的上下文。

outputs = rag_chain.invoke({"input": query})
print(outputs['answer'])

标准 RAG 的局限性以及为何我们需要推理扩展

虽然标准 RAG 方法的效果还不错，但在处理过长或复杂的内容时，它有几个关键局限性：

上下文管理：在处理大量文档时，标准的 RAG 难以有效利用所有可用的上下文。
检索质量：如果没有关于如何使用检索到的信息的指导，模型通常会关注文档中错误的部分。
组合性推理：对于需要多步推理的复杂查询，标准 RAG 往往难以理解和处理。
性能瓶颈：对于标准 RAG，随着文档数量的增加，收效通常在达到某个阈值后会递减。

推理扩展技术通过在推理阶段有策略地分配更多计算资源来应对这些限制。

使用 DRAG 增强型 RAG（基于演示的 RAG）

现在，我们将实施研究论文“Inference Scaling for Long-Context Retrieval Augmented Generation”中的 DRAG 技术来增强我们的 RAG 系统。

DRAG 使用上下文示例向模型演示如何从文档中提取和使用信息，从而提高长上下文场景的性能。

第 1 步：创建上下文演示示例

这些通常来自经过精心挑选的高质量问答对数据集。为此，我们将创建一些与预期领域相匹配的合成示例。

在这里，我们定义了一个数据类来表示单个示范，然后创建一些示范。

from dataclasses import dataclass, field, InitVar
from langchain_core.documents import Document

@dataclass
class DRAG_Demonstration:
    query: str
    answer: str
    retriever: InitVar[VectorStoreRetriever] = field(kw_only=True)
    documents: list[Document] = field(default_factory=list, kw_only=True)

    def __post_init__(self, retriever: VectorStoreRetriever):
        if not self.documents:
            self.documents = retriever.invoke(self.query)

    def __format__(self, format_spec: str) -> str:
        formatted_documents = "\n".join(
            f"Document {i+1}:\n{document.page_content}"
            for i, document in enumerate(self.documents)
        )
        return f"""\
{formatted_documents}
Question: {self.query}
Answer: {self.answer}
"""

def create_enhanced_drag_demonstrations(vector_db: VectorStore) -> list[DRAG_Demonstration]:
    """Create high-quality demonstrations for DRAG technique that showcase effective document analysis"""
    demonstration_retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 5})
    demonstrations = [
        DRAG_Demonstration(
            query="How did the COVID-19 pandemic impact Midwest Food Bank's operations in 2020?",
            answer="The COVID-19 pandemic significantly impacted Midwest Food Bank's operations in 2020. Despite challenges, MFB remained open and responsive to increased needs. They implemented safety protocols, reduced volunteer numbers for social distancing, and altered their distribution model to allow partner agencies to receive food safely. The pandemic created unprecedented food insecurity, with many people seeking assistance for the first time. MFB distributed 37% more food than in 2019, with a record 179 semi-loads of Disaster Relief family food boxes sent nationwide. The organization also faced supply chain disruptions and food procurement challenges in the early months but continued to find and distribute food. Community, business, and donor support helped fund operations and food purchases. Additionally, MFB began participating in the USDA Farmers to Families Food Box program in May 2020, distributing over $52 million worth of nutritious produce, protein, and dairy products.",
            retriever=demonstration_retriever
        ),
        DRAG_Demonstration(
            query="What role did volunteers play at Midwest Food Bank during 2020, and how were they affected by the pandemic?",
            answer="Volunteers were described as 'the life-blood of the organization' in the 2020 annual report. Despite the pandemic creating safety challenges, volunteers demonstrated courage and dedication by increasing their hours to meet growing needs. MFB implemented safety protocols at each location and limited volunteer group sizes to allow for social distancing. This created a challenge as food needs increased while fewer volunteers were available to help. To address this gap, multiple MFB locations received assistance from the National Guard, who filled vital volunteer positions driving trucks, operating forklifts, and helping with food distributions. In 2020, 17,930 individuals volunteered 300,898 hours of service, equivalent to 150 full-time employees. The volunteer-to-staff ratio was remarkable with 450 volunteers for every 1 paid MFB staff member, highlighting the volunteer-driven nature of the organization during the crisis.",
            retriever=demonstration_retriever
        ),
        DRAG_Demonstration(
            query="How did Midwest Food Bank's international programs perform during 2020, particularly in Haiti and East Africa?",
            answer="In 2020, Midwest Food Bank's international operations in East Africa and Haiti faced unique challenges but continued to serve communities. In East Africa (operated as Kapu Africa), strict lockdowns led to mass hunger, especially in slum areas. Kapu Africa distributed 7.2 million Tender Mercies meals, working with partner ministries to share food in food-insecure slums. A notable outcome was a spiritual awakening among recipients, with many asking why they were receiving help. In Haiti, the pandemic added to existing challenges, closing airports, seaports, factories, and schools. MFB Haiti more than doubled its food shipments to Haiti, delivering over 160 tons of food relief, nearly three-quarters being Tender Mercies meals. As Haitian children primarily receive nourishment from school lunches, MFB Haiti distributed Tender Mercies through faith-based schools and also partnered with over 20 feeding centers serving approximately 1,100 children daily. Nearly 1 million Tender Mercies meals were distributed in Haiti during 2020.",
            retriever=demonstration_retriever
        ),
    ]

    return demonstrations

第 2 步：将示范格式化以便包含在提示中

接着，我们将所有示范统一格式化，以便放入提示中。

# Format all demonstrations together
demonstrations = create_enhanced_drag_demonstrations(vector_db)

formatted_demonstrations = "\n\n".join(
    f"Example {i+1}:\n{demo}"
    for i, demo in enumerate(demonstrations)
)

第 3 步：创建 DRAG 提示模板

然后，我们为模型创建 DRAG 提示，其中包含格式化后的示范例子。

drag_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": f"""\
Here are examples of effectively extracting information from documents to answer questions.

{formatted_demonstrations}

Follow these examples when answering the user's question:

{{input}}""",
    }],
    documents=[{
        "doc_id": "0",
        "text": "Placeholder{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)

# Convert to prompt template
drag_prompt_template = PromptTemplate.from_template(template=escape_f_string(drag_prompt, "input", "context"))

第 4 步：创建对文档重新排序的自定义检索器

通常，检索器会按相似度顺序返回文档，其中最相似的文档排在最前面。我们定义了一个重排序检索器，用于将结果的顺序反转。现在的顺序将最相似的文档放在最后，也就是更接近提示的末尾。

import typing
from langchain_core.retrievers import BaseRetriever, RetrieverInput, RetrieverOutput
from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun

class ReorderingRetriever(BaseRetriever):
    base_retriever: BaseRetriever

    def _get_relevant_documents(
        self, query: RetrieverInput, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: typing.Any
    ) -> RetrieverOutput:
        docs = self.base_retriever._get_relevant_documents(query, run_manager=run_manager, **kwargs)
        return list(reversed(docs))  # Reverse the order so higher-ranked docs are closer to query in prompt

reordering_retriever = ReorderingRetriever(base_retriever=retriever)

第 5 步：创建 DRAG 管道

我们使用 DRAG 提示模板和重新排序检索器创建 DRAG 查询的管道。

drag_combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=drag_prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)

drag_chain = create_retrieval_chain(
    retriever=reordering_retriever,
    combine_docs_chain=drag_combine_docs_chain,
)

第 6 步：生成增强了 DRAG 的问题响应

drag_outputs = drag_chain.invoke({"input": query})
print("\n=== DRAG-Enhanced Answer ===")
print(drag_outputs['answer'])

太好了，看起来通过给出一些示例，我们的答案确实有所改进。接下来我们尝试一个更全面的 RAG 技术吧！

实施 IterDRAG（基于迭代演示的 RAG）

IterDRAG 通过将复杂查询分解为简单的子查询并执行交错检索来扩展 DRAG。这种方法对于复杂的多跳问题尤其有效，因为这些问题需要整合来自多个来源的信息，或跨越多个步骤进行推理。

迭代方法的主要优点：

将复杂问题拆解成可管理的部分。
为每个子问题检索更相关的信息。
创建明确的推理链。
能够在单个步骤中解决具有挑战性的问题。

第 1 步：创建查询分解链

分解步骤是关键的，因为它将复杂的查询分解为更简单、更集中的可以单独回答的子查询。

decompose_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that breaks down complex questions into simpler sub-questions.
For multi-part or complex questions, generate 1-3 sub-questions that would help answer the main question.

Here are examples of how to decompose complex questions:
{demonstrations}

Follow the above examples when breaking down the user's question.
If the following question is already simple enough, just respond with "No follow-up needed."

Otherwise, break down the following question into simpler sub-questions. Format your response as:
Follow up: [sub-question]

Question: {input}"""
    }],
    add_generation_prompt=True,
    tokenize=False,
)

decompose_prompt_template = PromptTemplate.from_template(template=escape_f_string(decompose_prompt, "input", "demonstrations"))
decompose_chain = decompose_prompt_template | model

第 2 步：创建子查询回答链

子查询回答组件通过检索相关文档并生成针对性的中间答案来处理每个单独的子问题。

intermediate_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that answers specific questions based on the provided documents.

Focus only on the sub-question and provide a concise intermediate answer.
Please answer the following sub-question based on the provided documents.
Format your response as:
Intermediate answer: [your concise answer to the sub-question]

Sub-question: {input}
"""
    }],
    documents=[{
        "doc_id": "0",
        "text": "Placeholder{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)

intermediate_prompt_template = PromptTemplate.from_template(template=escape_f_string(intermediate_prompt, "input", "context"))
intermediate_combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=intermediate_prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
intermediate_chain = create_retrieval_chain(
    retriever=reordering_retriever,
    combine_docs_chain=intermediate_combine_docs_chain,
)

第 3 步：创建最终答案生成链

最终答案生成组件会将所有中间答案组合起来，以生成对原始问题的全面响应。

final_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that provides comprehensive answers to questions.
Use the intermediate answers to sub-questions to formulate a complete final answer.
Please provide a final answer to the main question based on the intermediate answers to sub-questions.
Format your response as:
So the final answer is: [your comprehensive answer to the main question]

Main question: {input}

Sub-questions and intermediate answers:
{context}"""
    }],
    add_generation_prompt=True,
    tokenize=False,
)

final_prompt_template = PromptTemplate.from_template(template=escape_f_string(final_prompt, "input", "context"))
final_chain = final_prompt_template | model

第 4 步：为 IterDRAG 创建示例演示

创建有效的示例演示对于 IterDRAG 的性能至关重要。这些示例通过模型展示如何：

将复杂的问题分解为简单的子问题。
生成相关的中间答案。
将这些答案整合成一个连贯的最终回应。

@dataclass
class IterDRAG_Demonstration_Base:
    query: str
    answer: str

@dataclass
class IterDRAG_Demonstration(IterDRAG_Demonstration_Base):
    intermediate: list[IterDRAG_Demonstration_Base]

    def __format__(self, format_spec: str) -> str:
        sub_questions="\n".join(
            f"Follow up: {sub.query}"
            for sub in self.intermediate
        )

        return f"Question: {self.query}\n{sub_questions}"

def create_iterdrag_demonstrations() -> list[IterDRAG_Demonstration]:
    """Create examples showing how to decompose and answer complex questions"""

    demonstrations = [
        IterDRAG_Demonstration(
            query="What impact did the pandemic have on the food bank's operations and distribution?",
            answer="The pandemic had a profound impact on food bank operations and distribution. Distribution volume increased by 60% to over 100 million pounds of food in 2020. Operationally, the food bank faced supply chain disruptions, volunteer shortages, and safety protocol challenges. In response, they implemented contactless distribution, expanded mobile pantries, created emergency food boxes for vulnerable populations, and developed virtual nutrition education. Despite these challenges, they successfully scaled operations to meet the unprecedented community need during the crisis.",
            intermediate=[
                IterDRAG_Demonstration_Base(
                    query="How did food distribution volume change during the pandemic?",
                    answer="Food distribution volume increased by 60% during the pandemic, rising from approximately 62 million pounds in 2019 to over 100 million pounds in 2020.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What operational challenges did the food bank face during the pandemic?",
                    answer="The food bank faced challenges including supply chain disruptions, volunteer shortages due to social distancing requirements, and the need to implement new safety protocols for food handling and distribution.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What new programs were implemented in response to the pandemic?",
                    answer="New programs included contactless distribution methods, expanded mobile pantry operations, emergency food boxes for vulnerable populations, and virtual nutrition education classes.",
                ),
            ],
        ),
        IterDRAG_Demonstration(
            query="How does the food bank's financial management compare to industry standards for non-profits?",
            answer="The food bank demonstrates excellent financial management compared to industry standards. With 94% of its budget allocated to program services and only 6% to administrative and fundraising costs, it exceeds the industry benchmark of 85-90% for program spending. This financial efficiency places the food bank among the top-performing non-profits in terms of maximizing donor impact and minimizing overhead expenses.",
            intermediate=[
                IterDRAG_Demonstration_Base(
                    query="What percentage of the food bank's budget goes to program services versus administrative costs?",
                    answer="94% of the food bank's budget goes directly to program services, with only 6% allocated to administrative and fundraising costs.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What are the industry standards for program spending versus overhead for food banks?",
                    answer="Industry standards suggest that well-run food banks typically allocate 85-90% of their budget to program services, with 10-15% for administrative and fundraising expenses.",
                ),
            ],
        ),
    ]
    return demonstrations

第 5 步：实现 IterDRAG 函数

该函数负责协调整个迭代过程：

将主问题分解成若干子问题。
对于每个子问题，检索相关文档并生成中间答案。
合并所有中间答案以产生最终响应。

import re

def iterative_drag(main_question: str) -> dict[str, typing.Any]:
    """
    Implements IterDRAG: decomposing queries, retrieving documents for sub-queries,
    and generating a final answer based on intermediate answers.
    """
    print(f"\n=== Processing query with IterDRAG: '{main_question}' ===")

    # Step 1: Decompose the main question into sub-questions
    print("Step 1: Decomposing the query into sub-questions...")
    iterdrag_demonstrations = create_iterdrag_demonstrations()
    formatted_demonstrations = "\n\n".join(
        f"Example {i+1}:\n{demo}"
        for i, demo in enumerate(iterdrag_demonstrations)
    )
    decompose_result = decompose_chain.invoke({
        "input": main_question,
        "demonstrations": formatted_demonstrations,
    })
    decompose_answer = decompose_result

    # Extract sub-questions using regex
    sub_questions = re.findall(r"Follow up: (.*?)(?=Follow up:|\n|$)", decompose_answer, re.DOTALL)
    sub_questions = [sq.strip() for sq in sub_questions if sq.strip()]
    if not sub_questions:
        print("No decomposition needed or found. Using standard DRAG approach.")
        return drag_chain.invoke({"input": main_question})
    print(f"Decomposed into {len(sub_questions)} sub-questions")

    # Step 2: Answer each sub-question
    intermediate_pairs: list[dict[str, str]] = []
    for i, sub_question in enumerate(sub_questions):
        print(f"\nStep 2.{i+1}: Processing sub-question: '{sub_question}'")

        # Generate answer for this sub-question
        intermediate_result = intermediate_chain.invoke({"input": sub_question})
        intermediate_answer = intermediate_result["answer"]

        # Extract intermediate answer using regex
        intermediate_answer_match = re.search(r"Intermediate answer: (.*?)$", intermediate_answer, re.DOTALL)
        if intermediate_answer_match:
            intermediate_answer = intermediate_answer_match.group(1).strip()

        print(f"Generated intermediate answer: {intermediate_answer[:100]}...")

        # Store the sub-question and its answer
        intermediate_pairs.append({"input": sub_question, "answer": intermediate_answer})

    # Step 3: Generate the final answer based on sub-question answers
    print("\nStep 3: Generating final answer based on intermediate answers...")
    final_result = final_chain.invoke({
        "input": main_question,
        "context": "\n\n".join(
            f"Sub-question: {pair['input']}\nIntermediate answer: {pair['answer']}"
            for pair in intermediate_pairs
        ),
    })
    final_answer = final_result

    # Extract final answer
    final_answer_match = re.search(r"So the final answer is: (.*?)$", final_answer, re.DOTALL)
    if final_answer_match:
        final_answer = final_answer_match.group(1).strip()

    return {"input": main_question, "answer": final_answer, "intermediate": intermediate_pairs}

比较 RAG 方法

现在我们已经设置好了三种 RAG 方法，让我们将它们对同一查询的回答进行比较，这次使用一个更复杂的查询，以观察它们之间的差异。

通过比较，我们可以了解每种方法的优点以及每种方法最适合在什么情况下使用。

# Run all approaches on the same complex query
comparison_query = "What was the full impact chain of the National Guard's assistance during the pandemic? Specifically, how did their involvement affect volunteer operations, what specific tasks did they perform, and how did this ultimately translate to community impact in terms of food distribution capabilities and reach?"

print("\n=== Standard RAG ===")
standard_result = rag_chain.invoke({"input": comparison_query})
print(standard_result["answer"])

print("\n=== DRAG ===")
drag_result = drag_chain.invoke({"input": comparison_query})
print(drag_result["answer"])

print("\n=== IterDRAG ===")
iterdrag_result = iterative_drag(comparison_query)
print(iterdrag_result["answer"])

结果比较和分析

我们在这里总结了所实施的三种 RAG 方法之间的性能差异：

方法	优势	限制	最佳用例
标准 RAG	实施简单适用于简单查询更低的计算要求	上下文利用有限随着文档数量增加，性能提升趋于平缓不擅长复杂推理	简单的事实性问题当计算受到限制时当上下文很小时
DRAG	更好地利用上下文文档数量增加时性能有所提升适合中等复杂程度的查询	仍受限于单步生成对于多跳问题效果较差	中等复杂性查询当有更多文档可用时当可以提供上下文示例时
IterDRAG	最适合复杂查询明确的推理链最有效地利用上下文	最高计算要求实施更加复杂	多跳问题需要综合推理的复杂分析何时需要最高性能

正如我们在实施中看到的，推理扩展技术（如 DRAG 和 IterDRAG）可以显著提升 RAG 的性能。此方法尤其适用于需要深入分析多个文档的复杂查询。

总结

在本教程中，我们深入了解了推理扩展如何显著提高 RAG 性能。通过在推理阶段通过 DRAG 和 IterDRAG 等技术策略性地分配额外计算资源，我们可以在复杂查询的响应质量上取得显著提升。

传统 RAG 和基于转换器的模型面临的挑战

推理成本高：基于转换器的模型使用自注意力机制，其推理成本会随输入长度呈二次增长。这种方法使得处理长上下文在计算上代价高昂，从而将 RAG 的实际应用限制在较短文档上，或需要进行大幅截断。

上下文利用受限：标准 RAG 系统通常检索并处理固定数量的文档，这对于复杂的多跳查询可能是不够的。随着上下文长度增加，性能趋于平稳，尤其是在超过 128,000 个令牌时，因为模型难以在大量检索到的段落中整合信息。

计算资源分配效率低：如果没有合理分配，增加更多检索文档或上下文只会提高计算成本，而不会带来相应的准确性提升，可能导致收益递减，甚至因信息过载而性能下降。

DRAG 和 IterDRAG 如何解决这些挑战

基于演示的 RAG (DRAG)：

DRAG 充分利用多个检索到的示例、问题和答案作为提示中的演示，使模型能够在上下文中学习如何查找和应用相关信息。

这种方法对于较短的有效上下文长度尤其有效，因为它允许模型利用丰富的上下文而不会使注意力机制不堪重负，从而提高检索和生成质量。

基于迭代演示的 RAG (IterDRAG)：

IterDRAG 将复杂查询分解为更简单的子查询，迭代检索并生成每个子步骤的答案。

通过交错进行检索与生成，IterDRAG 构建了推理链以弥补多跳查询的缺口，使其在处理超长上下文时尤其有效。

这一过程使模型能够更有效地分配计算，在每一步中关注最相关的信息，并避免长上下文注意力过载的风险。通过将这些推理扩展技术应用于您的 RAG 应用程序，您可以在知识密集型任务中显著提高性能，而无需更改基础模型。

后续步骤：

试验不同的检索模型和文档预处理方法。
尝试不同的提示公式来理解图像。
探索模型参数优化，针对您的特定用例找到理想的设置。

数据领导者的数据科学和 MLOps

与其他领导者就 MLOps 和值得信任的 AI 的 3 个关键目标达成一致：信任数据、信任模型和信任流程。

资源

深入了解 IBM Granite

IBM® Granite 是我们开放式、性能优异、值得信赖的 AI 模型系列，专门为企业量身定制，并经过优化，可以帮助您扩展 AI 应用。深入了解语言、代码、时间序列和防护措施选项。

2024 年 AI 实际应用

我们对 2,000 家组织进行了调查，旨在了解他们的 AI 计划，以发现哪些方法有效、哪些方法无效，以及如何才能取得领先。

解锁生成式 AI + ML 的强大功能

了解如何将生成式 AI、机器学习和基础模型整合到您的业务运营中，以提高绩效。

如何选择合适的基础模型

了解如何为您的用例选择最合适的 AI 基础模型。

什么是机器学习？

机器学习是 AI 和计算机科学的一个分支，专注于使用数据和算法使 AI 能够模仿人类的学习方式。

树立信任，从容自信在 AI 新时代蓬勃发展

深入了解强大 AI 战略的 3 个关键要素：创造竞争优势、在整个企业中扩展 AI 以及推进值得信赖的 AI。

脚注

1. “A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems,” Ke, Zixuan, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, et al., ArXiv.org, 2025.

2. “Reasoning in Granite 3.2 Using Inference Scaling,” Lastras, Luis. 2025, IBM Research, IBM, February 26, 2025.

3. “Inference Scaling for Long-Context Retrieval Augmented Generation,” Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, ArXiv.org, 2024.

方法	优势	限制	最佳用例
标准 RAG	实施简单适用于简单查询更低的计算要求	上下文利用有限随着文档数量增加，性能提升趋于平缓不擅长复杂推理	简单的事实性问题当计算受到限制时当上下文很小时
DRAG	更好地利用上下文文档数量增加时性能有所提升适合中等复杂程度的查询	仍受限于单步生成对于多跳问题效果较差	中等复杂性查询当有更多文档可用时当可以提供上下文示例时
IterDRAG	最适合复杂查询明确的推理链最有效地利用上下文	最高计算要求实施更加复杂	多跳问题需要综合推理的复杂分析何时需要最高性能