人工智 能(AI) 中的推理扩展是指通过在推理阶段(模型产生输出)分配计算资源而不是依赖更大的训练数据集或模型架构来提高模型的性能。随着大语言模型 (LLM) 在模型参数和数据集规模方面的不断扩展,优化推理时间和管理推理计算扩展(尤其是在 GPU 硬件上)已成为部署高性能多模态检索增强生成 (RAG) 系统的核心挑战。
近年来在推理策略上的进展,通过增加计算资源并在推理阶段使用复杂算法,正在重新定义 LLM 处理复杂推理任务的方式,并在多种输入模态下提供更高质量的输出。推理扩展通过加深推理深度来优化思维链 (CoT)。这种扩展使模型能够通过迭代提示或多步生成,产生更长、更详细的思维链。推理扩展可用于提升多模态 RAG,重点关注模型规模、计算资源预算以及在实际应用中优化推理时间之间的相互作用。
此外,扩展规律和基准测试结果强调了在预训练、微调、推理阶段策略以及用于输出选择的高级算法之间的权衡。无论是大型模型还是小型模型,都能从推理扩展中受益,因为它还能使资源受限的系统接近先进 LLM 的性能。本教程展示了优化技术对模型性能的影响,并提供了可操作的指导,以在多模态 RAG 部署中平衡准确性、延迟和成本。
本教程专为希望增强文档管理和高级自然语言处理 (NLP) 技术知识的人工智能开发人员、研究人员和爱好者而设计。您将学习如何利用推理扩展的力量,改进在之前的代码模板中创建的多模态 RAG 流程。虽然本教程重点介绍多模态 RAG 中的可扩展性策略,特别关注 IBM® Granite 大型语言模型,但类似的原则也适用于大多数流行的模型,包括来自 OpenAI(如 GPT-4、GPT-4o、ChatGPT)和 DeepMind 的各类模型。
本教程指导您完成以下流程:
完成本教程后,您将实现以下目标:
传统语言模型在处理长上下文时会遇到几个问题,原因如下:
本教程中的技术通过对推理计算的策略性分配来应对这些挑战。
有关这两种高级推理扩展技术(DRAG 和 IterDRAG)的更多信息,请参阅研究论文“Inference Scaling for Long-Context Retrieval Augmented Generation”
这些方法表明,在最佳分配情况下,扩展推理计算可以几乎线性地提高 RAG 性能,从而使 RAG 系统能够更好地利用现代 LLM 的长上下文能力。在本次实施中,我们将使用能够处理不同模态的 IBM® Granite 模型。您将创建一个 AI 系统,利用论文中的原理,从非结构化数据中回答用户的实时查询。
确保在新创建的虚拟环境中运行 Python 3.10、3.11 或 3.12。注意,您也可以在 GitHub 上访问本教程。
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), "Use Python 3.10, 3.11, or 3.12 to run this notebook."
! pip install "git+https://github.com/ibm-granite-community/utils.git" \
transformers \
pillow \
langchain_community \
langchain_huggingface \
langchain_milvus \
docling \
replicate
为了查看一些日志信息,我们可以将日志级别配置为 INFO。
注意:可以跳过运行此单元。
import logging
logging.basicConfig(level=logging.INFO)
指定用于生成文本嵌入矢量的嵌入模型。这里我们将使用其中一个 Granite Embeddings 模型。
若要使用不同的嵌入模型,请将此代码单元替换为来自此 Embeddings Model 代码模板的单元。
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer
embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)
指定用于图像理解的 MLLM。我们将使用 Granite Vision 模型。
from ibm_granite_community.notebook_utils import get_env_var
from langchain_community.llms import Replicate
from transformers import AutoProcessor
vision_model_path = "ibm-granite/granite-vision-3.2-2b"
vision_model = Replicate(
model=vision_model_path,
replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
model_kwargs={
"max_tokens": embeddings_tokenizer.max_len_single_sentence, # Set the maximum number of tokens to generate as output.
"min_tokens": 100, # Set the minimum number of tokens to generate as output.
"temperature": 0.01,
},
)
vision_processor = AutoProcessor.from_pretrained(vision_model_path)
指定用于 RAG 生成操作的语言模型。这里我们使用 Replicate LangChain 客户端,连接到 Replicate 上 ibm-granite 组织的 Granite 模型。
要设置 Replicate,请参阅 Replicate 入门。
若要连接 Replicate 之外其他服务商的模型,请将当前代码单元替换为 LLM 组件代码模板中对应的代码单元。
model_path = "ibm-granite/granite-3.3-8b-instruct"
model = Replicate(
model=model_path,
replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
model_kwargs={
"max_tokens": 1000, # Set the maximum number of tokens to generate as output.
"min_tokens": 100, # Set the minimum number of tokens to generate as output.
"temperature": 0.01
},
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
pdf_pipeline_options = PdfPipelineOptions(
do_ocr=False,
generate_picture_images=True,
)
format_options = {
InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}
converter = DocumentConverter(format_options=format_options)
sources = [
"https://midwestfoodbank.org/images/AR_2020_WEB2.pdf",
]
conversions = { source: converter.convert(source=source).document for source in sources }
在处理完文档后,我们进一步处理文档中的文本元素,并将其分块为适合所使用嵌入模型的大小。LangChain 文档列表是从文本块创建的。
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc import DocItem, TableItem
from langchain_core.documents import Document
doc_id = 0
texts: list[Document] = []
for source, docling_document in conversions.items():
for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
items: list[DocItem] = chunk.meta.doc_items # type: ignore
if len(items) == 1 and isinstance(items[0], TableItem):
continue # we will process tables later
refs = " ".join(map(lambda item: item.get_ref().cref, items))
print(refs)
text = chunk.text
document = Document(
page_content=text,
metadata={
"doc_id": (doc_id:=doc_id+1),
"source": source,
"ref": refs,
},
)
texts.append(document)
print(f"{len(texts)} text document chunks created")
接下来,我们处理文档中的所有表格。我们将表数据转换为 markdown 格式,以便语言模型可以处理它。从表格的 Markdown 渲染生成了一个 LangChain 文档列表。
from docling_core.types.doc import DocItemLabel
doc_id = len(texts)
tables: list[Document] = []
for source, docling_document in conversions.items():
for table in docling_document.tables:
if table.label in [DocItemLabel.TABLE]:
ref = table.get_ref().cref
print(ref)
text = table.export_to_markdown(docling_document)
document = Document(
page_content=text,
metadata={
"doc_id": (doc_id:=doc_id+1),
"source": source,
"ref": ref
},
)
tables.append(document)
print(f"{len(tables)} table documents created")
最后,我们会处理文档中的所有图像。在这里,我们使用视觉语言模型来理解图像的内容。在此示例中,我们关注图像中的任何文本信息。
选择合适的图像提示非常关键,因为它决定了模型将关注图像的哪些方面。例如:
注:图像处理可能需要大量处理时间,具体取决于图像的数量和运行视觉语言模型的服务。
import base64
import io
import PIL.Image
import PIL.ImageOps
def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
image = PIL.ImageOps.exif_transpose(image) or image
image = image.convert("RGB")
buffer = io.BytesIO()
image.save(buffer, format)
encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
uri = f"data:image/{format};base64,{encoding}"
return uri
# Feel free to experiment with this prompt
image_prompt = "Give a detailed description of what is depicted in the image"
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": image_prompt},
],
},
]
vision_prompt = vision_processor.apply_chat_template(
conversation=conversation,
add_generation_prompt=True,
)
pictures: list[Document] = []
doc_id = len(texts) + len(tables)
for source, docling_document in conversions.items():
for picture in docling_document.pictures:
ref = picture.get_ref().cref
print(ref)
image = picture.get_image(docling_document)
if image:
text = vision_model.invoke(vision_prompt, image=encode_image(image))
document = Document(
page_content=text,
metadata={
"doc_id": (doc_id:=doc_id+1),
"source": source,
"ref": ref,
},
)
pictures.append(document)
print(f"{len(pictures)} image descriptions created")
接着,我们可以显示从输入文档生成的 LangChain 文档。
import itertools
from docling_core.types.doc import RefItem
from IPython.display import display
# Print all created documents
for document in itertools.chain(texts, tables):
print(f"Document ID: {document.metadata['doc_id']}")
print(f"Source: {document.metadata['source']}")
print(f"Content:\n{document.page_content}")
print("=" * 80) # Separator for clarity
for document in pictures:
print(f"Document ID: {document.metadata['doc_id']}")
source = document.metadata['source']
print(f"Source: {source}")
print(f"Content:\n{document.page_content}")
docling_document = conversions[source]
ref = document.metadata['ref']
picture = RefItem(cref=ref).resolve(docling_document)
image = picture.get_image(docling_document)
print("Image:")
display(image)
print("=" * 80) # Separator for clarity
利用嵌入模型,我们将文本块中的文档和生成的图像描述加载到矢量数据库中。通过创建这个矢量数据库,我们可以轻松地在文档中进行语义相似性搜索。
注意:根据您的嵌入模型和服务,填充矢量数据库可能需要相当长的处理时间。
指定用于存储和检索嵌入矢量的数据库。在本教程中,我们将通过 LangChain 使用 Milvus。作为矢量数据库,Milvus 将存储、索引并管理由神经网络和各种机器学习算法生成的数值嵌入。
若要连接到 Milvus 以外的矢量数据库,请将此代码单元替换为来自此 Vector Store 代码模板的单元。
import tempfile
from langchain_core.vectorstores import VectorStore, VectorStoreRetriever
from langchain_milvus import Milvus
db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")
vector_db: VectorStore = Milvus(
embedding_function=embeddings_model,
connection_args={"uri": db_file},
auto_id=True,
enable_dynamic_field=True,
index_params={"index_type": "AUTOINDEX"},
)
现在,我们将文本、表格及图像描述对应的所有 LangChain 文档添加到矢量数据库中。
import itertools
documents = list(itertools.chain(texts, tables, pictures))
ids = vector_db.add_documents(documents)
print(f"{len(ids)} documents added to the vector database")
retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 10})
现在我们已经成功转换了文档并将其矢量化,我们可以建立 RAG 管道了。
在这里,我们通过在矢量空间中搜索与查询相关的信息块来测试矢量数据库。我们显示与检索到的图像描述相关的文档。
此验证步骤非常重要,可在我们构建完整的 RAG 管道之前确保我们的检索系统正常工作。我们希望查看返回的文档是否与我们的查询相关。
欢迎尝试不同的查询方式。
query = "Analyze how Midwest Food Bank's financial efficiency changed during the pandemic by comparing their 2019 and 2020 performance metrics. What specific pandemic adaptations had the greatest impact on their operational capacity, and how did their volunteer management strategy evolve to maintain service levels despite COVID-19 restrictions? Provide specific statistics from the report to support your analysis."
for doc in vector_db.as_retriever().invoke(query):
print(doc)
print("=" * 80) # Separator for clarity
返回的文档应能针对查询做出响应。接下来,我们来构建 RAG 管道。
返回的文档应能针对查询做出响应。接下来,我们来构建 RAG 管道。
首先,我们为 Granite 创建提示以执行 RAG 查询。我们使用 Granite 聊天模板,并提供 LangChain RAG 管道将替换的占位符值。
{context} 将保存检索到的块,如上一个搜索所示,并将其作为文档上下文提供给模型,以回答我们的问题。
然后,我们使用创建的 Granite 提示模板构建 RAG 管道。
from ibm_granite_community.notebook_utils import escape_f_string
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
# Create a Granite prompt for question-answering with the retrieved context
prompt = tokenizer.apply_chat_template(
conversation=[{
"role": "user",
"content": "{input}",
}],
documents=[{
"doc_id": "0",
"text": "{context}",
}],
add_generation_prompt=True,
tokenize=False,
)
prompt_template = PromptTemplate.from_template(template=escape_f_string(prompt, "input", "context"))
# Create a Granite document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
<|end_of_text|>
<|start_of_role|>document {{"document_id": "{doc_id}"}}<|end_of_role|>
{page_content}""")
document_separator=""
# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
llm=model,
prompt=prompt_template,
document_prompt=document_prompt_template,
document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
retriever=retriever,
combine_docs_chain=combine_docs_chain,
)
管道使用查询从矢量数据库中查找文档,并将其用作查询的上下文。
outputs = rag_chain.invoke({"input": query})
print(outputs['answer'])
虽然标准 RAG 方法的效果还不错,但在处理过长或复杂的内容时,它有几个关键局限性:
推理扩展技术通过在推理阶段有策略地分配更多计算资源来应对这些限制。
现在,我们将实施研究论文“Inference Scaling for Long-Context Retrieval Augmented Generation”中的 DRAG 技术来增强我们的 RAG 系统。
DRAG 使用上下文示例向模型演示如何从文档中提取和使用信息,从而提高长上下文场景的性能。
这些通常来自经过精心挑选的高质量问答对数据集。为此,我们将创建一些与预期领域相匹配的合成示例。
在这里,我们定义了一个数据类来表示单个示范,然后创建一些示范。
from dataclasses import dataclass, field, InitVar
from langchain_core.documents import Document
@dataclass
class DRAG_Demonstration:
query: str
answer: str
retriever: InitVar[VectorStoreRetriever] = field(kw_only=True)
documents: list[Document] = field(default_factory=list, kw_only=True)
def __post_init__(self, retriever: VectorStoreRetriever):
if not self.documents:
self.documents = retriever.invoke(self.query)
def __format__(self, format_spec: str) -> str:
formatted_documents = "\n".join(
f"Document {i+1}:\n{document.page_content}"
for i, document in enumerate(self.documents)
)
return f"""\
{formatted_documents}
Question: {self.query}
Answer: {self.answer}
"""
def create_enhanced_drag_demonstrations(vector_db: VectorStore) -> list[DRAG_Demonstration]:
"""Create high-quality demonstrations for DRAG technique that showcase effective document analysis"""
demonstration_retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 5})
demonstrations = [
DRAG_Demonstration(
query="How did the COVID-19 pandemic impact Midwest Food Bank's operations in 2020?",
answer="The COVID-19 pandemic significantly impacted Midwest Food Bank's operations in 2020. Despite challenges, MFB remained open and responsive to increased needs. They implemented safety protocols, reduced volunteer numbers for social distancing, and altered their distribution model to allow partner agencies to receive food safely. The pandemic created unprecedented food insecurity, with many people seeking assistance for the first time. MFB distributed 37% more food than in 2019, with a record 179 semi-loads of Disaster Relief family food boxes sent nationwide. The organization also faced supply chain disruptions and food procurement challenges in the early months but continued to find and distribute food. Community, business, and donor support helped fund operations and food purchases. Additionally, MFB began participating in the USDA Farmers to Families Food Box program in May 2020, distributing over $52 million worth of nutritious produce, protein, and dairy products.",
retriever=demonstration_retriever
),
DRAG_Demonstration(
query="What role did volunteers play at Midwest Food Bank during 2020, and how were they affected by the pandemic?",
answer="Volunteers were described as 'the life-blood of the organization' in the 2020 annual report. Despite the pandemic creating safety challenges, volunteers demonstrated courage and dedication by increasing their hours to meet growing needs. MFB implemented safety protocols at each location and limited volunteer group sizes to allow for social distancing. This created a challenge as food needs increased while fewer volunteers were available to help. To address this gap, multiple MFB locations received assistance from the National Guard, who filled vital volunteer positions driving trucks, operating forklifts, and helping with food distributions. In 2020, 17,930 individuals volunteered 300,898 hours of service, equivalent to 150 full-time employees. The volunteer-to-staff ratio was remarkable with 450 volunteers for every 1 paid MFB staff member, highlighting the volunteer-driven nature of the organization during the crisis.",
retriever=demonstration_retriever
),
DRAG_Demonstration(
query="How did Midwest Food Bank's international programs perform during 2020, particularly in Haiti and East Africa?",
answer="In 2020, Midwest Food Bank's international operations in East Africa and Haiti faced unique challenges but continued to serve communities. In East Africa (operated as Kapu Africa), strict lockdowns led to mass hunger, especially in slum areas. Kapu Africa distributed 7.2 million Tender Mercies meals, working with partner ministries to share food in food-insecure slums. A notable outcome was a spiritual awakening among recipients, with many asking why they were receiving help. In Haiti, the pandemic added to existing challenges, closing airports, seaports, factories, and schools. MFB Haiti more than doubled its food shipments to Haiti, delivering over 160 tons of food relief, nearly three-quarters being Tender Mercies meals. As Haitian children primarily receive nourishment from school lunches, MFB Haiti distributed Tender Mercies through faith-based schools and also partnered with over 20 feeding centers serving approximately 1,100 children daily. Nearly 1 million Tender Mercies meals were distributed in Haiti during 2020.",
retriever=demonstration_retriever
),
]
return demonstrations
接着,我们将所有示范统一格式化,以便放入提示中。
# Format all demonstrations together
demonstrations = create_enhanced_drag_demonstrations(vector_db)
formatted_demonstrations = "\n\n".join(
f"Example {i+1}:\n{demo}"
for i, demo in enumerate(demonstrations)
)
然后,我们为模型创建 DRAG 提示,其中包含格式化后的示范例子。
drag_prompt = tokenizer.apply_chat_template(
conversation=[{
"role": "user",
"content": f"""\
Here are examples of effectively extracting information from documents to answer questions.
{formatted_demonstrations}
Follow these examples when answering the user's question:
{{input}}""",
}],
documents=[{
"doc_id": "0",
"text": "Placeholder{context}",
}],
add_generation_prompt=True,
tokenize=False,
)
# Convert to prompt template
drag_prompt_template = PromptTemplate.from_template(template=escape_f_string(drag_prompt, "input", "context"))
通常,检索器会按相似度顺序返回文档,其中最相似的文档排在最前面。我们定义了一个重排序检索器,用于将结果的顺序反转。现在的顺序将最相似的文档放在最后,也就是更接近提示的末尾。
import typing
from langchain_core.retrievers import BaseRetriever, RetrieverInput, RetrieverOutput
from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun
class ReorderingRetriever(BaseRetriever):
base_retriever: BaseRetriever
def _get_relevant_documents(
self, query: RetrieverInput, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: typing.Any
) -> RetrieverOutput:
docs = self.base_retriever._get_relevant_documents(query, run_manager=run_manager, **kwargs)
return list(reversed(docs)) # Reverse the order so higher-ranked docs are closer to query in prompt
reordering_retriever = ReorderingRetriever(base_retriever=retriever)
我们使用 DRAG 提示模板和重新排序检索器创建 DRAG 查询的管道。
drag_combine_docs_chain = create_stuff_documents_chain(
llm=model,
prompt=drag_prompt_template,
document_prompt=document_prompt_template,
document_separator=document_separator,
)
drag_chain = create_retrieval_chain(
retriever=reordering_retriever,
combine_docs_chain=drag_combine_docs_chain,
)
drag_outputs = drag_chain.invoke({"input": query})
print("\n=== DRAG-Enhanced Answer ===")
print(drag_outputs['answer'])
太好了,看起来通过给出一些示例,我们的答案确实有所改进。接下来我们尝试一个更全面的 RAG 技术吧!
IterDRAG 通过将复杂查询分解为简单的子查询并执行交错检索来扩展 DRAG。这种方法对于复杂的多跳问题尤其有效,因为这些问题需要整合来自多个来源的信息,或跨越多个步骤进行推理。
迭代方法的主要优点:
分解步骤是关键的,因为它将复杂的查询分解为更简单、更集中的可以单独回答的子查询。
decompose_prompt = tokenizer.apply_chat_template(
conversation=[{
"role": "user",
"content": """\
You are a helpful assistant that breaks down complex questions into simpler sub-questions.
For multi-part or complex questions, generate 1-3 sub-questions that would help answer the main question.
Here are examples of how to decompose complex questions:
{demonstrations}
Follow the above examples when breaking down the user's question.
If the following question is already simple enough, just respond with "No follow-up needed."
Otherwise, break down the following question into simpler sub-questions. Format your response as:
Follow up: [sub-question]
Question: {input}"""
}],
add_generation_prompt=True,
tokenize=False,
)
decompose_prompt_template = PromptTemplate.from_template(template=escape_f_string(decompose_prompt, "input", "demonstrations"))
decompose_chain = decompose_prompt_template | model
子查询回答组件通过检索相关文档并生成针对性的中间答案来处理每个单独的子问题。
intermediate_prompt = tokenizer.apply_chat_template(
conversation=[{
"role": "user",
"content": """\
You are a helpful assistant that answers specific questions based on the provided documents.
Focus only on the sub-question and provide a concise intermediate answer.
Please answer the following sub-question based on the provided documents.
Format your response as:
Intermediate answer: [your concise answer to the sub-question]
Sub-question: {input}
"""
}],
documents=[{
"doc_id": "0",
"text": "Placeholder{context}",
}],
add_generation_prompt=True,
tokenize=False,
)
intermediate_prompt_template = PromptTemplate.from_template(template=escape_f_string(intermediate_prompt, "input", "context"))
intermediate_combine_docs_chain = create_stuff_documents_chain(
llm=model,
prompt=intermediate_prompt_template,
document_prompt=document_prompt_template,
document_separator=document_separator,
)
intermediate_chain = create_retrieval_chain(
retriever=reordering_retriever,
combine_docs_chain=intermediate_combine_docs_chain,
)
最终答案生成组件会将所有中间答案组合起来,以生成对原始问题的全面响应。
final_prompt = tokenizer.apply_chat_template(
conversation=[{
"role": "user",
"content": """\
You are a helpful assistant that provides comprehensive answers to questions.
Use the intermediate answers to sub-questions to formulate a complete final answer.
Please provide a final answer to the main question based on the intermediate answers to sub-questions.
Format your response as:
So the final answer is: [your comprehensive answer to the main question]
Main question: {input}
Sub-questions and intermediate answers:
{context}"""
}],
add_generation_prompt=True,
tokenize=False,
)
final_prompt_template = PromptTemplate.from_template(template=escape_f_string(final_prompt, "input", "context"))
final_chain = final_prompt_template | model
创建有效的示例演示对于 IterDRAG 的性能至关重要。这些示例通过模型展示如何:
@dataclass
class IterDRAG_Demonstration_Base:
query: str
answer: str
@dataclass
class IterDRAG_Demonstration(IterDRAG_Demonstration_Base):
intermediate: list[IterDRAG_Demonstration_Base]
def __format__(self, format_spec: str) -> str:
sub_questions="\n".join(
f"Follow up: {sub.query}"
for sub in self.intermediate
)
return f"Question: {self.query}\n{sub_questions}"
def create_iterdrag_demonstrations() -> list[IterDRAG_Demonstration]:
"""Create examples showing how to decompose and answer complex questions"""
demonstrations = [
IterDRAG_Demonstration(
query="What impact did the pandemic have on the food bank's operations and distribution?",
answer="The pandemic had a profound impact on food bank operations and distribution. Distribution volume increased by 60% to over 100 million pounds of food in 2020. Operationally, the food bank faced supply chain disruptions, volunteer shortages, and safety protocol challenges. In response, they implemented contactless distribution, expanded mobile pantries, created emergency food boxes for vulnerable populations, and developed virtual nutrition education. Despite these challenges, they successfully scaled operations to meet the unprecedented community need during the crisis.",
intermediate=[
IterDRAG_Demonstration_Base(
query="How did food distribution volume change during the pandemic?",
answer="Food distribution volume increased by 60% during the pandemic, rising from approximately 62 million pounds in 2019 to over 100 million pounds in 2020.",
),
IterDRAG_Demonstration_Base(
query="What operational challenges did the food bank face during the pandemic?",
answer="The food bank faced challenges including supply chain disruptions, volunteer shortages due to social distancing requirements, and the need to implement new safety protocols for food handling and distribution.",
),
IterDRAG_Demonstration_Base(
query="What new programs were implemented in response to the pandemic?",
answer="New programs included contactless distribution methods, expanded mobile pantry operations, emergency food boxes for vulnerable populations, and virtual nutrition education classes.",
),
],
),
IterDRAG_Demonstration(
query="How does the food bank's financial management compare to industry standards for non-profits?",
answer="The food bank demonstrates excellent financial management compared to industry standards. With 94% of its budget allocated to program services and only 6% to administrative and fundraising costs, it exceeds the industry benchmark of 85-90% for program spending. This financial efficiency places the food bank among the top-performing non-profits in terms of maximizing donor impact and minimizing overhead expenses.",
intermediate=[
IterDRAG_Demonstration_Base(
query="What percentage of the food bank's budget goes to program services versus administrative costs?",
answer="94% of the food bank's budget goes directly to program services, with only 6% allocated to administrative and fundraising costs.",
),
IterDRAG_Demonstration_Base(
query="What are the industry standards for program spending versus overhead for food banks?",
answer="Industry standards suggest that well-run food banks typically allocate 85-90% of their budget to program services, with 10-15% for administrative and fundraising expenses.",
),
],
),
]
return demonstrations
该函数负责协调整个迭代过程:
import re
def iterative_drag(main_question: str) -> dict[str, typing.Any]:
"""
Implements IterDRAG: decomposing queries, retrieving documents for sub-queries,
and generating a final answer based on intermediate answers.
"""
print(f"\n=== Processing query with IterDRAG: '{main_question}' ===")
# Step 1: Decompose the main question into sub-questions
print("Step 1: Decomposing the query into sub-questions...")
iterdrag_demonstrations = create_iterdrag_demonstrations()
formatted_demonstrations = "\n\n".join(
f"Example {i+1}:\n{demo}"
for i, demo in enumerate(iterdrag_demonstrations)
)
decompose_result = decompose_chain.invoke({
"input": main_question,
"demonstrations": formatted_demonstrations,
})
decompose_answer = decompose_result
# Extract sub-questions using regex
sub_questions = re.findall(r"Follow up: (.*?)(?=Follow up:|\n|$)", decompose_answer, re.DOTALL)
sub_questions = [sq.strip() for sq in sub_questions if sq.strip()]
if not sub_questions:
print("No decomposition needed or found. Using standard DRAG approach.")
return drag_chain.invoke({"input": main_question})
print(f"Decomposed into {len(sub_questions)} sub-questions")
# Step 2: Answer each sub-question
intermediate_pairs: list[dict[str, str]] = []
for i, sub_question in enumerate(sub_questions):
print(f"\nStep 2.{i+1}: Processing sub-question: '{sub_question}'")
# Generate answer for this sub-question
intermediate_result = intermediate_chain.invoke({"input": sub_question})
intermediate_answer = intermediate_result["answer"]
# Extract intermediate answer using regex
intermediate_answer_match = re.search(r"Intermediate answer: (.*?)$", intermediate_answer, re.DOTALL)
if intermediate_answer_match:
intermediate_answer = intermediate_answer_match.group(1).strip()
print(f"Generated intermediate answer: {intermediate_answer[:100]}...")
# Store the sub-question and its answer
intermediate_pairs.append({"input": sub_question, "answer": intermediate_answer})
# Step 3: Generate the final answer based on sub-question answers
print("\nStep 3: Generating final answer based on intermediate answers...")
final_result = final_chain.invoke({
"input": main_question,
"context": "\n\n".join(
f"Sub-question: {pair['input']}\nIntermediate answer: {pair['answer']}"
for pair in intermediate_pairs
),
})
final_answer = final_result
# Extract final answer
final_answer_match = re.search(r"So the final answer is: (.*?)$", final_answer, re.DOTALL)
if final_answer_match:
final_answer = final_answer_match.group(1).strip()
return {"input": main_question, "answer": final_answer, "intermediate": intermediate_pairs}
现在我们已经设置好了三种 RAG 方法,让我们将它们对同一查询的回答进行比较,这次使用一个更复杂的查询,以观察它们之间的差异。
通过比较,我们可以了解每种方法的优点以及每种方法最适合在什么情况下使用。
# Run all approaches on the same complex query
comparison_query = "What was the full impact chain of the National Guard's assistance during the pandemic? Specifically, how did their involvement affect volunteer operations, what specific tasks did they perform, and how did this ultimately translate to community impact in terms of food distribution capabilities and reach?"
print("\n=== Standard RAG ===")
standard_result = rag_chain.invoke({"input": comparison_query})
print(standard_result["answer"])
print("\n=== DRAG ===")
drag_result = drag_chain.invoke({"input": comparison_query})
print(drag_result["answer"])
print("\n=== IterDRAG ===")
iterdrag_result = iterative_drag(comparison_query)
print(iterdrag_result["answer"])
我们在这里总结了所实施的三种 RAG 方法之间的性能差异:
方法
| 优势
| 限制
| 最佳用例
|
|---|---|---|---|
标准 RAG |
|
|
|
DRAG |
|
|
|
IterDRAG |
|
|
|
正如我们在实施中看到的,推理扩展技术(如 DRAG 和 IterDRAG)可以显著提升 RAG 的性能。此方法尤其适用于需要深入分析多个文档的复杂查询。
在本教程中,我们深入了解了推理扩展如何显著提高 RAG 性能。通过在推理阶段通过 DRAG 和 IterDRAG 等技术策略性地分配额外计算资源,我们可以在复杂查询的响应质量上取得显著提升。
推理成本高:基于转换器的模型使用自注意力机制,其推理成本会随输入长度呈二次增长。这种方法使得处理长上下文在计算上代价高昂,从而将 RAG 的实际应用限制在较短文档上,或需要进行大幅截断。
上下文利用受限:标准 RAG 系统通常检索并处理固定数量的文档,这对于复杂的多跳查询可能是不够的。随着上下文长度增加,性能趋于平稳,尤其是在超过 128,000 个令牌时,因为模型难以在大量检索到的段落中整合信息。
计算资源分配效率低:如果没有合理分配,增加更多检索文档或上下文只会提高计算成本,而不会带来相应的准确性提升,可能导致收益递减,甚至因信息过载而性能下降。
基于演示的 RAG (DRAG):
DRAG 充分利用多个检索到的示例、问题和答案作为提示中的演示,使模型能够在上下文中学习如何查找和应用相关信息。
这种方法对于较短的有效上下文长度尤其有效,因为它允许模型利用丰富的上下文而不会使注意力机制不堪重负,从而提高检索和生成质量。
基于迭代演示的 RAG (IterDRAG):
IterDRAG 将复杂查询分解为更简单的子查询,迭代检索并生成每个子步骤的答案。
通过交错进行检索与生成,IterDRAG 构建了推理链以弥补多跳查询的缺口,使其在处理超长上下文时尤其有效。
这一过程使模型能够更有效地分配计算,在每一步中关注最相关的信息,并避免长上下文注意力过载的风险。通过将这些推理扩展技术应用于您的 RAG 应用程序,您可以在知识密集型任务中显著提高性能,而无需更改基础模型。
使用面向 AI 构建器的新一代企业级开发平台 IBM watsonx.ai,可以训练、验证、调整和部署生成式 AI、基础模型和机器学习功能。使用一小部分数据,即可在很短的时间内构建 AI 应用程序。
借助 IBM 业界领先的人工智能专业知识和解决方案组合,让人工智能在您的业务中发挥作用。
通过增加 AI 重塑关键工作流程和运营,最大限度提升体验、实时决策和商业价值。
1. “A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems,” Ke, Zixuan, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, et al., ArXiv.org, 2025.
2. “Reasoning in Granite 3.2 Using Inference Scaling,” Lastras, Luis. 2025, IBM Research, IBM, February 26, 2025.
3. “Inference Scaling for Long-Context Retrieval Augmented Generation,” Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, ArXiv.org, 2024.