マルチモーダルRAGを改善するための推論スケーリング

人工知能 (AI) における推論スケーリングとは、大規模なトレーニング用データセットやモデルアーキテクチャに頼る代わりに、推論フェーズ（モデルが出力を生成する段階）中に計算リソースを割り当てることでモデルの性能を向上させる手法を指します。大規模言語モデル（LLM）はモデルパラメータとデータセットの両面で拡大を続けており、特にGPU上での推論時間の最適化と推論計算のスケーリングの管理が、高性能マルチモーダル検索拡張生成（RAG）システムを導入する際の主な課題となっています。

推論スケーリングの紹介

計算リソースを増やし、テスト時に複雑なアルゴリズムを採用する最近の推論戦略の進歩により、LLMが複雑な推論タスクに取り組み、多様なインプット様式にわたってより高品質のアウトプットを提供する方法が再定義されています。推論スケーリングは、推論の深さを拡張することで思考の連鎖（CoT）を最適化します。この拡張により、モデルは反復プロンプトや複数のステップによる生成を通じて、より長く詳細な思考の連鎖を生成できるようになります。推論スケーリングは、モデルのサイズ、コンピューター予算、現実世界のアプリケーションに適した推論時間の実用的な最適化のバランスを重視した、マルチモーダルRAGの改善に活用できます。

その上、スケーリングの法則とベンチマークの結果からは、事前トレーニング、微調整、推論時間戦略、アウトプット選択のための高度なアルゴリズム間でのトレードオフが際立ちます。リソースに制約のあるシステムでも最先端のLLMのパフォーマンスに近づくことができるため、大規模なモデルも小規模なモデルも、推論スケーリングからメリットを得ることができます。このチュートリアルでは、最適化手法がモデルのパフォーマンスに与える影響を説明し、マルチモーダルRAGの導入にあたって精度、レイテンシー、コストのバランスをとるための実行可能なガイダンスを提供します。

このチュートリアルは、ドキュメント管理と高度な自然言語処理 (NLP)技術に関する知識を高めたいと考えている人工知能の開発者、研究者、愛好家向けに設計されています。推論スケーリングの力を活用して、過去のレシピで作成したマルチモーダルRAGパイプラインを改善する方法を学習します。チュートリアルでは、特にIBM Granite 大規模言語モデルに焦点を当てたマルチモーダルRAGの拡張性ストラテジーに焦点を当てていますが、同様の原則はOpenAI（GPT-4、GPT-4o、ChatGPTなど）やDeepMindのモデルを含むほとんどの一般的なモデルにも適用できます。

このチュートリアルでは、次のプロセスについて説明します。

ドキュメントの前処理：Doclingを使用して、さまざまなソースからのドキュメントを処理し、解析して使用可能な形式に変換し、ベクトル・データベースに保存する方法を学習します。Domling は、PDF、DOCX、PPTX、XLSX、画像、HTML、AsciiDoc、Markdown などのドキュメント形式を効率的に解析するために使用されるIBMのオープンソース・ツールキットです。次に、文書の内容をマークダウンやJSONなどの機械で読み取り可能な形式にエクスポートします。Granite機械学習（ML）モデルを使用して、ドキュメント内の画像の画像説明を生成します。このチュートリアルでは、DoctlingがPDFドキュメントをダウンロードして処理し、ドキュメントに含まれるテキストと画像を取得できるようにします。このチュートリアルでは、DomlingがPDFドキュメントをダウンロードして処理し、ドキュメントに含まれるテキストと画像を取得できるようにします。
検索拡張生成 (RAG)：GraniteなどのLLMを外部のナレッジベースに接続して、クエリ応答を強化し、価値のあるインサイトを生成する方法を学びます。RAGは、LLMがトレーニングされたデータ以外の情報の知識ベースとLLMを接続するために使用される大規模言語モデル（LLM）技術です。この手法は微調整を必要とせずにLLMに適用できます。従来のRAGは、テキストの要約やチャットボットなどのテキストベースのユースケースに限定されています。
マルチモーダル RAG：マルチモーダルRAGが複数の種類のデータから情報を処理するためにマルチモーダル大規模言語モデル (MLLM) を使用する方法を学びます。このデータは、RAGで使用される外部知識ベースの一部として含めることができます。マルチモーダル・データには、テキスト、画像、オーディオ、ビデオ、その他の形式が含まれます。このチュートリアルでは、IBMの最新のマルチモーダル・ビジョン・モデルであるGranite 3.2ビジョンを使用しています。
実装デモンストレーションベースのRAG（DRAG）と反復デモンストレーションベースのRAG（IterDRAG）：研究論文に書かれた推論スケーリング手法を適用して、長いコンテキストで作業する場合のRAGの性能を大幅に向上させます。DRAGの手法では、コンテキスト内学習を活用してRAGの性能を向上させます。DRAGは、複数のRAGの例をデモンストレーションとして含めることで、モデルが長いコンテキストの中で関連情報を見つけることを学習するのに役立ちます。ドキュメントが増えると停滞する可能性がある標準的なRAGとは異なり、DRAGはコンテキスト長が増えるにつれて線形に改善を示します。IterDRAGは、複雑なマルチホップ・クエリーをより単純なサブクエリーに分解することで対処するDRAGの拡張機能です。マルチホップは、複雑なクエリーを分解し、下位の単純な質問に応答していくプロセスです。それぞれの下位の質問で、さまざまなソースから取得した情報や合成した情報が必要になる場合があります。IterDRAGでは検索と生成のステップを織り交ぜ、構成上不十分な箇所を埋め合わせる推論のチェーンを作成します。このアプローチは、長いコンテクストにわたる複雑なクエリを処理する場合に特に効果的です。
ワークフロー統合のためのLangChain ：LangChainを使用してドキュメント処理と検索のワークフローを合理化・調整し、システムのさまざまなコンポーネント間のシームレスな相互作用を可能にする方法を説明します。

チュートリアルでは、3つの最先端テクノロジーを使用します。

Docling : ドキュメントを解析および変換するために使用するオープンソースのツールキット。
Granite：強力な自然言語機能と、画像からテキストを生成するビジョン言語モデルを備えた最先端のLLMファミリー。
LangChain：言語モデルを活用したアプリケーションの構築に使用される強力なフレームワーク。複雑なワークフローを簡素化し、外部ツールをシームレスに統合するように設計されています。

このチュートリアルを完了すると、以下を達成できます。

ドキュメントの前処理、チャンク化、画像理解の習熟
ベクトル・データベースの統合による検索機能の強化
推論スケーリングによる効率的で正確なデータ検索を実現するための、DRAGとIterDRAGの実装
推論コンピューティングのスケーリングにより、RAGパフォーマンスがほぼ直線的に向上する仕組みを直に体験しましょう。

長いコンテキストに関する課題を理解する

従来の言語モデルは、いくつかの理由から長いコンテキストへの対応を苦手としています。

トランスフォーマーのような従来のアテンションメカニズムは二次的にスケールするため、膨大なリソースが必要になる場合があります。
非常に長いシーケンスで関連情報を見つけることが困難
インプットのうち、距離のある箇所で一貫性を維持する際の課題
長いシーケンスを処理するための計算要求の増加

今回のチュートリアルで紹介する手法では、推論計算の戦略的な割り当てを通じて、これらの課題に対処します。

推論スケーリング手法：DRAGとIterDRAG

DRAGとIterDRAGの比較

これら2つの高度な推論スケーリング技術（DRAGとIterDRAG）の詳細は、研究論文“Inference Scaling for Long-Context Retrieval Augmented Generation“で読むことができます。

これらの手法では、最適な割り当てをした場合に、推論計算のスケーリングがRAGの性能をほぼ直線的に向上させ、長いコンテクストを処理する最新のLLMの能力をより有効に活用できることが明らかになっています。この実装では、さまざまなモダリティーを処理できるIBM Graniteモデルを使用します。論文の原則を適用して、非構造化データからリアルタイムのユーザークエリに回答するAIシステムを作成します。

前提条件

Pythonプログラミングに精通していること
LLM、NLP の概念、コンピューター・ビジョンに関する基本的な理解

手順

新たに作成した仮想環境でPython 3.10、3.11、または3.12を実行します。なお、このチュートリアルにはGitHubでもアクセスできます。

ステップ 1：環境のセットアップ

import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), "Use Python 3.10, 3.11, or 3.12 to run this notebook."

ステップ 2：依存関係のインストール

! pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    pillow \
    langchain_community \
    langchain_huggingface \
    langchain_milvus \
    docling \
    replicate

ロギング

ロギング情報を確認するには、INFO log レベルを設定します。

注: このセルの実行は省略しても問題ありません。

import logging

logging.basicConfig(level=logging.INFO)

ステップ3：AIモデルを選択

Graniteモデルのロード

テキスト埋め込みベクトルの生成に使用する埋め込みモデルを指定します。ここでは、 Granite Embeddingsモデルの 1 つを使用します。

別の埋め込みモデルを使用するには、このコード・セルを埋め込みモデルレシピのコード・セルに置き換えます。

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
    model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

画像理解に使用するMLLMを指定します。今回はGraniteビジョン・モデルを使用します。

from ibm_granite_community.notebook_utils import get_env_var
from langchain_community.llms import Replicate
from transformers import AutoProcessor

vision_model_path = "ibm-granite/granite-vision-3.2-2b"
vision_model = Replicate(
    model=vision_model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": embeddings_tokenizer.max_len_single_sentence, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
        "temperature": 0.01,
    },
)
vision_processor = AutoProcessor.from_pretrained(vision_model_path)

RAG 生成操作に使用する言語モデルを指定します。今回はReplicate LangChainクライアントを使用して、Replicate上のibm- granite org から Graniteモデルに接続します。

Replicateをセットアップするには「Replicateの使用をはじめる」を参照してください。

Replicate 以外のプロバイダーのモデルに接続するには、このコード・セルをLLM コンポーネントレシピのコード・セルに置き換えます。

model_path = "ibm-granite/granite-3.3-8b-instruct"
model = Replicate(
    model=model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
        "temperature": 0.01
    },
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

ステップ4：Doctlingを使用してベクトル・データベース用のドキュメントを準備する

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    generate_picture_images=True,
)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}
converter = DocumentConverter(format_options=format_options)

sources = [
    "https://midwestfoodbank.org/images/AR_2020_WEB2.pdf",
]
conversions = { source: converter.convert(source=source).document for source in sources }

処理済みのドキュメントを使って、ドキュメント内のテキスト要素をさらに処理し、使用している埋め込みモデルに適したサイズに分割します。LangChainドキュメントのリストは、テキスト・チャンクから作成されます。

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc import DocItem, TableItem
from langchain_core.documents import Document

doc_id = 0
texts: list[Document] = []
for source, docling_document in conversions.items():
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
        items: list[DocItem] = chunk.meta.doc_items # type: ignore
        if len(items) == 1 and isinstance(items[0], TableItem):
            continue # we will process tables later
        refs = " ".join(map(lambda item: item.get_ref().cref, items))
        print(refs)
        text = chunk.text
        document = Document(
            page_content=text,
            metadata={
                "doc_id": (doc_id:=doc_id+1),
                "source": source,
                "ref": refs,
            },
        )
        texts.append(document)

print(f"{len(texts)} text document chunks created")

次に、ドキュメント内のテーブルを処理します。言語モデルが処理できるように、テーブルのデータをマークダウン形式に変換します。LangChainドキュメントのリストが、テーブルのマークダウン・レンダリングから作成されます。

from docling_core.types.doc import DocItemLabel

doc_id = len(texts)
tables: list[Document] = []
for source, docling_document in conversions.items():
    for table in docling_document.tables:
        if table.label in [DocItemLabel.TABLE]:
            ref = table.get_ref().cref
            print(ref)
            text = table.export_to_markdown(docling_document)
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref
                },
            )
            tables.append(document)


print(f"{len(tables)} table documents created")

最後に文書内の画像を処理します。ここでは、画像の内容を理解するために視覚言語モデルを使用します。この例では、画像内のテキスト情報を知りたいとします。

画像プロンプトの選択は、モデルが画像のどの側面に焦点を当てるかを指示する重要なものです。例：

「画像に何が描かれているかについて詳しく説明してください」のようなプロンプト（下記で使用）では、すべての視覚要素に関する一般的な情報が提供されます。
「この画像にはどのようなテキストがありますか？」といったプロンプトは、特にテキストコンテンツの抽出に重点を置いたものです。
チャートやグラフには、「この画像のグラフィカル・データの可視化について説明してください」などのプロンプトの方が適しています。
ドキュメント内の画像の種類とそこから抽出する必要がある情報に基づいてさまざまなプロンプトを試す必要があります。

注：画像処理では、画像の数とビジョン言語モデルを実行するサービスによっては、かなりの処理時間が必要になる場合があります。

import base64
import io
import PIL.Image
import PIL.ImageOps

def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert("RGB")

    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
    uri = f"data:image/{format};base64,{encoding}"
    return uri

# Feel free to experiment with this prompt
image_prompt = "Give a detailed description of what is depicted in the image"
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": image_prompt},
        ],
    },
]
vision_prompt = vision_processor.apply_chat_template(
    conversation=conversation,
    add_generation_prompt=True,
)
pictures: list[Document] = []
doc_id = len(texts) + len(tables)
for source, docling_document in conversions.items():
    for picture in docling_document.pictures:
        ref = picture.get_ref().cref
        print(ref)
        image = picture.get_image(docling_document)
        if image:
            text = vision_model.invoke(vision_prompt, image=encode_image(image))
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref,
                },
            )
            pictures.append(document)

print(f"{len(pictures)} image descriptions created")

その後、入力したドキュメントから作成されたLangChainドキュメントを表示できます。

import itertools
from docling_core.types.doc import RefItem
from IPython.display import display

# Print all created documents
for document in itertools.chain(texts, tables):
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80)  # Separator for clarity

for document in pictures:
    print(f"Document ID: {document.metadata['doc_id']}")
    source = document.metadata['source']
    print(f"Source: {source}")
    print(f"Content:\n{document.page_content}")
    docling_document = conversions[source]
    ref = document.metadata['ref']
    picture = RefItem(cref=ref).resolve(docling_document)
    image = picture.get_image(docling_document)
    print("Image:")
    display(image)
    print("=" * 80)  # Separator for clarity

ベクトル・データベースを読み込む

埋め込みモデルを使用して、テキスト・チャンクからドキュメントと生成された画像キャプションをベクトル・データベースにロードします。このベクトル・データベースを作成すると、ドキュメント全体で意味的な類似性検索を簡単に実行できます。

注：ベクトル・データベースの移植には、埋め込みモデルとサービスによっては、かなりの処理時間が必要になる場合があります。

ベクトル・データベースの選択

埋め込みベクトルの保存と取得に使用するデータベースを指定します。このチュートリアルでは、Langchain 経由でMilvus を使用します。Milvusはベクトル・データベースとして、ニューラル・ネットワークやさまざまなMLアルゴリズムの作成した数値埋め込みの保管、インデックス付け、管理を行います。

Milvus 以外のベクトル・データベースに接続するには、このコード・セルをベクトル・ストアレシピのコード・セルに置き換えます。

import tempfile
from langchain_core.vectorstores import VectorStore, VectorStoreRetriever
from langchain_milvus import Milvus

db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db: VectorStore = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

ここで、テキスト、テーブル、画像の説明用のすべてのLangChainドキュメントをベクトル・データベースに追加します。

import itertools

documents = list(itertools.chain(texts, tables, pictures))
ids = vector_db.add_documents(documents)
print(f"{len(ids)} documents added to the vector database")
retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 10})

ステップ5：Graniteを使用したRAG

ドキュメントの変換とベクトル化が正常に完了したので、RAGパイプラインを設定できます。

検索品質の検証

ここでは、ベクトル空間内のクエリに関連する情報を含むチャンクを検索することにより、ベクトル・データベースをテストします。検索された画像の説明に関連する文書を表示します。

この検証ステップは、完全なRAGパイプラインを構築する前に、検索システムが正しく動作していることを確認する重要なものです。返されたドキュメントがクエリに関連しているかどうかを確認します。

さまざまなクエリを自由に試してください。

query = "Analyze how Midwest Food Bank's financial efficiency changed during the pandemic by comparing their 2019 and 2020 performance metrics. What specific pandemic adaptations had the greatest impact on their operational capacity, and how did their volunteer management strategy evolve to maintain service levels despite COVID-19 restrictions? Provide specific statistics from the report to support your analysis."
for doc in vector_db.as_retriever().invoke(query):
    print(doc)
    print("=" * 80)  # Separator for clarity

返されるドキュメントは、クエリの応答になっている必要があります。ここからRAGパイプラインを構築します。

Granite用のRAGパイプラインを作成

初めに、GraniteがRAGクエリーを実行するためのプロンプトを作成します。Graniteチャット・テンプレートを使用し、LangChain RAGパイプラインが置き換えるプレースホルダー値を提供します。

{context}は、前回の検索で示されたように取得されたチャンクを保持し、質問に答えるためのドキュメント・コンテキストとしてモデルに渡します。

次に、作成したGraniteプロンプト・テンプレートを使用してRAGパイプラインを構築します。

from ibm_granite_community.notebook_utils import escape_f_string
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create a Granite prompt for question-answering with the retrieved context
prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": "{input}",
    }],
    documents=[{
        "doc_id": "0",
        "text": "{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)
prompt_template = PromptTemplate.from_template(template=escape_f_string(prompt, "input", "context"))

# Create a Granite document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
<|end_of_text|>
<|start_of_role|>document {{"document_id": "{doc_id}"}}<|end_of_role|>
{page_content}""")
document_separator=""

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
    retriever=retriever,
    combine_docs_chain=combine_docs_chain,
)

質問に対する検索拡張応答を生成する

パイプラインはクエリを使用してベクトル・データベースからドキュメントを検索し、それらをクエリのコンテキストとして使用します。

outputs = rag_chain.invoke({"input": query})
print(outputs['answer'])

Standard RAGの制限と推論スケーリングが必要な理由

標準的なRAGアプローチも比較的適切に機能しますが、長いコンテンツや複雑なコンテンツを扱う場合は、いくつかの重要な限界があります。

コンテキスト管理： 多くのドキュメントを扱う場合、標準のRAGでは利用可能なすべてのコンテキストを効果的に利用することが困難になります。
検索品質： 取得した情報の使用方法に関するガイダンスがなければ、モデルはしばしばドキュメントの間違った部分に焦点を当てます。
構成の推論： 多段階の推論を必要とする複雑なクエリを理解するプロセスは、標準のRAGでは困難です。
性能の停滞： 標準RAGに追加するドキュメントを増した場合、あるしきい値を超えるとしばしば応答が低下します。

推論スケーリングの手法では、推論時により多くのコンピューティングリソースを戦略的に割り当てることで、これらの制限に対処します。

DRAGによる強化RAG（デモンストレーションベースのRAG）

ここで、研究論文「Inference Scaling for Long-Context Retrieval Augmented Generation」の DRAG テクニックを実装して、RAG システムを強化します。

DRAGは、コンテキスト内の例を使用して、文書から情報を抽出して使用する方法をモデルに示し、コンテキストが長い場合の性能を向上させます。

ステップ1：コンテキスト内デモンストレーションのサンプルを作成する

通常は質の高い質問と応答のペアからなる、キュレーション済みのデータセットから取得できます。この目的のために、予想されるドメインに一致する合成例をいくつか作成します。

ここでは、個々のデモンストレーションを表すデータクラスを定義し、複数のデモンストレーションを作成します。

from dataclasses import dataclass, field, InitVar
from langchain_core.documents import Document

@dataclass
class DRAG_Demonstration:
    query: str
    answer: str
    retriever: InitVar[VectorStoreRetriever] = field(kw_only=True)
    documents: list[Document] = field(default_factory=list, kw_only=True)

    def __post_init__(self, retriever: VectorStoreRetriever):
        if not self.documents:
            self.documents = retriever.invoke(self.query)

    def __format__(self, format_spec: str) -> str:
        formatted_documents = "\n".join(
            f"Document {i+1}:\n{document.page_content}"
            for i, document in enumerate(self.documents)
        )
        return f"""\
{formatted_documents}
Question: {self.query}
Answer: {self.answer}
"""

def create_enhanced_drag_demonstrations(vector_db: VectorStore) -> list[DRAG_Demonstration]:
    """Create high-quality demonstrations for DRAG technique that showcase effective document analysis"""
    demonstration_retriever: VectorStoreRetriever = vector_db.as_retriever(search_kwargs={"k": 5})
    demonstrations = [
        DRAG_Demonstration(
            query="How did the COVID-19 pandemic impact Midwest Food Bank's operations in 2020?",
            answer="The COVID-19 pandemic significantly impacted Midwest Food Bank's operations in 2020. Despite challenges, MFB remained open and responsive to increased needs. They implemented safety protocols, reduced volunteer numbers for social distancing, and altered their distribution model to allow partner agencies to receive food safely. The pandemic created unprecedented food insecurity, with many people seeking assistance for the first time. MFB distributed 37% more food than in 2019, with a record 179 semi-loads of Disaster Relief family food boxes sent nationwide. The organization also faced supply chain disruptions and food procurement challenges in the early months but continued to find and distribute food. Community, business, and donor support helped fund operations and food purchases. Additionally, MFB began participating in the USDA Farmers to Families Food Box program in May 2020, distributing over $52 million worth of nutritious produce, protein, and dairy products.",
            retriever=demonstration_retriever
        ),
        DRAG_Demonstration(
            query="What role did volunteers play at Midwest Food Bank during 2020, and how were they affected by the pandemic?",
            answer="Volunteers were described as 'the life-blood of the organization' in the 2020 annual report. Despite the pandemic creating safety challenges, volunteers demonstrated courage and dedication by increasing their hours to meet growing needs. MFB implemented safety protocols at each location and limited volunteer group sizes to allow for social distancing. This created a challenge as food needs increased while fewer volunteers were available to help. To address this gap, multiple MFB locations received assistance from the National Guard, who filled vital volunteer positions driving trucks, operating forklifts, and helping with food distributions. In 2020, 17,930 individuals volunteered 300,898 hours of service, equivalent to 150 full-time employees. The volunteer-to-staff ratio was remarkable with 450 volunteers for every 1 paid MFB staff member, highlighting the volunteer-driven nature of the organization during the crisis.",
            retriever=demonstration_retriever
        ),
        DRAG_Demonstration(
            query="How did Midwest Food Bank's international programs perform during 2020, particularly in Haiti and East Africa?",
            answer="In 2020, Midwest Food Bank's international operations in East Africa and Haiti faced unique challenges but continued to serve communities. In East Africa (operated as Kapu Africa), strict lockdowns led to mass hunger, especially in slum areas. Kapu Africa distributed 7.2 million Tender Mercies meals, working with partner ministries to share food in food-insecure slums. A notable outcome was a spiritual awakening among recipients, with many asking why they were receiving help. In Haiti, the pandemic added to existing challenges, closing airports, seaports, factories, and schools. MFB Haiti more than doubled its food shipments to Haiti, delivering over 160 tons of food relief, nearly three-quarters being Tender Mercies meals. As Haitian children primarily receive nourishment from school lunches, MFB Haiti distributed Tender Mercies through faith-based schools and also partnered with over 20 feeding centers serving approximately 1,100 children daily. Nearly 1 million Tender Mercies meals were distributed in Haiti during 2020.",
            retriever=demonstration_retriever
        ),
    ]

    return demonstrations

ステップ2：プロンプトに追加するためにデモンストレーションをフォーマットする

次に、すべてのデモをプロンプト用にフォーマットします。

# Format all demonstrations together
demonstrations = create_enhanced_drag_demonstrations(vector_db)

formatted_demonstrations = "\n\n".join(
    f"Example {i+1}:\n{demo}"
    for i, demo in enumerate(demonstrations)
)

ステップ3：DRAGプロンプトのテンプレートを作成する

次に、フォーマット済みのデモンストレーション例を含むモデルのDRAGプロンプトを作成します。

drag_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": f"""\
Here are examples of effectively extracting information from documents to answer questions.

{formatted_demonstrations}

Follow these examples when answering the user's question:

{{input}}""",
    }],
    documents=[{
        "doc_id": "0",
        "text": "Placeholder{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)

# Convert to prompt template
drag_prompt_template = PromptTemplate.from_template(template=escape_f_string(drag_prompt, "input", "context"))

ステップ4：ドキュメントの順序を変更するカスタム・レトリーバーを作成する

通常、レトリーバーは類似性の順序でドキュメントを返します。最も類似性の高いドキュメントが最初に返されます。結果の順序を逆にするために、並べ替え用レトリーバーを定義します。これにより最も類似したドキュメントが末尾、つまりプロンプトの終わり近くに表示されるようになります。

import typing
from langchain_core.retrievers import BaseRetriever, RetrieverInput, RetrieverOutput
from langchain_core.callbacks.manager import CallbackManagerForRetrieverRun

class ReorderingRetriever(BaseRetriever):
    base_retriever: BaseRetriever

    def _get_relevant_documents(
        self, query: RetrieverInput, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: typing.Any
    ) -> RetrieverOutput:
        docs = self.base_retriever._get_relevant_documents(query, run_manager=run_manager, **kwargs)
        return list(reversed(docs))  # Reverse the order so higher-ranked docs are closer to query in prompt

reordering_retriever = ReorderingRetriever(base_retriever=retriever)

ステップ5：DRAGパイプラインの作成

DRAGプロンプト・テンプレートと並べ替え用レトリーバーを使用して、DRAGクエリのパイプラインを作成します。

drag_combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=drag_prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)

drag_chain = create_retrieval_chain(
    retriever=reordering_retriever,
    combine_docs_chain=drag_combine_docs_chain,
)

ステップ6：質問に対してDRAGを使った回答を作成する

drag_outputs = drag_chain.invoke({"input": query})
print("\n=== DRAG-Enhanced Answer ===")
print(drag_outputs['answer'])

例を与えたことで、回答に改善が見られます。次はさらに徹底したRAGテクニックを試してみましょう。

IterDRAG（反復的デモンストレーションに基づくRAG）の実施

IterDRAGは、複雑なクエリーをより単純なサブクエリーに分解し、反復的な検索を実行することでDRAGを拡張します。このアプローチは、複数のソースからの情報を統合したり、いくつかのステップにわたって推論する必要がある複雑なマルチホップ問題で特に効果を発揮します。

反復的アプローチの主なメリット：

複雑な質問を管理可能な部分に分解できる
それぞれの下位質問の関連情報を取得できる
明示的な推論チェーンを作成できる
難しい問題にワンステップで対応できる

ステップ 1: クエリ分解チェーンを作成する

分解ステップは、複雑なクエリーを取得し、それを個別に回答可能な、より単純で焦点を絞ったサブクエリーに分割する重要なものです。

decompose_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that breaks down complex questions into simpler sub-questions.
For multi-part or complex questions, generate 1-3 sub-questions that would help answer the main question.

Here are examples of how to decompose complex questions:
{demonstrations}

Follow the above examples when breaking down the user's question.
If the following question is already simple enough, just respond with "No follow-up needed."

Otherwise, break down the following question into simpler sub-questions. Format your response as:
Follow up: [sub-question]

Question: {input}"""
    }],
    add_generation_prompt=True,
    tokenize=False,
)

decompose_prompt_template = PromptTemplate.from_template(template=escape_f_string(decompose_prompt, "input", "demonstrations"))
decompose_chain = decompose_prompt_template | model

ステップ2：下位クエリの回答チェーンの作成

下位クエリへの応答コンポーネントは、関連するドキュメントを取得し、焦点を絞った中間回答を生成して、個々の下位質問を処理します。

intermediate_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that answers specific questions based on the provided documents.

Focus only on the sub-question and provide a concise intermediate answer.
Please answer the following sub-question based on the provided documents.
Format your response as:
Intermediate answer: [your concise answer to the sub-question]

Sub-question: {input}
"""
    }],
    documents=[{
        "doc_id": "0",
        "text": "Placeholder{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)

intermediate_prompt_template = PromptTemplate.from_template(template=escape_f_string(intermediate_prompt, "input", "context"))
intermediate_combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=intermediate_prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
intermediate_chain = create_retrieval_chain(
    retriever=reordering_retriever,
    combine_docs_chain=intermediate_combine_docs_chain,
)

ステップ3：最終的な回答生成チェーンの作成

最終的な回答生成コンポーネントでは、すべての中間回答を組み合わせて、元の質問に対する包括的な回答を生成します。

final_prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": """\
You are a helpful assistant that provides comprehensive answers to questions.
Use the intermediate answers to sub-questions to formulate a complete final answer.
Please provide a final answer to the main question based on the intermediate answers to sub-questions.
Format your response as:
So the final answer is: [your comprehensive answer to the main question]

Main question: {input}

Sub-questions and intermediate answers:
{context}"""
    }],
    add_generation_prompt=True,
    tokenize=False,
)

final_prompt_template = PromptTemplate.from_template(template=escape_f_string(final_prompt, "input", "context"))
final_chain = final_prompt_template | model

ステップ4： IterDRAGの例示用デモンストレーションを作成する

IterDRAGの性能にとっては、効果的なデモンストレーションの作成が非常に重要です。これらの例は、モデルに次の方法を示します。

複雑な質問をより単純な下位質問に分解
関連する中間回答を生成
回答を一貫した最終的な回答にまとめる

@dataclass
class IterDRAG_Demonstration_Base:
    query: str
    answer: str

@dataclass
class IterDRAG_Demonstration(IterDRAG_Demonstration_Base):
    intermediate: list[IterDRAG_Demonstration_Base]

    def __format__(self, format_spec: str) -> str:
        sub_questions="\n".join(
            f"Follow up: {sub.query}"
            for sub in self.intermediate
        )

        return f"Question: {self.query}\n{sub_questions}"

def create_iterdrag_demonstrations() -> list[IterDRAG_Demonstration]:
    """Create examples showing how to decompose and answer complex questions"""

    demonstrations = [
        IterDRAG_Demonstration(
            query="What impact did the pandemic have on the food bank's operations and distribution?",
            answer="The pandemic had a profound impact on food bank operations and distribution. Distribution volume increased by 60% to over 100 million pounds of food in 2020. Operationally, the food bank faced supply chain disruptions, volunteer shortages, and safety protocol challenges. In response, they implemented contactless distribution, expanded mobile pantries, created emergency food boxes for vulnerable populations, and developed virtual nutrition education. Despite these challenges, they successfully scaled operations to meet the unprecedented community need during the crisis.",
            intermediate=[
                IterDRAG_Demonstration_Base(
                    query="How did food distribution volume change during the pandemic?",
                    answer="Food distribution volume increased by 60% during the pandemic, rising from approximately 62 million pounds in 2019 to over 100 million pounds in 2020.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What operational challenges did the food bank face during the pandemic?",
                    answer="The food bank faced challenges including supply chain disruptions, volunteer shortages due to social distancing requirements, and the need to implement new safety protocols for food handling and distribution.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What new programs were implemented in response to the pandemic?",
                    answer="New programs included contactless distribution methods, expanded mobile pantry operations, emergency food boxes for vulnerable populations, and virtual nutrition education classes.",
                ),
            ],
        ),
        IterDRAG_Demonstration(
            query="How does the food bank's financial management compare to industry standards for non-profits?",
            answer="The food bank demonstrates excellent financial management compared to industry standards. With 94% of its budget allocated to program services and only 6% to administrative and fundraising costs, it exceeds the industry benchmark of 85-90% for program spending. This financial efficiency places the food bank among the top-performing non-profits in terms of maximizing donor impact and minimizing overhead expenses.",
            intermediate=[
                IterDRAG_Demonstration_Base(
                    query="What percentage of the food bank's budget goes to program services versus administrative costs?",
                    answer="94% of the food bank's budget goes directly to program services, with only 6% allocated to administrative and fundraising costs.",
                ),
                IterDRAG_Demonstration_Base(
                    query="What are the industry standards for program spending versus overhead for food banks?",
                    answer="Industry standards suggest that well-run food banks typically allocate 85-90% of their budget to program services, with 10-15% for administrative and fundraising expenses.",
                ),
            ],
        ),
    ]
    return demonstrations

ステップ5：IterDRAG関数の実装

この関数は反復プロセス全体の調整を行うものです。

主要質問を下位質問に分解
それぞれの下位質問について関連文書を検索し、中間回答を生成
すべての中間回答を組み合わせ、最終的な回答を生成

import re

def iterative_drag(main_question: str) -> dict[str, typing.Any]:
    """
    Implements IterDRAG: decomposing queries, retrieving documents for sub-queries,
    and generating a final answer based on intermediate answers.
    """
    print(f"\n=== Processing query with IterDRAG: '{main_question}' ===")

    # Step 1: Decompose the main question into sub-questions
    print("Step 1: Decomposing the query into sub-questions...")
    iterdrag_demonstrations = create_iterdrag_demonstrations()
    formatted_demonstrations = "\n\n".join(
        f"Example {i+1}:\n{demo}"
        for i, demo in enumerate(iterdrag_demonstrations)
    )
    decompose_result = decompose_chain.invoke({
        "input": main_question,
        "demonstrations": formatted_demonstrations,
    })
    decompose_answer = decompose_result

    # Extract sub-questions using regex
    sub_questions = re.findall(r"Follow up: (.*?)(?=Follow up:|\n|$)", decompose_answer, re.DOTALL)
    sub_questions = [sq.strip() for sq in sub_questions if sq.strip()]
    if not sub_questions:
        print("No decomposition needed or found. Using standard DRAG approach.")
        return drag_chain.invoke({"input": main_question})
    print(f"Decomposed into {len(sub_questions)} sub-questions")

    # Step 2: Answer each sub-question
    intermediate_pairs: list[dict[str, str]] = []
    for i, sub_question in enumerate(sub_questions):
        print(f"\nStep 2.{i+1}: Processing sub-question: '{sub_question}'")

        # Generate answer for this sub-question
        intermediate_result = intermediate_chain.invoke({"input": sub_question})
        intermediate_answer = intermediate_result["answer"]

        # Extract intermediate answer using regex
        intermediate_answer_match = re.search(r"Intermediate answer: (.*?)$", intermediate_answer, re.DOTALL)
        if intermediate_answer_match:
            intermediate_answer = intermediate_answer_match.group(1).strip()

        print(f"Generated intermediate answer: {intermediate_answer[:100]}...")

        # Store the sub-question and its answer
        intermediate_pairs.append({"input": sub_question, "answer": intermediate_answer})

    # Step 3: Generate the final answer based on sub-question answers
    print("\nStep 3: Generating final answer based on intermediate answers...")
    final_result = final_chain.invoke({
        "input": main_question,
        "context": "\n\n".join(
            f"Sub-question: {pair['input']}\nIntermediate answer: {pair['answer']}"
            for pair in intermediate_pairs
        ),
    })
    final_answer = final_result

    # Extract final answer
    final_answer_match = re.search(r"So the final answer is: (.*?)$", final_answer, re.DOTALL)
    if final_answer_match:
        final_answer = final_answer_match.group(1).strip()

    return {"input": main_question, "answer": final_answer, "intermediate": intermediate_pairs}

RAGアプローチの比較

3つのRAGアプローチをすべてセットアップしたので、違いを確認するために、より複雑にした同一のクエリに対する回答を比較してみましょう。

比較は、それぞれのアプローチのメリットと、それぞれの使用に最適なケースを理解するのに役立ちます。

# Run all approaches on the same complex query
comparison_query = "What was the full impact chain of the National Guard's assistance during the pandemic? Specifically, how did their involvement affect volunteer operations, what specific tasks did they perform, and how did this ultimately translate to community impact in terms of food distribution capabilities and reach?"

print("\n=== Standard RAG ===")
standard_result = rag_chain.invoke({"input": comparison_query})
print(standard_result["answer"])

print("\n=== DRAG ===")
drag_result = drag_chain.invoke({"input": comparison_query})
print(drag_result["answer"])

print("\n=== IterDRAG ===")
iterdrag_result = iterative_drag(comparison_query)
print(iterdrag_result["answer"])

成果の比較と分析

ここでは、実装された3つのRAGアプローチの性能の違いを要約します。

アプローチ	長所	制限	最適なユースケース
標準RAG	実装が簡単簡単なクエリに適している計算要件が低い	コンテキスト活用には限界があるドキュメントの増加によるパフォーマンスの停滞複雑な推論に不適	シンプルな事実に基づくクエリ計算量が限られている場合コンテキストが小さい場合
DRAG	コンテキスト活用を改善ドキュメントの増加によるパフォーマンスの向上中程度に複雑なクエリに適する	ワンステップの生成に依然として限界があるマルチホップの質問に対する効果が低い	中程度に複雑なクエリ利用可能なドキュメントが多い場合コンテキストに沿った例が提供できる場合
IterDRAG	複雑なクエリに最適明示的な推論の連鎖コンテキストの活用に最も優れている	計算要件が非常に高い亜より複雑な実装	マルチホップの質問複合推論を必要とする複雑な分析最高の性能が必要な場合

これまで見てきたとおり、DRAGやIterDRAGなどの実装推論スケーリング手法はRAGの性能を大幅に向上させることができます。この方法は、複数のドキュメントの詳細な分析を必要とする複雑なクエリに特に適しています。

まとめ

このチュートリアルでは、推論スケーリングによってRAGの性能がどのように劇的に向上するかを説明します。DRAGやIterDRAGなどの手法を用いて推論時に追加の計算を戦略的に割り当てることで、複雑なクエリに対する応答品質を大幅に向上させることができます。

従来のRAGとTransformerベースのモデルの課題

高価な推論：自己注意メカニズムを使用するTransformerベースのモデルには、インプットに応じて二次関数的に拡張する推論コストがかかります。この方法では、長いコンテキストの処理にコストがかかり、RAGの実用的なアプリケーションが短いドキュメントに限られたり、積極的な切り捨てが必要になる場合があります。

コンテキスト活用の限界：標準のRAGシステムは、多くの場合固定数のドキュメントを取得して処理するため、複雑なマルチホップのクエリの場合は不十分になる場合があります。取得した多くのパッセージにわたって情報を合成するには不向きであるため、コンテキスト長が長い場合、特に128,000トークンを超えるとパフォーマンスが限界に達します。

非効率的な計算割り当て：慎重な割り当てを行わずに、読み込むドキュメントやコンテキストを増やすと、精度が比例することなく単に計算コストが増加し、情報過多となるため、応答の減少や性能の低下につながります。

DRAGとIterDRAGによる対処方法

デモンストレーションベースのRAG（DRAG）：

DRAGは、複数の取得例、質問と答えをプロンプト内でデモンストレーションとして活用し、モデルがコンテキストに沿って関連情報を見つけて適用する方法を学習できるようにします。

このアプローチは、モデルが注意メカニズムを圧倒することなく豊富なコンテキストを利用できるため、有効なコンテキスト長が短い場合に特に効果的であり、検索と生成の品質の両方が向上します。

反復的デモンストレーションベースのRAG（IterDRAG）

IterDRAGは、複雑なクエリーをより単純なサブクエリーに分解し、各サブステップの回答を繰り返し取得して生成します。

取得と生成をインターリーブすることで、IterDRAGはマルチホップ・クエリーのギャップを埋める推論チェーンを構築するため、非常に長いコンテキストの場合に特に効果的です。

このプロセスにより、モデルは計算をより効率的に割り当てることができ、各ステップで最も関連性の高い情報に焦点を当て、長いコンテキストにおける注意の過負荷のリスクを回避できます。これらの推論スケーリング手法をRAGアプリケーションに適用することで、基盤となるモデルを変更することなく、知識集約型のタスクでパフォーマンスを大幅に向上させることができます。

次のステップ：

様々な検索モデルと文書の前処理のアプローチを試す
画像理解のためにさまざまなプロンプト表現を試す
モデル・パラメーターの最適化を追求し、固有のユースケースに最適な設定を見つける

データ・リーダーのためのデータサイエンスとMLOps

MLOpsと信頼できるAIの3つの重要な目標、すなわちデータへの信頼、モデルへの信頼、プロセスへの信頼について、他のリーダーと足並みを揃えましょう。

参考情報

IBM Graniteはこちら

IBM Graniteは、ビジネス向けにカスタマイズされ、AIアプリケーションの拡張に合わせて最適化された、オープンで高性能、かつ信頼性の高いAIモデル・ファミリーです。言語、コード、時系列、ガードレールのオプションをご覧ください。

AI in Action 2024

2,000の組織を対象に、AIへの取り組みについて調査を行い、何が機能し、何が機能していないのか、どうすれば前進できるのかを明らかにしました。

生成AI + MLの力を解き放つ

生成AI、機械学習、基盤モデルを事業活動に組み込んでパフォーマンスを向上させる方法をご紹介します。

適切な基盤モデルを選ぶ方法

ユースケースに最適なAI基盤モデルを選択する方法について説明します。

機械学習とは

機械学習は、AIとコンピューター・サイエンスの一分野であり、データとアルゴリズムを使用してAIが人間の学習方法を模倣できるようにすることに重点を置いています。

AIの新時代に信頼と自信を持って成功する方法

強力なAIストラテジーの3つの重要な要素である、競争優位性の創出、ビジネス全体へのAIの拡張、信頼できるAIの推進について詳しく説明します。

脚注

1. “A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems,” Ke, Zixuan, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, et al., ArXiv.org, 2025.

2. “Reasoning in Granite 3.2 Using Inference Scaling,” Lastras, Luis. 2025, IBM Research, IBM, February 26, 2025.

3. “Inference Scaling for Long-Context Retrieval Augmented Generation,” Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, ArXiv.org, 2024.

アプローチ	長所	制限	最適なユースケース
標準RAG	実装が簡単簡単なクエリに適している計算要件が低い	コンテキスト活用には限界があるドキュメントの増加によるパフォーマンスの停滞複雑な推論に不適	シンプルな事実に基づくクエリ計算量が限られている場合コンテキストが小さい場合
DRAG	コンテキスト活用を改善ドキュメントの増加によるパフォーマンスの向上中程度に複雑なクエリに適する	ワンステップの生成に依然として限界があるマルチホップの質問に対する効果が低い	中程度に複雑なクエリ利用可能なドキュメントが多い場合コンテキストに沿った例が提供できる場合
IterDRAG	複雑なクエリに最適明示的な推論の連鎖コンテキストの活用に最も優れている	計算要件が非常に高い亜より複雑な実装	マルチホップの質問複合推論を必要とする複雑な分析最高の性能が必要な場合