Docling과 Granite로 AI 기반 멀티모달 RAG 시스템 구축하기

작성자

Open Source Developer, STSM

Data Scientist

IBM

이 튜토리얼에서는 IBM의 Docling과 오픈 소스 Granite 비전, 텍스트 기반 임베딩, 생성형 AI 모델을 사용하여 RAG 시스템을 만듭니다. 이러한 모델은 다양한 오픈 소스 프레임워크를 통해 사용할 수 있습니다. 이 튜토리얼에서는 Replicate를 사용해 IBM Granite의 비전 및 생성형 AI 모델과 연결하고, HuggingFace를 사용해 임베딩 모델에 연결합니다.

멀티모달 검색 증강 생성

검색 증강 생성(RAG)은 대규모 언어 모델(LLM)과 함께 사용되는 기술로, 미세 조정 없이도 LLM 훈련에 사용한 데이터 외부의 정보에 대한 지식 기반과 모델을 연결합니다. 기존 RAG는 텍스트 요약과 챗봇과 같은 텍스트 기반 사용 사례로 제한됩니다.

멀티모달 RAG는 여러 유형의 데이터에서 정보를 처리하여 RAG에 사용되는 외부 지식기반의 일부로 포함하기 위해 멀티모달LLM(MLLM)을 사용할 수 있습니다. 멀티모달 데이터에는 텍스트, 이미지, 오디오, 비디오, 그리고 기타 형식이 포함될 수 있습니다. 인기 있는 멀티모달 LLM으로는 Google의 Gemini, Meta의 Llama 3.2, OpenAI의 GPT-4와 GPT-4o가 있습니다.

이 레시피에서는 다양한 모달리티를 처리할 수 있는 IBM Granite 모델을 사용합니다. PDF의 비정형 데이터에서 실시간 사용자 쿼리에 응답하는 AI 시스템을 생성합니다.

튜토리얼 개요

Granite 튜토리얼에 오신 것을 환영합니다. 이 튜토리얼에서는 최신 도구의 기능을 활용하여 AI 기반 멀티모달 RAG 파이프라인을 구축하는 방법을 알아봅니다. 이 튜토리얼에서는 다음 프로세스를 안내합니다.

문서 전처리: Docling을 사용하여 다양한 소스의 문서를 처리하고, 사용 가능한 형식으로 구문 분석 및 변환하고, 벡터 데이터베이스에 저장하는 방법을 알아봅니다. Granite MLLM을 사용하여 문서에 있는 이미지의 이미지 설명을 생성합니다.
RAG: Granite와 같은 LLM을 외부 지식 기반과 연결하여 쿼리 응답을 개선하고 가치 있는 인사이트를 생성하는 방법을 이해합니다.
워크플로 통합을 위한 LangChain: LangChain을 사용하여 문서 처리 및 검색 워크플로를 간소화하고 오케스트레이션하며, 시스템의 여러 구성 요소 간에 원활한 상호 작용을 가능하게 하는 방법을 알아봅니다.

이 튜토리얼에서는 세 가지 최첨단 기술을 사용합니다.

Docling: 문서를 구문 분석하고 변환하는 데 사용되는 오픈 소스 툴킷입니다.
Granite: 자연어 처리 기능이 뛰어나고, 이미지를 텍스트로 변환하는 비전 언어 모델을 포함한 최첨단 LLM입니다.
LangChain: 언어 모델로 구동되는 애플리케이션을 구축하는 데 사용되는 강력한 프레임워크로, 복잡한 워크플로를 단순화하고 외부 도구를 원활하게 통합하도록 설계되었습니다.

이 튜토리얼을 마치면 다음을 수행할 수 있습니다.

문서 전처리, 청크화, 이미지 이해 능력을 습득합니다.
벡터 데이터베이스를 통합하여 검색 능력을 개선합니다.
RAG를 사용하여 실제 애플리케이션에 효율적이고 정확한 데이터 검색을 수행합니다.

이 튜토리얼은 문서 관리 및 고급 자연어 처리(NLP) 기술에 대한 지식을 향상하고자 하는 AI 개발자, 연구원, 그리고 AI에 관심 있는 분들을 위해 마련되었습니다. 튜토리얼은 IBM Granite 커뮤니티의 Granite Snack Cookbook GitHub(Jupyter Notebook)에서도 확인할 수 있습니다.

전제조건

Python 프로그래밍에 익숙함.
LLM, NLP 개념, 컴퓨팅 비전에 대한 기본적인 이해.

단계

1단계: 종속성 설치

! echo "::group::Install Dependencies"
%pip install uv
! uv pip install git+https://github.com/ibm-granite-community/utils.git \
    transformers \
    pillow \
    langchain_classic \
    langchain_core \
    langchain_huggingface sentence_transformers \
    langchain_milvus 'pymilvus[milvus_lite]' \
    docling \
    'langchain_replicate @ git+https://github.com/ibm-granite-community/langchain-replicate.git'
! echo "::endgroup::"

2단계: AI 모델 선택

로깅

로깅 정보를 확인하려면 INFO 로그 레벨을 설정할 수 있습니다.

참고: 이 셀 실행은 건너뛰어도 괜찮습니다.

import logging

logging.basicConfig(level=logging.INFO)

Granite 모델 불러오기

텍스트 임베딩 벡터를 생성하는 데 사용할 임베딩 모델을 지정합니다. 여기서는 Granite 임베딩 모델 중 하나를 사용하겠습니다.

다른 임베딩 모델을 사용하려면 이 코드 셀을 이 임베딩 모델 레시피의 코드 셀로 교체합니다.

from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = “ibm-granite/granite-embedding-30m-english”
embeddings_model = HuggingFaceEmbeddings(
model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

이미지 이해에 사용할 MLLM을 지정합니다. 우리는 Granite 비전 모델을 사용합니다.

from ibm_granite_community.notebook_utils import get_env_var
from langchain_community.llms import Replicate
from transformers import AutoProcessor

vision_model_path = “ibm-granite/granite-vision-3.2-2b”
vision_model = Replicate(
    model=vision_model_path,
    replicate_api_token=get_env_var(“REPLICATE_API_TOKEN”),
    model_kwargs={
        “max_tokens”: embeddings_tokenizer.max_len_single_sentence, # Set the maximum number of tokens to generate as output.
        “min_tokens”: 100, # Set the minimum number of tokens to generate as output.
    },
)
vision_processor = AutoProcessor.from_pretrained(vision_model_path)

RAG 생성 작업에 사용할 언어 모델을 지정합니다. 여기서는 Replicate LangChain 클라이언트를 사용하여 Replicate에 있는 ibm-granite조직의 Granite 모델에 연결합니다.

Replicate를 설정하려면 Replicate 시작하기를 참조하세요. Replicate가 아닌 다른 제공업체의 모델에 연결하려면 이 코드 셀을 LLM 컴포넌트 레시피의 코드 셀로 대체합니다.

Replicate가 아닌 다른 제공업체의 모델에 연결하려면 이 코드 셀을 LLM 컴포넌트 레시피의 코드 셀로 대체합니다.

from langchain_replicate import ChatReplicate

model_path = "ibm-granite/granite-4.0-h-small"
model = ChatReplicate(
    model=model_path,
    replicate_api_token=get_env_var("REPLICATE_API_TOKEN"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
    },
)

3단계: 벡터 데이터베이스용 문서 준비

이 예시에서는 소스 문서 세트에서 Docling을 사용하여 문서를 텍스트와 이미지로 변환합니다. 그런 다음 텍스트가 청크로 분할됩니다. 이미지는 MLLM에서 처리되어 이미지 요약을 생성합니다.

Docling으로 문서를 다운로드하고 텍스트 및 이미지로 변환하기

Docling은 PDF 문서를 다운로드하고 처리하여, 문서에 포함된 텍스트와 이미지를 얻을 수 있도록 해 줍니다. PDF에는 텍스트, 테이블, 그래프, 이미지 등 다양한 데이터 유형이 있습니다.

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pdf_pipeline_options = PdfPipelineOptions(
  do_ocr=False,
    generate_picture_images=True,
)
format_options = {
    InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options),
}
converter = DocumentConverter(format_options=format_options)

sources = [
    “https://midwestfoodbank.org/images/AR_2020_WEB2.pdf”,
]
conversions = { source: converter.convert(source=source).document for source in sources }

문서가 처리되면 문서의 텍스트 요소를 추가로 처리합니다. 사용 중인 임베딩 모델에 적합한 크기로 분할합니다. LangChain 문서 목록은 텍스트 청크에서 생성됩니다.

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.document import TableItem
from langchain_core.documents import Document

doc_id = 0
texts: list[Document] = []
for source, docling_document in conversions.items():
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
        items = chunk.meta.doc_items
        if len(items) == 1 and isinstance(items[0], TableItem):
            continue # we will process tables later
        refs = “ “.join(map(lambda item: item.get_ref().cref, items))
        print(refs)
        text = chunk.text
        document = Document(
            page_content=text,
            metadata={
                “doc_id”: (doc_id:=doc_id+1),
                “source”: source,
                “ref”: refs,
            },
        )
        texts.append(document)

print(f”{len(texts)} text document chunks created”)

다음으로 문서에 있는 모든 테이블을 처리합니다. 언어 모델에 전달하기 위해 테이블 데이터를 마크다운 형식으로 변환합니다. LangChain 문서 목록은 테이블의 마크다운 렌더링에서 생성됩니다.

from docling_core.types.doc.labels import DocItemLabel

doc_id = len(texts)
tables: list[Document] = []
for source, docling_document in conversions.items():
    for table in docling_document.tables:
        if table.label in [DocItemLabel.TABLE]:
            ref = table.get_ref().cref
            print(ref)
            text = table.export_to_markdown()
            document = Document(
                page_content=text,
                metadata={
                    “doc_id”: (doc_id:=doc_id+1),
                    “source”: source,
                    “ref”: ref
                },
            )
            tables.append(document)

print(f”{len(tables)} table documents created”)

마지막으로 문서에 있는 모든 이미지를 처리합니다. 여기서는 비전 언어 모델을 사용하여 이미지의 내용을 이해합니다. 이 예시에서 우리는 이미지의 텍스트 정보에 관심이 있습니다. 다양한 프롬프트 텍스트를 시험해 보면서 결과를 어떻게 개선할 수 있는지 확인해 볼 수도 있습니다.

참고: 이미지 처리는 이미지 수와 비전 언어 모델을 실행하는 서비스에 따라 매우 오랜 시간이 걸릴 수 있습니다.

import base64
import io
import PIL.Image
import PIL.ImageOps
from IPython.display import display

def encode_image(image: PIL.Image.Image, format: str = “png”) -> str:
    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert(“RGB”)

    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode(“utf-8”)
    uri = f”data:image/{format};base64,{encoding}”
    return uri

# Feel free to experiment with this prompt
image_prompt = “If the image contains text, explain the text in the image.”
conversation = [
    {
        “role”: “user”,
        “content”: [
            {“type”: “image”},
            {“type”: “text”, “text”: image_prompt},
        ],
    },
]
vision_prompt = vision_processor.apply_chat_template(
    conversation=conversation,
    add_generation_prompt=True,
)
pictures: list[Document] = []
doc_id = len(texts) + len(tables)
for source, docling_document in conversions.items():
    for picture in docling_document.pictures:
        ref = picture.get_ref().cref
        print(ref)
        image = picture.get_image(docling_document)
        if image:
            text = vision_model.invoke(vision_prompt, image=encode_image(image))
            document = Document(
                page_content=text,
                metadata={
                    “doc_id”: (doc_id:=doc_id+1),
                    “source”: source,
                    “ref”: ref,
                },
            )
            pictures.append(document)

print(f”{len(pictures)} image descriptions created”)

그런 다음 입력 문서에서 생성된 LangChain 문서를 표시할 수 있습니다.

import itertools
from docling_core.types.doc.document import RefItem

# Print all created documents
for document in itertools.chain(texts, tables):
    print(f”Document ID: {document.metadata[‘doc_id’]}”)
    print(f”Source: {document.metadata[‘source’]}”)
    print(f”Content:\n{document.page_content}”)
    print(“=” * 80) # Separator for clarity

for document in pictures:
    print(f”Document ID: {document.metadata[‘doc_id’]}”)
    source = document.metadata[‘source’]
    print(f”Source: {source}”)
    print(f”Content:\n{document.page_content}”)
    docling_document = conversions[source]
    ref = document.metadata[‘ref’]
    picture = RefItem(cref=ref).resolve(docling_document)
    image = picture.get_image(docling_document)
    print(“Image:”)
    display(image)
    print(“=” * 80) # Separator for clarity

벡터 데이터베이스 채우기

임베딩 모델을 사용하여 텍스트 청크에서 문서를 로드하고 생성된 이미지 캡션을 벡터 데이터베이스에 로드합니다. 이 벡터 데이터베이스를 만들면 문서 전체에서 시맨틱 유사성 검색을 쉽게 수행할 수 있습니다.

참고: 벡터 데이터베이스를 채우는 데는 임베딩 모델과 서비스에 따라 다소 시간이 걸릴 수 있습니다.

벡터 데이터베이스 선택

임베딩 벡터를 저장하고 검색에 사용할 데이터베이스를 지정합니다.

Milvus가 아닌 다른 벡터 데이터베이스에 연결하려면 이 코드 셀을 이 벡터 스토어 레시피의 셀로 대체합니다.

import tempfile
from langchain_core.vectorstores import VectorStore
from langchain_milvus import Milvus

db_file = tempfile.NamedTemporaryFile(prefix=”vectorstore_”, suffix=”.db”, delete=False).name
print(f”The vector database will be saved to {db_file}”)

vector_db: VectorStore = Milvus(
    embedding_function=embeddings_model,
    connection_args={“uri”: db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={“index_type”: “AUTOINDEX”},
)

이제 텍스트, 테이블, 이미지 설명에 대한 모든 LangChain 문서를 벡터 데이터베이스에 추가합니다.

import itertools
documents = list(itertools.chain(texts, tables, pictures))
ids = vector_db.add_documents(documents)
print(f”{len(ids)} documents added to the vector database”)

4단계: Granite를 활용한 RAG

문서를 성공적으로 변환하고 벡터화했으니, 이제 RAG 파이프라인을 설정할 수 있습니다.

Granite용 RAG 파이프라인 생성

먼저 Granite가 RAG 쿼리를 수행하도록 프롬프트를 생성합니다. Granite 채팅 템플릿을 사용하고 LangChain RAG 파이프라인이 대체할 자리 표시자 값을 제공합니다.

다음으로 이전에 만든 Granite 프롬프트 템플릿을 사용하여 RAG 파이프라인을 구성합니다.

from ibm_granite_community.langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate

# Create a Granite prompt for question-answering with the retrieved context
prompt_template = ChatPromptTemplate.from_template("{input}")

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
)
rag_chain = create_retrieval_chain(
    retriever=vector_db.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

질문에 대한 검색 증강 응답 생성

파이프라인은 쿼리를 사용하여 벡터 데이터베이스에서 문서를 찾고 이를 쿼리의 컨텍스트로 활용합니다.

from ibm_granite_community.notebook_utils import wrap_text

output = rag_chain.invoke({"input": query})

print(wrap_text(output['answer']))

대단하군요! 소스 문서의 텍스트와 이미지에서 지식을 성공적으로 활용할 수 있는 AI 애플리케이션을 만들었습니다.

다음 단계

다른 산업 분야를 위한 고급 RAG 워크플로를 탐색합니다.
다른 문서 유형과 더 큰 데이터 세트로 실험해 봅니다.
더 나은 Granite 응답을 위해 프롬프트 엔지니어링을 최적화합니다.

생성형 AI + ML의 힘 활용하기

생성형 AI와 머신 러닝을 비즈니스에 자신 있게 통합하는 방법 알아보기

리소스

생성형 AI를 위한 CEO 가이드

생성형 AI가 창출할 수 있는 가치와 AI가 요구하는 투자 및 그로 인한 위험에서 CEO가 균형을 맞출 수 있는 방법을 알아보세요.

생성형 AI 기술 업그레이드

실습, 강좌, 가이드 프로젝트, 평가판 등을 통해 기본 개념을 배우고 기술을 쌓으세요.

생성형 AI + ML의 힘 활용하기

생성형 AI와 머신 러닝을 비즈니스에 자신 있게 통합하는 방법 알아보기

업무에 AI 활용: 생성형 AI로 ROI 향상

AI 투자에 대해 더 나은 수익을 얻고 싶으신가요? 주요 영역에서 차세대 AI를 확장하여 최고의 인재들이 혁신적인 새 솔루션을 구축하고 제공하도록 지원함으로써 변화를 주도하는 방법을 알아보세요.

2024년 AI 활용 현황

IBM은 2,000개 조직을 대상으로 AI 이니셔티브에 대한 설문조사를 실시하여 효과적인 전략과 효과적이지 못한 전략, 그리고 앞서나갈 수 있는 방법을 알아보았습니다.

IBM Granite 살펴보기

IBM Granite는 비즈니스에 맞게 맞춤화되고 AI 애플리케이션 확장에 최적화되었으며 개방적이고 성능이 뛰어나며 신뢰할 수 있는 AI 모델 제품군입니다. 언어, 코드, 시계열 및 가드레일 옵션을 살펴보세요.

적절한 파운데이션 모델을 선택하는 방법

사용 사례에 가장 적합한 AI 파운데이션 모델을 선택하는 방법을 알아보세요.

신뢰와 확신을 바탕으로 새로운 AI 시대에 성공하는 방법

강력한 AI 전략의 3가지 핵심 요소인 경쟁 우위 확보, 비즈니스 전반의 AI 확장, 신뢰할 수 있는 AI 발전에 대해 자세히 알아보세요.