LangChain과 watsonx.ai를 사용하여 RAG 청킹 전략 구현

이 튜토리얼에서는 LangChain과 watsonx.ai에서 제공되는 최신 IBM® Granite 모델을 사용하여 여러 청킹 전략을 실험하게 됩니다. 전체 목표는 검색 증강 생성(RAG)을 수행하는 것입니다.

청킹이란 무엇인가요?

청킹은 큰 텍스트 조각을 더 작은 텍스트 세그먼트 또는 청크로 분할하는 과정을 말합니다. 청킹의 중요성을 강조하기 위해 RAG를 이해하는 것이 도움이 됩니다. RAG는 정보 검색과 대규모 언어 모델(LLM)을 결합하여 보조 데이터 세트에서 관련 정보를 검색하여 LLM의 아웃풋 품질을 최적화하는 자연어 처리(NLP) 기술입니다. 대형 문서를 관리하기 위해 청킹을 사용하여 텍스트를 더 의미 있는 작은 단위의 청크로 분할할 수 있습니다. 이렇게 생성된 텍스트 청크들은 임베딩 모델을 활용해 벡터 데이터베이스에 임베딩되어 저장될 수 있습니다. 마지막으로 RAG 시스템은 시맨틱 검색을 통해 가장 관련성 높은 청크만을 검색할 수 있습니다. 작은 청크는 더 작은 컨텍스트 창 크기를 가진 모델에서 처리하기 쉬운 단위이므로 일반적으로 큰 청크보다 더 나은 성능을 보이는 경향이 있습니다.

청킹의 몇 가지 주요 구성 요소는 다음과 같습니다.

청킹 전략: RAG 애플리케이션에 적합한 청킹 전략을 선택하는 것은 청크 설정 시 경계를 결정하기 때문에 중요합니다. 이러한 전략 중 일부를 다음 섹션에서 살펴보겠습니다.
청크 크기: 각 청크에 포함될 최대 토큰 수입니다. 적절한 청크 크기를 결정하려면 일반적으로 몇 가지 실험이 필요합니다.
청크 중복: 컨텍스트를 보존하기 위해 청크 간에 중복되는 토큰의 수입니다. 이는 선택적 매개변수입니다.

청킹 전략

몇 가지 다양한 청킹 전략 중에서 선택할 수 있습니다. LLM 애플리케이션의 특정 사용 사례에 가장 효과적인 청킹 기술을 선택하는 것이 중요합니다. 일반적으로 사용되는 몇 가지 청킹 프로세스는 다음과 같습니다.

고정 크기 청킹: 특정 청크 크기 및 선택적 청크 겹침을 사용하여 텍스트를 분할합니다. 이 방식은 가장 일반적이며 직관적입니다.
재귀 청킹: 기본 구분자를 반복 적용하여 원하는 청크 크기를 얻을 때까지 분할합니다. 기본 구분자에는 ["\n\n", "\n", " ", ""]가 포함됩니다. 이 청킹 방법은 계층적 구분자를 사용하여 문단, 문장, 단어 단위까지 최대한 함께 유지되도록 합니다.
시멘틱 청킹: 임베딩의 의미적 유사성에 따라 문장을 그룹화하여 텍스트를 분할하는 방식입니다. 의미론적 유사성이 높은 임베딩은 의미론적 유사성이 낮은 임베딩보다 서로 더 가깝게 위치하게 됩니다. 그 결과 컨텍스트를 인식하는 청크가 생성됩니다.
문서 기반 청킹: 문서 구조를 기준으로 분할하는 방식입니다. 이 분할기는 구조를 결정할 때 Markdown 텍스트, 이미지, 표, 심지어 Python 코드의 클래스와 함수까지 활용할 수 있습니다. 이러한 방법으로 대형 문서를 LLM이 분할하고 처리할 수 있습니다.
에이전틱 청킹: 에이전틱 AI를 활용하여 LLM이 의미적 관계뿐만 아니라 문단 유형, 섹션 제목, 단계별 지침 등과 같은 콘텐츠 구조를 기반으로 문서를 적절히 분할하도록 합니다. 이 청킹 도구는 실험적 기능으로 긴 문서를 처리할 때 인간의 추론 방식을 모방하려고 시도합니다.

단계

1단계. 환경 설정

여러 툴 중에서 선택할 수 있지만, 이 튜토리얼에서는 Jupyter Notebook을 사용하기 위해 IBM 계정을 설정하는 방법을 안내합니다.

IBM® Cloud 계정을 사용하여 watsonx.ai 에 로그인합니다.
watsonx.ai 프로젝트를 생성합니다.

프로젝트 내에서 프로젝트 ID를 가져올 수 있습니다. 관리 탭을 클릭합니다. 그런 다음 일반 페이지의 세부 정보 섹션에서 프로젝트 ID를 복사합니다. 이 튜토리얼에는 이 ID가 필요합니다.
Jupyter Notebook을 만듭니다.

해당 단계에서는 이 튜토리얼의 코드를 복사할 수 있는 Notebook 환경이 열립니다. 또는 노트북을 로컬 시스템에 다운로드하여 watsonx.ai 프로젝트에 에셋으로 업로드할 수 있습니다. 더 많은 Granite 튜토리얼을 보려면 IBM Granite 커뮤니티를 확인하세요. Jupyter Notebook과 함께 사용된 데이터 세트는 GitHub에서 확인할 수 있습니다.

2단계. watsonx.ai Runtime 인스턴스 및 API 키 설정

watsonx.ai 런타임 서비스 인스턴스를 만듭니다(적절한 지역을 선택하고 무료 인스턴스인 Lite 요금제를 선택합니다).
API 키를 생성합니다.
watsonx.ai 런타임 서비스 인스턴스를 watsonx.ai에서 생성한 프로젝트에 연결합니다.

3단계. 관련 라이브러리 설치, 가져오기 및 자격 증명 설정

#installations
!pip install -q langchain langchain-ibm langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers

# imports
import getpass
from langchain_ibm import WatsonxLLM
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from transformers import AutoTokenizer

자격 증명을 설정하려면 1단계에서 생성한 WATSONX_APIKEY와 WATSONX_PROJECT_ID가 필요합니다. 또한 API 엔드포인트 역할을 하는 URL도 설정합니다.

WATSONX_APIKEY = getpass.getpass("Please enter your watsonx.ai Runtime API key (hit enter): ")
WATSONX_PROJECT_ID = getpass.getpass("Please enter your project ID (hit enter): ")
URL = "https://us-south.ml.cloud.ibm.com"

4단계. LLM 초기화

이 튜토리얼에서는 LLM으로 Granite 3.1 버전을 사용합니다. LLM을 초기화하려면 모델 매개변수를 설정해야 합니다. 최소 및 최대 토큰 제한과 같은 모델 매개변수에 대해 자세히 알아보려면 문서를 참조하세요.

llm = WatsonxLLM(
        model_id= "ibm/granite-3-8b-instruct",
        url=URL,
        apikey=WATSONX_APIKEY,
        project_id=WATSONX_PROJECT_ID,
        params={
            GenParams.DECODING_METHOD: "greedy",
            GenParams.TEMPERATURE: 0,
            GenParams.MIN_NEW_TOKENS: 5,
            GenParams.MAX_NEW_TOKENS: 2000,
            GenParams.REPETITION_PENALTY:1.2
        }
)

5단계. 문서 불러오기

RAG 파이프라인에 사용 중인 컨텍스트는 Granite 3.1 출시를 알리는 IBM 공식 발표문입니다. LangChain의 WebBaseLoader를 사용하여 사용하여 웹페이지에서 블로그를 문서로 직접 불러올 수 있습니다.

url = "https://www.ibm.com/kr-ko/new/announcements/ibm-granite-3-1-powerful-performance-long-context-and-more"
doc = WebBaseLoader(url).load()

6단계. 텍스트 분할 수행

이 튜토리얼에서 앞서 다룬 각 청킹 전략을 LangChain을 사용하여 구현하는 샘플 코드를 제공하겠습니다.

고정 크기 청킹

고정 크기 청킹을 구현하려면 LangChain의 CharacterTextSplitter를 사용하고 chunk_size와 chunk_overlap를 설정할 수 있습니다. chunk_size는 문자 수로 측정되므로 다양한 값을 시험해보셔도 됩니다. 문단을 구분하기 위해 separator를 줄바꿈 문자로 설정합니다. 토큰화 작업에서는 granite-3.1-8b-instruct 토크나이저를 사용할 수 있습니다. 토크나이저는 텍스트를 LLM이 처리할 수 있는 토큰으로 분할합니다.

from langchain_text_splitters import CharacterTextSplitter
tokenizer = AutoTokenizer.from_pretrained(“ibm-granite/granite-3.1-8b-instruct”)
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
                    tokenizer,
                    separator=”\n”, #default: “\n\n”
                    chunk_size=1200, chunk_overlap=200)
fixed_size_chunks = text_splitter.create_documents([doc[0].page_content])

청크 중 하나를 출력하여 구조를 더 잘 이해할 수 있습니다.

Fixed_size_chunks[1]

아웃풋: (잘림)

Document(metadata={}, page_content=’As always, IBM’s historical commitment to open source is reflected in the permissive and standard open source licensing for every offering discussed in this article.\n\r\n Granite 3.1 8B Instruct: raising the bar for lightweight enterprise models\r\n \nIBM’s efforts in the ongoing optimization the Granite series are most evident in the growth of its flagship 8B dense model. IBM Granite 3.1 8B Instruct now bests most open models in its weight class in average scores on the academic benchmarks evaluations included in the Hugging Face OpenLLM Leaderboard...’)

토크나이저를 사용하여 각 청크에 포함된 토큰 수를 확인하고 처리 과정을 검증할 수도 있습니다. 이 단계는 선택 사항이며, 데모 용도입니다.

for idx, val in enumerate(fixed_size_chunks):
token_count = len(tokenizer.encode(val.page_content))
print(f”{idx} 인덱스의 청크는 {token_count}개의 토큰을 포함합니다.”)

아웃풋:

The chunk at index 0 contains 1106 tokens.
The chunk at index 1 contains 1102 tokens.
The chunk at index 2 contains 1183 tokens.
The chunk at index 3 contains 1010 tokens.

훌륭합니다! 청크 크기가 적절하게 구현된 것 같습니다.

재귀적 청킹

재귀적 청킹은 LangChain의 RecursiveCharacterTextSplitter를 사용할 수 있습니다. 고정 크기 청킹 예제와 마찬가지로 다양한 청크 및 겹침 크기를 실험할 수 있습니다.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
recursive_chunks = text_splitter.create_documents([doc[0].page_content])
recursive_chunks[:5]

아웃풋:

[Document(metadata={}, page_content=’IBM Granite 3.1: powerful performance, longer context and more’),
Document(metadata={}, page_content=’IBM Granite 3.1: powerful performance, longer context, new embedding models and more’),
Document(metadata={}, page_content=’Artificial Intelligence’),
Document(metadata={}, page_content=’Compute and servers’),
Document(metadata={}, page_content=’IT automation’)]

분할기는 기본 구분자 [“\n\n”, “\n”, “ “, “”]를 사용하여 텍스트를 성공적으로 청킹했습니다.

시맨틱 청킹

시맨틱 청킹에는 임베딩 또는 인코더 모델이 필요합니다. granite-embedding-30m-english 모델을 임베딩 모델로 사용할 수 있으며 청크 구조를 더 잘 이해하기 위해 결과 중 하나를 출력할 수도 있습니다.

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings_model = HuggingFaceEmbeddings(model_name=”ibm-granite/granite-embedding-30m-english”)
text_splitter = SemanticChunker(embeddings_model)
semantic_chunks = text_splitter.create_documents([doc[0].page_content])
semantic_chunks[1]

아웃풋: (잘림)

Document(metadata={}, page_content=’Our latest dense models (Granite 3.1 8B, Granite 3.1 2B), MoE models (Granite 3.1 3B-A800M, Granite 3.1 1B-A400M) and guardrail models (Granite Guardian 3.1 8B, Granite Guardian 3.1 2B) all feature a 128K token context length.We’re releasing a family of all-new embedding models. The new retrieval-optimized Granite Embedding models are offered in four sizes, ranging from 30M–278M parameters. Like their generative counterparts, they offer multilingual support across 12 different languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch and Chinese. Granite Guardian 3.1 8B and 2B feature a new function calling hallucination detection capability, allowing increased control over and observability for agents making tool calls...’)

문서 기반 청킹

다양한 파일 형식의 문서는 LangChain의 문서 기반 텍스트 분할기와 호환됩니다. 이 튜토리얼에서는 Markdown 파일을 사용합니다. 재귀 JSON 분할, 코드 분할 및 HTML 분할의 예시는 LangChain 문서를 참조하세요.

로드할 수 있는 Markdown 파일의 예시로는 IBM의 GitHub에 있는 Granite 3.1용 README 파일이 있습니다.

url = “https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md”
markdown_doc = WebBaseLoader(url).load()
markdown_doc

아웃풋:

[Document(metadata={‘source’: ‘https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md’}, page_content=’\n\n\n\n :books: Paper (comming soon)\xa0 | :hugs: HuggingFace Collection\xa0 | \n :speech_balloon: Discussions Page\xa0 | ðŸ“˜ IBM Granite Docs\n\n\n---\n## Introduction to Granite 3.1 Language Models\nGranite 3.1 language models are lightweight, state-of-the-art, open foundation models that natively support multilinguality, coding, reasoning, and tool usage, including the potential to be run on constrained compute resources. All the models are publicly released under an Apache 2.0 license for both research and commercial use. The models\’ data curation and training procedure were designed for enterprise usage and customization, with a process that evaluates datasets for governance, risk and compliance (GRC) criteria, in addition to IBM\’s standard data clearance process and document quality checks...’)]

이제 LangChain의 MarkdownHeaderTextSplitter를 사용하여 headers_to_split_on 목록에 설정한 헤더 유형별로 파일을 분할할 수 있습니다. 또한 예시로 청크 중 하나를 출력해 보겠습니다.

#문서 기반 청킹
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
    (“#”, “Header 1”),
    (“##”, “Header 2”),
    (“###”, “Header 3”),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
document_based_chunks = markdown_splitter.split_text(markdown_doc[0].page_content)
document_based_chunks[3]

아웃풋:

Document(metadata={‘Header 2’: ‘How to Use our Models?’, ‘Header 3’: ‘Inference’}, page_content=’This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model. \n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = “auto”\nmodel_path = “ibm-granite/granite-3.1-1b-a400m-instruct”\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n# drop device_map if running on CPU\nmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)\nmodel.eval()\n# change input text as desired\nchat = [\n{ “role”: “user”, “content”: “Please list one IBM Research laboratory located in the United States. You should only output its name and location.” },\n]\nchat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)\n# tokenize the text\ninput_tokens = tokenizer(chat, return_tensors=”pt”).to(device)\n# generate output tokens\noutput = model.generate(**input_tokens,\nmax_new_tokens=100)\n# decode output tokens into text\noutput = tokenizer.batch_decode(output)\n# print output\nprint(output)\n```’)

아웃풋에서 볼 수 있듯이 청킹이 헤더 유형별로 텍스트를 성공적으로 분할했습니다.

7단계. 벡터 스토어 생성

이제 다양한 청킹 전략을 실험했으므로 RAG 구현을 진행해 보겠습니다. 이 튜토리얼에서는 시맨틱 분할에 의해 생성된 청크를 선택하고 이를 벡터 임베딩으로 변환합니다. 사용할 수 있는 오픈 소스 벡터 저장소는 Chroma DB입니다. langchain_chroma 패키지를 통해 쉽게 Chroma 기능을 사용할 수 있습니다.

Chroma 벡터 데이터베이스를 초기화하고, 임베딩 모델을 제공하며, 시맨틱 청킹으로 생성한 문서를 추가하겠습니다.

vector_db = Chroma(
    collection_name=”example_collection”,
    embedding_function=embeddings_model,
    persist_directory=”./chroma_langchain_db”, # 로컬에 테이터 저장 위치
)

vector_db.add_documents(semantic_chunks)

아웃풋:

[‘84fcc1f6-45bb-4031-b12e-031139450cf8’,
‘433da718-0fce-4ae8-a04a-e62f9aa0590d’,
‘4bd97cd3-526a-4f70-abe3-b95b8b47661e’,
‘342c7609-b1df-45f3-ae25-9d9833829105’,
‘46a452f6-2f02-4120-a408-9382c240a26e’]

7단계: 프롬프트 템플릿 구조화

다음으로 LLM에 대한 프롬프트 템플릿을 만드는 단계로 넘어갈 수 있습니다. 이 프롬프트 템플릿을 사용하면 초기 프롬프트 구조를 변경하지 않고도 여러 질문을 할 수 있습니다. 벡터 스토어를 검색기로 제공할 수도 있습니다. 이 단계에서는 RAG 구조를 완성합니다.

from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt_template = """<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(llm, qa_chain_prompt)
rag_chain = create_retrieval_chain(vector_db.as_retriever(), combine_docs_chain)

9단계. RAG 체인 프롬프트

완성된 RAG 워크플로를 사용하여 사용자 쿼리를 실행해 보겠습니다. 먼저, 구축한 벡터 저장소의 추가 컨텍스트 없이 모델을 프롬프트하여 모델이 내장 지식을 사용하는지 실제로 RAG 컨텍스트를 활용하는지 테스트할 수 있습니다. Granite 3.1 발표 블로그에서는 다양한 문서 유형을 구문 분석하고 Markdown 또는 JSON 형식으로 변환하는 IBM의 도구인 Docling을 언급합니다. LLM에게 Docling에 대해 물어보겠습니다.

output = llm.invoke(“What is Docling?”)
output

아웃풋:

‘?\n\n”Docling” does not appear to be a standard term in English. It might be a typo or a slang term specific to certain contexts. If you meant “documenting,” it refers to the process of creating and maintaining records, reports, or other written materials that provide information about an activity, event, or situation. Please check your spelling or context for clarification.’

의심의 여지 없이 이 모델은 Docling에 관한 정보로 학습되지 않았으며, 외부 도구나 추가 정보 없이 이러한 내용을 제공할 수 없습니다. 이제 우리가 구축한 RAG 체인에 동일한 쿼리를 제공해 보겠습니다.

rag_output = rag_chain.invoke({“input”: “What is Docling?”})
rag_output[‘answer’]

아웃풋:

‘Docling is a powerful tool developed by IBM Deep Search for parsing documents in various formats such as PDF, DOCX, images, PPTX, XLSX, HTML, and AsciiDoc, and converting them into model-friendly formats like Markdown or JSON. This enables easier access to the information within these documents for models like Granite for tasks such as RAG and other workflows. Docling is designed to integrate seamlessly with agentic frameworks like LlamaIndex, LangChain, and Bee, providing developers with the flexibility to incorporate its assistance into their preferred ecosystem. It surpasses basic optical character recognition (OCR) and text extraction methods by employing advanced contextual and element-based preprocessing techniques. Currently, Docling is open-sourced under the permissive MIT License, and the team continues to develop additional features, including equation and code extraction, as well as metadata extraction.’

훌륭합니다! Granite 모델은 RAG 컨텍스트를 올바르게 활용해 의미의 일관성을 유지하면서 Docling에 대한 정확한 정보를 제공했습니다. RAG를 사용하지 않고는 동일한 결과를 얻을 수 없음을 입증했습니다.

요약

이 튜토리얼에서는 RAG 파이프라인을 구축하고, 시스템의 검색 정확도를 개선하기 위해 여러 청킹 전략을 실험했습니다. Granite 3.1 모델을 사용하여 컨텍스트로 제공된 문서와 관련된 사용자 쿼리에 대한 적절한 모델 응답을 성공적으로 생성했습니다. 이번 RAG 구현에 사용된 텍스트는 ibm.com의 Granite 3.1 출시를 알리는 블로그에서 불러왔습니다. 이 모델은 모델의 초기 지식 베이스에 포함되지 않은 정보를 제공된 컨텍스트를 통해서만 접근할 수 있도록 제공했습니다.

더 자세한 내용을 알고 싶다면 HTML 구조화 청킹을 사용한 LLM 성능과 watsonx 청킹을 비교한 프로젝트 결과를 확인해 보세요.

생성형 AI + ML의 힘 활용하기

생성형 AI와 머신 러닝을 비즈니스에 자신 있게 통합하는 방법 알아보기