20 January 2025
In this tutorial, you will experiment with several chunking strategies using LangChain and the latest IBM® Granite™ model now available on watsonx.ai™. The overall goal is to perform retrieval augmented generation (RAG).
Chunking refers to the process of breaking large pieces of text into smaller text segments or chunks. To emphasize the importance of chunking, it is helpful to understand RAG. RAG is a technique in natural language processing (NLP) that combines information retrieval and large language models (LLMs) to retrieve relevant information from supplemental datasets to optimize the quality of the LLM’s output. To manage large documents, we can use chunking to split the text into smaller snippets of meaningful chunks. These text chunks can then be embedded and stored in a vector database through the use of an embedding model. Finally, the RAG system can then use semantic search to retrieve only the most relevant chunks. Smaller chunks tend to outperform larger chunks as they tend to be more manageable pieces for models of smaller context window size.
Some key components of chunking include:
There are several different chunking strategies to choose from. It is important to select the most effective chunking technique for the specific use case of your LLM application. Some commonly used chunking processes include:r:
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud® account.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step will open a Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community. This Jupyter Notebook along with the datasets used can be found on GitHub.
Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.
To set our credentials, we need the WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in step 1. We will also set the URL serving as the API endpoint.
We will use Granite 3.1 as our LLM for this tutorial. To initialize the LLM, we need to set the model parameters. To learn more about these model parameters, such as the minimum and maximum token limits, refer to the documentation.
The context we are using for our RAG pipeline is the official IBM announcement for the release of Granite 3.1. We can load the blog to a document directly from the webpage by using LangChain's WebBaseLoader.
Let's provide sample code for implementing each of the chunking strategies we covered earlier in this tutorial available through LangChain.
To implement fixed-size chunking, we can use LangChain's CharacterTextSplitter and set a chunk_size as well as chunk_overlap. The chunk_size is measured by the number of characters. Feel free to experiment with different values. We will also set the separator to be the newline character so that we can differentiate between paragraphs. For tokenization, we can use the granite-3.1-8b-instruct tokenizer. The tokenizer breaks down text into tokens that can be processed by the LLM.
We can print one of the chunks for a better understanding of their structure.
Output: (truncated)
Document(metadata={}, page_content='As always, IBM’s historical commitment to open source is reflected in the permissive and standard open source licensing for every offering discussed in this article.\n\r\n Granite 3.1 8B Instruct: raising the bar for lightweight enterprise models\r\n \nIBM’s efforts in the ongoing optimization the Granite series are most evident in the growth of its flagship 8B dense model. IBM Granite 3.1 8B Instruct now bests most open models in its weight class in average scores on the academic benchmarks evaluations included in the Hugging Face OpenLLM Leaderboard...')
We can also use the tokenizer for verifying our process and to check the number of tokens present in each chunk. This step is optional and for demonstrative purposes.
Output:
The chunk at index 0 contains 1106 tokens.
The chunk at index 1 contains 1102 tokens.
The chunk at index 2 contains 1183 tokens.
The chunk at index 3 contains 1010 tokens.
Great! It looks like our chunk sizes were appropriately implemented.
For recursive chunking, we can use LangChain's RecursiveCharacterTextSplitter. Like the fixed-size chunking example, we can experiment with different chunk and overlap sizes.
Output:
[Document(metadata={}, page_content='IBM Granite 3.1: powerful performance, longer context and more'),
Document(metadata={}, page_content='IBM Granite 3.1: powerful performance, longer context, new embedding models and more'),
Document(metadata={}, page_content='Artificial Intelligence'),
Document(metadata={}, page_content='Compute and servers'),
Document(metadata={}, page_content='IT automation')]
The splitter successfully chunked the text by using the default separators: ["\n\n", "\n", " ", ""].
Semantic chunking requires an embedding or encoder model. We can use the granite-embedding-30m-english model as our embedding model. We can also print one of the chunks for a better understanding of their structure.
Output: (truncated)
Document(metadata={}, page_content='Our latest dense models (Granite 3.1 8B, Granite 3.1 2B), MoE models (Granite 3.1 3B-A800M, Granite 3.1 1B-A400M) and guardrail models (Granite Guardian 3.1 8B, Granite Guardian 3.1 2B) all feature a 128K token context length.We’re releasing a family of all-new embedding models. The new retrieval-optimized Granite Embedding models are offered in four sizes, ranging from 30M–278M parameters. Like their generative counterparts, they offer multilingual support across 12 different languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch and Chinese. Granite Guardian 3.1 8B and 2B feature a new function calling hallucination detection capability, allowing increased control over and observability for agents making tool calls...')
Documents of various file types are compatible with LangChain's document-based text splitters. For this tutorial's purposes, we will use a Markdown file. For examples of recursive JSON splitting, code splitting and HTML splitting, refer to the LangChain documentation.
An example of a Markdown file we can load is the README file for Granite 3.1 on IBM's GitHub.
Output:
[Document(metadata={'source': 'https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md'}, page_content='\n\n\n\n :books: Paper (comming soon)\xa0 | :hugs: HuggingFace Collection\xa0 | \n :speech_balloon: Discussions Page\xa0 | 📘 IBM Granite Docs\n\n\n---\n## Introduction to Granite 3.1 Language Models\nGranite 3.1 language models are lightweight, state-of-the-art, open foundation models that natively support multilinguality, coding, reasoning, and tool usage, including the potential to be run on constrained compute resources. All the models are publicly released under an Apache 2.0 license for both research and commercial use. The models\' data curation and training procedure were designed for enterprise usage and customization, with a process that evaluates datasets for governance, risk and compliance (GRC) criteria, in addition to IBM\'s standard data clearance process and document quality checks...')]
Now, we can use LangChain's MarkdownHeaderTextSplitter to split the file by header type, which we set in the headers_to_split_on list. We will also print one of the chunks as an example.
Output:
Document(metadata={'Header 2': 'How to Use our Models?', 'Header 3': 'Inference'}, page_content='This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model. \n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = "auto"\nmodel_path = "ibm-granite/granite-3.1-1b-a400m-instruct"\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n# drop device_map if running on CPU\nmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)\nmodel.eval()\n# change input text as desired\nchat = [\n{ "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },\n]\nchat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)\n# tokenize the text\ninput_tokens = tokenizer(chat, return_tensors="pt").to(device)\n# generate output tokens\noutput = model.generate(**input_tokens,\nmax_new_tokens=100)\n# decode output tokens into text\noutput = tokenizer.batch_decode(output)\n# print output\nprint(output)\n```')
As you can see in the output, the chunking successfully split the text by header type.
Now that we have experimented with various chunking strategies, let's move along with our RAG implementation. For this tutorial, we will choose the chunks produced by the semantic split and convert them to vector embeddings. An open source vector store we can use is Chroma DB. We can easily access Chroma functionality through the langchain_chroma package.
Let's initialize our Chroma vector database, provide it with our embeddings model and add our documents produced by semantic chunking.
Output:
['84fcc1f6-45bb-4031-b12e-031139450cf8',
'433da718-0fce-4ae8-a04a-e62f9aa0590d',
'4bd97cd3-526a-4f70-abe3-b95b8b47661e',
'342c7609-b1df-45f3-ae25-9d9833829105',
'46a452f6-2f02-4120-a408-9382c240a26e']
Next, we can move onto creating a prompt template for our LLM. This prompt template allows us to ask multiple questions without altering the initial prompt structure. We can also provide our vector store as the retriever. This step finalizes the RAG structure.
Using our completed RAG workflow, let's invoke a user query. First, we can strategically prompt the model without any additional context from the vector store we built to test whether the model is using its built-in knowledge or truly using the RAG context. The Granite 3.1 announcement blog references Docling, IBM's tool for parsing various document types and converting them into Markdown or JSON. Let's ask the LLM about Docling.
Output:
'?\n\n"Docling" does not appear to be a standard term in English. It might be a typo or a slang term specific to certain contexts. If you meant "documenting," it refers to the process of creating and maintaining records, reports, or other written materials that provide information about an activity, event, or situation. Please check your spelling or context for clarification.'
Clearly, the model was not trained on information about Docling and without outside tools or information, it cannot provide us with this information. Now, let's try providing the same query to the RAG chain we built.
Output:
'Docling is a powerful tool developed by IBM Deep Search for parsing documents in various formats such as PDF, DOCX, images, PPTX, XLSX, HTML, and AsciiDoc, and converting them into model-friendly formats like Markdown or JSON. This enables easier access to the information within these documents for models like Granite for tasks such as RAG and other workflows. Docling is designed to integrate seamlessly with agentic frameworks like LlamaIndex, LangChain, and Bee, providing developers with the flexibility to incorporate its assistance into their preferred ecosystem. It surpasses basic optical character recognition (OCR) and text extraction methods by employing advanced contextual and element-based preprocessing techniques. Currently, Docling is open-sourced under the permissive MIT License, and the team continues to develop additional features, including equation and code extraction, as well as metadata extraction.'
Great! The Granite model correctly used the RAG context to tell us correct information about Docling while preserving semantic coherence. We proved this same result was not possible without the use of RAG.
In this tutorial, you created a RAG pipeline and experimented with several chunking strategies to improve the system’s retrieval accuracy. Using the Granite 3.1 model, we successfully produced appropriate model responses to a user query related to the documents provided as context. The text we used for this RAG implementation was loaded from a blog on ibm.com announcing the release of Granite 3.1. The model provided us with information only accessible through the provided context since it was not part of the model's initial knowledge base.
For those in search of further reading, check out the results of a project comparing LLM performance using HTML structured chunking in comparison to watsonx chunking.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at one low price.
Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.
Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
Learn how to confidently incorporate generative AI and machine learning into your business.
Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.