Chunking

Overview

In the realm of web crawling and document indexing, effective chunking methodologies are crucial for optimizing search performance and relevance. When dealing with HTML web pages or PDFs, the strategy you employ can significantly impact the efficiency and accuracy of your search application. Here's a comprehensive guide to effective chunking methodologies for different circumstances, particularly in the context of using Elasticsearch as a vector store.

The IBM RAG Cookbook

Document Segmentation

Web pages vary widely in size and structure. For efficient processing:

HTML Parsing: Break down HTML pages into manageable chunks based on HTML tags (e.g., for paragraphs, <div> for sections).
Content Extraction: Focus on extracting meaningful content such as main text, headings, and structured data.

Chunking Strategy

By URL: Treat each URL as a separate document. This method is straightforward and useful for websites where each page is relatively small and self-contained.
By Page Sections: Segment documents by meaningful sections (e.g., articles, blog posts). This approach is beneficial for long-form content and sites with diverse content types.
Content Length: Avoid extremely large chunks; balance between granularity for search relevance and performance. Smaller chunks can improve search precision but might increase processing overhead.

Processing Pipeline

Use libraries like Beautiful Soup or Scrapy for web scraping.
Normalize text by removing HTML tags and handling encoding issues.
Apply language-specific analyzers (e.g., stemming, stop words) to improve search accuracy.

Challenges with older HTML Code

When web crawling involves older HTML code, it can significantly impact the quality of the chunks extracted for indexing in Elasticsearch. Here are some key challenges and mitigation strategies:

Structural Inconsistencies

Non-Standard Tags and Attributes:

Older HTML code may use non-standard tags or attributes that modern parsers might not handle correctly or fully understand.
This can lead to incomplete or incorrect extraction of content chunks, especially if the crawler relies on standard HTML parsing libraries.

Nested and Deprecated Elements:

Old HTML often includes nested elements or deprecated tags like , <center>, or <strike>, which may not be properly parsed by modern HTML parsers.
As a result, content segmentation based on these elements can produce chunks that are fragmented or incorrectly structured.

Content Quality and Relevance

Inline Styling and Formatting:

Older HTML may heavily rely on inline styling and formatting tags (, , , etc.), which can clutter the extracted text and reduce the quality of the indexed chunks.
Search quality may suffer as Elasticsearch analyzes and indexes these chunks, potentially affecting relevance in search results.

Inconsistent Document Structure:

Lack of standardized document structure in older HTML pages can lead to inconsistent chunk sizes and content segments.
This inconsistency makes it challenging to establish clear boundaries between meaningful content sections, affecting the granularity of chunks and search relevancy.

Technical Challenges

Encoding and Character Issues:

Older HTML pages may have encoding issues, such as incorrect character sets or entities ( , <), which need proper handling during text extraction.
Failure to handle these correctly can result in garbled text or incorrect representation of content in indexed chunks.

Script and Dynamic Content:

Dynamic elements and scripts embedded within older HTML pages (e.g., JavaScript, Flash) may not be effectively processed by basic web crawlers.
This can lead to incomplete extraction of content or the omission of dynamically generated text, impacting the comprehensiveness of indexed chunks.

Mitigation Strategies

To mitigate the impact of old HTML code on chunk quality when web crawling for Elasticsearch indexing, consider the following strategies:

Use Robust Parsing Libraries: Employ advanced HTML parsing libraries that can handle various HTML versions and non-standard elements.
Normalize Text: Pre-process extracted text to normalize formatting and remove unnecessary tags or attributes.
Cleanse Content: Filter out irrelevant or deprecated elements during the extraction process to improve the quality of indexed chunks.
Adapt Extraction Logic: Develop custom logic to handle specific quirks of older HTML pages, ensuring more accurate segmentation into meaningful content chunks.
Regular Updates: Maintain crawler logic and parsing libraries to adapt to changes in web standards and HTML practices over time.

By addressing these challenges proactively, you can enhance the quality of content chunks extracted from older HTML pages, thereby improving the overall effectiveness and relevance of Elasticsearch-based search applications.

Indexing PDFs

PDF Content Extraction

Extract text from PDFs, considering elements like headings, paragraphs, and tables.
Handle metadata (title, author, date) separately for indexing.

Chunking Strategy

By Document: Treat each PDF file as a single document for indexing.
By Page: Alternatively, divide PDFs into chunks based on pages or logical sections if needed.
Text Blocks: Segment large chunks into smaller units based on natural breaks in content (e.g., paragraphs).

Text Extraction Challenges, Facts, and Myths

Fact: PDFs may contain images or scanned documents, requiring OCR (Optical Character Recognition) for text extraction. Watsonx Discovery does have OCR, so this should not be a challenge.
Handle text encoding and formatting issues unique to PDFs (e.g., hyphenation, line breaks).

Indexing Pipeline

Normalize extracted text by removing special characters and handling hyphenation.
Apply language-specific analyzers to processed text to enhance search capabilities.

Considerations

General Considerations

Performance: Balance chunk size for efficient indexing and query performance.
Relevance: Ensure chunks capture meaningful content segments for accurate search results.
Indexing Overhead: Consider the overhead of indexing large volumes of data; optimize indexing settings and shard allocation in Elasticsearch.
Search Quality: Test and refine search queries to ensure they deliver relevant results across chunked documents.

Choosing a Chunking Method

Content Characteristics: For short messages, sentence-level chunking might suffice. For long documents, consider content-aware chunking or even explore heterogeneous chunking with a mix of chunk sizes.
Application Requirements: If computational efficiency is paramount, fixed-size chunking might be a good starting point. If context-rich retrieval is crucial, content-aware chunking is a better option.

In summary, the best chunking strategy for using Elasticsearch with web crawling or PDF indexing involves breaking down content into manageable units (documents, pages, sections) that balance granularity with efficiency in indexing and querying. Adjust the strategy based on the nature of the content source (e.g., web pages vs. PDFs) and specific requirements of your search application. By implementing these methodologies, you can enhance the performance and accuracy of your search engine, providing users with more relevant and precise results.

Comparison of Chunking Methodologies

Benefits

Simplicity, Easy to manage

Use Cases:

Websites with small, self-contained pages: Easy to manage each URL as a separate document.
Situations where each URL represents a distinct document: Ensures clarity and simplicity in document handling.

Benefits

Improved granularity, Enhanced relevance for search queries

Use Cases:

- Long-form content websites: Enhances search relevance by segmenting large articles or posts.

- Sites with diverse content types: Allows for better organization and retrieval of varied content.

Benefits

Balance between granularity and performance, Enhanced search precision

Use Cases:

- Content-heavy websites: Ensures that chunks are not too large, improving search precision and performance.

- Websites where extremely large chunks are not desirable: Balances the need for detailed indexing with performance.

Benefits

Computational efficiency, Predictable performane

Use Cases:

- Applications with high computational efficiency requirements: Predictable performance and resource usage.

- Standardized data processing tasks: Suitable for environments where uniform chunk sizes are beneficial.

Benefits

Fine-grained control over content, Highly relevant search results

Use Cases:

- Short message indexing: Ideal for precise search and retrieval in brief texts.

- Text analysis applications: Enables detailed analysis and understanding of short segments.

Benefits

Context-Aware Chunking, Enhances search relevance

Use Cases:

- Long documents with varied content: Ensures that chunks retain meaningful context.

- Detailed content analysis: Improves the quality of search results by maintaining context.

Benefits

Logical segmentation, Improved readability

Use Cases:

- PDF indexing: Treats each page as a logical unit, enhancing readability and searchability.

- Documents where each page is a distinct unit: Maintains the integrity of page-based content.

Benefits

Natural breaks in content, Better handling of structured content

Use Cases:

- Large documents with clear paragraph breaks: Facilitates easier reading and processing.

- Structured document indexing: Ensures that meaningful sections are indexed separately.

IBM Solutions

Chunking in watsonx

IBM watsonx platform enables and accelerates RAG patterns implementation and allows for document chunking and understanding. watsonx Orchestrate together with watsonx Discovery allow document chunking from low-code/no-code to more custom solution implementations.

The listed methods are common considerations of chunking techniques to implement in a RAG use case. The hybrid technique is available with watsonx to provide better accuracy and speed. The overall process splits a document by single sentences, to ensure meaningful granularity. Then, sentences are grouped into chunks using an LLM. The LLM assesses whether the context of the sentence fits the chunk, otherwise, a new chunk is initiated. Using that chunk, meaningful metadata is extracted using the LLM; this includes, title, ID, and summary.

For more information on this method, please refer to this IBM blog: https://developer.ibm.com/articles/awb-enhancing-llm-performance-document-chunking-with-watsonx/

Open Source Tools

Syntactic chunking

Syntactic chunking uses a structure guided approach to separate text based on document formatting. Most commonly and naturally, section headings are used as separators. However, with document formats without encoded information (i.e., PDF), font sizes can be used as a marker to separate chunks.

Strengths:

Simple implementation

Limitations:

Algorithm assumes informational hierarchy; hence, syntactic markers may be missed

Syntactic chunking can be beneficial in circumstances where the topological structure of the document is predictable and cleanly formatted for partitioning.

Fixed size chunking

In fixed-size chunking, the input data is divided into equal-sized chunks. This method often separates text into predefined sizes based on the number of tokens. This method is simple to implement and understand, making it suitable for tasks where the input data has a consistent and predictable size.

Strengths:

Easy to implement and understand
Computationally efficient

Limitations:

Rigid and cannot adjust to text context
May leave out important contextual information in text

Fixed size chunking is best used in cases where input data is predictable and does not need thorough context.

Example:

from langchain.text_splitter import CharacterTextSplitter

# Example text
text = "The quick brown fox jumps over the lazy dog."

# Create a splitter for fixed size chunks
splitter = CharacterTextSplitter(chunk_size=10, chunk_overlap=0)

# Split the text
chunks = splitter.split(text)

# Print the chunks
for chunk in chunks:
 print(chunk)

Semantic chunking

Semantic chunking separates text based on context and groups meaningful components using algorithmic techniques. Semantic chunking is important when meaning of content is vital in coming up with a response, such as legal documentation.

Strengths:

Retrieved data has higher accuracy and relevancy
Cohesiveness of knowledge base maintained in chunking

Limitations:

Complex implementation using sophisticated algorithms
Computationally more extensive

Example:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Example text
text = "Steve Jobs was born in California. He co-founded Apple in 1976."

# Create a splitter for semantic chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)

# Split the text
chunks = splitter.split(text)

# Print the chunks
for chunk in chunks:
 print(chunk)

Hybrid chunking

Hybrid chunking uses both contextual and structural information for combined strengths from fixed sized chunking and semantic chunking. This method is best used for situations where there’s a large corpus of data across use cases, with varied formatting and material, such as an enterprise knowledge base.

Strengths:

Optimal accuracy and speed with flexible chunking techniques based on knowledge base structure and context

Limitations:

More maintenance and effort to implement due to demanding resource requirements

Example:

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

# Example text
text = "Steve Jobs was born in California. He co-founded Apple in 1976."

# Fixed-size splitter
fixed_size_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=10)
fixed_size_chunks = fixed_size_splitter.split(text)

# Semantic splitter
semantic_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
semantic_chunks = semantic_splitter.split(text)

# Print the fixed-size chunks
print("Fixed-size chunks:")
for chunk in fixed_size_chunks:
 print(chunk)

# Print the semantic chunks
print("\nSemantic chunks:")
for chunk in semantic_chunks:
 print(chunk)

Agentic chunking

This chunking strategy involves using large language models (LLMs) to determine the appropriate length and content of text chunks based on the context in which they will be used. The strategy is inspired by the concept of propositions, which involves extracting standalone statements from a raw piece of text.

To implement this approach, the strategy uses a propositional-retrieval template provided by Langchain to extract propositions from the text. These propositions are then fed to an LLM-based agent, which determines whether each proposition should be included in an existing chunk or if a new chunk should be created.

The LLM-based agent uses a variety of factors to make this determination, including the relevance of the proposition to the current chunk, the overall coherence of the chunk, and the goals and intentions of the model. By using LLMs in this way, the strategy aims to generate text chunks that are not only coherent and contextually appropriate, but is also aligned with the model's goals and intentions.

Strengths:

Autonomy: LLMs can make decisions about which propositions to include in a chunk and when to create a new chunk without human intervention, allowing for more efficient and scalable text generation.
Coherence: LLMs are designed to generate text that is coherent and contextually appropriate, which can help to ensure that the generated chunks are meaningful and easy to understand.

Limitations:

Computationally exhaustive: Generating text using LLMs can be computationally expensive, particularly for large or complex models. This can limit the scalability and practicality of using LLMs for certain applications.
Hallucinations: This can occur when the model generates text based on its own internal biases or assumptions, rather than on the actual data or context.

Now how do we implement some of these techniques to chunk documentation? Many open source libraries exist to run these algorithms and chunking techniques. A good place to start is using Langchain text splitters; refer to the following documentation on specific text splitters available. https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

Next Steps

Get the latest technology patterns, solution architectures, and architecture publications from IBM.

Go to the IBM Architecture Center

Contributors

Haneen Bakbak, Luke Major

Updated: November 15, 2024