Syntactic chunking
Syntactic chunking uses a structure guided approach to separate text based on document formatting. Most commonly and naturally, section headings are used as separators. However, with document formats without encoded information (i.e., PDF), font sizes can be used as a marker to separate chunks.
Strengths:
Limitations:
- Algorithm assumes informational hierarchy; hence, syntactic markers may be missed
Syntactic chunking can be beneficial in circumstances where the topological structure of the document is predictable and cleanly formatted for partitioning.
Fixed size chunking
In fixed-size chunking, the input data is divided into equal-sized chunks. This method often separates text into predefined sizes based on the number of tokens. This method is simple to implement and understand, making it suitable for tasks where the input data has a consistent and predictable size.
Strengths:
- Easy to implement and understand
- Computationally efficient
Limitations:
- Rigid and cannot adjust to text context
- May leave out important contextual information in text
Fixed size chunking is best used in cases where input data is predictable and does not need thorough context.
Example:
from langchain.text_splitter import CharacterTextSplitter
# Example text
text = "The quick brown fox jumps over the lazy dog."
# Create a splitter for fixed size chunks
splitter = CharacterTextSplitter(chunk_size=10, chunk_overlap=0)
# Split the text
chunks = splitter.split(text)
# Print the chunks
for chunk in chunks:
print(chunk)
Semantic chunking
Semantic chunking separates text based on context and groups meaningful components using algorithmic techniques. Semantic chunking is important when meaning of content is vital in coming up with a response, such as legal documentation.
Strengths:
- Retrieved data has higher accuracy and relevancy
- Cohesiveness of knowledge base maintained in chunking
Limitations:
- Complex implementation using sophisticated algorithms
- Computationally more extensive
Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Example text
text = "Steve Jobs was born in California. He co-founded Apple in 1976."
# Create a splitter for semantic chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
# Split the text
chunks = splitter.split(text)
# Print the chunks
for chunk in chunks:
print(chunk)
Hybrid chunking
Hybrid chunking uses both contextual and structural information for combined strengths from fixed sized chunking and semantic chunking. This method is best used for situations where there’s a large corpus of data across use cases, with varied formatting and material, such as an enterprise knowledge base.
Strengths:
- Optimal accuracy and speed with flexible chunking techniques based on knowledge base structure and context
Limitations:
- More maintenance and effort to implement due to demanding resource requirements
Example:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# Example text
text = "Steve Jobs was born in California. He co-founded Apple in 1976."
# Fixed-size splitter
fixed_size_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=10)
fixed_size_chunks = fixed_size_splitter.split(text)
# Semantic splitter
semantic_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
semantic_chunks = semantic_splitter.split(text)
# Print the fixed-size chunks
print("Fixed-size chunks:")
for chunk in fixed_size_chunks:
print(chunk)
# Print the semantic chunks
print("\nSemantic chunks:")
for chunk in semantic_chunks:
print(chunk)
Agentic chunking
This chunking strategy involves using large language models (LLMs) to determine the appropriate length and content of text chunks based on the context in which they will be used. The strategy is inspired by the concept of propositions, which involves extracting standalone statements from a raw piece of text.
To implement this approach, the strategy uses a propositional-retrieval template provided by Langchain to extract propositions from the text. These propositions are then fed to an LLM-based agent, which determines whether each proposition should be included in an existing chunk or if a new chunk should be created.
The LLM-based agent uses a variety of factors to make this determination, including the relevance of the proposition to the current chunk, the overall coherence of the chunk, and the goals and intentions of the model. By using LLMs in this way, the strategy aims to generate text chunks that are not only coherent and contextually appropriate, but is also aligned with the model's goals and intentions.
Strengths:
- Autonomy: LLMs can make decisions about which propositions to include in a chunk and when to create a new chunk without human intervention, allowing for more efficient and scalable text generation.
- Coherence: LLMs are designed to generate text that is coherent and contextually appropriate, which can help to ensure that the generated chunks are meaningful and easy to understand.
Limitations:
- Computationally exhaustive: Generating text using LLMs can be computationally expensive, particularly for large or complex models. This can limit the scalability and practicality of using LLMs for certain applications.
- Hallucinations: This can occur when the model generates text based on its own internal biases or assumptions, rather than on the actual data or context.
Now how do we implement some of these techniques to chunk documentation? Many open source libraries exist to run these algorithms and chunking techniques. A good place to start is using Langchain text splitters; refer to the following documentation on specific text splitters available. https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/