让我们提供一些示例代码,用于实现我们在本教程前面介绍的每种分块策略,这些代码可以通过 LangChain 获取。
固定大小分块
为了实现固定大小的分块,我们可以使用 LangChain 的 CharacterTextSplitter,并设置 chunk_size 和 chunk_overlap。chunk_size 以字符数来衡量。可以自由尝试不同的值。我们还将分隔符设置为换行符,以便我们可以区分段落。对于令牌化,我们可以使用 granite-3.1-8b-instruct分词器。分词器将文本分解成 LLM 可以处理的令牌。
from langchain_text_splitters import CharacterTextSplitter
tokenizer = AutoTokenizer.from_pretrained(“ibm-granite/granite-3.1-8b-instruct”)
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer,
separator=”\n”, #default: “\n\n”
chunk_size=1200, chunk_overlap=200)
fixed_size_chunks = text_splitter.create_documents([doc[0].page_content])
我们可以输出其中一个分块,以更好地理解它们的结构。
fixed_size_chunks[1]
输出:(已截断)
Document(metadata={}, page_content=’As always, IBM’s historical commitment to open source is reflected in the permissive and standard open source licensing for every offering discussed in this article.\n\r\n Granite 3.1 8B Instruct: raising the bar for lightweight enterprise models\r\n \nIBM’s efforts in the ongoing optimization the Granite series are most evident in the growth of its flagship 8B dense model. IBM Granite 3.1 8B Instruct now bests most open models in its weight class in average scores on the academic benchmarks evaluations included in the Hugging Face OpenLLM Leaderboard...’)
我们还可以使用分词器来验证我们的处理过程,并检查每个分块中存在的令牌数量。此步骤为可选步骤,仅供参考。
for idx, val in enumerate(fixed_size_chunks):
token_count = len(tokenizer.encode(val.page_content))
print(f”The chunk at index {idx} contains {token_count} tokens.”)
输出:
The chunk at index 0 contains 1106 tokens.
The chunk at index 1 contains 1102 tokens.
The chunk at index 2 contains 1183 tokens.
The chunk at index 3 contains 1010 tokens.
太棒了!看来我们的数据块大小设置得当。
递归分块
对于递归分块,我们可以使用 LangChain 的 RecursiveCharacterTextSplitter。与固定大小的分块示例一样,我们可以尝试不同的块和重叠大小。
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
recursive_chunks = text_splitter.create_documents([doc[0].page_content])
recursive_chunks[:5]
输出:
[Document(metadata={}, page_content=’IBM Granite 3.1: powerful performance, longer context and more’),
Document(metadata={}, page_content=’IBM Granite 3.1: powerful performance, longer context, new embedding models and more’),
Document(metadata={}, page_content=’Artificial Intelligence’),
Document(metadata={}, page_content=’Compute and servers’),
Document(metadata={}, page_content=’IT automation’)]
拆分器使用默认分隔符成功地对文本进行分块:[“\n\n”, “\n”, “ “, “”]。
语义分块
语义分块需要一个嵌入模型或编码器模型。我们可以使用 granite-embedding-30m-english 模型作为我们的嵌入模型。我们也可以输出其中一个块,以便更好地理解它们的结构。
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
embeddings_model = HuggingFaceEmbeddings(model_name=”ibm-granite/granite-embedding-30m-english”)
text_splitter = SemanticChunker(embeddings_model)
semantic_chunks = text_splitter.create_documents([doc[0].page_content])
semantic_chunks[1]
输出:(已截断)
Document(metadata={}, page_content=’Our latest dense models (Granite 3.1 8B, Granite 3.1 2B), MoE models (Granite 3.1 3B-A800M, Granite 3.1 1B-A400M) and guardrail models (Granite Guardian 3.1 8B, Granite Guardian 3.1 2B) all feature a 128K token context length.We’re releasing a family of all-new embedding models. The new retrieval-optimized Granite Embedding models are offered in four sizes, ranging from 30M–278M parameters. Like their generative counterparts, they offer multilingual support across 12 different languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch and Chinese. Granite Guardian 3.1 8B and 2B feature a new function calling hallucination detection capability, allowing increased control over and observability for agents making tool calls...’)
基于文档的分块
各种文件类型的文档均与 LangChain 的基于文档的文本拆分器兼容。在本教程中,我们将使用 Markdown 文件。有关递归 JSON 拆分、代码拆分和 HTML 拆分的示例,请参阅 LangChain 文档。
我们可以加载的 Markdown 文件的示例是 IBM GitHub 上 Granite 3.1 的 README 文件。
url = “https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md”
markdown_doc = WebBaseLoader(url).load()
markdown_doc
输出:
[Document(metadata={‘source’: ‘https://raw.githubusercontent.com/ibm-granite/granite-3.1-language-models/refs/heads/main/README.md’}, page_content=’\n\n\n\n :books: Paper (comming soon)\xa0 | :hugs: HuggingFace Collection\xa0 | \n :speech_balloon: Discussions Page\xa0 | 📘 IBM Granite Docs\n\n\n---\n## Introduction to Granite 3.1 Language Models\nGranite 3.1 language models are lightweight, state-of-the-art, open foundation models that natively support multilinguality, coding, reasoning, and tool usage, including the potential to be run on constrained compute resources. All the models are publicly released under an Apache 2.0 license for both research and commercial use. The models\’ data curation and training procedure were designed for enterprise usage and customization, with a process that evaluates datasets for governance, risk and compliance (GRC) criteria, in addition to IBM\’s standard data clearance process and document quality checks...’)]
现在,我们可以使用 LangChain 的 MarkdownHeaderTextSplitter 按标题类型拆分文件,标题类型在 headers_to_split_on 列表中设置。我们还将输出其中一个块作为示例。
#document based chunking
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
(“#”, “Header 1”),
(“##”, “Header 2”),
(“###”, “Header 3”),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
document_based_chunks = markdown_splitter.split_text(markdown_doc[0].page_content)
document_based_chunks[3]
输出:
Document(metadata={‘Header 2’: ‘How to Use our Models?’, ‘Header 3’: ‘Inference’}, page_content=’This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model. \n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = “auto”\nmodel_path = “ibm-granite/granite-3.1-1b-a400m-instruct”\ntokenizer = AutoTokenizer.from_pretrained(model_path)\n# drop device_map if running on CPU\nmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)\nmodel.eval()\n# change input text as desired\nchat = [\n{ “role”: “user”, “content”: “Please list one IBM Research laboratory located in the United States. You should only output its name and location.” },\n]\nchat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)\n# tokenize the text\ninput_tokens = tokenizer(chat, return_tensors=”pt”).to(device)\n# generate output tokens\noutput = model.generate(**input_tokens,\nmax_new_tokens=100)\n# decode output tokens into text\noutput = tokenizer.batch_decode(output)\n# print output\nprint(output)\n```’)
正如在输出中看到的,分块已成功按标题类型拆分了文本。