In this tutorial, we will set up a retrieval augmented generation (RAG) workflow with IBM® Granite® 4.0 to explore and ask questions about an IBM Quantum® research paper published earlier this year. Before we begin, let’s achieve a basic understanding of quantum computing.
Â
Your traditional computer can stream movies and run numerous applications simultaneously, yet it cannot factor large numbers before our sun burns out. However, quantum computers might do it while you're waiting for your pizza delivery. They outperform the best algorithms on traditional computers, tackling tasks that once seemed impossible. Many of these problems remained out of reach due to the sheer difficulty in terms of time and energy constraints for traditional computers. You might ask, "Why is this the case? Aren't traditional computers quite powerful in today's world?" To answer that, yes, they are powerful, but some problems require far greater computational power. For example, imagine being stuck in a maze and you need to find the exit. A traditional computer maps each path step-by-step until it finds the exit. Depending on the maze’s size and the number of paths, finding the exit could take a long time. However, quantum computing explores all pathways simultaneously to find the exit. This ability, in turn, removes the constraints that traditional computing faces when dealing with complex problems.
This tutorial contains step-by-step instructions for creating a RAG pipeline with LangChain over text data, specifically an IBM Quantum research paper.
RAG is an architectural pattern that can be used to augment the performance of large language models (LLMs). It does this by recalling factual information from an external knowledge base that is then used to inform and enhance the model's generated response. This method is typically more cost-effective and resource-efficient than fine-tuning a model for specific queries. It is also possible to integrate autonomous agents, automated components that can perform retrieval or decision-making tasks, into your RAG system for expanded retrieval capabilities. However, in this walkthrough, we focus on the most common approach to simple RAG by creating dense, numerical vector representations of the knowledge base. These representations enable the retrieval of text chunks that are semantically similar to a specific user query. Alternatively, LangGraph, the graph-based knowledge retrieval system created by LangChain, can also be used in this process. However, for simplicity and focus, we will be using LangChain, a popular framework for building language model applications.
Use cases for RAG applications include:
Customer service chatbot: Answering questions about a product or service by using facts from the product documentation.
Domain knowledge: Exploring a specialized domain (for example, finance) by using relevant information from papers or articles in the knowledge base.
LLM-powered news chat: Chatting about current events by calling up relevant recent news articles.
In its simplest form, RAG requires 3 steps:
Index knowledge-based passages for efficient retrieval. In this recipe, we take embeddings of the passages, and store them in a vector database.
Retrieve relevant passages from the database by using semantic search. In this recipe, we use an embedding of the query to retrieve semantically similar passages.
Generate a response by feeding the retrieved passage to an LLM along with the user query.
You need an IBM Cloud® account to create a watsonx.ai® project.
Several Python versions can work for this tutorial. At the time of publishing, we recommend downloading 3.10, 3.11 or 3.12
To get started with IBM Granite on IBM watsonx.ai, follow this recipe.
This tutorial is available on GitHub.
Ensure that you are running Python 3.10, 3.11 or 3.12 by using the following code.
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), "Use Python 3.10, 3.11 or 3.12 to run this notebook."
python -m venv myenv
source myenv/bin/activate
Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an application programming interface (API) key.
Associate the watsonx.ai Runtime service instance with the project that you created in watsonx.ai.
! echo "::group::Install Dependencies"
%pip install uv
! uv pip install git+https://github.com/ibm-granite-community/utils.git \
transformers \
langchain \
langchain_core \
langchain_huggingface sentence_transformers \
langchain_milvus \
langchain_ibm \
docling \
accelerate \
"pymilvus[milvus_lite]"
! echo "::endgroup::"
from ibm_granite_community.notebook_utils import get_env_var, wrap_text
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer
from langchain_milvus import Milvus
import tempfile
from langchain_ibm import ChatWatsonx
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document
WATSONX_APIKEY = get_env_var('WATSONX_APIKEY')
WATSONX_PROJECT_ID = get_env_var('WATSONX_PROJECT_ID')
URL = get_env_var("WATSONX_URL")
Specify the model to use for generating embedding vectors from text.
Specify the model to use for generating embedding vectors from text.
To use an open-source model from a provider other than Hugging Face, replace this code cell with one from this embeddings model recipe. Feel free to modify the following code to use a different provider like OpenA and their OpenAIEmbeddings package.
embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embedding_model = HuggingFaceEmbeddings(
model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)
Specify the database to use for storing and retrieving embedding vectors.
To connect to a vector database other than Milvus, substitute this code cell with one from this vector store recipe.
db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")
vector_db = Milvus(
embedding_function=embedding_model,
connection_args={"uri": db_file},
auto_id=True,
enable_dynamic_field=True,
index_params={"index_type": "AUTOINDEX"},
)
The LLM will be used for answering our questions given the retrieved text.
The LLM will be used for answering our questions given the retrieved text.
Select an IBM Granite model from the list of available foundation models. Here we inference the model with watsonx.ai, which will also save local computational resources. To connect to a chat model on another provider like OpenAI, substitute this code cell with one from the LLM component recipe. Also, consider modifying the model parameters to best optimize your use case.
llm = ChatWatsonx(
model_id="ibm/granite-4-h-small",
apikey=WATSONX_APIKEY,
url=URL,
project_id=WATSONX_PROJECT_ID,
params={
"temperature": 0,
"max_new_tokens": 512,
}
)
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-h-small")
Here we use an IBM paper that lays out an end-to-end framework for a fault-tolerant quantum computer that is modular and based on the bivariate bicycle codes.
Split the document into text segments that can fit into the model's context window.
# Here are our documents, feel free to add more documents in formats that Docling supports
sources = [
'https://arxiv.org/pdf/2506.03094'
]
converter = DocumentConverter()
# Convert and chunk out documents
doc_id = 0
texts: list[Document] = [
Document(page_content=chunk.text, metadata={"doc_id": (doc_id:=doc_id+1), "source": source})
for source in sources
for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(converter.convert(source=source).document)
if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]
print(f"{len(texts)} document chunks created")
NOTE: Population of the vector database can take over a minute depending on your embedding model and service.
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")
Conduct a similarity search in the database to search for similar, relevant documents by proximity of the embedded vector in vector space. The query variable takes any user input as the search query. Take some time to craft your own questions. In the meantime, here are some sample questions to get you started:
·      How do we validate the fault tolerance of each bicycle instruction?
·      Explain to me what this paper is about.
·      How do we validate the fault tolerance of each bicycle instruction?
query = "Explain to me what this paper is about."
retrieved_docs = vector_db.similarity_search(query) # return a list of documents
print(f"{len(retrieved_docs)} source documents returned")
for doc in retrieved_docs:
print(doc)
print("=" * 80) # Separator for clarity
In the previous step, we created a variable, query to define a user query. Finally, in this step, we'll construct our RAG workflow to answer our user query.
First, we will use a chat prompt template and supply two placeholder values that the LangChain RAG pipeline will replace: {input} will contain the user query and {context} will hold the retrieved documents from the IBM Quantum paper as shown in the previous search. The context will be fed to the model as document context for answering our question.
Next, the document_prompt_template is established to make our document id and page content consumable by the Granite model. We then create a document chain with our LLM, prompt template, document prompt and document separator.
Finally, we set up our RAG chain with a retriever for similarity search and the documents chain we just set up.
Note: For additional observability, it is recommended that you use LangSmith. For the purposes of this tutorial, we will not be applying this step. Feel free to learn more about it and even apply it yourself here. If you do apply LangSmith to your notebook, don’t forget to set your environment variables.
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
# Create a Granite prompt for question-answering with the retrieved context
prompt_template = ChatPromptTemplate.from_messages(
[
("system", """You are a helpful assistant with access to the following documents. You may use one or more documents to assist with the user query.
You are given a list of documents within <documents></documents> XML tags:
<documents>{context}
</documents>
Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data."""),
("user", "{input}"),
],
)
document_prompt_template = PromptTemplate.from_template("""{{"doc_id": {doc_id}, "text": {page_content}}}""")
# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
llm=llm,
prompt=prompt_template,
document_prompt=document_prompt_template,
document_separator="\n",
)
rag_chain = create_retrieval_chain(
retriever=vector_db.as_retriever(),
combine_docs_chain=combine_docs_chain,
)
Use the RAG chain to process your question. The document chunks relevant to that question are retrieved and used as context.
output = rag_chain.invoke({"input": query})
print("=" * 40)
print("RAG Answer:")
print(wrap_text(output['answer']))
print("=" * 40)
You have now successfully built a powerful RAG system that combines PDF parsing with semantic search and language generation. This setup has given you a solid foundation for extracting and querying information from complex documents while preserving their structure and context. To take this further, try testing the notebook with different types of PDFs such as technical manuals, research papers or financial reports to see how it handles various document structures. Also, try experimenting with different models and tweaking their parameters to find the most efficient configuration. You might also consider integrating LangSmith to track retrieval quality and optimize your setup’s performance over time.
Open-source small language models delivering enterprise-grade performance and transparency at a competitive price.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.