Multimodal RAG

Use the Granite Vision model to build an AI-powered multimodal RAG pipeline.

In this recipe, you'll learn how to harness the power of advanced tools to build an AI-powered multimodal RAG pipeline. This tutorial will guide you through the following processes:

Document preprocessing: Learn how to handle documents from various sources, parse and transform them into usable formats and store them in vector databases by using Docling. You will use a Granite LLM to generate image descriptions of images in the documents.
RAG: Understand how to connect LLMs such as Granite with external knowledge bases to enhance query responses and generate valuable insights.
LangChain for workflow integration: Discover how to use LangChain to streamline and orchestrate document processing and retrieval workflows, enabling seamless interaction between different components of the system.

This recipe uses three cutting-edge technologies:

Docling: An open-source toolkit used to parse and convert documents.
Granite: A state-of-the-art MLLM that provides robust natural language capabilities and a vision language model that provides image to text generation.
LangChain: A powerful framework used to build applications powered by language models, designed to simplify complex workflows and integrate external tools seamlessly.

You will need a Replicate API token to run this recipe in Colab. Instructions for obtaining this credential can be found here.

Get started

Explore sample code in a GitHub repo

Try it out

Execute sample code in Colab