IBM Granite

Granite Vision

Fine-tune Granite Vision

granite-vision-3.2-2b

Fine-tune Granite Vision with the Transformer Reinforcement Learning library (TRL).

Multimodal RAG

granite-vision-3.2-2b

Multimodal RAG recipe.

Table of contents

  1. Overview
  2. Model cards
  3. Getting Started
    1. Granite Vision with Hugging Face transformers
    2. Granite Vision with vLLM

Overview

The Granite Vision collection offers a streamlined model tailored for visual document understanding, specifically aimed at enterprise needs. It’s trained to handle a variety of tasks, such as extracting information from tables, charts, diagrams, sketches, and infographics, along with general image-related tasks. The Granite Vision model can support up to ~1.5 million pixels, with images up to 1152x1152.

Despite its lightweight design, Granite Vision delivers strong performance on standard visual document understanding benchmarks and the LiveXiv benchmark, which uses a constantly updated set of new Arxiv papers to avoid data leakage.

We’re releasing Granite Vision under the Apache 2.0 license, making it freely available for both research and commercial use, with complete visibility into the training data.

Model cards

Run locally with Ollama

Learn more about Granite 3.2 on Ollama.

Getting Started

Granite Vision with Hugging Face transformers

This is a simple example of how to use the granite-vision-3.2-2b model with the Transformers library and PyTorch.

First, install the required libraries

pip install transformers>=4.49
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "ibm-granite/granite-vision-3.2-2b"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)

Granite Vision with vLLM

The granite-vision-3.2-2b model can also be loaded with vLLM. First make sure to install the following libraries:

pip install torch torchvision torchaudio
pip install vllm==0.6.6

Then, copy the snippet from the section that is relevant for your use case.

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image
model_path = "ibm-granite/granite-vision-3.2-2b"
model = LLM(
model=model_path,