Granite Vision

Model Collection

View the full Granite Vision collection on Hugging Face

Run locally with Ollama

Download and run Granite Vision with Ollama

Demo

Check out Granite Vision in action

Multimodal RAG

Try out this Granite x Docling Multimodal RAG recipe

Overview

The Granite Vision models are designed for enterprise applications, specializing in visual document understanding. They are capable of performing a wide range of tasks, including extracting information from tables, charts, diagrams, sketches, and infographics, as well as general image analysis. The family of vision models also includes Granite Vision Embedding, a novel multimodal embedding model for document retrieval. It enables queries on documents containing tables, charts, infographics, and complex layouts. By eliminating the need for text extraction, Vision Embedding simplifies and accelerates retrieval-augmented generation (RAG) pipelines. Despite its lightweight architecture, Granite Vision achieves strong performance on standard visual document understanding benchmarks and on the LiveXiv benchmark, which continuously evaluates the model on updated sets of arXiv papers to prevent data leakage. Granite Vision is currently ranked 2nd on the OCRBench Leaderboard (as of 10/2/2025). Similarly, Granite Vision Embedding achieves top ranks on visual document retrieval benchmarks, currently holding 5th place on the ViDoRe 2 leaderboard (as of 10/2/2025). Granite Vision and Vision Embedding are released under the Apache 2.0 license, making them freely available for both research and commercial purposes, with full transparency into their training data. Granite Vision Paper - Please note that the paper describes Granite Vision 3.2. However, Granite Vision 3.3 shares most of the technical underpinnings with Granite 3.2, but with several enhancements in terms of new and improved vision encoder, many new high quality datasets for training, and several new experimental capabilities.

Getting started

Granite Vision with Hugging Face transformers

This is a simple example of how to use the granite-vision-3.3-2b model with the Transformers library and PyTorch. First, install the required libraries

pip install transformers>=4.49

from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "ibm-granite/granite-vision-3.3-2b"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)

# prepare image and text prompt, using the appropriate prompt template

img_path = hf_hub_download(repo_id=model_path, filename='example.png')

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": img_path},
            {"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(device)


# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Granite Vision with vLLM

The granite-vision-3.3-2b model can also be loaded with vLLM. First make sure to install the following libraries:

pip install torch torchvision torchaudio
pip install vllm==0.6.6

Then, copy the snippet from the section that is relevant for your use case.

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image

model_path = "ibm-granite/granite-vision-3.3-2b"

model = LLM(
    model=model_path,
)

sampling_params = SamplingParams(
    temperature=0.2,
    max_tokens=64,
)

# Define the question we want to answer and format the prompt
image_token = "<image>"
system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"

question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
image = Image.open(img_path).convert("RGB")
print(image)

# Build the inputs to vLLM; the image is passed as `multi_modal_data`.
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
    }
}

outputs = model.generate(inputs, sampling_params=sampling_params)
print(f"Generated text: {outputs[0].outputs[0].text}")

Models

Run Granite

Model Standards

Responsible AI

Model Collection

Run locally with Ollama

Demo

Multimodal RAG

Overview

Getting started

Granite Vision with Hugging Face transformers

Granite Vision with vLLM

Models

Run Granite

Model Standards

Responsible AI

Model Collection

Run locally with Ollama

Demo

Multimodal RAG

​Overview

​Getting started

​Granite Vision with Hugging Face transformers

​Granite Vision with vLLM

Overview

Getting started

Granite Vision with Hugging Face transformers

Granite Vision with vLLM