Granite Vision
Granite Vision is designed for efficient content extraction from tables, charts, and diagrams, making it a powerful tool for structured data analysis. Our focus was on creating a lightweight, computationally efficient model that keeps resource costs low, making it practical and accessible for real-world applications.
Fine-tune Granite Vision
granite-vision-3.2-2b
Fine-tune Granite Vision with the Transformer Reinforcement Learning library (TRL).
Table of contents
Overview
The Granite Vision collection offers a streamlined model tailored for visual document understanding, specifically aimed at enterprise needs. It’s trained to handle a variety of tasks, such as extracting information from tables, charts, diagrams, sketches, and infographics, along with general image-related tasks. The Granite Vision model can support up to ~1.5 million pixels, with images up to 1152x1152.
Despite its lightweight design, Granite Vision delivers strong performance on standard visual document understanding benchmarks and the LiveXiv benchmark, which uses a constantly updated set of new Arxiv papers to avoid data leakage.
We’re releasing Granite Vision under the Apache 2.0 license, making it freely available for both research and commercial use, with complete visibility into the training data.
Model cards
Run locally with Ollama
Learn more about Granite 3.2 on Ollama.
Getting Started
Granite Vision with Hugging Face transformers
This is a simple example of how to use the granite-vision-3.2-2b model with the Transformers library and PyTorch.
First, install the required libraries
pip install transformers>=4.49
from transformers import AutoProcessor, AutoModelForVision2Seqfrom huggingface_hub import hf_hub_downloadimport torchdevice = "cuda" if torch.cuda.is_available() else "cpu"model_path = "ibm-granite/granite-vision-3.2-2b"processor = AutoProcessor.from_pretrained(model_path)model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)
Granite Vision with vLLM
The granite-vision-3.2-2b model can also be loaded with vLLM. First make sure to install the following libraries:
pip install torch torchvision torchaudiopip install vllm==0.6.6
Then, copy the snippet from the section that is relevant for your use case.
from vllm import LLM, SamplingParamsfrom vllm.assets.image import ImageAssetfrom huggingface_hub import hf_hub_downloadfrom PIL import Imagemodel_path = "ibm-granite/granite-vision-3.2-2b"model = LLM(model=model_path,