The Granite Vision models are designed for enterprise applications, specializing in visual document understanding. They are capable of performing a wide range of tasks, including extracting information from tables, charts, diagrams, sketches, and infographics, as well as general image analysis.The family of vision models also includes Granite Vision Embedding, a novel multimodal embedding model for document retrieval. It enables queries on documents containing tables, charts, infographics, and complex layouts. By eliminating the need for text extraction, Vision Embedding simplifies and accelerates retrieval-augmented generation (RAG) pipelines.Despite its lightweight architecture, Granite Vision achieves strong performance on standard visual document understanding benchmarks and on the LiveXiv benchmark, which continuously evaluates the model on updated sets of arXiv papers to prevent data leakage. Granite Vision is currently ranked 2nd on the OCRBench Leaderboard (as of 10/2/2025). Similarly, Granite Vision Embedding achieves top ranks on visual document retrieval benchmarks, currently holding 5th place on the ViDoRe 2 leaderboard (as of 10/2/2025).Granite Vision and Vision Embedding are released under the Apache 2.0 license, making them freely available for both research and commercial purposes, with full transparency into their training data.Granite Vision Paper - Please note that the paper describes Granite Vision 3.2. However, Granite Vision 3.3 shares most of the technical underpinnings with Granite 3.2, but with several enhancements in terms of new and improved vision encoder, many new high quality datasets for training, and several new experimental capabilities.
Then, copy the snippet from the section that is relevant for your use case.
Copy
Ask AI
from vllm import LLM, SamplingParamsfrom vllm.assets.image import ImageAssetfrom huggingface_hub import hf_hub_downloadfrom PIL import Imagemodel_path = "ibm-granite/granite-vision-3.3-2b"model = LLM( model=model_path,)sampling_params = SamplingParams( temperature=0.2, max_tokens=64,)# Define the question we want to answer and format the promptimage_token = "<image>"system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"question = "What is the highest scoring model on ChartQA and what is its score?"prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"img_path = hf_hub_download(repo_id=model_path, filename='example.png')image = Image.open(img_path).convert("RGB")print(image)# Build the inputs to vLLM; the image is passed as `multi_modal_data`.inputs = { "prompt": prompt, "multi_modal_data": { "image": image, }}outputs = model.generate(inputs, sampling_params=sampling_params)print(f"Generated text: {outputs[0].outputs[0].text}")