IBM Granite

Granite with vLLM

Overview

In this guide, we’ll be using vLLM running as a container to run Granite.

Prerequisites

This guide assumes you have a docker-compatible container runtime such as podman or Docker Desktop installed. This guide also requires an NVIDIA GPU be available on your machine.

Pull the container image

docker pull vllm/vllm-openai:latest

Run the model

To run the model inside the container, simply run the container with your huggingface cache mounted:

docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model ibm-granite/granite-3.1-8b-instruct

NOTE: You can pre-download the model into ~/.cache/huggingface using huggingface-cli download ibm-granite/granite-3.1-8b-instruct. If not pre-downloaded, the model will be downloaded when vllm boots and saved in the mounted cache.

Run a sample request

curl -H "Content-Type: application/json" http://localhost:8000/v1/chat/completions -d '{
"model": "ibm-granite/granite-3.1-2b-instruct",
"messages": [
{"role": "users", "content": "How are you today?"}
]
}'