Granite with vLLM
Overview
In this guide, we’ll be using vLLM running as a container to run Granite.
Prerequisites
This guide assumes you have a docker-compatible container runtime such as podman or Docker Desktop installed. This guide also requires an NVIDIA GPU be available on your machine.
Pull the container image
docker pull vllm/vllm-openai:latest
Run the model
To run the model inside the container, simply run the container with your huggingface cache mounted:
docker run --runtime nvidia --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \vllm/vllm-openai:latest \--model ibm-granite/granite-3.1-8b-instruct
NOTE: You can pre-download the model into ~/.cache/huggingface
using
huggingface-cli download ibm-granite/granite-3.1-8b-instruct
. If not
pre-downloaded, the model will be downloaded when vllm
boots and saved in the
mounted cache.
Run a sample request
curl -H "Content-Type: application/json" http://localhost:8000/v1/chat/completions -d '{"model": "ibm-granite/granite-3.1-2b-instruct","messages": [{"role": "users", "content": "How are you today?"}]}'