Supported hardware, model architectures and performance settings
Review the foundation model architectures, and performance-boosting settings and features available for deploying a custom foundation model with watsonx.ai.
You can deploy two custom foundation model types: general-purpose models and time-series models. When you upload and register a custom foundation model, consider the following requirements:
- General purpose models: make sure that your model meets the Hardware requirements.
- All model types: make sure that your model uses one of the Supported model architectures.
Hardware requirements
- NVIDIA A100 GPUs with 80 GB RAM
- NVIDIA H100 GPUs with 80 GB RAM
- NVIDIA H200 GPUs with 141 GB RAM
If your GPU configuration is different (for example NVIDIA H100 GPUs with 40 GB RAM), you must create a custom hardware specification. For details, see Creating custom hardware specifications.
Supported model architectures
To find the architecture type for your custom foundation model, see Planning to deploy a custom foundation model.
- The
watsonx-cfm-caikit-1.1software specification is based on thevLLMruntime engine. - The
watsonx-tsfm-runtime-1.0software specification is designed for time-series models. It's based on thewatsonx-tsfm-runtime-1.0inference runtime.
The following tables provide the information about the supported model architectures, available quantization methods, support for multiple GPUs and available deployment configurations. Also provided are software specifications that you can use when you deploy the models and the names of some example foundation models that belong to each of the listed architecture types.
Some models are available from a specific release. Look for <release>.<refresh> markings next to models.
| Model family | Foundation model examples | Supported quantization method | Parallel tensors (Multiple GPUs supported) | Deployment configurations |
|---|---|---|---|---|
bloom |
bigscience/bloom-3b, bigscience/bloom-560m |
N/A | Yes | Small, Medium, Large |
exaone |
lgai-exaone/exaone-3.0-7.8B-Instruct |
N/A | No | Small |
falcon |
tiiuae/falcon-7b |
N/A | Yes | Small, Medium, Large |
gemma |
google/gemma-2b |
N/A | Yes | Small, Medium, Large |
gemma2 |
google/gemma-2-9b |
N/A | Yes | Small, Medium, Large |
gemma3 |
google/gemma-3-27b-it |
N/A | Yes | Small, Medium, Large |
gpt_bigcode |
bigcode/starcoder, bigcode/gpt_bigcode-santacoder |
gptq |
Yes | Small, Medium & Large |
gpt_neox |
rinna/japanese-gpt-neox-small, EleutherAI/pythia-12b,
databricks/dolly-v2-12b |
N/A | Yes | Small, Medium, Large |
gptj |
EleutherAI/gpt-j-6b |
N/A | No | Small |
gpt2 |
gpt2, gpt2-xl |
N/A | Yes | Small, Medium, Large |
granite |
ibm-granite/granite-3.0-8b-instruct,
ibm-granite/granite-3b-code-instruct-2k, granite-8b-code-instruct,
granite-7b-lab
|
N/A | No | Small |
jais |
core42/jais-13b |
N/A | Yes | Small, Medium, Large |
llama |
meta-llama/Meta-Llama-3-8B,
meta-llama/Meta-Llama-3.1-8B-Instruct, llama-2-13b-chat-hf,
TheBloke/Llama-2-7B-Chat-AWQ,
ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf |
gptq |
Yes | Small, Medium, Large |
mistral |
mistralai/Mistral-7B-v0.3,
neuralmagic/OpenHermes-2.5-Mistral-7B-marlin |
N/A | No | Small |
mixtral |
TheBloke/Mixtral-8x7B-v0.1-GPTQ,
mistralai/Mixtral-8x7B-Instruct-v0.1 |
gptq |
No | Small |
nemotron |
nvidia/Minitron-8B-Base |
N/A | Yes | Small, Medium, Large |
olmo |
allenai/OLMo-1B-hf, allenai/OLMo-7B-hf |
N/A | Yes | Small, Medium, Large |
persimmon |
adept/persimmon-8b-base, adept/persimmon-8b-chat |
N/A | Yes | Small, Medium, Large |
phi |
microsoft/phi-2, microsoft/phi-1_5 |
N/A | Yes | Small, Medium, Large |
phi3 |
microsoft/Phi-3-mini-4k-instruct |
N/A | Yes | Small, Medium, Large |
qwen |
DeepSeek-R1 (distilled variant) |
N/A | Yes | Small, Medium, Large |
qwen2 |
Qwen/Qwen2-7B-Instruct-AWQ |
AWQ |
Yes | Small, Medium, Large |
qwen3 |
qwen/qwen3-32B |
N/A | Yes | Small, Medium, Large |
If you want to use models from the MosaicML family, you must deploy them by using a custom inference runtime, with vLLM image version 0.7.4.
Limitation: It's not possible to use text/generation or text/generation_stream. Only text/chat is
available. For text/chat to work, the model must have a chat_template defined in
the model content. If not present, create a chat_template and reference it in the
deployment payload.
| Model family | Foundation model examples | Supported Quantization method | Parallel Tensors (Multiple GPUs supported) | Deployment configurations |
|---|---|---|---|---|
tinytimemixer |
ibm-granite/granite-timeseries-ttm-r2 |
N/A | N/A | Small, Medium, Large, Extra large |
| Model family | Foundation model examples | Supported semantic retrieval | Deployment configurations |
|---|---|---|---|
BGE |
BAAI/bge-reranker-v2-m3 |
reranking | Small, Medium, Large |
E5 |
intfloat/multilingual-e5-large |
embedding and reranking | Small, Medium, Large |
granite |
ibm/granite-embedding-107m-multilingual,
ibm/granite-embedding-278m-multilingual |
embedding | Small, Medium, Large |
Jina Reranker |
jinaai/jina-reranker-v2-base-multilingual |
reranking | Small, Medium, Large |
MiniLM |
cross-encoder/ms-marco-minilm-l-12-v2 |
reranking | Small, Medium, Large |
MiniLM |
sentence-transformers/all-minilm-l6-v2 |
embedding and reranking | Small, Medium, Large |
Qwen |
Qwen/Qwen3-Embedding-0.6B |
embedding and reranking | Small, Medium, Large |
slate |
ibm/slate-125m-english-rtrvr,
ibm/slate-125m-english-rtrvr-v2, ibm/slate-30m-english-rtrvr,
ibm/slate-30m-english-rtrvr-v2 |
embedding and reranking | Small, Medium, Large |
- Automatic speech recognition (ASR) models can be added by admins for global deployment. For more information, see Registering custom foundation models for global deployment.
- IBM only certifies the model architectures that are listed in Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications tables. You can use models with alternative architectures if they are supported by the vLLM runtime, however, IBM does not support deployment failures as a result of deploying foundation models with unsupported architectures or incompatible features. Alternatively, you can add or build a custom inference runtime image to match your model's architecture. However, only images that are listed in Open Shift registry were tested and are confirmed to work.