Supported hardware, model architectures and performance settings

Review the foundation model architectures, and performance-boosting settings and features available for deploying a custom foundation model with watsonx.ai.

You can deploy two custom foundation model types: general-purpose models and time-series models. When you upload and register a custom foundation model, consider the following requirements:

General purpose models: make sure that your model meets the Hardware requirements.
All model types: make sure that your model uses one of the Supported model architectures.

Hardware requirements

For general-purpose models, the standard supported hardware configurations to deploy custom foundation models are:

NVIDIA A100 GPUs with 80 GB RAM
NVIDIA H100 GPUs with 80 GB RAM
NVIDIA H200 GPUs with 141 GB RAM

If your GPU configuration is different (for example NVIDIA H100 GPUs with 40 GB RAM), you must create a custom hardware specification. For details, see Creating custom hardware specifications.

Restriction: You cannot use GPUs that are based on the Intel Gaudi 3 AI Accelerator architecture for custom foundation model deployments.

Supported model architectures

To find the architecture type for your custom foundation model, see Planning to deploy a custom foundation model.

Note:

Various software specifications are available for your deployments:

The watsonx-cfm-caikit-1.1 software specification is based on the vLLM runtime engine.
The watsonx-tsfm-runtime-1.0 software specification is designed for time-series models. It's based on the watsonx-tsfm-runtime-1.0 inference runtime.

The following tables provide the information about the supported model architectures, available quantization methods, support for multiple GPUs and available deployment configurations. Also provided are software specifications that you can use when you deploy the models and the names of some example foundation models that belong to each of the listed architecture types.

Note:

Some models are available from a specific release. Look for <release>.<refresh> markings next to models.

Table 1. Supported model architectures, quantization methods, parallel tensors, and deployment configurations for general-purpose models
Model family	Foundation model examples	Supported quantization method	Parallel tensors (Multiple GPUs supported)	Deployment configurations
`bloom`	`bigscience/bloom-3b`, `bigscience/bloom-560m`	N/A	Yes	Small, Medium, Large
`exaone`	`lgai-exaone/exaone-3.0-7.8B-Instruct`	N/A	No	Small
`falcon`	`tiiuae/falcon-7b`	N/A	Yes	Small, Medium, Large
`gemma`	`google/gemma-2b`	N/A	Yes	Small, Medium, Large
`gemma2`	`google/gemma-2-9b`	N/A	Yes	Small, Medium, Large
`gemma3`	`google/gemma-3-27b-it`	N/A	Yes	Small, Medium, Large
`gpt_bigcode`	`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`	`gptq`	Yes	Small, Medium & Large
`gpt_neox`	`rinna/japanese-gpt-neox-small`, `EleutherAI/pythia-12b`, `databricks/dolly-v2-12b`	N/A	Yes	Small, Medium, Large
`gptj`	`EleutherAI/gpt-j-6b`	N/A	No	Small
`gpt2`	`gpt2`, `gpt2-xl`	N/A	Yes	Small, Medium, Large
`granite`	`ibm-granite/granite-3.0-8b-instruct`, `ibm-granite/granite-3b-code-instruct-2k`, `granite-8b-code-instruct`, `granite-7b-lab`	N/A	No	Small
`jais`	`core42/jais-13b`	N/A	Yes	Small, Medium, Large
`llama`	`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3.1-8B-Instruct`, `llama-2-13b-chat-hf`, `TheBloke/Llama-2-7B-Chat-AWQ`, `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf`	`gptq`	Yes	Small, Medium, Large
`mistral`	`mistralai/Mistral-7B-v0.3`, `neuralmagic/OpenHermes-2.5-Mistral-7B-marlin`	N/A	No	Small
`mixtral`	`TheBloke/Mixtral-8x7B-v0.1-GPTQ`, `mistralai/Mixtral-8x7B-Instruct-v0.1`	`gptq`	No	Small
`nemotron`	`nvidia/Minitron-8B-Base`	N/A	Yes	Small, Medium, Large
`olmo`	`allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`	N/A	Yes	Small, Medium, Large
`persimmon`	`adept/persimmon-8b-base`, `adept/persimmon-8b-chat`	N/A	Yes	Small, Medium, Large
`phi`	`microsoft/phi-2`, `microsoft/phi-1_5`	N/A	Yes	Small, Medium, Large
`phi3`	`microsoft/Phi-3-mini-4k-instruct`	N/A	Yes	Small, Medium, Large
`qwen`	`DeepSeek-R1 (distilled variant)`	N/A	Yes	Small, Medium, Large
`qwen2`	`Qwen/Qwen2-7B-Instruct-AWQ`	`AWQ`	Yes	Small, Medium, Large
`qwen3`	`qwen/qwen3-32B`	N/A	Yes	Small, Medium, Large

Note:

If you want to use models from the MosaicML family, you must deploy them by using a custom inference runtime, with vLLM image version 0.7.4.

Limitation: It's not possible to use text/generation or text/generation_stream. Only text/chat is available. For text/chat to work, the model must have a chat_template defined in the model content. If not present, create a chat_template and reference it in the deployment payload.

For more information, see:

Table 2. Supported model architectures, quantization methods, parallel tensors, and deployment configurations for time-series models
Model family	Foundation model examples	Supported Quantization method	Parallel Tensors (Multiple GPUs supported)	Deployment configurations
`tinytimemixer`	`ibm-granite/granite-timeseries-ttm-r2`	N/A	N/A	Small, Medium, Large, Extra large

Table 3. Supported model architectures: embedding and reranking models
Model family	Foundation model examples	Supported semantic retrieval	Deployment configurations
`BGE`	`BAAI/bge-reranker-v2-m3`	reranking	Small, Medium, Large
`E5`	`intfloat/multilingual-e5-large`	embedding and reranking	Small, Medium, Large
`granite`	`ibm/granite-embedding-107m-multilingual`, `ibm/granite-embedding-278m-multilingual`	embedding	Small, Medium, Large
`Jina Reranker`	`jinaai/jina-reranker-v2-base-multilingual`	reranking	Small, Medium, Large
`MiniLM`	`cross-encoder/ms-marco-minilm-l-12-v2`	reranking	Small, Medium, Large
`MiniLM`	`sentence-transformers/all-minilm-l6-v2`	embedding and reranking	Small, Medium, Large
`Qwen`	`Qwen/Qwen3-Embedding-0.6B`	embedding and reranking	Small, Medium, Large
`slate`	`ibm/slate-125m-english-rtrvr`, `ibm/slate-125m-english-rtrvr-v2`, `ibm/slate-30m-english-rtrvr`, `ibm/slate-30m-english-rtrvr-v2`	embedding and reranking	Small, Medium, Large

Important:

Automatic speech recognition (ASR) models can be added by admins for global deployment. For more information, see Registering custom foundation models for global deployment.
IBM only certifies the model architectures that are listed in Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications tables. You can use models with alternative architectures if they are supported by the vLLM runtime, however, IBM does not support deployment failures as a result of deploying foundation models with unsupported architectures or incompatible features. Alternatively, you can add or build a custom inference runtime image to match your model's architecture. However, only images that are listed in Open Shift registry were tested and are confirmed to work.