Supported hardware, model architectures and performance settings

Review the foundation model architectures, and performance-boosting settings and features available for deploying a custom foundation model with watsonx.ai.

You can deploy two custom foundation model types: general-purpose models and time-series models. When you upload and register a custom foundation model, consider the following requirements:

Hardware requirements

For general-purpose models, the standard supported hardware configurations to deploy custom foundation models are:
  • NVIDIA A100 GPUs with 80 GB RAM
  • NVIDIA H100 GPUs with 80 GB RAM
  • NVIDIA H200 GPUs with 141 GB RAM

If your GPU configuration is different (for example NVIDIA H100 GPUs with 40 GB RAM), you must create a custom hardware specification. For details, see Creating custom hardware specifications.

Restriction: You cannot use GPUs that are based on the Intel Gaudi 3 AI Accelerator architecture for custom foundation model deployments.

Supported model architectures

To find the architecture type for your custom foundation model, see Planning to deploy a custom foundation model.

Note:
Various software specifications are available for your deployments:
  • The watsonx-cfm-caikit-1.1 software specification is based on the vLLM runtime engine.
  • The watsonx-tsfm-runtime-1.0 software specification is designed for time-series models. It's based on the watsonx-tsfm-runtime-1.0 inference runtime.

The following tables provide the information about the supported model architectures, available quantization methods, support for multiple GPUs and available deployment configurations. Also provided are software specifications that you can use when you deploy the models and the names of some example foundation models that belong to each of the listed architecture types.

Note:

Some models are available from a specific release. Look for <release>.<refresh> markings next to models.

Table 1. Supported model architectures, quantization methods, parallel tensors, and deployment configurations for general-purpose models
Model family Foundation model examples Supported quantization method Parallel tensors (Multiple GPUs supported) Deployment configurations
bloom bigscience/bloom-3b, bigscience/bloom-560m N/A Yes Small, Medium, Large
exaone lgai-exaone/exaone-3.0-7.8B-Instruct N/A No Small
falcon tiiuae/falcon-7b N/A Yes Small, Medium, Large
gemma google/gemma-2b N/A Yes Small, Medium, Large
gemma2 google/gemma-2-9b N/A Yes Small, Medium, Large
gemma3 google/gemma-3-27b-it N/A Yes Small, Medium, Large
gpt_bigcode bigcode/starcoder, bigcode/gpt_bigcode-santacoder gptq Yes Small, Medium & Large
gpt_neox rinna/japanese-gpt-neox-small, EleutherAI/pythia-12b, databricks/dolly-v2-12b N/A Yes Small, Medium, Large
gptj EleutherAI/gpt-j-6b N/A No Small
gpt2 gpt2, gpt2-xl N/A Yes Small, Medium, Large
granite ibm-granite/granite-3.0-8b-instruct, ibm-granite/granite-3b-code-instruct-2k, granite-8b-code-instruct, granite-7b-lab N/A No Small
jais core42/jais-13b N/A Yes Small, Medium, Large
llama meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, llama-2-13b-chat-hf, TheBloke/Llama-2-7B-Chat-AWQ, ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf gptq Yes Small, Medium, Large
mistral mistralai/Mistral-7B-v0.3, neuralmagic/OpenHermes-2.5-Mistral-7B-marlin N/A No Small
mixtral TheBloke/Mixtral-8x7B-v0.1-GPTQ, mistralai/Mixtral-8x7B-Instruct-v0.1 gptq No Small
nemotron nvidia/Minitron-8B-Base N/A Yes Small, Medium, Large
olmo allenai/OLMo-1B-hf, allenai/OLMo-7B-hf N/A Yes Small, Medium, Large
persimmon adept/persimmon-8b-base, adept/persimmon-8b-chat N/A Yes Small, Medium, Large
phi microsoft/phi-2, microsoft/phi-1_5 N/A Yes Small, Medium, Large
phi3 microsoft/Phi-3-mini-4k-instruct N/A Yes Small, Medium, Large
qwen DeepSeek-R1 (distilled variant) N/A Yes Small, Medium, Large
qwen2 Qwen/Qwen2-7B-Instruct-AWQ AWQ Yes Small, Medium, Large
qwen3 qwen/qwen3-32B N/A Yes Small, Medium, Large
Note:

If you want to use models from the MosaicML family, you must deploy them by using a custom inference runtime, with vLLM image version 0.7.4.

Limitation: It's not possible to use text/generation or text/generation_stream. Only text/chat is available. For text/chat to work, the model must have a chat_template defined in the model content. If not present, create a chat_template and reference it in the deployment payload.

Table 2. Supported model architectures, quantization methods, parallel tensors, and deployment configurations for time-series models
Model family Foundation model examples Supported Quantization method Parallel Tensors (Multiple GPUs supported) Deployment configurations
tinytimemixer ibm-granite/granite-timeseries-ttm-r2 N/A N/A Small, Medium, Large, Extra large
Table 3. Supported model architectures: embedding and reranking models
Model family Foundation model examples Supported semantic retrieval Deployment configurations
BGE BAAI/bge-reranker-v2-m3 reranking Small, Medium, Large
E5 intfloat/multilingual-e5-large embedding and reranking Small, Medium, Large
granite ibm/granite-embedding-107m-multilingual, ibm/granite-embedding-278m-multilingual embedding Small, Medium, Large
Jina Reranker jinaai/jina-reranker-v2-base-multilingual reranking Small, Medium, Large
MiniLM cross-encoder/ms-marco-minilm-l-12-v2 reranking Small, Medium, Large
MiniLM sentence-transformers/all-minilm-l6-v2 embedding and reranking Small, Medium, Large
Qwen Qwen/Qwen3-Embedding-0.6B embedding and reranking Small, Medium, Large
slate ibm/slate-125m-english-rtrvr, ibm/slate-125m-english-rtrvr-v2, ibm/slate-30m-english-rtrvr, ibm/slate-30m-english-rtrvr-v2 embedding and reranking Small, Medium, Large
Important:
  • Automatic speech recognition (ASR) models can be added by admins for global deployment. For more information, see Registering custom foundation models for global deployment.
  • IBM only certifies the model architectures that are listed in Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications tables. You can use models with alternative architectures if they are supported by the vLLM runtime, however, IBM does not support deployment failures as a result of deploying foundation models with unsupported architectures or incompatible features. Alternatively, you can add or build a custom inference runtime image to match your model's architecture. However, only images that are listed in Open Shift registry were tested and are confirmed to work.