Requirements for deploying custom foundation models on dedicated GPUs

Review the foundation model architectures, and performance-boosting settings and features available for deploying a custom foundation model with watsonx.ai.

When you deploy a custom foundation model, consider the following requirements:

Supported model architectures

Note:

Various software specifications are available for your deployments:

  • The watsonx-cfm-caikit-1.0 software specification is based on TGI runtime engine.
  • The watsonx-cfm-caikit-1.1 software specification is based on the vLLM runtime engine. It is better in terms of performance, but it's not available with every model architecture.
  • The watsonx-tsfm-runtime-1.0 software specification is designed for time-series models. It's based on the watsonx-tsfm-runtime-1.0 inference runtime.

The following tables provide the information about the supported model architectures, available quantization methods, support for multiple GPUs, and available deployment configurations. Also provided are software specifications that you can use when you deploy the models and the names of some example foundation models that belong to each of the listed architecture types.

General-purpose models:

Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications for general-purpose models
Model family Foundation model examples Supported quantization method Parallel tensors (multiple GPUs supported) Deployment configurations Software specifications
bloom bigscience/bloom-3b, bigscience/bloom-560m N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.0
watsonx-cfm-caikit-1.1
codegen Salesforce/codegen-350M-mono, Salesforce/codegen-16B-mono N/A No Small watsonx-cfm-caikit-1.0
exaone lgai-exaone/exaone-3.0-7.8B-Instruct N/A No Small watsonx-cfm-caikit-1.1
falcon tiiuae/falcon-7b N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1
gemma google/gemma-2b N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
gemma2 google/gemma-2-9b N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
gpt_bigcode bigcode/starcoder
bigcode/gpt_bigcode-santacoder
gptq Yes Small, Small, Medium, Large watsonx-cfm-caikit-1.0
watsonx-cfm-caikit-1.1
gpt_neox rinna/japanese-gpt-neox-small
EleutherAI/pythia-12b, databricks/dolly-v2-12b
N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.0
watsonx-cfm-caikit-1.1
gptj EleutherAI/gpt-j-6b N/A No Small watsonx-cfm-caikit-1.0
watsonx-cfm-caikit-1.1
gpt2 gpt2
gpt2-xl
N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
granite ibm-granite/granite-3.0-8b-instruct, ibm-granite/granite-3b-code-instruct-2k, granite-8b-code-instruct, granite-7b-lab N/A No Small watsonx-cfm-caikit-1.1
jais core42/jais-13b N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
llama Deepseek-R1 (distilled variant)
meta-llama/Meta-Llama-3-8B
meta-llama/Meta-Llama-3.1-8B-Instruct
llama-2-13b-chat-hf
TheBloke/Llama-2-7B-Chat-AWQ
ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf
gptq Yes Small, Medium, Large watsonx-cfm-caikit-1.0
watsonx-cfm-caikit-1.1
mistral mistralai/Mistral-7B-v0.3, neuralmagic/OpenHermes-2.5-Mistral-7B-marlin N/A No Small watsonx-cfm-caikit-1.0
watsonx-cfm-caikit-1.1
mixtral TheBloke/Mixtral-8x7B-v0.1-GPTQ
mistralai/Mixtral-8x7B-Instruct-v0.1
gptq No Small watsonx-cfm-caikit-1.1
mpt mosaicml/mpt-7b
mosaicml/mpt-7b-storywriter, mosaicml/mpt-30b
N/A No Small watsonx-cfm-caikit-1.0
mt5 google/mt5-small
google/mt5-xl
N/A No Small watsonx-cfm-caikit-1.0
nemotron nvidia/Minitron-8B-Base N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
olmo allenai/OLMo-1B-hf
allenai/OLMo-7B-hf
N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
persimmon adept/persimmon-8b-base
adept/persimmon-8b-chat
N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
phi microsoft/phi-2
microsoft/phi-1_5
N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
phi3 microsoft/Phi-3-mini-4k-instruct N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
qwen DeepSeek-R1 (distilled variant) N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.1
qwen2 Deepseek-R1 (distilled variant)
Qwen/Qwen2-7B-Instruct-AWQ
AWQ Yes Small, Medium, Large watsonx-cfm-caikit-1.1
t5 google/flan-t5-large, google/flan-t5-small N/A Yes Small, Medium, Large watsonx-cfm-caikit-1.0

Time-series models:

Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications for time-series models
Model family Foundation model examples Supported quantization method Parallel tensors (multiple GPUs supported) Deployment configurations Software specifications
tinytimemixer ibm-granite/granite-timeseries-ttm-r2 N/A N/A Small, Medium, Large, Extra large watsonx-tsfm-runtime-1.0
Important:
  • From release 2.2.1: Automatic speech recognition (ASR) models can be added globally by admins. For more information, see Registering custom foundation models for global deployment.
  • Deployments of llama 3.1 models might fail if you use the watsonx-cfm-caikit-1.0 software specification for deployment. To address this issue, see steps that are listed in Troubleshooting.
  • IBM only certifies the model architectures that are listed in the Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications tables. You can use models with alternative architectures if they are supported by the vLLM runtime, however, IBM does not support deployment failures as a result of deploying foundation models with unsupported architectures or incompatible features.
  • If your model architecture is not supported, an MLOps engineer can add a custom inference runtime image to match your model's architecture and prepare a model asset for deployment. Note that if you fine-tune such a model using the LoRA technique, you won't be able to deploy it.

Predefined hardware specifications

Note:

For time-series models, use CPU-based hardware specifications.

CPU-based hardware specifications

Here is the list of CPU-based hardware specifications:

  • S: 2 vCPU and 8 GB of memory
  • M: 4 vCPUs and 16 GB of memory
  • L: 8 vCPUs and 32 GB of memory
  • XL: 16 vCPUs and 64 GB of memory

For time-series models, assign a hardware specification to your model based on the maximum number of concurrent users and payload characteristics:

Recommendations for assigning CPU-based hardware specifications to time-series models
Univariate Time Series Multivariate Time Series (Series x Targets) S M L XL
1000 23x100 6 12 25 50
500 15x80 10 21 42 85
250 15x40 13 26 53 106
125 15x20 13 27 54 109
60 15x10 14 28 56 112
30 15x5 14 28 56 113

General-purpose models (GPU-based hardware specifications)

For general-purpose models, available inference runtimes support these GPU-based hardware specifications:

  • WX-S: 1 GPU, 2 CPUs, and 60 GB of memory
  • WX-M: 2 GPUs, 3 CPUs, and 120 GB of memory
  • WX-L: 4 GPUs, 5 CPUs, and 240 GB of memory
  • WX-XL: 8 GPUs, 9 CPUs, and 600 GB of memory

These pre-defined hardware specifications are applicable only for these standard supported hardware configurations:

  • NVIDIA A100 80 GB of GPU memory
  • NVIDIA H100 with 80 GB of GPU memory

If your GPU configuration is different (for example NVIDIA L40S with 48 GB of GPU memory), you must create a custom hardware specification. For details, see Creating custom hardware specifications.

Assign a hardware specification to your custom foundation model, based on the number of parameters that the model was trained with:

  • 1B to 20B parameters: WX-S
  • 21B to 40B parameters: WX-M
  • 41B to 80B parameters: WX-L
  • 80B to 200B parameters WX-XL
Note:

Some model architectures only support one GPU, so regardless of the number of parameters you must assign the WX-S hardware specification to them. You cannot use predefined hardware specifications with quantized models. For quantized models and in other non-standard cases, use a custom hardware specification. For more information, see Creating custom hardware specifications.

Next steps