Requirements for deploying custom foundation models on dedicated GPUs
Review the foundation model architectures, and performance-boosting settings and features available for deploying a custom foundation model with watsonx.ai.
When you deploy a custom foundation model, consider the following requirements:
- Make sure that the model that you are deploying uses a supported architecture.
- Choose a hardware specification for deploying a custom foundation model.
Supported model architectures
Various software specifications are available for your deployments:
- The
watsonx-cfm-caikit-1.0software specification is based onTGIruntime engine. - The
watsonx-cfm-caikit-1.1software specification is based on thevLLMruntime engine. It is better in terms of performance, but it's not available with every model architecture. - The
watsonx-tsfm-runtime-1.0software specification is designed for time-series models. It's based on thewatsonx-tsfm-runtime-1.0inference runtime.
The following tables provide the information about the supported model architectures, available quantization methods, support for multiple GPUs, and available deployment configurations. Also provided are software specifications that you can use when you deploy the models and the names of some example foundation models that belong to each of the listed architecture types.
General-purpose models:
| Model family | Foundation model examples | Supported quantization method | Parallel tensors (multiple GPUs supported) | Deployment configurations | Software specifications |
|---|---|---|---|---|---|
bloom |
bigscience/bloom-3b, bigscience/bloom-560m |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.0 watsonx-cfm-caikit-1.1 |
codegen |
Salesforce/codegen-350M-mono, Salesforce/codegen-16B-mono |
N/A | No | Small | watsonx-cfm-caikit-1.0 |
exaone |
lgai-exaone/exaone-3.0-7.8B-Instruct |
N/A | No | Small | watsonx-cfm-caikit-1.1 |
falcon |
tiiuae/falcon-7b |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
gemma |
google/gemma-2b |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
gemma2 |
google/gemma-2-9b |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
gpt_bigcode |
bigcode/starcoder bigcode/gpt_bigcode-santacoder |
gptq |
Yes | Small, Small, Medium, Large | watsonx-cfm-caikit-1.0 watsonx-cfm-caikit-1.1 |
gpt_neox |
rinna/japanese-gpt-neox-small EleutherAI/pythia-12b, databricks/dolly-v2-12b |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.0 watsonx-cfm-caikit-1.1 |
gptj |
EleutherAI/gpt-j-6b |
N/A | No | Small | watsonx-cfm-caikit-1.0 watsonx-cfm-caikit-1.1 |
gpt2 |
gpt2 gpt2-xl |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
granite |
ibm-granite/granite-3.0-8b-instruct, ibm-granite/granite-3b-code-instruct-2k, granite-8b-code-instruct, granite-7b-lab |
N/A | No | Small | watsonx-cfm-caikit-1.1 |
jais |
core42/jais-13b |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
llama |
Deepseek-R1 (distilled variant) meta-llama/Meta-Llama-3-8B meta-llama/Meta-Llama-3.1-8B-Instruct llama-2-13b-chat-hf TheBloke/Llama-2-7B-Chat-AWQ ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf |
gptq |
Yes | Small, Medium, Large | watsonx-cfm-caikit-1.0 watsonx-cfm-caikit-1.1 |
mistral |
mistralai/Mistral-7B-v0.3, neuralmagic/OpenHermes-2.5-Mistral-7B-marlin |
N/A | No | Small | watsonx-cfm-caikit-1.0 watsonx-cfm-caikit-1.1 |
mixtral |
TheBloke/Mixtral-8x7B-v0.1-GPTQ mistralai/Mixtral-8x7B-Instruct-v0.1 |
gptq |
No | Small | watsonx-cfm-caikit-1.1 |
mpt |
mosaicml/mpt-7b mosaicml/mpt-7b-storywriter, mosaicml/mpt-30b |
N/A | No | Small | watsonx-cfm-caikit-1.0 |
mt5 |
google/mt5-small google/mt5-xl |
N/A | No | Small | watsonx-cfm-caikit-1.0 |
nemotron |
nvidia/Minitron-8B-Base |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
olmo |
allenai/OLMo-1B-hf allenai/OLMo-7B-hf |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
persimmon |
adept/persimmon-8b-base adept/persimmon-8b-chat |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
phi |
microsoft/phi-2 microsoft/phi-1_5 |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
phi3 |
microsoft/Phi-3-mini-4k-instruct |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
qwen |
DeepSeek-R1 (distilled variant) |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
qwen2 |
Deepseek-R1 (distilled variant) Qwen/Qwen2-7B-Instruct-AWQ |
AWQ |
Yes | Small, Medium, Large | watsonx-cfm-caikit-1.1 |
t5 |
google/flan-t5-large, google/flan-t5-small |
N/A | Yes | Small, Medium, Large | watsonx-cfm-caikit-1.0 |
Time-series models:
| Model family | Foundation model examples | Supported quantization method | Parallel tensors (multiple GPUs supported) | Deployment configurations | Software specifications |
|---|---|---|---|---|---|
tinytimemixer |
ibm-granite/granite-timeseries-ttm-r2 |
N/A | N/A | Small, Medium, Large, Extra large | watsonx-tsfm-runtime-1.0 |
- From release 2.2.1: Automatic speech recognition (ASR) models can be added globally by admins. For more information, see Registering custom foundation models for global deployment.
- Deployments of
llama 3.1models might fail if you use thewatsonx-cfm-caikit-1.0software specification for deployment. To address this issue, see steps that are listed in Troubleshooting. - IBM only certifies the model architectures that are listed in the Supported model architectures, quantization methods, parallel tensors, deployment configurations, and software specifications tables. You can use models with alternative architectures if they are supported by the vLLM runtime, however, IBM does not support deployment failures as a result of deploying foundation models with unsupported architectures or incompatible features.
- If your model architecture is not supported, an MLOps engineer can add a custom inference runtime image to match your model's architecture and prepare a model asset for deployment. Note that if you fine-tune such a model using the LoRA technique, you won't be able to deploy it.
Predefined hardware specifications
For time-series models, use CPU-based hardware specifications.
CPU-based hardware specifications
Here is the list of CPU-based hardware specifications:
S: 2 vCPU and 8 GB of memoryM: 4 vCPUs and 16 GB of memoryL: 8 vCPUs and 32 GB of memoryXL: 16 vCPUs and 64 GB of memory
For time-series models, assign a hardware specification to your model based on the maximum number of concurrent users and payload characteristics:
| Univariate Time Series | Multivariate Time Series (Series x Targets) | S | M | L | XL |
|---|---|---|---|---|---|
| 1000 | 23x100 | 6 | 12 | 25 | 50 |
| 500 | 15x80 | 10 | 21 | 42 | 85 |
| 250 | 15x40 | 13 | 26 | 53 | 106 |
| 125 | 15x20 | 13 | 27 | 54 | 109 |
| 60 | 15x10 | 14 | 28 | 56 | 112 |
| 30 | 15x5 | 14 | 28 | 56 | 113 |
General-purpose models (GPU-based hardware specifications)
For general-purpose models, available inference runtimes support these GPU-based hardware specifications:
WX-S: 1 GPU, 2 CPUs, and 60 GB of memoryWX-M: 2 GPUs, 3 CPUs, and 120 GB of memoryWX-L: 4 GPUs, 5 CPUs, and 240 GB of memoryWX-XL: 8 GPUs, 9 CPUs, and 600 GB of memory
These pre-defined hardware specifications are applicable only for these standard supported hardware configurations:
- NVIDIA A100 80 GB of GPU memory
- NVIDIA H100 with 80 GB of GPU memory
If your GPU configuration is different (for example NVIDIA L40S with 48 GB of GPU memory), you must create a custom hardware specification. For details, see Creating custom hardware specifications.
Assign a hardware specification to your custom foundation model, based on the number of parameters that the model was trained with:
- 1B to 20B parameters:
WX-S - 21B to 40B parameters:
WX-M - 41B to 80B parameters:
WX-L - 80B to 200B parameters
WX-XL
Some model architectures only support one GPU, so regardless of the number of parameters you must assign the WX-S hardware specification to them. You cannot use predefined hardware specifications with quantized models. For
quantized models and in other non-standard cases, use a custom hardware specification. For more information, see Creating custom hardware specifications.