Planning to deploy a custom foundation model
Review the considerations and requirements for deploying a custom foundation model for inferencing with watsonx.ai.
Uploading and using your own custom model is available only in the Standard plan for watsonx.ai.
As you prepare to deploy a custom foundation model, review these planning considerations:
- Review the Requirements and usage notes for custom foundation models
- Review the Supported architectures for custom foundation models to make sure that your model is compatible
- Collect the details that are required as prerequisites for deploying a custom foundation model
- Select a hardware specification for your custom foundation model
- Review resource utilization guidelines
- Review the deployment limitations
- Enable task credentials to be able to deploy custom foundation models
- Verify the list of modalities (text, audio, video, and image) that can be used when inferencing your model
Requirements and usage notes for custom foundation models
Deployable custom models must meet these requirements:
- The model must be built with a supported model architecture type.
- The file list for the model must contain a
config.jsonfile. - General-purpose models: you must make sure that your custom foundation model is saved with the supported
transformerslibrary. If themodel.safetensorsfile for your custom foundation model uses an unsupported data format in the metadata header, your deployment might fail. For more information, see Troubleshooting watsonx.ai Runtime. - Time-series models: the model directory must contain the
tsfm_config.jsonfile. Time-series models that are hosted on Hugging Face (model_type:tinytimemixer) may not include this file. If the file is not there when the model is downloaded and deployed, forecasting will fail. To avoid forecasting issues, you'll have to perform an extra step when you download the model.
Additionally, you must make sure that the project or space where you want to deploy your custom foundation model has an associated watsonx.ai Runtime instance. Open the Manage tab in your project or space to check that.
Supported model architectures
The following tables list the model architectures that you can deploy as custom models for inferencing with watsonx.ai. The model architectures are listed together with information about their supported quantization methods, parallel tensors, deployment configuration sizes, and software specifications.
Various software specifications are available for your deployments:
- The
watsonx-cfm-caikit-1.0software specification is based on TGI runtime engine. - The
watsonx-cfm-caikit-1.1software specification is based on the vLLM runtime engine. It is better in terms of performance, but it's not available with every model architecture. - The
watsonx-tsfm-runtime-1.0software specification is designed for time-series models. It's based on thewatsonx-tsfm-runtime-1.0inference runtime.
General-purpose models:
| Model architecture type | Foundation model examples | Quantization method | Parallel tensors (multiGpu) | Software specifications |
|---|---|---|---|---|
bloom |
bigscience/bloom-3b, bigscience/bloom-560m |
N/A | Yes | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
codegen |
Salesforce/codegen-350M-mono, Salesforce/codegen-16B-mono |
N/A | No | watsonx-cfm-caikit-1.0 |
exaone |
lgai-exaone/exaone-3.0-7.8B-Instruct |
N/A | No | watsonx-cfm-caikit-1.1 |
falcon |
tiiuae/falcon-7b |
N/A | Yes | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
gemma |
google/gemma-2b |
N/A | Yes | watsonx-cfm-caikit-1.1 |
gemma2 |
google/gemma-2-9b |
N/A | Yes | watsonx-cfm-caikit-1.1 |
gemma3 |
google/gemma-3-27b-it |
N/A | Yes | watsonx-cfm-caikit-1.1 |
gpt_bigcode |
bigcode/starcoder, bigcode/gpt_bigcode-santacoder |
gptq |
Yes | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
gpt-neox |
rinna/japanese-gpt-neox-small, EleutherAI/pythia-12b, databricks/dolly-v2-12b |
N/A | Yes | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
gptj |
EleutherAI/gpt-j-6b |
N/A | No | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
gpt2 |
openai-community/gpt2-large |
N/A | No | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
granite |
ibm-granite/granite-3.0-8b-instruct, ibm-granite/granite-3b-code-instruct-2k, granite-8b-code-instruct, granite-7b-lab |
N/A | No | watsonx-cfm-caikit-1.1 |
jais |
core42/jais-13b |
N/A | Yes | watsonx-cfm-caikit-1.1 |
llama |
DeepSeek-R1 (distilled variant), meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, llama-2-13b-chat-hf, TheBloke/Llama-2-7B-Chat-AWQ, ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf |
gptq |
Yes | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
mistral |
mistralai/Mistral-7B-v0.3, neuralmagic/OpenHermes-2.5-Mistral-7B-marlin |
N/A | No | watsonx-cfm-caikit-1.0, watsonx-cfm-caikit-1.1 |
mixtral |
TheBloke/Mixtral-8x7B-v0.1-GPTQ, mistralai/Mixtral-8x7B-Instruct-v0.1 |
gptq |
No | watsonx-cfm-caikit-1.1 |
mpt |
mosaicml/mpt-7b, mosaicml/mpt-7b-storywriter, mosaicml/mpt-30b |
N/A | No | watsonx-cfm-caikit-1.0 |
mt5 |
google/mt5-small, google/mt5-xl |
N/A | No | watsonx-cfm-caikit-1.0 |
nemotron |
nvidia/Minitron-8B-Base |
N/A | Yes | watsonx-cfm-caikit-1.1 |
olmo |
allenai/OLMo-1B-hf, allenai/OLMo-7B-hf |
N/A | Yes | watsonx-cfm-caikit-1.1 |
persimmon |
adept/persimmon-8b-base, adept/persimmon-8b-chat |
N/A | Yes | watsonx-cfm-caikit-1.1 |
phi |
microsoft/phi-2, microsoft/phi-1_5 |
N/A | Yes | watsonx-cfm-caikit-1.1 |
phi3 |
microsoft/Phi-3-mini-4k-instruct |
N/A | Yes | watsonx-cfm-caikit-1.1 |
qwen |
DeepSeek-R1 (distilled variant) |
N/A | Yes | watsonx-cfm-caikit-1.1 |
qwen2 |
Qwen/Qwen2-7B-Instruct-AWQ |
AWQ |
Yes | watsonx-cfm-caikit-1.1 |
qwen3 |
Qwen/Qwen3-32B |
N/A | Yes | watsonx-cfm-caikit-1.1 |
t5 |
google/flan-t5-large, google/flan-t5-small |
N/A | Yes | watsonx-cfm-caikit-1.0 |
Time-series models:
| Model architecture type | Foundation model examples | Quantization method | Parallel tensors (multiGpu) | Deployment configurations | Software specifications |
|---|---|---|---|---|---|
tinytimemixer |
ibm-granite/granite-timeseries-ttm-r2 |
N/A | N/A | Small, Medium, Large, Extra large | watsonx-tsfm-runtime-1.0 |
- IBM only certifies the model architectures that are listed in Table 1 and Table 2. You can use models with other architectures that are supported by the vLLM inference framework, but IBM does not support deployment failures as a result of deploying foundation models with unsupported architectures or incompatible features.
- It is not possible to deploy
codegen,mt5, andt5type models with thewatsonx-cfm-caikit-1.1software specification. - If your model does not support parallel tensors, the only configuration that you can use is one GPU, for example:
1 x L40S,1 x A100,1 x H100and other similar single-GPU configurations. However, if your model was trained with a large number of parameters that exceeds the capacity of a single GPU, deployment will fail. For more information on limitations, see Resource utilization guidelines.
Collecting the prerequisite details for a custom foundation model
-
Check for the existence of the file
config.jsonin the foundation model content folder. Deployment service will mandate for existence of the fileconfig.jsonin the foundation model content folder after it is uploaded to the cloud storage. -
Open the
config.jsonfile to confirm that the foundation model uses a supported architecture.Important:You must make sure that your custom foundation model is saved with the supported
transformerslibrary. If the model.safetensors file for your custom foundation model uses an unsupported data format in the metadata header, your deployment might fail. For more information, see Troubleshooting watsonx.ai Runtime.
See an example:
For the falcon-40b model stored on Hugging Face, click Files and versions to view the file structure and check for config.json:

The example model uses a version of the supported falcon architecture.

If the model does not meet these requirements, you cannot create a model asset and deploy your model.
Resource utilization guidelines
Time-series models
The inference runtime for time-series models supports these hardware specifications: S (Small), M (Medium), L (Large), XL (Extra large).
Assign a hardware specification to your custom time-series model, based on the maximum number of concurrent users and payload characteristics:
| Univariate Time Series | Multivariate Time Series (Series x Targets) | Small | Medium | Large | Extra large |
|---|---|---|---|---|---|
| 1000 | 23x100 | 6 | 12 | 25 | 50 |
| 500 | 15x80 | 10 | 21 | 42 | 85 |
| 250 | 15x40 | 13 | 26 | 53 | 106 |
| 125 | 15x20 | 13 | 27 | 54 | 109 |
| 60 | 15x10 | 14 | 28 | 56 | 112 |
| 30 | 15x5 | 14 | 28 | 56 | 113 |
General-purpose models
- Assign the 1 GPU configuration (e.g., 1 x A100, 1 x H100) to any double-byte precision model under 26B parameters, subject to testing and validation.
- Assign the 2 GPU configuration (e.g., 2 x A100, 2 x H100) to any double-byte precision model between 27B and 53B parameters, subject to testing and validation.
- Assign the 4 GPU configuration (e.g., 4 x A100, 4 x H100) to any double-byte precision model between 54B and 106B parameters, subject to testing and validation.
- If the 1 GPU configuration (e.g., 1 x A100, 1 x H100) fails, try the 2 GPU configuration (for example, 2 x A100, 2 x H100).
- If the 2 GPU configuration (e.g., 2 x A100, 2 x H100) fails, try the 4 GPU configuration (for example, 4 x A100, 4 x H100).
| Configuration | Examples of suitable models |
|---|---|
| 1 GPU configuration (for example 1xH100) |
llama-3-8bllama-2-13bstarcoder-15.5bmt0-xxl-13bjais-13bgpt-neox-20bflan-t5-xxl-11bflan-ul2-20ballam-1-13b |
| 2 GPU confuguration (for example 2xA100) |
codellama-34b |
| 4 GPU configuration (for example 4xH100) |
llama-3-70b llama-2-70b |
| GPU Configuration | Total GPU Memory |
|---|---|
| 1 × L40S | 48 GB |
| 2 × L40S | 96 GB |
| 1 × A100 | 80 GB |
| 2 × A100 | 160 GB |
| 4 × A100 | 320 GB |
| 8 × A100 | 640 GB |
| 1 × H100 | 80 GB |
| 2 × H100 | 160 GB |
| 4 × H100 | 320 GB |
| 8 × H100 | 640 GB |
| 1 × H200 | 141 GB |
| 2 × H200 | 282 GB |
| 4 × H200 | 564 GB |
| 8 × H200 | 1128 GB |
Limitations and restrictions for custom foundation models
- Time-series models do not take any parameters. Do not provide any parameters when you are deploying a custom time-series model. If you provide parameters when you deploy a custom time-series model, they will have no effect.
- You cannot tune a custom foundation model.
- You cannot use watsonx.governance to evaluate or track a prompt template for a custom foundation model.
Next steps
Downloading a custom foundation model and setting up storage