Foundation models in IBM watsonx.ai

You can deploy a collection of third-party and IBM models in IBM watsonx.ai.

One of the following types of GPUs is required to support the use of foundation models in IBM watsonx.ai:
  • NVIDIA A100 GPUs with 80 GB RAM
  • NVIDIA H100 GPUs with 80 GB RAM
  • NVIDIA L40S GPUs with 48 GB RAM (Not supported with all models. See tables for details.)
Attention: You cannot use virtual GPUs (vGPUs) with the IBM watsonx.ai service.

5.0.3 or later You can optionally partition A100 or H100 GPU processors to add more than one foundation model to a GPU. For more information, see Partitioning GPU processors in IBM watsonx.ai. Models that can be partitioned indicate Yes for NVIDIA Multi-Instance GPU support in the foundation models table.

The following table lists the recommended number of GPUs to configure on a single OpenShift® worker node for the various foundation models that are available with IBM watsonx.ai. You might be able to run some models with fewer GPUs at context lengths other than the maximum or subject to other performance tradeoffs and constraints. If you use a configuration with fewer than the recommended number of GPUs, be sure to test the deployment to verify that the performance is satisfactory before you use the configuration in production.

When you calculate the total number of GPUs that you need for your deployment, consider whether you plan to customize any foundation models by tuning them. If you plan to tune a foundation model, factor in one GPU that can be reserved for tuning tasks. Do not partition the GPU that will be used for tuning a foundation model.

Note: See Hardware requirements for the vCPU and memory resources that are required by the IBM watsonx.ai service. Additional persistent and ephemeral storage resources are required to support foundation model inferencing. The resources that are required vary based on the foundation models that you choose to install. The following table lists the resource requirements for each foundation model. Add the resource requirements for the foundation models that you plan to use to the resource requirements for the service to get the total resources that are needed.
The following table describes the provided foundation models that you can install after you install the service.
Foundation model Description System requirements Group name 5.0.1 or later
Model name
allam-1-13b-instruct
Model ID
allam-1-13b-instruct
A bilingual large language model for Arabic and English that is initialized with Llama-2 weights and is fine-tuned to support conversational tasks.
Note: Starting with 5.0.3, this model can be fine tuned when configured to use an NVIDIA A100 or H100 GPU.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
30 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxAllam113bInstruct
Model name
codellama-34b-instruct
Model ID
codellama-codellama-34b-instruct-hf
Code Llama is an AI model built on top of Llama 2, fine-tuned for generating and discussing code.
GPU (Number of shards)
2

The 2 shards must be hosted on a single OpenShift worker node.

CPU and memory
3 CPU, 128 GB RAM
Storage
77 GB
NVIDIA Multi-Instance GPU support
No
ibmwxCodellamaCodellama34bInstructHf
Model name
elyza-japanese-llama-2-7b-instruct
Model ID
elyza-japanese-llama-2-7b-instruct
General use with zero- or few-shot prompts. Works well for classification and extraction in Japanese and for translation between English and Japanese. Performs best when prompted in Japanese.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
50 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxElyzaJapaneseLlama27bInstruct
Model name
flan-t5-xl-3b
Model ID
google-flan-t5-xl
General use with zero- or few-shot prompts.
Note: This foundation model can be prompt tuned.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
21 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGoogleFlanT5xl
Model name
flan-t5-xxl-11b
Model ID
google-flan-t5-xxl
General use with zero- or few-shot prompts.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
52 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGoogleFlanT5xxl
Model name
flan-ul2-20b
Model ID
google-flan-ul2
General use with zero- or few-shot prompts.
Note: In 5.0.0 only, if you want to use this model with L40S GPUs, you must take some extra steps. See Adding foundation models for details.
GPU (Number of shards)
1 A100 GPU, 1 H100 GPU, or 2 L40S GPUs

The shards must be hosted on a single OpenShift worker node.

CPU and memory
  • 2 CPU(A100), 2 CPU (H100), or 3 CPU(L40S)
  • 128 GB RAM
Storage
85 GB
NVIDIA Multi-Instance GPU support
No
ibmwxGoogleFlanul2
Model name
granite-7b-lab
Model ID
ibm-granite-7b-lab
InstructLab foundation model from IBM that supports knowledge and skills contributed by the open source community.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
30 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGranite7bLab
Model name
granite-8b-japanese
Model ID
ibm-granite-8b-japanese
A per-trained instruct variant model from IBM designed to work with Japanese text.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
50 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGranite8bJapanese
Model name
granite-13b-chat-v2
Model ID
ibm-granite-13b-chat-v2
General use model from IBM that is optimized for dialogue use cases.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
36 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGranite13bChatv2
Model name
granite-13b-instruct-v2
Model ID
ibm-granite-13b-instruct-v2
General use model from IBM that is optimized for question and answer use cases.
Note: This model can be prompt tuned.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
62 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGranite13bInstructv2
Model name
granite-20b-multilingual
Model ID
ibm-granite-20b-multilingual
The Granite model series is a family of IBM-trained, dense decoder-only models, which are particularly well-suited for generative tasks.
GPU (Number of shards)
1
Note: Supports NVIDIA A100 or H100 GPUs only.
Note: This foundation model cannot be sharded.
CPU and memory
2 CPU, 96 GB RAM
Storage
100 GB
NVIDIA Multi-Instance GPU support
No
ibmwxGranite20bMultilingual
Model name
granite-3b-code-instruct
Model ID
granite-3b-code-instruct

A 3-billion parameter instruction fine-tuned model from IBM that supports code discussion, generation, and conversion.
Note: New in 5.0.1.
Note: Starting with 5.0.3, this model can be fine tuned when configured to use an NVIDIA A100 or H100 GPU.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
9 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGranite3bCodeInstruct
Model name
granite-8b-code-instruct
Model ID
granite-8b-code-instruct
An 8-billion parameter instruction fine-tuned model from IBM that supports code discussion, generation, and conversion.
Note: New in 5.0.1.
Note: Starting with 5.0.3, this model can be fine tuned when configured to use an NVIDIA A100 or H100 GPU.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
19 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxGranite8bCodeInstruct
Model name
granite-20b-code-instruct
Model ID
granite-20b-code-instruct
A 20-billion parameter instruction fine-tuned model from IBM that supports code discussion, generation, and conversion.
Note: New in 5.0.1.
Note: Starting with 5.0.3, this model can be fine tuned when configured to use an NVIDIA A100 or H100 GPU.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
70 GB
NVIDIA Multi-Instance GPU support
No
ibmwxGranite20bCodeInstruct
Model name
granite-34b-code-instruct
Model ID
granite-34b-code-instruct
A 34-billion parameter instruction fine-tuned model from IBM that supports code discussion, generation, and conversion.
Note: New in 5.0.1.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
78 GB
NVIDIA Multi-Instance GPU support
No
ibmwxGranite34bCodeInstruct
Model name
jais-13b-chat
Model ID
core42-jais-13b-chat
General use foundation model for generative tasks in Arabic.
GPU (Number of shards)
1
Note: Supports NVIDIA A100 or H100 GPUs only.
Note: This foundation model cannot be sharded.
CPU and memory
2 CPU, 96 GB RAM
Storage
60 GB
NVIDIA Multi-Instance GPU support
No
ibmwxCore42Jais13bChat
Model name
llama-2-13b-chat
Model ID
meta-llama-llama-2-13b-chat
General use with zero- or few-shot prompts. Optimized for dialogue use cases.
Note: This model can be prompt tuned.
GPU (Number of shards)
1
CPU and memory
2 CPU, 128 GB RAM
Storage
62 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxMetaLlamaLlama213bChat
Model name
llama-2-70b-chat
Model ID
meta-llama-llama-2-70b-chat
Note: Deprecated in 5.0.3
General use with zero- or few-shot prompts. Optimized for dialogue use cases.
GPU (Number of shards)
4

The 4 shards must be hosted on a single OpenShift worker node.

CPU and memory
5 CPU, 250 GB RAM
Storage
150 GB
ibmwxMetaLlamaLlama370bChat
Model name
llama2-13b-dpo-v7
Model ID
mncai-llama-2-13b-dpo-v7
General use foundation model for generative tasks in Korean.
GPU (Number of shards)
1
Note: Supports NVIDIA A100 or H100 GPUs only.
CPU and memory
2 CPU, 96 GB RAM
Storage
30 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxMncaiLlama213bDpov7
Model name
llama-3-1-8b-instruct
Model ID
llama-3-1-8b-instruct
An auto-regressive language model that uses an optimized transformer architecture.
Note: New in 5.0.3.
Note: Starting with 5.0.3, this model can be fine tuned when configured to use an NVIDIA A100 or H100 GPU.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
20 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxLlama318bInstruct
Model name
llama-3-1-70b-instruct
Model ID
llama-3-1-70b-instruct
An auto-regressive language model that uses an optimized transformer architecture.
Note: New in 5.0.3.
GPU (Number of shards)
8

The 8 shards must be hosted on a single OpenShift worker node.

CPU and memory
16 CPU, 246 GB
Storage
163 GB
ibmwxLlama3170bInstruct
Model name
llama-3-405b-instruct
Model ID
llama-3-405b-instruct
Meta's largest open-sourced foundation model to date, with 405 billion parameters, and optimized for dialogue use cases.
Note: New in 5.0.3.
GPU (Number of shards)
8

The 8 shards must be hosted on a single OpenShift worker node.

CPU and memory
16 CPU, 246 GB
Storage
500 GB
ibmwxLlama3405bInstruct
Model name
llama-3-8b-instruct
Model ID
meta-llama-llama-3-8b-instruct
Pre-trained and instruction tuned generative text model optimized for dialogue use cases.
GPU (Number of shards)
1
CPU and memory
2 CPU, 96 GB RAM
Storage
40 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxMetaLlamaLlama38bInstruct
Model name
llama-3-70b-instruct
Model ID
meta-llama-llama-3-70b-instruct
Pre-trained and instruction tuned generative text model optimized for dialogue use cases.
GPU (Number of shards)
4

The 4 shards must be hosted on a single OpenShift worker node.

CPU and memory
10 CPU, 246 GB
Storage
180 GB
ibmwxMetaLlamaLlama370bInstruct
Model name
merlinite-7b
Model ID
ibm-mistralai-merlinite-7b
Note: Deprecated in 5.0.3
General use foundation model tuned by IBM that supports knowledge and skills contributed by the open source community.
GPU (Number of shards)
1
Note: This foundation model cannot be sharded.
CPU and memory
2 CPU, 96 GB RAM
Storage
20 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxMistralaiMerlinite7b
Model name
mistral-large
Model ID
mistral-large
The most advanced Large Language Model (LLM) developed by Mistral Al with state-of-the-art reasoning capabilities that can be applied to any language-based task, including the most sophisticated ones.
Note: New in 5.0.3.
Attention: You must purchase Mistral AI with IBM separately before you are entitled to download and use this model.
GPU (Number of shards)
8

The 8 shards must be hosted on a single OpenShift worker node.

CPU and memory
16 CPU, 246 GB
Storage
240 GB
ibmwxMistralLarge
Model name
mixtral-8x7b-instruct-v01
Model ID
mistralai-mixtral-8x7b-instruct-v01
The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts.
GPU (Number of shards)
2 A100 GPUs, 2 H100 GPUs, or 4 L40S GPUs

The shards must be hosted on a single OpenShift worker node.

CPU and memory
8 CPU, 96 GB RAM
Storage
195 GB
NVIDIA Multi-Instance GPU support
No
ibmwxMistralaiMixtral8x7bInstructv01
Model name
mt0-xxl-13b
Model ID
bigscience-mt0-xxl
General use with zero- or few-shot prompts. Supports prompts in languages other than English and multilingual prompts.
GPU (Number of shards)
1
Note: This foundation model cannot be sharded.
CPU and memory
2 CPU, 128 GB RAM
Storage
62 GB
NVIDIA Multi-Instance GPU support
Yes
ibmwxBigscienceMt0xxl
For more information, see Supported foundation models.

You cannot add deprecated or withdrawn models to your deployment. For more information about how deprecated and withdrawn models are handled, see Foundation model lifecycle.

In addition to foundation models that are curated by IBM, you can upload and deploy your own foundation models. For more information about how to upload, register, and deploy a custom foundation model, see the following information:
The following text embedding models are supported:
Embedding model System requirements Group name5.0.1 or later
Model name
all-minim-l6-v2
Model ID
all-minilm-l6-v2
CPU and memory
2 CPU, 4 GB
Storage
1 GB
Note: New in 5.0.3.
ibmwxAllMinilmL6V2
Model name
multilingual-e5-large
Model ID
multilingual-e5-large
CPU and memory
4 CPU, 8 GB
Storage
10 GB
Note: New in 5.0.3.
ibmwxMultilingualE5Large
Model name
slate-30m-english-rtrvr
Model ID
ibm-slate-30m-english-rtrvr
CPU and memory
2 CPU, 4 GB
Storage
10 GB
Note: This model was updated to version 2.0.1 in CPD 5.0.3.
ibmwxSlate30mEnglishRtrvr
Model name
slate-125m-english-rtrvr
Model ID
ibm-slate-125m-english-rtrvr
CPU and memory
2 CPU, 4 GB
Storage
10 GB
Note: This model was updated to version 2.0.1 in CPD 5.0.3.
ibmwxSlate125mEnglishRtrvr
For more information, see Supported embedding models.