Adding foundation models to IBM watsonx.ai

Important: IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.

To submit input to a foundation model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster. To do so, patch the custom resource for the service to add foundation models.

Before you begin

The IBM watsonx.ai service must be installed.

Be sure you have the necessary resources available to support the models that you want to use. For more information about the overall resources that are required for the service, see Hardware requirements.

One of the following types of GPUs is required to support the use of foundation models in IBM watsonx.ai:

NVIDIA A100 80 GB
NVIDIA H100 80 GB
NVIDIA L40S 48 GB

Note: The Multi-Instance GPU (MIG) feature of the NVIDIA GPU must be disabled.

Each model uses a subset of the total resources that are needed for the service. The following table describes the subset of resources that are required per model.

Note: You do not need to prepare these resources in addition to the overall service hardware requirements. If you meet the prerequisite hardware requirements for the service, you already have the resources you need.

The following foundation models are supported:

Model name	Model ID	Description	Required GPUs	Memory requirements	Storage requirements
codellama-34b-instruct	codellama-codellama-34b-instruct-hf	Code Llama is an AI model built on top of Llama 2, fine-tuned for generating and discussing code. Note: This model was introduced with the 4.8.4 release.	1	2 CPU, 128 GB RAM	77 GB
elyza-japanese-llama-2-7b-instruct	elyza-japanese-llama-2-7b-instruct	General use with zero- or few-shot prompts. Works well for classification and extraction in Japanese and for translation between English and Japanese. Performs best when prompted in Japanese. Note: This foundation model was introduced with the 4.8.3 release.	1	2 CPU, 128 GB RAM	50 GB
flan-t5-xl-3b	google-flan-t5-xl	General use with zero- or few-shot prompts. Note: This foundation model was introduced with the 4.8.1 release and can be tuned by using the Tuning Studio.	1	2 CPU, 128 GB RAM	21 GB
flan-t5-xxl-11b	google-flan-t5-xxl	General use with zero- or few-shot prompts.	1	2 CPU, 128 GB RAM	52 GB
flan-ul2-20b	google-flan-ul2	General use with zero- or few-shot prompts. Note: If you want to use this model with L40S GPUs, you must take some extra steps. See Step 3 in the procedure.	1 A100 or H100 GPU or 2 L40S GPUs	2 CPU(A100 or H100), 3 CPU(L40S), 128 GB RAM	85 GB
gpt-neox-20bDeprecated in 4.8.4	eleutherai-gpt-neox-20b	Works best with few-shot prompts. Works well with special characters, which can be used to generate structured output.	1	2 CPU, 128 GB RAM	100 GB
granite-8b-japanese	ibm-granite-8b-japanese	The Granite model series is a family of IBM-trained, dense decoder-only models, which are particularly well-suited for generative tasks. Note: This model was introduced with the 4.8.4 release.	1	2 CPU, 128 GB RAM	50 GB
granite-13b-chat-v2	ibm-granite-13b-chat-v2	General use model from IBM that is optimized for dialogue use cases. Note: A modification to this model was made with the 4.8.4 release. This model was introduced with the 4.8.1 release.	1	2 CPU, 128 GB RAM	36 GB
granite-13b-chat-v1Deprecated in 4.8.4	ibm-granite-13b-chat-v1	General use model from IBM that is optimized for dialogue use cases.	1	2 CPU, 128 GB RAM	36 GB
granite-13b-instruct-v2	ibm-granite-13b-instruct-v2	General use model from IBM that is optimized for question and answer use cases. Note: This model was introduced with the 4.8.1 release.	1	2 CPU, 128 GB RAM	62 GB
granite-13b-instruct-v1Deprecated in 4.8.4	ibm-granite-13b-instruct-v1	General use model from IBM that is optimized for question and answer use cases.	1	2 CPU, 128 GB RAM	36 GB
granite-20b-multilingual	ibm-granite-20b-multilingual	The Granite model series is a family of IBM-trained, dense decoder-only models, which are particularly well-suited for generative tasks. Note: This model was introduced with the 4.8.4 release.	1	2 CPU, 128 GB RAM	100 GB
jais-13b-chat	core42-jais-13b-chat	General use foundation model for generative tasks in Arabic. Note: This model was introduced with the 4.8.5 release.	1	2 CPU, 96 GB RAM	60 GB
llama-2-13b-chat	meta-llama-llama-2-13b-chat	General use with zero- or few-shot prompts. Optimized for dialogue use cases. Note: This model was introduced with the 4.8.1 release. This model can be tuned by using the Tuning Studio with the 4.8.4 release.	1	2 CPU, 128 GB RAM	62 GB
llama-2-70b-chat	meta-llama-llama-2-70b-chat	General use with zero- or few-shot prompts. Optimized for dialogue use cases.	4	5 CPU, 250 GB RAM	150 GB
llama2-13b-dpo-v7	mncai-llama-2-13b-dpo-v7	General use foundation model for generative tasks in Korean. Note: This model was introduced with the 4.8.5 release.	1	2 CPU, 96 GB RAM	30 GB
mixtral-8x7b-instruct-v01-q	ibm-mistralai-mixtral-8x7b-instruct-v01-q	Mixtral-8-7b-instruct-v01-gptq model is made with AutoGPTQ, which mainly leverages the quantization technique to 'compress' the model weights from FP16 to 4-bit INT and performs 'decompression' on-the-fly before computation (in FP16). Note: This model was introduced with the 4.8.4 release.	1	2 CPU, 128 GB RAM	35 GB
mpt-7b-instruct2Deprecated in 4.8.4	ibm-mpt-7b-instruct2	General use with few-shot prompts.	1	2 CPU, 128 GB RAM	50 GB
mt0-xxl-13b	bigscience-mt0-xxl	General use with zero- or few-shot prompts. Supports prompts in languages other than English and multilingual prompts.	1	2 CPU, 128 GB RAM	62 GB
starcoder-15.5bDeprecated in 4.8.4	bigcode-starcoder	Optimized for code generation and code conversion.	1	2 CPU, 128 GB RAM	138 GB

For more information, see Supported foundation models.

In addition to foundation models that are curated by IBM, you can upload and deploy your own foundation models. For more information about how to upload, register, and deploy a custom foundation model, see Deploying custom foundation models.

Restriction: The NVIDIA L40S GPU does not support the use of custom foundation models.

Procedure

To deploy foundation models in your cluster:

Decide which models you want to use, and then make a note of their model IDs.
Tip: Due to the large resource demands of foundation models, you might want to install only the models that you know you want to use right away at the time that the service is installed.

To add models, run the following command to patch the deployment:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'

Replace the <model-id> variables with the IDs for the models that you want to use. For example:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["google-flan-t5-xxl","bigcode-starcoder"]}}'

If you want to use the flan-ul2-20b foundation model with L40S GPUs only:

Run the following patch command to make adjustments to the model configuration that optimize the model for use with L40S GPUs:

oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \
--type merge -p '{"spec": {\
"google_flan_ul2": \
{ "cuda_visible_devices": "0,1", "deployment_framework": "hf_custom_tp" ,\
"deployment_yaml_name": "flan-ul2-deployment.yaml.j2" ,\
"dir_name": "models--google--flan-ul2" ,"dtype_str": "float16" ,\
"force_apply": "false" ,"hf_modules_cache": "/tmp/huggingface/modules" ,\
"max_batch_size": "128" ,"max_batch_weight": "34543200" ,\
"max_concurrent_requests": "150" ,"max_new_tokens": "4096" ,\
"max_sequence_length": "4096" ,"model_name": "google/flan-ul2" ,\
"model_root_dir": "/watsonxaiifm-models" ,"num_gpus": "2" ,"num_shards": "2" ,\
"pt2_compile": "false" ,"pvc_name": "google-flan-ul2-pvc-rev2" ,\
"pvc_size": "85Gi" ,"svc_name": "flan-ul2" },\
"google_flan_ul2_resources": \
{ "limits": {"cpu": "3" ,"ephemeral-storage": "1Gi", "memory": "128Gi", \
"nvidia.com/gpu": "2" }, \
"requests": {"cpu": "2", "ephemeral-storage": "10Mi", "memory": "4Gi", \
"nvidia.com/gpu": "2" }}}}'

Add the flan-ul2-20b model to the list of models to install by patching the deployment as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["google-flan-ul2"]}}'

If you want to remove the flan-ul2-20b model, run the following patch command to remove some file path values from the custom resource.

oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \
--type json -p '[{ "op": "remove", "path": "/spec/google_flan_ul2"}, \
{ "op": "remove", "path": "/spec/google_flan_ul2_resources"}]'

You can add additional models as required at any point after the initial service setup.

What to do next

IBM watsonx.ai is ready to use. To get started, see Developing generative AI solutions with foundation models.