Adding foundation models to IBM watsonx.ai

Important: IBM Cloud Pak® for Data Version 4.8 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.

To submit input to a foundation model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster. To do so, patch the custom resource for the service to add foundation models.

Before you begin

The IBM watsonx.ai service must be installed.

Be sure you have the necessary resources available to support the models that you want to use. For more information about the overall resources that are required for the service, see Hardware requirements.

One of the following types of GPUs is required to support the use of foundation models in IBM watsonx.ai:
  • NVIDIA A100 80 GB
  • NVIDIA H100 80 GB
  • NVIDIA L40S 48 GB
Note: The Multi-Instance GPU (MIG) feature of the NVIDIA GPU must be disabled.

Each model uses a subset of the total resources that are needed for the service. The following table describes the subset of resources that are required per model.

Note: You do not need to prepare these resources in addition to the overall service hardware requirements. If you meet the prerequisite hardware requirements for the service, you already have the resources you need.
The following foundation models are supported:
Model name Model ID Description Required GPUs Memory requirements Storage requirements
codellama-34b-instruct codellama-codellama-34b-instruct-hf Code Llama is an AI model built on top of Llama 2, fine-tuned for generating and discussing code.
Note: This model was introduced with the 4.8.4 release.
1 2 CPU, 128 GB RAM 77 GB
elyza-japanese-llama-2-7b-instruct elyza-japanese-llama-2-7b-instruct General use with zero- or few-shot prompts. Works well for classification and extraction in Japanese and for translation between English and Japanese. Performs best when prompted in Japanese.
Note: This foundation model was introduced with the 4.8.3 release.
1 2 CPU, 128 GB RAM 50 GB
flan-t5-xl-3b google-flan-t5-xl General use with zero- or few-shot prompts.
Note: This foundation model was introduced with the 4.8.1 release and can be tuned by using the Tuning Studio.
1 2 CPU, 128 GB RAM 21 GB
flan-t5-xxl-11b google-flan-t5-xxl General use with zero- or few-shot prompts. 1 2 CPU, 128 GB RAM 52 GB
flan-ul2-20b google-flan-ul2 General use with zero- or few-shot prompts.
Note: If you want to use this model with L40S GPUs, you must take some extra steps. See Step 3 in the procedure.
1 A100 or H100 GPU or 2 L40S GPUs 2 CPU(A100 or H100), 3 CPU(L40S), 128 GB RAM 85 GB
gpt-neox-20bDeprecated in 4.8.4 eleutherai-gpt-neox-20b Works best with few-shot prompts. Works well with special characters, which can be used to generate structured output. 1 2 CPU, 128 GB RAM 100 GB
granite-8b-japanese ibm-granite-8b-japanese The Granite model series is a family of IBM-trained, dense decoder-only models, which are particularly well-suited for generative tasks.
Note: This model was introduced with the 4.8.4 release.
1 2 CPU, 128 GB RAM 50 GB
granite-13b-chat-v2 ibm-granite-13b-chat-v2 General use model from IBM that is optimized for dialogue use cases.
Note:
  • A modification to this model was made with the 4.8.4 release.
  • This model was introduced with the 4.8.1 release.
1 2 CPU, 128 GB RAM 36 GB
granite-13b-chat-v1Deprecated in 4.8.4 ibm-granite-13b-chat-v1 General use model from IBM that is optimized for dialogue use cases. 1 2 CPU, 128 GB RAM 36 GB
granite-13b-instruct-v2 ibm-granite-13b-instruct-v2 General use model from IBM that is optimized for question and answer use cases.
Note: This model was introduced with the 4.8.1 release.
1 2 CPU, 128 GB RAM 62 GB
granite-13b-instruct-v1Deprecated in 4.8.4 ibm-granite-13b-instruct-v1 General use model from IBM that is optimized for question and answer use cases. 1 2 CPU, 128 GB RAM 36 GB
granite-20b-multilingual ibm-granite-20b-multilingual The Granite model series is a family of IBM-trained, dense decoder-only models, which are particularly well-suited for generative tasks.
Note: This model was introduced with the 4.8.4 release.
1 2 CPU, 128 GB RAM 100 GB
jais-13b-chat core42-jais-13b-chat General use foundation model for generative tasks in Arabic.
Note: This model was introduced with the 4.8.5 release.
1 2 CPU, 96 GB RAM 60 GB
llama-2-13b-chat meta-llama-llama-2-13b-chat General use with zero- or few-shot prompts. Optimized for dialogue use cases.
Note:
  • This model was introduced with the 4.8.1 release.
  • This model can be tuned by using the Tuning Studio with the 4.8.4 release.
1 2 CPU, 128 GB RAM 62 GB
llama-2-70b-chat meta-llama-llama-2-70b-chat General use with zero- or few-shot prompts. Optimized for dialogue use cases. 4 5 CPU, 250 GB RAM 150 GB
llama2-13b-dpo-v7 mncai-llama-2-13b-dpo-v7 General use foundation model for generative tasks in Korean.
Note: This model was introduced with the 4.8.5 release.
1 2 CPU, 96 GB RAM 30 GB
mixtral-8x7b-instruct-v01-q ibm-mistralai-mixtral-8x7b-instruct-v01-q Mixtral-8-7b-instruct-v01-gptq model is made with AutoGPTQ, which mainly leverages the quantization technique to 'compress' the model weights from FP16 to 4-bit INT and performs 'decompression' on-the-fly before computation (in FP16).
Note: This model was introduced with the 4.8.4 release.
1 2 CPU, 128 GB RAM 35 GB
mpt-7b-instruct2Deprecated in 4.8.4 ibm-mpt-7b-instruct2 General use with few-shot prompts. 1 2 CPU, 128 GB RAM 50 GB
mt0-xxl-13b bigscience-mt0-xxl General use with zero- or few-shot prompts. Supports prompts in languages other than English and multilingual prompts. 1 2 CPU, 128 GB RAM 62 GB
starcoder-15.5bDeprecated in 4.8.4 bigcode-starcoder Optimized for code generation and code conversion. 1 2 CPU, 128 GB RAM 138 GB
For more information, see Supported foundation models.
In addition to foundation models that are curated by IBM, you can upload and deploy your own foundation models. For more information about how to upload, register, and deploy a custom foundation model, see Deploying custom foundation models.
Restriction: The NVIDIA L40S GPU does not support the use of custom foundation models.

Procedure

To deploy foundation models in your cluster:

  1. Decide which models you want to use, and then make a note of their model IDs.
    Tip: Due to the large resource demands of foundation models, you might want to install only the models that you know you want to use right away at the time that the service is installed.
  2. To add models, run the following command to patch the deployment:
    oc patch watsonxaiifm watsonxaiifm-cr \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --type=merge \
    --patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'
    
    Replace the <model-id> variables with the IDs for the models that you want to use. For example:
    oc patch watsonxaiifm watsonxaiifm-cr \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --type=merge \
    --patch='{"spec":{"install_model_list": ["google-flan-t5-xxl","bigcode-starcoder"]}}'
    
If you want to use the flan-ul2-20b foundation model with L40S GPUs only:
  1. Run the following patch command to make adjustments to the model configuration that optimize the model for use with L40S GPUs:
    oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \
    --type merge -p '{"spec": {\
    "google_flan_ul2": \
    { "cuda_visible_devices": "0,1", "deployment_framework": "hf_custom_tp" ,\
    "deployment_yaml_name": "flan-ul2-deployment.yaml.j2" ,\
    "dir_name": "models--google--flan-ul2" ,"dtype_str": "float16" ,\
    "force_apply": "false" ,"hf_modules_cache": "/tmp/huggingface/modules" ,\
    "max_batch_size": "128" ,"max_batch_weight": "34543200" ,\
    "max_concurrent_requests": "150" ,"max_new_tokens": "4096" ,\
    "max_sequence_length": "4096" ,"model_name": "google/flan-ul2" ,\
    "model_root_dir": "/watsonxaiifm-models" ,"num_gpus": "2" ,"num_shards": "2" ,\
    "pt2_compile": "false" ,"pvc_name": "google-flan-ul2-pvc-rev2" ,\
    "pvc_size": "85Gi" ,"svc_name": "flan-ul2" },\
    "google_flan_ul2_resources": \
    { "limits": {"cpu": "3" ,"ephemeral-storage": "1Gi", "memory": "128Gi", \
    "nvidia.com/gpu": "2" }, \
    "requests": {"cpu": "2", "ephemeral-storage": "10Mi", "memory": "4Gi", \
    "nvidia.com/gpu": "2" }}}}'
  2. Add the flan-ul2-20b model to the list of models to install by patching the deployment as follows:
    oc patch watsonxaiifm watsonxaiifm-cr \
    --namespace=${PROJECT_CPD_INST_OPERANDS} \
    --type=merge \
    --patch='{"spec":{"install_model_list": ["google-flan-ul2"]}}'
    
  3. If you want to remove the flan-ul2-20b model, run the following patch command to remove some file path values from the custom resource.
    oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \
    --type json -p '[{ "op": "remove", "path": "/spec/google_flan_ul2"}, \
    { "op": "remove", "path": "/spec/google_flan_ul2_resources"}]'

You can add additional models as required at any point after the initial service setup.

What to do next

IBM watsonx.ai is ready to use. To get started, see Developing generative AI solutions with foundation models.