Adding foundation models to IBM watsonx.ai

To submit input to a foundation model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster.

Before you begin

The IBM watsonx.ai service must be installed. Make sure you have the necessary resources available to support the models that you want to use. For more information about the overall resources that are required for the service, see Hardware requirements. For details about the available foundation models and the extra resources needed to host them, see Foundation models in IBM watsonx.ai.
Important: Before you install a foundation model, make sure your cluster has at least one node with enough available GPUs to run that model.
If you are upgrading the software and foundation models are already installed that you want to continue to use, you can do so even when the installed models are deprecated or withdrawn in the new release. For more information, see Foundation model lifecycle. When you list the new models that you want to add to the install_model_list field, be sure to include the model IDs for any existing models that you want to continue to use.
To complete this task the first time, you must be the instance administrator who installed the IBM watsonx.ai service.
You can add foundation models that are curated by IBM to a full installation or lightweight engine installation.

Procedure

To add foundation models, complete the following steps:

Decide which models you want to use, and then make a note of their model IDs.
Tip: Due to the large resource demands of foundation models, install only the models that you want to use right away at the time that the service is installed. You can add more models later.
Choose how you want to install the foundation models from the following options:

Install foundation models with default configuration

Simplest installation option where models are installed on available resources. Some foundation models are sharded as part of their standard configuration. The number of GPUs specified per model in Foundation models in IBM watsonx.ai indicates the number of shards used.
See Installing models with the default configuration.

Note: Use this method for adding embedding models, reranker models, and models that support document text processing. You cannot shard embedding or reranker models.

Shard the foundation models during installation

You can use sharding to partition large models into smaller units, known as shards, that can be processed across multiple GPU processors in parallel. Sharding foundation models can improve performance and also reduces the amount of memory needed for each GPU.
See Sharding the foundation models.

Install foundation models on preconfigured GPU partitions

Partitioning GPU processors lets you install more than one smaller model on a single GPU to use resources more efficiently.
Important: Not all foundation models can be sharded or installed on partitioned GPU processors. See System requirements for foundation models in IBM watsonx.ai for details.

See Installing models on GPU partitions.

Install foundation models with a custom vLLM image

You can update the section for a specific model in the watsonx.ai™ custom resource to include the vLLM image that supports loading the model onto the memory of a specific GPU. You can use a custom vLLM image to install models on Intel Gaudi 3 AI Accelerator and NVIDIA RTX PRO 6000 GPUs.
See Installing foundation models with a custom vLLM image.

Install embedding models on GPUs

For better performance while vectorizing text, you can configure embedding models to run on GPUs.
See Installing embedding models on GPUs.
Confirm that the spec section of the watsonxaiifm custom resource is updated by running the following command:
```
oc describe watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}
```
Important: Do not edit the custom resource directly because it is easy to introduce errors.
Wait for the operator to finish reconciling the changes and show the Completed status. You can use the following command to check the status of the service:
```
oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}
```
Use the following commands to check whether your changes were applied successfully.
- To check whether the model predictor pod is running:
```
oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep predictor
```
- To check whether the predictor pod shows the correct number of GPU processors. If you specified a node selector, you can also check the value of the Node-Selectors field.
```
oc describe po <model-predictor-pod> -n ${PROJECT_CPD_INST_OPERANDS}
```
- To check which node a foundation model is using:
```
oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide | grep predictor
```

Optional: You can optionally update the vLLM image to use a custom image to load the foundation model as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"model_install_parameters":{"<model_id_with_underscore>":{"image": "<image-value>"}}}}'

You can add additional models as required at any point after the initial service setup. You can also remove or change the sharding settings later. For more information, see Changing foundation model sharding configuration.

Installing models with the default configuration

Run the following command to modify the custom resource:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'

In the install_model_list array, list the IDs for the models that you want to use. Separate the model IDs with commas as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["llama-3-1-8b-instruct","granite-3-3-8b-instruct"]}}'

Document text processing models only:

To use the text classification and extraction APIs,you must install a set of machine learning models that perform natural language processing on your documents. To extract key-value pair data from your documents, you must also install the install the mistral-small-3-1-24b-instruct-2503 model. Run the following command to set up text processing in your deployment:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["wdu", "mistral-small-3-1-24b-instruct-2503"]}}'
```

The text processing models are installed on a separate pod on the cluster. Choose from the following pod configuration options based on your usage:

Text processing pod capacity	Custom resource settings
Small	`wdu_ai_deploy_distributed_replicas: 1` `wdu_api_deploy_distributed_replicas: 1` `wdu_page_deploy_distributed_replicas: 1` `wdu_result_deploy_distributed_replicas: 1` `wdu_watch_deploy_distributed_replicas: 1`
Medium	`wdu_ai_deploy_distributed_replicas: 2` `wdu_api_deploy_distributed_replicas: 2` `wdu_page_deploy_distributed_replicas: 4` `wdu_result_deploy_distributed_replicas: 2` `wdu_watch_deploy_distributed_replicas: 2`
Large	`wdu_ai_deploy_distributed_replicas: 4` `wdu_api_deploy_distributed_replicas: 4` `wdu_page_deploy_distributed_replicas: 8` `wdu_result_deploy_distributed_replicas: 2` `wdu_watch_deploy_distributed_replicas: 2`

Note: By default, text classification and extraction runs on pods configured with Small capacity. For better performance, set the text processing pod capacity to Medium or Large.

Run the following command to set the text processing pod capacity to Medium:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
-p '{"spec": {"wdu_api_deploy_distributed_replicas": 2, "wdu_page_deploy_distributed_replicas": 4, "wdu_result_deploy_distributed_replicas": 2, "wdu_watch_deploy_distributed_replicas": 2, "wdu_ai_deploy_distributed_replicas": 2 }}'

Optional: You can optimize the memory used by the document understanding library by setting various environment variable in the custom resource. For details, see Managing resources for document text processing models.

Sharding the foundation models

You can choose to shard a model across multiple GPUs on any available node in the cluster or choose the specific cluster node you want to use for the model.

Verify that the foundation model you want to shard supports sharding by confirming the foundation model can be installed on more than one GPU. See System requirements for foundation models in IBM watsonx.ai.
Do one of the following things:
Shard the foundation models on any available cluster node
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>}}}}'
```
Make the following edits:
- Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["llama-3-1-8b-instruct","granite-3-3-8b-instruct"].
- Specify each model that you want to shard by specifying the model ID, but use underscores instead of hyphens in the model ID in model_install_parameters. For example, for the llama-3-1-8b-instruct model, specify llama_3_1_8b_instruct.
- Assign a shard value to each model in model_install_parameters. The shard value specifies the number of units in which to split the model. Accepted shard values are 2, 4, or 8 only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied.
For example, the following patch adds the llama-3-1-8b-instruct and granite-3-3-8b-instruct foundation models and shards the llama-3-1-8b-instruct model:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["llama-3-1-8b-instruct","granite-3-3-8b-instruct"], "model_install_parameters": {"llama_3_1_8b_instruct":{"shard": 2}}}}'
```
Important: Do not edit the custom resource directly because it is easy to introduce errors.
Shard the foundation models on a specific cluster node
You can specify the node where you want the shards to run by specifying the hostname of the node in the nodeSelector object.
1. Get a list of the nodes:
```
oc get nodes
```
2. Check the label for the node that you want to use:
```
oc describe no <node-name> | grep kubernetes.io/hostname
```
3. Verify that the node has a GPU by using the following command. Foundation models can be sharded on GPU nodes only.
  oc describe node <node-hostname> | grep nvidia.com/gpu.product
  A GPU node returns nvidia.com/gpu.product=true.
4. Run the following command to patch the custom resource:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>, "nodeSelector":{"kubernetes.io/hostname": "<hostname-value>"}}}}}'
```
  Make the following edits:
  - Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["llama-3-1-8b-instruct","granite-3-3-8b-instruct"].
  - Specify each model that you want to shard by specifying the model ID, but use underscores instead of hyphens in the model ID in model_install_parameters. For example, for the llama-3-1-8b-instruct model, specify llama_3_1_8b_instruct.
  - Assign a shard value to each model in model_install_parameters. The shard value specifies the number of units in which to split the model. Accepted shard values are 2, 4, or 8 only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied.
  - For each shard that you want to host on a specific node, specify the node hostname value in the nodeSelector object. For example, "nodeSelector":{"kubernetes.io/hostname":"worker0.example.com"}
For example:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["llama-3-1-8b-instruct","granite-3-3-8b-instruct"], "model_install_parameters": {"llama_3_1_8b_instruct":{"shard": 2, "nodeSelector":{"kubernetes.io/hostname": "worker0.example.com"}}}}}'
```

You can modify the sharding configuration after a model is installed as needed. For details, see Changing foundation model sharding configuration.

Installing models on GPU partitions

You can specify the NVIDIA Multi-Instance GPU (MIG) node where you want the foundation models to run by specifying the hostname of the node where MIG is configured in the nodeSelector object.

Complete the following steps:

Verify that the foundation model you want to install supports being installed on an NVIDIA Multi-Instance GPU. See System requirements for foundation models in IBM watsonx.ai.
You can set the nodeSelector object in the following ways to select a specific cluster node to host your models:
Use the MIG profile label
1. Get a list of the nodes:
```
oc get nodes
```
2. Check the MIG profile label for the NVIDIA Multi-Instance GPU node that you want to use with the following command:
  oc describe node <node-name> | grep nvidia.com/mig.config
  For example:
  oc describe node worker10.wxai.example.com | grep nvidia.com/mig.config= nvidia.com/mig.config=all-3g.40gb
3. Run the following command to patch the custom resource:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector":{"nvidia.com/mig.config": "<mig-label>"}}}}}'
```
  Make the following edits:
  - Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["llama-3-1-8b-instruct","granite-3-3-8b-instruct"].
  - Specify the node label in the nodeSelector object. For example, "nodeSelector":{"nvidia.com/mig.config":"all-3g.40gb"}
Use the GPU type label
1. Get a list of product labels for the different types of NVIDIA Multi-Instance GPU nodes in your cluster:
```
oc get nodes -L nvidia.com/gpu.product
```
2. Run the following command to patch the custom resource:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector":{"nvidia.com/gpu.product": "<product-label>"}}}}}'
```
  Make the following edits:
  - Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["llama-3-1-8b-instruct","granite-3-3-8b-instruct"].
  - Specify the node label in the nodeSelector object. For example, "nodeSelector":{"nvidia.com/gpu.product":"NVIDIA-A100-SXM4-80GB"}

codestral-2501 model only: To configure the codestral-2501 to run on a MIG node you must set the context window length to 28,672 tokens.

Set the SERVICE_VERSION environment variable to the watsonx.ai service operand version.
```
export SERVICE_VERSION=<watsonxai-service-operand-version>
```

Use the following command to patch the custom resource:

cat <<EOF | oc apply -f -
apiVersion: watsonxaiifm.cpd.ibm.com/v1beta1
kind: Watsonxaiifm
metadata:
  name: watsonxaiifm-cr
  namespace: ${PROJECT_CPD_INST_OPERANDS}
spec:
  blockStorageClass: ocs-storagecluster-ceph-rbd
  install_model_list:
    - codestral-2501
  fileStorageClass: ocs-storagecluster-cephfs
  license:
    accept: true
  version: ${SERVICE_VERSION}
  model_install_parameters:
    codestral_2501:
      env:
        - name: MODEL_NAME
          value: /mnt/models/codestral-2501
        - name: SERVED_MODEL_NAME
          value: mistralai/codestral-2501
        - name: MAX_NUM_SEQS
          value: "8"
        - name: MAX_NEW_TOKENS
          value: "8192"
        - name: MAX_SEQUENCE_LENGTH
          value: "28672"
        - name: DISABLE_PROMPT_LOGPROBS
          value: "true"
        - name: TOKENIZER_MODE
          value: mistral
        - name: LOAD_FORMAT
          value: mistral
        - name: CONFIG_FORMAT
          value: mistral
        - name: ENABLE_PREFIX_CACHING
          value: "1"
        - name: NUM_GPUS
          value: "1"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: PORT
          value: "3000"
        - name: MAX_LOG_LEN
          value: "100"
        - name: GPRC_PORT
          value: "8033"
        - name: ENABLE_VLLM_LOG_REQUESTS
          value: "TRUE"
EOF

Wait for the patch to be applied. You can check the status by using the following command:
```
oc get watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}
```

llama-2-13b-chat model only: Set the following environment variables for the llama-2-13b-chat model:

ESTIMATE_MEMORY_BATCH_SIZE=4
ESTIMATE_MEMORY=off
model_id=meta-llama-llama-2-13b-chat

Use the following script to make changes to the model configuration:

#!/bin/bash

export current_envs=$(oc get isvc $model_id -n $PROJECT_CPD_INST_OPERANDS -ojsonpath='{.spec.predictor.model.env}')

memory_batch_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY_BATCH_SIZE" and .value == "4")) | length > 0')

estimate_memory_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY" and .value == "off")) | length > 0')

if [ "$memory_batch_check" = "false" ]; then oc patch InferenceService  $model_id -n $PROJECT_CPD_INST_OPERANDS \
  --type=json \
  -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY_BATCH_SIZE","value":"4"}}]' ; fi

if [ "$estimate_memory_check" = "false" ]; then oc patch InferenceService $model_id -n $PROJECT_CPD_INST_OPERANDS \
  --type=json \
  -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY","value":"off"}}]' ; fi

Installing foundation models with a custom vLLM image

To install a foundation model on specific GPUs or if the foundation model is not compatible with the default watsonx.ai vLLM image, you can update the section for the model in the watsonx.ai custom resource to use a custom vLLM image as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"model_install_parameters":{"<model_id_with_underscore>":{"image": "<vllm-image-location>"}}}}'

You can patch the custom resource with a custom image to load models onto the following GPU types:

Intel Gaudi 3 AI Accelerator

Based on the model you want to install, run the following commands on an Intel Gaudi 3 AI Accelerator cluster to include the vLLM image that supports loading the model onto the GPU memory:

Attention: If you want to install the Intel Gaudi 3 AI Accelerator operator on Red Hat® OpenShift®, see Intel Gaudi Base Operator - OpenShift Installation.

codestral-22b

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"codestral_22b": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-3-2-8b-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_3_2_8b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-34b-code-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_34b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-20b-code-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_20b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-8b-code-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_8b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-3b-code-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_3b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-3-8b-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_3_8b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

granite-guardian-3-2b

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_guardian_3_2b": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

llama-3-2-3b-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"llama_3_2_3b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

llama-3-2-1b-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"llama_3_2_1b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

llama-3-1-8b-instruct

oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"llama_3_1_8b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}

NVIDIA RTX PRO 6000

Run the following commands on an NVIDIA RTX PRO 6000 cluster to include the vLLM image that supports loading the model onto the GPU memory:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"model_install_parameters":{"<model_id_with_underscore>":{"image": "registry.redhat.io/rhoai/odh-vllm-cuda-rhel9@sha256:1e8b4f9fdc32213a45824c441171218fd4814ff5ea718b31fc0f74d9322f1a3f"}}}}'

A subset of models can be installed on NVIDIA RTX PRO 6000 GPUs. For details, see System requirements for foundation models in IBM watsonx.ai.

Installing embedding models on GPUs

You can choose to run an embedding model on GPU nodes in the following ways:

Override the model resources and GPU requests

Run the following command to modify the custom resource:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"<model_id_with_underscore>_replicas": 1, "<model_id_with_underscore>_resources": {"limits": {"cpu": "2", "memory": "4Gi", "nvidia.com/gpu": 1}, "requests": {"cpu": "1", "memory": "4Gi", "nvidia.com/gpu": 1}}}}'

Make the following edits:

Specify each embedding model for which you want to override resources by specifying the model ID, but use underscores instead of hyphens in the model ID. For example, for the ibm-slate-30m-english-rtrvr model ID, specify ibm_slate_30m_english_rtrvr.

Specify MIG nodes with the nodeSelector attribute

Run the following command to modify the custom resource:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec": {"model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector": <node-label>}}}}'

Make the following edits:

Specify each embedding model that you want to configure to run on GPUs by specifying the model ID, but use underscores instead of hyphens in the model ID in model_install_parameters. For example, for the ibm-slate-30m-english-rtrvr model ID, specify ibm_slate_30m_english_rtrvr.
Specify the node label in the nodeSelector object. For example, "nodeSelector":{"kubernetes.io/hostname":"gpu-node-1"}

What to do next

IBM watsonx.ai is ready to use. To start working with the various foundation models and tools, see Getting started with IBM watsonx.ai.