Adding foundation models to IBM watsonx.ai
To submit input to a foundation model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster.
Before you begin
- The IBM
watsonx.ai service must be installed. Make sure you have the necessary resources
available to support the models that you want to use. For more information about the overall
resources that are required for the service, see Hardware requirements. For details about the available foundation models and the
extra resources needed to host them, see Foundation models in IBM
watsonx.ai.Important: Before you install a foundation model, make sure your cluster has at least one node with enough available GPUs to run that model.
- If you are upgrading the software and foundation models are already installed that you want to
continue to use, you can do so even when the installed models are deprecated or withdrawn in the new
release. For more information, see Foundation model lifecycle. When you list the new models that you
want to add to the
install_model_listfield, be sure to include the model IDs for any existing models that you want to continue to use. - To complete this task the first time, you must be the instance administrator who installed the IBM watsonx.ai service.
- You can add foundation models that are curated by IBM to a full installation or lightweight engine installation.
Procedure
- Decide which models you want to use, and then make a note of their model IDs.Tip: Due to the large resource demands of foundation models, install only the models that you want to use right away at the time that the service is installed. You can add more models later.
- Choose how you want to install the foundation models from the following options:
- Install foundation models with default configuration
- Simplest installation option where models are installed on available resources. Some foundation
models are sharded as part of their standard configuration. The number of GPUs specified per model
in Foundation models in IBM
watsonx.ai indicates the
number of shards used.
See Installing models with the default configuration.
Note: Use this method for adding embedding models, reranker models, and models that support document text processing. You cannot shard embedding or reranker models. - Shard the foundation models during installation
- You can use sharding to partition large models into smaller units, known as shards, that can be processed across multiple GPU processors in parallel. Sharding foundation models can improve performance and also reduces the amount of memory needed for each GPU.
- Install foundation models on preconfigured GPU partitions
- Partitioning GPU processors lets you install more than one smaller model on a single GPU to use
resources more efficiently.Important: Not all foundation models can be sharded or installed on partitioned GPU processors. See System requirements for foundation models in IBM watsonx.ai for details.
- Install foundation models with a custom vLLM image
- You can update the section for a specific model in the watsonx.ai™ custom resource to include the vLLM image that supports loading the model onto the memory of a specific GPU. You can use a custom vLLM image to install models on Intel Gaudi 3 AI Accelerator and NVIDIA RTX PRO 6000 GPUs.
- Install embedding models on GPUs
- For better performance while vectorizing text, you can configure embedding models to run on GPUs.
- Confirm that the
specsection of the watsonxaiifm custom resource is updated by running the following command:oc describe watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}Important: Do not edit the custom resource directly because it is easy to introduce errors. - Wait for the operator to finish reconciling the changes and show the
Completedstatus. You can use the following command to check the status of the service:oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS} - Use the following commands to check whether your changes were applied successfully.
- To check whether the model predictor pod is
running:
oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep predictor - To check whether the predictor pod shows the correct number of GPU processors. If you specified
a node selector, you can also check the value of the Node-Selectors
field.
oc describe po <model-predictor-pod> -n ${PROJECT_CPD_INST_OPERANDS} - To check which node a foundation model is
using:
oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide | grep predictor
- To check whether the model predictor pod is
running:
- Optional: You can optionally update the vLLM image to use a custom image to load the
foundation model as
follows:
oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"model_install_parameters":{"<model_id_with_underscore>":{"image": "<image-value>"}}}}'
You can add additional models as required at any point after the initial service setup. You can also remove or change the sharding settings later. For more information, see Changing foundation model sharding configuration.
Installing models with the default configuration
- Run the following command to modify the custom
resource:
oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'In the
install_model_listarray, list the IDs for the models that you want to use. Separate the model IDs with commas as follows:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["llama-3-1-8b-instruct","granite-3-3-8b-instruct"]}}' - Document text processing models only:
- To use the text classification and extraction APIs,you must install a set of machine learning
models that perform natural language processing on your documents. To extract key-value pair data
from your documents, you must also install the install the
mistral-small-3-1-24b-instruct-2503model. Run the following command to set up text processing in your deployment:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["wdu", "mistral-small-3-1-24b-instruct-2503"]}}' - The text processing models are installed on a separate pod on the cluster. Choose from the
following pod configuration options based on your usage:
Text processing pod capacity Custom resource settings Small wdu_ai_deploy_distributed_replicas: 1wdu_api_deploy_distributed_replicas: 1wdu_page_deploy_distributed_replicas: 1wdu_result_deploy_distributed_replicas: 1wdu_watch_deploy_distributed_replicas: 1
Medium wdu_ai_deploy_distributed_replicas: 2wdu_api_deploy_distributed_replicas: 2wdu_page_deploy_distributed_replicas: 4wdu_result_deploy_distributed_replicas: 2wdu_watch_deploy_distributed_replicas: 2
Large wdu_ai_deploy_distributed_replicas: 4wdu_api_deploy_distributed_replicas: 4wdu_page_deploy_distributed_replicas: 8wdu_result_deploy_distributed_replicas: 2wdu_watch_deploy_distributed_replicas: 2
Note: By default, text classification and extraction runs on pods configured withRun the following command to set the text processing pod capacity toSmallcapacity. For better performance, set the text processing pod capacity toMediumorLarge.Medium:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ -p '{"spec": {"wdu_api_deploy_distributed_replicas": 2, "wdu_page_deploy_distributed_replicas": 4, "wdu_result_deploy_distributed_replicas": 2, "wdu_watch_deploy_distributed_replicas": 2, "wdu_ai_deploy_distributed_replicas": 2 }}' - Optional: You can optimize the memory used by the document understanding library by setting various environment variable in the custom resource. For details, see Managing resources for document text processing models.
- To use the text classification and extraction APIs,you must install a set of machine learning
models that perform natural language processing on your documents. To extract key-value pair data
from your documents, you must also install the install the
Sharding the foundation models
- Verify that the foundation model you want to shard supports sharding by confirming the foundation model can be installed on more than one GPU. See System requirements for foundation models in IBM watsonx.ai.
- Do one of the following things:
- Shard the foundation models on any available cluster node
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>}}}}'- Specify the models that you want to use in a comma-separated list in
install_model_list. For example,["llama-3-1-8b-instruct","granite-3-3-8b-instruct"]. - Specify each model that you want to shard by specifying the model ID, but
use underscores instead of hyphens in the model ID in
model_install_parameters. For example, for thellama-3-1-8b-instructmodel, specifyllama_3_1_8b_instruct. - Assign a shard value to each model in
model_install_parameters. The shard value specifies the number of units in which to split the model. Accepted shard values are2,4, or8only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied.
For example, the following patch adds the llama-3-1-8b-instruct and granite-3-3-8b-instruct foundation models and shards the llama-3-1-8b-instruct model:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["llama-3-1-8b-instruct","granite-3-3-8b-instruct"], "model_install_parameters": {"llama_3_1_8b_instruct":{"shard": 2}}}}'Important: Do not edit the custom resource directly because it is easy to introduce errors.- Specify the models that you want to use in a comma-separated list in
- Shard the foundation models on a specific cluster node
-
You can specify the node where you want the shards to run by specifying the
hostnameof the node in thenodeSelectorobject.- Get a list of the nodes:
oc get nodes - Check the label for the node that you want to
use:
oc describe no <node-name> | grep kubernetes.io/hostname - Verify that the node has a GPU by using the following command. Foundation models can be sharded
on GPU nodes
only.
A GPU node returnsoc describe node <node-hostname> | grep nvidia.com/gpu.productnvidia.com/gpu.product=true. - Run the following command to patch the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>, "nodeSelector":{"kubernetes.io/hostname": "<hostname-value>"}}}}}'- Specify the models that you want to use in a comma-separated list in
install_model_list. For example,["llama-3-1-8b-instruct","granite-3-3-8b-instruct"]. - Specify each model that you want to shard by specifying the model ID, but
use underscores instead of hyphens in the model ID in
model_install_parameters. For example, for thellama-3-1-8b-instructmodel, specifyllama_3_1_8b_instruct. - Assign a shard value to each model in
model_install_parameters. The shard value specifies the number of units in which to split the model. Accepted shard values are2,4, or8only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied. - For each shard that you want to host on a specific node, specify the node hostname value in the
nodeSelectorobject. For example,"nodeSelector":{"kubernetes.io/hostname":"worker0.example.com"}
- Specify the models that you want to use in a comma-separated list in
For example:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["llama-3-1-8b-instruct","granite-3-3-8b-instruct"], "model_install_parameters": {"llama_3_1_8b_instruct":{"shard": 2, "nodeSelector":{"kubernetes.io/hostname": "worker0.example.com"}}}}}' - Get a list of the nodes:
Installing models on GPU partitions
You can specify the NVIDIA Multi-Instance GPU (MIG) node where you want the foundation models to
run by specifying the hostname of the node where MIG is configured in the
nodeSelector object.
- Verify that the foundation model you want to install supports being installed on an NVIDIA Multi-Instance GPU. See System requirements for foundation models in IBM watsonx.ai.
- You can set the
nodeSelectorobject in the following ways to select a specific cluster node to host your models:- Use the MIG profile label
-
- Get a list of the nodes:
oc get nodes - Check the MIG profile label for the
NVIDIA Multi-Instance GPU node that you want to use with the
following
command:
For example:oc describe node <node-name> | grep nvidia.com/mig.configoc describe node worker10.wxai.example.com | grep nvidia.com/mig.config= nvidia.com/mig.config=all-3g.40gb - Run the following command to patch the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector":{"nvidia.com/mig.config": "<mig-label>"}}}}}'- Specify the models that you want to use in a comma-separated list in
install_model_list. For example,["llama-3-1-8b-instruct","granite-3-3-8b-instruct"]. - Specify the node label in the
nodeSelectorobject. For example,"nodeSelector":{"nvidia.com/mig.config":"all-3g.40gb"}
- Specify the models that you want to use in a comma-separated list in
- Get a list of the nodes:
- Use the GPU type label
-
- Get a list of product labels for the different types of NVIDIA Multi-Instance GPU nodes in your
cluster:
oc get nodes -L nvidia.com/gpu.product - Run the following command to patch the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector":{"nvidia.com/gpu.product": "<product-label>"}}}}}'- Specify the models that you want to use in a comma-separated list in
install_model_list. For example,["llama-3-1-8b-instruct","granite-3-3-8b-instruct"]. - Specify the node label in the
nodeSelectorobject. For example,"nodeSelector":{"nvidia.com/gpu.product":"NVIDIA-A100-SXM4-80GB"}
- Specify the models that you want to use in a comma-separated list in
- Get a list of product labels for the different types of NVIDIA Multi-Instance GPU nodes in your
cluster:
codestral-2501model only: To configure thecodestral-2501to run on a MIG node you must set the context window length to 28,672 tokens.- Set the SERVICE_VERSION environment variable to the watsonx.ai service operand
version.
export SERVICE_VERSION=<watsonxai-service-operand-version> - Use the following command to patch the custom
resource:
cat <<EOF | oc apply -f - apiVersion: watsonxaiifm.cpd.ibm.com/v1beta1 kind: Watsonxaiifm metadata: name: watsonxaiifm-cr namespace: ${PROJECT_CPD_INST_OPERANDS} spec: blockStorageClass: ocs-storagecluster-ceph-rbd install_model_list: - codestral-2501 fileStorageClass: ocs-storagecluster-cephfs license: accept: true version: ${SERVICE_VERSION} model_install_parameters: codestral_2501: env: - name: MODEL_NAME value: /mnt/models/codestral-2501 - name: SERVED_MODEL_NAME value: mistralai/codestral-2501 - name: MAX_NUM_SEQS value: "8" - name: MAX_NEW_TOKENS value: "8192" - name: MAX_SEQUENCE_LENGTH value: "28672" - name: DISABLE_PROMPT_LOGPROBS value: "true" - name: TOKENIZER_MODE value: mistral - name: LOAD_FORMAT value: mistral - name: CONFIG_FORMAT value: mistral - name: ENABLE_PREFIX_CACHING value: "1" - name: NUM_GPUS value: "1" - name: CUDA_VISIBLE_DEVICES value: "0" - name: HUGGINGFACE_HUB_CACHE value: /mnt/models/ - name: HF_MODULES_CACHE value: /tmp/huggingface/modules - name: PORT value: "3000" - name: MAX_LOG_LEN value: "100" - name: GPRC_PORT value: "8033" - name: ENABLE_VLLM_LOG_REQUESTS value: "TRUE" EOF
- Set the SERVICE_VERSION environment variable to the watsonx.ai service operand
version.
- Wait for the patch to be applied. You can check the status by using the following
command:
oc get watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS} llama-2-13b-chatmodel only: Set the following environment variables for the llama-2-13b-chat model:- ESTIMATE_MEMORY_BATCH_SIZE=4
- ESTIMATE_MEMORY=off
- model_id=meta-llama-llama-2-13b-chat
#!/bin/bash export current_envs=$(oc get isvc $model_id -n $PROJECT_CPD_INST_OPERANDS -ojsonpath='{.spec.predictor.model.env}') memory_batch_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY_BATCH_SIZE" and .value == "4")) | length > 0') estimate_memory_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY" and .value == "off")) | length > 0') if [ "$memory_batch_check" = "false" ]; then oc patch InferenceService $model_id -n $PROJECT_CPD_INST_OPERANDS \ --type=json \ -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY_BATCH_SIZE","value":"4"}}]' ; fi if [ "$estimate_memory_check" = "false" ]; then oc patch InferenceService $model_id -n $PROJECT_CPD_INST_OPERANDS \ --type=json \ -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY","value":"off"}}]' ; fi
Installing foundation models with a custom vLLM image
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"model_install_parameters":{"<model_id_with_underscore>":{"image": "<vllm-image-location>"}}}}'- Intel Gaudi 3 AI Accelerator
- Based on the model you want to install, run the following commands on an Intel Gaudi 3 AI
Accelerator cluster to include the vLLM image that
supports loading the model onto the GPU memory:Attention: If you want to install the Intel Gaudi 3 AI Accelerator operator on Red Hat® OpenShift®, see Intel Gaudi Base Operator - OpenShift Installation.
- codestral-22b
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"codestral_22b": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-3-2-8b-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_3_2_8b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-34b-code-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_34b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-20b-code-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_20b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-8b-code-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_8b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-3b-code-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_3b_code_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-3-8b-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_3_8b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - granite-guardian-3-2b
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"granite_guardian_3_2b": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - llama-3-2-3b-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"llama_3_2_3b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - llama-3-2-1b-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"llama_3_2_1b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}} - llama-3-1-8b-instruct
-
oc patch watsonxaiifm watsonxaiifm-cr --type merge -p '{"spec":{"model_install_parameters": {"llama_3_1_8b_instruct": {"image": "quay.io/modh/vllm:rhoai-2.21-gaudi", "accelerator_vendor": "gaudi"}}}}
- NVIDIA RTX PRO 6000
- Run the following commands on an NVIDIA RTX PRO 6000 cluster to include the vLLM image
that supports loading the model onto the GPU
memory:
A subset of models can be installed on NVIDIA RTX PRO 6000 GPUs. For details, see System requirements for foundation models in IBM watsonx.ai.oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"model_install_parameters":{"<model_id_with_underscore>":{"image": "registry.redhat.io/rhoai/odh-vllm-cuda-rhel9@sha256:1e8b4f9fdc32213a45824c441171218fd4814ff5ea718b31fc0f74d9322f1a3f"}}}}'
Installing embedding models on GPUs
- Override the model resources and GPU requests
- Run the following command to modify the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"<model_id_with_underscore>_replicas": 1, "<model_id_with_underscore>_resources": {"limits": {"cpu": "2", "memory": "4Gi", "nvidia.com/gpu": 1}, "requests": {"cpu": "1", "memory": "4Gi", "nvidia.com/gpu": 1}}}}'- Specify each embedding model for which you want to override resources by specifying the model
ID, but use underscores instead of hyphens in the model ID. For example, for the
ibm-slate-30m-english-rtrvrmodel ID, specifyibm_slate_30m_english_rtrvr.
- Specify each embedding model for which you want to override resources by specifying the model
ID, but use underscores instead of hyphens in the model ID. For example, for the
- Specify MIG nodes with the
nodeSelectorattribute - Run the following command to modify the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec": {"model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector": <node-label>}}}}'- Specify each embedding model that you want to configure to run on GPUs by specifying the model
ID, but use underscores instead of hyphens in the model ID in
model_install_parameters. For example, for theibm-slate-30m-english-rtrvrmodel ID, specifyibm_slate_30m_english_rtrvr. - Specify the node label in the
nodeSelectorobject. For example,"nodeSelector":{"kubernetes.io/hostname":"gpu-node-1"}
- Specify each embedding model that you want to configure to run on GPUs by specifying the model
ID, but use underscores instead of hyphens in the model ID in
What to do next
IBM watsonx.ai is ready to use. To start working with the various foundation models and tools, see Getting started with IBM watsonx.ai.