Adding foundation models to IBM watsonx.ai
To submit input to a foundation model or text embedding model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster.
Before you begin
The IBM watsonx.ai service must be installed. Make sure you have the necessary resources available to support the models that you want to use. For more information about the overall resources that are required for the service, see Hardware requirements. For details about the available foundation models and the extra resources needed to host them, see Foundation models in IBM watsonx.ai.
To complete this task the first time, you must be the instance administrator who installed the IBM watsonx.ai service.
You can add foundation models that are curated by IBM to a full installation or lightweight engine installation.
Procedure
- Decide which models you want to use, and then make a note of their model IDs.Tip: Due to the large resource demands of foundation models, install only the models that you know you want to use right away at the time that the service is installed. You can add more models later.
- Choose how you want to install the foundation models from the following options:
- Install with default configuration
- Simplest installation option where models are installed on available resources. Some foundation
models are sharded as part of their standard configuration. The number of GPUs specified per model
in the table in Foundation models in IBM
watsonx.ai
indicates the number of shards used.
See Installing models with the default configuration.
Note: Use this method for adding embedding models. You cannot shard embedding models. - Shard the foundation models during installation
-
Sharding partitions large models into smaller units, known as shards, that can be processed across multiple GPU processors in parallel. Sharding foundation models can improve performance and also reduces the amount of memory needed for each GPU.
- Install foundation models on preconfigured GPU partitions
- Partitioning GPU processors enables you to install more than one smaller model on a single GPU
to use resources more efficiently.Note: This option was added with the 5.0.3 release.
Important: Not all foundation models can be sharded or installed on partitioned GPU processors. See Foundation models in IBM watsonx.ai for details. - Confirm that the
spec
section of the watsonxaiifm custom resource is updated by running the following command:oc describe watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}
Important: Do not edit the custom resource directly because it is easy to introduce errors. - Wait for the operator to finish reconciling the changes and show the
Completed
status. You can use the following command to check the status of the service:oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}
- Use the following commands to check whether your changes were applied successfully.
- To check whether the model predictor pod is
running:
oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep predictor
- To check whether the predictor pod shows the correct number of GPU processors. If you specified
a node selector, you can also check the value of the Node-Selectors
field.
oc describe po <model-predictor-pod> -n ${PROJECT_CPD_INST_OPERANDS}
- To check which node a foundation model is
using:
oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide | grep predictor
- To check whether the model predictor pod is
running:
- 5.0.0 only: Optionally, if you want to use the
flan-ul2-20b
foundation model with L40S GPUs only:- Run the following patch command to make adjustments to the model configuration that optimize the
model for use with L40S
GPUs:
oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \ --type merge \ --patch '{"spec": { "google_flan_ul2_resources": {"limits": {"cpu": "2","ephemeral-storage": "1Gi","memory": "128Gi", "nvidia.com/gpu": "2"},"requests": {"cpu": "1","ephemeral-storage": "10Mi", "memory": "4Gi","nvidia.com/gpu": "2"}},"model_install_parameters": {"google_flan_ul2": {"args": ["--port=3000", "--grpc-port=8033", "--num-shard=2"],"env": [{ "name": "MODEL_NAME", "value": "google/flan-ul2" },{ "name": "DEPLOYMENT_FRAMEWORK", "value": "hf_custom_tp" },{ "name": "TRANSFORMERS_CACHE", "value": "/mnt/models/" },{ "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" },{ "name": "DTYPE_STR", "value": "float16" },{ "name": "MAX_SEQUENCE_LENGTH", "value": "4096" },{ "name": "MAX_BATCH_SIZE", "value": "128" },{ "name": "MAX_CONCURRENT_REQUESTS", "value": "150" },{ "name": "MAX_BATCH_WEIGHT", "value": "34543200" },{ "name": "MAX_NEW_TOKENS", "value": "4096" },{ "name": "CUDA_VISIBLE_DEVICES", "value": "0,1" },{ "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" },{ "name": "NUM_GPUS", "value": "2" }]}}}}'
- Add the
flan-ul2-20b
model to the list of models to install by patching the deployment as follows:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["google-flan-ul2"]}}'
- Run the following patch command to make adjustments to the model configuration that optimize the
model for use with L40S
GPUs:
- 5.0.0 only: Optionally, if you want to use the
gpt-neox-20b
foundation model with L40S GPUs only:- Run the following patch command to make adjustments to the model configuration that optimize the
model for use with L40S
GPUs:
oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \ --type merge \ --patch '{"spec": {"model_install_parameters": {"eleutherai_gpt_neox_20b": {"args": ["--port=3000","--grpc-port=8033","--num-shard=2"], "env": [{ "name": "MODEL_NAME", "value": "EleutherAI/gpt-neox-20b" }, { "name": "NUM_GPUS", "value": "2" }, { "name": "MAX_BATCH_WEIGHT", "value": "10000" }, { "name": "MAX_PREFILL_WEIGHT", "value": "8192" }, { "name": "FLASH_ATTENTION", "value": "true" }, { "name": "DEPLOYMENT_FRAMEWORK", "value": "hf_custom_tp" }, { "name": "TRANSFORMERS_CACHE", "value": "/mnt/models/" }, { "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" }, { "name": "DTYPE_STR", "value": "float16" }, { "name": "MAX_SEQUENCE_LENGTH", "value": "8192" }, { "name": "MAX_BATCH_SIZE", "value": "256" }, { "name": "MAX_CONCURRENT_REQUESTS", "value": "64" }, { "name": "MAX_NEW_TOKENS", "value": "8192" }, { "name": "CUDA_VISIBLE_DEVICES", "value": "0,1" }, { "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" }]}}, "eleutherai_gpt_neox_20b_resources": {"limits": {"cpu": "3", "ephemeral-storage": "1Gi", "memory": "128Gi", "nvidia.com/gpu": "2" }, "requests": {"cpu":"2","ephemeral-storage": "10Mi", "memory": "4Gi", "nvidia.com/gpu": "2" }}}}'
- Add the
gpt-neox-20b
model to the list of models to install by patching the deployment as follows:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["eleutherai_gpt_neox_20b"]}}'
- Run the following patch command to make adjustments to the model configuration that optimize the
model for use with L40S
GPUs:
You can add additional models as required at any point after the initial service setup. You can also remove or change the sharding settings later. For more information, see Changing how models are sharded.
Installing models with the default configuration
- Run the following
command:
oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'
In the
install_model_list
array, list the IDs for the models that you want to use. Separate the model IDs with commas.oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"]}}'
Sharding the foundation models
- Verify that the foundation model you want to shard supports sharding. See Foundation models in IBM watsonx.ai.
- Do one of the following things:
- Shard the foundation models
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>}}}}'
- Specify the models that you want to use in a comma-separated list in
install_model_list
. For example,["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"]
. - Specify each model that you want to shard by specifying the model ID, but
use underscores instead of hyphens in the model ID in
model_install_parameters
. For example, for themeta-llama-llama-3-1-8b-instruct
model, specify"meta_llama_llama_3_1_8b_instruct"
. - Assign a shard value to each model in
model_install_parameters
. The shard value specifies the number of units in which to split the model. Accepted shard values are2
,4
, or8
only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied.
For example:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"], "model_install_parameters": {"ibm_granite_13b_chat_v2":{"shard": 2}}}}'
Important: Do not edit the custom resource directly because it is easy to introduce errors.- Specify the models that you want to use in a comma-separated list in
- Shard the foundation models on specific nodes
-
You can specify the node where you want the shards to run by specifying the hostname of the node in the
nodeSelector
object.- Get a list of the nodes:
oc get nodes
- Check the label for the node that you want to
use:
oc describe no <node-name> | grep kubernetes.io/hostname
- Verify that the node has a GPU by using the following command. Foundation models can be sharded
on GPU nodes
only.
A GPU node returnsoc describe node <node-hostname> | grep nvidia.com/gpu.product
nvidia.com/gpu.product=true
. - Run the following command to patch the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>, "nodeSelector":{"kubernetes.io/hostname": "<hostname-value>"}}}}}'
- Specify the models that you want to use in a comma-separated list in
install_model_list
. For example,["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"]
. - Specify each model that you want to shard by specifying the model ID, but
use underscores instead of hyphens in the model ID in
model_install_parameters
. For example, for themeta-llama-llama-3-1-8b-instruct
model, specify"meta_llama_llama_3_1_8b_instruct"
. - Assign a shard value to each model in
model_install_parameters
. The shard value specifies the number of units in which to split the model. Accepted shard values are2
,4
, or8
only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied. - For each shard that you want to host on a specific node, specify the node hostname value in the
nodeSelector
object. For example,"nodeSelector":{"kubernetes.io/hostname":"worker0.example.com"}
- Specify the models that you want to use in a comma-separated list in
For example:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["meta-llama-llama-3-8b-instruct","ibm-granite-13b-chat-v2"], "model_install_parameters": {"ibm_granite_13b_chat_v2":{"shard": 2, "nodeSelector":{"kubernetes.io/hostname": "worker0.example.com"}}}}}'
- Get a list of the nodes:
Installing models on GPU partitions
nodeSelector
object.- Verify that the foundation model you want to install supports being installed on an NVIDIA Multi-Instance GPU. See Foundation models in IBM watsonx.ai.
- Get a list of the nodes:
oc get nodes
- Check the label for the NVIDIA Multi-Instance GPU node that
you want to use with the following
command:
For example:oc describe node <node-name> | grep nvidia.com/mig.config
oc describe node worker10.wxai.example.com | grep nvidia.com/mig.config= nvidia.com/mig.config=all-3g.40gb
- Run the following command to patch the custom
resource:
Make the following edits:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector":{"nvidia.com/mig.config": "<mig-label>"}}}}}'
- Specify the models that you want to use in a comma-separated list in
install_model_list
. For example,["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"]
. - Specify the node label in the
nodeSelector
object. For example,"nodeSelector":{"nvidia.com/mig.config":"all-3g.40gb"}
- Specify the models that you want to use in a comma-separated list in
- Wait for the patch to be applied. You can check the status by using the following
command:
oc get watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}
- If you are installing any of the following foundation models, which are listed by model ID, you
must take an additional step.
google-flan-t5-xxl
meta-llama-llama-2-13b-chat
mncai-llama2-13b-dpo-v7
bigscience-mt0-xxl
- ESTIMATE_MEMORY_BATCH_SIZE=4
- ESTIMATE_MEMORY=off
$model_id
variable.#!/bin/bash export model_id=<replace with model ID> export PROJECT_CPD_INST_OPERANDS=<replace with instance-namespace> export current_envs=$(oc get isvc $model_id -n $PROJECT_CPD_INST_OPERANDS -ojsonpath='{.spec.predictor.model.env}') memory_batch_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY_BATCH_SIZE" and .value == "4")) | length > 0') estimate_memory_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY" and .value == "off")) | length > 0') if [ "$memory_batch_check" = "false" ]; then oc patch InferenceService $model_id -n $PROJECT_CPD_INST_OPERANDS \ --type=json \ -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY_BATCH_SIZE","value":"4"}}]' ; fi if [ "$estimate_memory_check" = "false" ]; then oc patch InferenceService $model_id -n $PROJECT_CPD_INST_OPERANDS \ --type=json \ -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY","value":"off"}}]' ; fi
What to do next
- Full-service installation
- To get started with the available tools, see Developing generative AI solutions with foundation models.
- watsonx.ai lightweight engine installation
- See Getting started with watsonx.ai Lightweight Engine.