Adding foundation models to IBM watsonx.ai

To submit input to a foundation model or text embedding model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster.

Before you begin

The IBM watsonx.ai service must be installed. Make sure you have the necessary resources available to support the models that you want to use. For more information about the overall resources that are required for the service, see Hardware requirements. For details about the available foundation models and the extra resources needed to host them, see Foundation models in IBM watsonx.ai.

To complete this task the first time, you must be the instance administrator who installed the IBM watsonx.ai service.

You can add foundation models that are curated by IBM to a full installation or lightweight engine installation.

Procedure

To add foundation models, complete the following steps:

Decide which models you want to use, and then make a note of their model IDs.
Tip: Due to the large resource demands of foundation models, install only the models that you know you want to use right away at the time that the service is installed. You can add more models later.
Choose how you want to install the foundation models from the following options:

Install with default configuration

Simplest installation option where models are installed on available resources. Some foundation models are sharded as part of their standard configuration. The number of GPUs specified per model in the table in Foundation models in IBM watsonx.ai indicates the number of shards used.
See Installing models with the default configuration.

Note: Use this method for adding embedding models. You cannot shard embedding models.

Shard the foundation models during installation

Sharding partitions large models into smaller units, known as shards, that can be processed across multiple GPU processors in parallel. Sharding foundation models can improve performance and also reduces the amount of memory needed for each GPU.

See Sharding the foundation models.

Install foundation models on preconfigured GPU partitions

Partitioning GPU processors enables you to install more than one smaller model on a single GPU to use resources more efficiently.
Note: This option was added with the 5.0.3 release.

See Installing models on GPU partitions.

Important: Not all foundation models can be sharded or installed on partitioned GPU processors. See Foundation models in IBM watsonx.ai for details.
Confirm that the spec section of the watsonxaiifm custom resource is updated by running the following command:
```
oc describe watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}
```
Important: Do not edit the custom resource directly because it is easy to introduce errors.
Wait for the operator to finish reconciling the changes and show the Completed status. You can use the following command to check the status of the service:
```
oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}
```
Use the following commands to check whether your changes were applied successfully.
- To check whether the model predictor pod is running:
```
oc get po -n ${PROJECT_CPD_INST_OPERANDS} | grep predictor
```
- To check whether the predictor pod shows the correct number of GPU processors. If you specified a node selector, you can also check the value of the Node-Selectors field.
```
oc describe po <model-predictor-pod> -n ${PROJECT_CPD_INST_OPERANDS}
```
- To check which node a foundation model is using:
```
oc get po -n ${PROJECT_CPD_INST_OPERANDS} -o wide | grep predictor
```

5.0.0 only: Optionally, if you want to use the flan-ul2-20b foundation model with L40S GPUs only:

Run the following patch command to make adjustments to the model configuration that optimize the model for use with L40S GPUs:

oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \
--type merge \
--patch '{"spec": { "google_flan_ul2_resources": {"limits": {"cpu": "2","ephemeral-storage": "1Gi","memory": "128Gi", "nvidia.com/gpu": "2"},"requests": {"cpu": "1","ephemeral-storage": "10Mi", "memory": "4Gi","nvidia.com/gpu": "2"}},"model_install_parameters": {"google_flan_ul2": {"args": ["--port=3000", "--grpc-port=8033", "--num-shard=2"],"env": [{ "name": "MODEL_NAME", "value": "google/flan-ul2" },{ "name": "DEPLOYMENT_FRAMEWORK", "value": "hf_custom_tp" },{ "name": "TRANSFORMERS_CACHE", "value": "/mnt/models/" },{ "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" },{ "name": "DTYPE_STR", "value": "float16" },{ "name": "MAX_SEQUENCE_LENGTH", "value": "4096" },{ "name": "MAX_BATCH_SIZE", "value": "128" },{ "name": "MAX_CONCURRENT_REQUESTS", "value": "150" },{ "name": "MAX_BATCH_WEIGHT", "value": "34543200" },{ "name": "MAX_NEW_TOKENS", "value": "4096" },{ "name": "CUDA_VISIBLE_DEVICES", "value": "0,1" },{ "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" },{ "name": "NUM_GPUS", "value": "2" }]}}}}'

Add the flan-ul2-20b model to the list of models to install by patching the deployment as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["google-flan-ul2"]}}'

5.0.0 only: Optionally, if you want to use the gpt-neox-20b foundation model with L40S GPUs only:

Run the following patch command to make adjustments to the model configuration that optimize the model for use with L40S GPUs:

oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \
--type merge \
--patch '{"spec": {"model_install_parameters": {"eleutherai_gpt_neox_20b": {"args": ["--port=3000","--grpc-port=8033","--num-shard=2"], "env": [{ "name": "MODEL_NAME", "value": "EleutherAI/gpt-neox-20b" }, { "name": "NUM_GPUS", "value": "2" }, { "name": "MAX_BATCH_WEIGHT", "value": "10000" }, { "name": "MAX_PREFILL_WEIGHT", "value": "8192" }, { "name": "FLASH_ATTENTION", "value": "true" }, { "name": "DEPLOYMENT_FRAMEWORK", "value": "hf_custom_tp" }, { "name": "TRANSFORMERS_CACHE", "value": "/mnt/models/" }, { "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" }, { "name": "DTYPE_STR", "value": "float16" }, { "name": "MAX_SEQUENCE_LENGTH", "value": "8192" }, { "name": "MAX_BATCH_SIZE", "value": "256" }, { "name": "MAX_CONCURRENT_REQUESTS", "value": "64" }, { "name": "MAX_NEW_TOKENS", "value": "8192" }, { "name": "CUDA_VISIBLE_DEVICES", "value": "0,1" }, { "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" }]}}, "eleutherai_gpt_neox_20b_resources": {"limits": {"cpu": "3", "ephemeral-storage": "1Gi", "memory": "128Gi", "nvidia.com/gpu": "2" }, "requests": {"cpu":"2","ephemeral-storage": "10Mi", "memory": "4Gi", "nvidia.com/gpu": "2" }}}}'

Add the gpt-neox-20b model to the list of models to install by patching the deployment as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["eleutherai_gpt_neox_20b"]}}'

You can add additional models as required at any point after the initial service setup. You can also remove or change the sharding settings later. For more information, see Changing how models are sharded.

Installing models with the default configuration

Run the following command:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'

In the install_model_list array, list the IDs for the models that you want to use. Separate the model IDs with commas.

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"install_model_list": ["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"]}}'

Sharding the foundation models

You can choose to specify only the number of shards and let the models be installed on any available nodes or you can also specify the nodes where you want each shard to be installed.

Verify that the foundation model you want to shard supports sharding. See Foundation models in IBM watsonx.ai.
Do one of the following things:
Shard the foundation models
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>}}}}'
```
Make the following edits:
- Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"].
- Specify each model that you want to shard by specifying the model ID, but use underscores instead of hyphens in the model ID in model_install_parameters. For example, for the meta-llama-llama-3-1-8b-instruct model, specify "meta_llama_llama_3_1_8b_instruct".
- Assign a shard value to each model in model_install_parameters. The shard value specifies the number of units in which to split the model. Accepted shard values are 2, 4, or 8 only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied.
For example:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"], "model_install_parameters": {"ibm_granite_13b_chat_v2":{"shard": 2}}}}'
```
Important: Do not edit the custom resource directly because it is easy to introduce errors.
Shard the foundation models on specific nodes
You can specify the node where you want the shards to run by specifying the hostname of the node in the nodeSelector object.
1. Get a list of the nodes:
```
oc get nodes
```
2. Check the label for the node that you want to use:
```
oc describe no <node-name> | grep kubernetes.io/hostname
```
3. Verify that the node has a GPU by using the following command. Foundation models can be sharded on GPU nodes only.
  oc describe node <node-hostname> | grep nvidia.com/gpu.product
  A GPU node returns nvidia.com/gpu.product=true.
4. Run the following command to patch the custom resource:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"shard": <shard-value>, "nodeSelector":{"kubernetes.io/hostname": "<hostname-value>"}}}}}'
```
  Make the following edits:
  - Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"].
  - Specify each model that you want to shard by specifying the model ID, but use underscores instead of hyphens in the model ID in model_install_parameters. For example, for the meta-llama-llama-3-1-8b-instruct model, specify "meta_llama_llama_3_1_8b_instruct".
  - Assign a shard value to each model in model_install_parameters. The shard value specifies the number of units in which to split the model. Accepted shard values are 2, 4, or 8 only. If you specify a value other than one of these accepted values, the default shard value (number of GPUs) for the model is used. No message is shown to inform you that your configuration change is not applied.
  - For each shard that you want to host on a specific node, specify the node hostname value in the nodeSelector object. For example, "nodeSelector":{"kubernetes.io/hostname":"worker0.example.com"}
For example:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["meta-llama-llama-3-8b-instruct","ibm-granite-13b-chat-v2"], "model_install_parameters": {"ibm_granite_13b_chat_v2":{"shard": 2, "nodeSelector":{"kubernetes.io/hostname": "worker0.example.com"}}}}}'
```

Installing models on GPU partitions

You can specify the NVIDIA Multi-Instance GPU (MIG) node where you want the foundation models to run by specifying the hostname of the node where MIG is configured in the nodeSelector object.

Note: This configuration option was added with the 5.0.3 release.

Complete the following steps:

Verify that the foundation model you want to install supports being installed on an NVIDIA Multi-Instance GPU. See Foundation models in IBM watsonx.ai.
Get a list of the nodes:
```
oc get nodes
```

Check the label for the NVIDIA Multi-Instance GPU node that you want to use with the following command:

oc describe node <node-name> | grep nvidia.com/mig.config

For example:

oc describe node worker10.wxai.example.com | grep nvidia.com/mig.config=
nvidia.com/mig.config=all-3g.40gb

Run the following command to patch the custom resource:
```
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["<model-id>"], "model_install_parameters": {"<model_id_with_underscore>":{"nodeSelector":{"nvidia.com/mig.config": "<mig-label>"}}}}}'
```
Make the following edits:
- Specify the models that you want to use in a comma-separated list in install_model_list. For example, ["meta-llama-llama-3-1-8b-instruct","ibm-granite-13b-chat-v2"].
- Specify the node label in the nodeSelector object. For example, "nodeSelector":{"nvidia.com/mig.config":"all-3g.40gb"}
Wait for the patch to be applied. You can check the status by using the following command:
```
oc get watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}
```

If you are installing any of the following foundation models, which are listed by model ID, you must take an additional step.

google-flan-t5-xxl
meta-llama-llama-2-13b-chat
mncai-llama2-13b-dpo-v7
bigscience-mt0-xxl

Edit the following environment variables for each of these models that you install:

ESTIMATE_MEMORY_BATCH_SIZE=4
ESTIMATE_MEMORY=off

You can use the following script to make the required environment variable changes per model. First specify the model ID to set in the $model_id variable.

#!/bin/bash

export model_id=<replace with model ID>

export PROJECT_CPD_INST_OPERANDS=<replace with instance-namespace>

export current_envs=$(oc get isvc $model_id -n $PROJECT_CPD_INST_OPERANDS -ojsonpath='{.spec.predictor.model.env}')

memory_batch_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY_BATCH_SIZE" and .value == "4")) | length > 0')

estimate_memory_check=$(echo $current_envs | jq 'map(select(.name == "ESTIMATE_MEMORY" and .value == "off")) | length > 0')

if [ "$memory_batch_check" = "false" ]; then oc patch InferenceService  $model_id -n $PROJECT_CPD_INST_OPERANDS \
  --type=json \
  -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY_BATCH_SIZE","value":"4"}}]' ; fi

if [ "$estimate_memory_check" = "false" ]; then oc patch InferenceService $model_id -n $PROJECT_CPD_INST_OPERANDS \
  --type=json \
  -p='[{"op": "add", "path": "/spec/predictor/model/env/-", "value": {"name":"ESTIMATE_MEMORY","value":"off"}}]' ; fi

What to do next

IBM watsonx.ai is ready to use.

Full-service installation: To get started with the available tools, see Developing generative AI solutions with foundation models.
watsonx.ai lightweight engine installation: See Getting started with watsonx.ai Lightweight Engine.