Adding foundation models to IBM watsonx.ai
Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.8 reaches end of support. For more information, see Upgrading from IBM Cloud Pak for Data Version 4.8 to IBM Software Hub Version 5.1.
To submit input to a foundation model from IBM watsonx.ai, you must deploy the models that you want to use in your cluster. To do so, patch the custom resource for the service to add foundation models.
Before you begin
The IBM watsonx.ai service must be installed.
Be sure you have the necessary resources available to support the models that you want to use. For more information about the overall resources that are required for the service, see Hardware requirements.
- NVIDIA A100 80 GB
- NVIDIA H100 80 GB
- NVIDIA L40S 48 GB
Each model uses a subset of the total resources that are needed for the service. The following table describes the subset of resources that are required per model.
| Model name | Model ID | Description | Required GPUs | Memory requirements | Storage requirements |
|---|---|---|---|---|---|
| codellama-34b-instruct | codellama-codellama-34b-instruct-hf | Code Llama is an AI model built on top of Llama 2, fine-tuned for generating and discussing
code. Note: This model was introduced with the 4.8.4 release.
|
1 | 2 CPU, 128 GB RAM | 77 GB |
| elyza-japanese-llama-2-7b-instruct | elyza-japanese-llama-2-7b-instruct | General use with zero- or few-shot prompts. Works well for classification and extraction in
Japanese and for translation between English and Japanese. Performs best when prompted in
Japanese. Note: This foundation model was introduced with the 4.8.3
release.
|
1 | 2 CPU, 128 GB RAM | 50 GB |
| flan-t5-xl-3b | google-flan-t5-xl | General use with zero- or few-shot prompts. Note: This foundation model was
introduced with the 4.8.1 release and can be tuned by using the Tuning Studio.
|
1 | 2 CPU, 128 GB RAM | 21 GB |
| flan-t5-xxl-11b | google-flan-t5-xxl | General use with zero- or few-shot prompts. | 1 | 2 CPU, 128 GB RAM | 52 GB |
| flan-ul2-20b | google-flan-ul2 | General use with zero- or few-shot prompts. Note: If you want to use this model with L40S
GPUs, you must take some extra steps. See Step 3 in the procedure.
|
1 A100 or H100 GPU or 2 L40S GPUs | 2 CPU(A100 or H100), 3 CPU(L40S), 128 GB RAM | 85 GB |
| gpt-neox-20bDeprecated in 4.8.4 | eleutherai-gpt-neox-20b | Works best with few-shot prompts. Works well with special characters, which can be used to generate structured output. | 1 | 2 CPU, 128 GB RAM | 100 GB |
| granite-8b-japanese | ibm-granite-8b-japanese | The Granite model series is a family of IBM-trained, dense decoder-only models, which are
particularly well-suited for generative tasks. Note: This model was introduced with the
4.8.4 release.
|
1 | 2 CPU, 128 GB RAM | 50 GB |
| granite-13b-chat-v2 | ibm-granite-13b-chat-v2 | General use model from IBM that is optimized for dialogue use cases. Note:
|
1 | 2 CPU, 128 GB RAM | 36 GB |
| granite-13b-chat-v1Deprecated in 4.8.4 | ibm-granite-13b-chat-v1 | General use model from IBM that is optimized for dialogue use cases. | 1 | 2 CPU, 128 GB RAM | 36 GB |
| granite-13b-instruct-v2 | ibm-granite-13b-instruct-v2 | General use model from IBM that is optimized for question and answer use cases. Note: This model was introduced with the 4.8.1 release.
|
1 | 2 CPU, 128 GB RAM | 62 GB |
| granite-13b-instruct-v1Deprecated in 4.8.4 | ibm-granite-13b-instruct-v1 | General use model from IBM that is optimized for question and answer use cases. | 1 | 2 CPU, 128 GB RAM | 36 GB |
| granite-20b-multilingual | ibm-granite-20b-multilingual | The Granite model series is a family of IBM-trained, dense decoder-only models, which are
particularly well-suited for generative tasks. Note: This model was introduced with the
4.8.4 release.
|
1 | 2 CPU, 128 GB RAM | 100 GB |
| jais-13b-chat | core42-jais-13b-chat | General use foundation model for generative tasks in Arabic. Note: This model
was introduced with the 4.8.5 release.
|
1 | 2 CPU, 96 GB RAM | 60 GB |
| llama-2-13b-chat | meta-llama-llama-2-13b-chat | General use with zero- or few-shot prompts. Optimized for dialogue use cases. Note:
|
1 | 2 CPU, 128 GB RAM | 62 GB |
| llama-2-70b-chat | meta-llama-llama-2-70b-chat | General use with zero- or few-shot prompts. Optimized for dialogue use cases. | 4 | 5 CPU, 250 GB RAM | 150 GB |
| llama2-13b-dpo-v7 | mncai-llama-2-13b-dpo-v7 | General use foundation model for generative tasks in Korean. Note: This model
was introduced with the 4.8.5 release.
|
1 | 2 CPU, 96 GB RAM | 30 GB |
| mixtral-8x7b-instruct-v01-q | ibm-mistralai-mixtral-8x7b-instruct-v01-q | Mixtral-8-7b-instruct-v01-gptq model is made with AutoGPTQ, which mainly leverages the
quantization technique to 'compress' the model weights from FP16 to 4-bit INT and performs
'decompression' on-the-fly before computation (in FP16). Note: This model was introduced
with the 4.8.4 release.
|
1 | 2 CPU, 128 GB RAM | 35 GB |
| mpt-7b-instruct2Deprecated in 4.8.4 | ibm-mpt-7b-instruct2 | General use with few-shot prompts. | 1 | 2 CPU, 128 GB RAM | 50 GB |
| mt0-xxl-13b | bigscience-mt0-xxl | General use with zero- or few-shot prompts. Supports prompts in languages other than English and multilingual prompts. | 1 | 2 CPU, 128 GB RAM | 62 GB |
| starcoder-15.5bDeprecated in 4.8.4 | bigcode-starcoder | Optimized for code generation and code conversion. | 1 | 2 CPU, 128 GB RAM | 138 GB |
Procedure
To deploy foundation models in your cluster:
- Decide which models you want to use, and then make a note of their model IDs.Tip: Due to the large resource demands of foundation models, you might want to install only the models that you know you want to use right away at the time that the service is installed.
- To add models, run the following command to patch the
deployment:
Replace theoc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["<model-id1>","<model-id2>"]}}'<model-id>variables with the IDs for the models that you want to use. For example:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["google-flan-t5-xxl","bigcode-starcoder"]}}'
flan-ul2-20b foundation model with L40S GPUs only:- Run the following patch command to make adjustments to the model configuration that optimize the
model for use with L40S
GPUs:
oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \ --type merge -p '{"spec": {\ "google_flan_ul2": \ { "cuda_visible_devices": "0,1", "deployment_framework": "hf_custom_tp" ,\ "deployment_yaml_name": "flan-ul2-deployment.yaml.j2" ,\ "dir_name": "models--google--flan-ul2" ,"dtype_str": "float16" ,\ "force_apply": "false" ,"hf_modules_cache": "/tmp/huggingface/modules" ,\ "max_batch_size": "128" ,"max_batch_weight": "34543200" ,\ "max_concurrent_requests": "150" ,"max_new_tokens": "4096" ,\ "max_sequence_length": "4096" ,"model_name": "google/flan-ul2" ,\ "model_root_dir": "/watsonxaiifm-models" ,"num_gpus": "2" ,"num_shards": "2" ,\ "pt2_compile": "false" ,"pvc_name": "google-flan-ul2-pvc-rev2" ,\ "pvc_size": "85Gi" ,"svc_name": "flan-ul2" },\ "google_flan_ul2_resources": \ { "limits": {"cpu": "3" ,"ephemeral-storage": "1Gi", "memory": "128Gi", \ "nvidia.com/gpu": "2" }, \ "requests": {"cpu": "2", "ephemeral-storage": "10Mi", "memory": "4Gi", \ "nvidia.com/gpu": "2" }}}}' - Add the
flan-ul2-20bmodel to the list of models to install by patching the deployment as follows:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type=merge \ --patch='{"spec":{"install_model_list": ["google-flan-ul2"]}}' - If you want to remove the
flan-ul2-20bmodel, run the following patch command to remove some file path values from the custom resource.oc patch -n ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr \ --type json -p '[{ "op": "remove", "path": "/spec/google_flan_ul2"}, \ { "op": "remove", "path": "/spec/google_flan_ul2_resources"}]'
You can add additional models as required at any point after the initial service setup.
What to do next
IBM watsonx.ai is ready to use. To get started, see Developing generative AI solutions with foundation models.