Adding custom foundation models to watsonx.ai Lightweight Engine
Review the curated foundation models that are available with IBM watsonx.ai to check whether an existing model that you can add to the service with simple CLI commands might meet your needs. See Supported foundation models.
Prerequisites
The IBM watsonx.ai service must be installed in lightweight engine mode.
- The model must be compatible with the Text Generation Inference (TGI) standard and be built with a supported model architecture type. See Supported foundation model architectures.
- The file list for the model must contain a
config.jsonfile. - The model must be in a
safetensorsformat and include a tokenizer.
Supported foundation model architectures
To check the architecture of a foundation model, find the config.json file
for the foundation model, and then check the model_type value.
| Model type | Supported quantization methods | Deployment framework | Flash attention |
|---|---|---|---|
| bloom | Not applicable | tgis_native | false |
| codegen | Not applicable | hf_transformers | Not applicable |
| falcon | Not applicable | tgis_native | true |
| gpt_bigcode | GPTQ | tgis_native | true |
| gpt_neox | Not applicable | tgis_native | true |
| gptj | Not applicable | hf_transformers | Not applicable |
| llama | GPTQ | tgis_native | true |
| llama2 | GPTQ | tgis_native | true |
| t5 | Not applicable | tgis_native | false |
| mistral | Not applicable | hf_transformers | Not applicable |
| mpt | Not applicable | hf_transformers | Not applicable |
| mt5 | Not applicable | hf_transformers | Not applicable |
| sphinx | Not applicable | Not applicable | Not applicable |
- Deployment framework
- Specifies the library to use for loading the foundation model. Although some foundation models support more than one deployment method, the table lists the deployment framework in which foundation models with the specified architecture perform best.
- Flash attention
- Flash attention is a mechanism that is used to scale transformer-based models more efficiently and enable faster inferencing. To host the model correctly, the service needs to know whether the model uses the flash attention mechanism.
- Quantization methods
- The post-training quantization for generative pre-trained transformers (GPTQ) method of quantization is supported for foundation models with the architectures that are listed.
Procedure
- Upload the model.
Follow the steps in the Setting up storage and uploading the model procedure.
Make a note of the
pvc_namefor the persistent volume claim where you store the downloaded model source files.Important: Complete only the storage setup and model download tasks, and then return to this procedure. Other steps in the full-service installation instructions describe how to create a deployment to host the custom foundation model. You do not need to set up a deployment to use custom foundation models from a watsonx.ai lightweight engine installation. - Create a ConfigMap file for the custom foundation model.
ConfigMap files are used by the Red Hat® OpenShift® AI layer of the service to serve configuration information to independent containers that run in pods or to other system components, such as controllers. See Creating a ConfigMap file.
- To register the custom foundation model, apply the ConfigMap file by using
the following
command:
The service operator picks up the configuration information and applies it to your cluster.oc apply -f configmap.yml - You can check the status of the service by using the following command. When
Completedis returned, the custom foundation models are ready for use.oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}
Creating a ConfigMap file
apiVersion: v1
kind: ConfigMap
metadata:
name: <model name with dash as delimiter>
labels:
syom: watsonxaiifm_extra_models_config
data:
model: |
<model name with underscore as delimiter>:
pvc_name: <pvc where model is downloaded>
pvc_size: <size of pvc where model is downloaded>
isvc_yaml_name: isvc.yaml.j2
dir_name: <directory inside of pvc where model content is downloaded>
force_apply: no
command: ["text-generation-launcher"]
serving_runtime: tgis-serving-runtime
storage_uri: pvc://<pvc where model is downloaded>/
env:
- name: MODEL_NAME
value: /mnt/models
- name: CUDA_VISIBLE_DEVICES
value: "0" # if sharding, change to the value of shard. If sharding in 2, specify "0,1". If sharding in 4, specify "0,1,2,3".
- name: TRANSFORMERS_CACHE
value: /mnt/models/
- name: HUGGINGFACE_HUB_CACHE
value: /mnt/models/
- name: DTYPE_STR
value: "<value>"
- name: MAX_SEQUENCE_LENGTH
value: "<value>"
- name: MAX_BATCH_SIZE
value: "<value>"
- name: MAX_CONCURRENT_REQUESTS
value: "<value>"
- name: MAX_NEW_TOKENS
value: "<value>"
- name: FLASH_ATTENTION
value: "<value>"
- name: DEPLOYMENT_FRAMEWORK
value: "<value>"
- name: HF_MODULES_CACHE
value: /tmp/huggingface/modules
annotations:
cloudpakId: 5e4c7dd451f14946bc298e18851f3746
cloudpakName: IBM watsonx.ai
productChargedContainers: All
productCloudpakRatio: "1:1"
productID: 3a6d4448ec8342279494bc22e36bc318
productMetric: VIRTUAL_PROCESSOR_CORE
productName: IBM Watsonx.ai
productVersion: <watsonx.ai ifm version>
cloudpakInstanceId: <cloudpak instance id>
labels_syom:
app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
app.kubernetes.io/instance: watsonxaiifm
app.kubernetes.io/name: watsonxaiifm
icpdsupport/addOnId: watsonx_ai_ifm
icpdsupport/app: api
release: watsonxaiifm
icpdsupport/module: <model name with dash as delimiter>
app: text-<model name with dash as delimiter>
component: fmaas-inference-server # wont override value predictor
bam-placement: colocate
syom_model: <model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens>
args:
- "--port=3000"
- "--grpc-port=8033"
- "--num-shard=1" # shard setting for the GPU (2,4, or 8)
wx_inference_proxy:
<model id>:
enabled:
- "true"
label: "<label of model>"
provider: "< provider of model>"
source: "Hugging Face"
tags:
- consumer_public
# Description needs to be updated
short_description: "<short discription of model>"
long_description: "<long discription of model>"
task_ids:
- question_answering
- generation
- summarization
- classification
- extraction
tasks_info:
question_answering:
task_ratings:
quality: 0
cost: 0
generation:
task_ratings:
quality: 0
cost: 0
summarization:
task_ratings:
quality: 0
cost: 0
classification:
task_ratings:
quality: 0
cost: 0
extraction:
task_ratings:
quality: 0
cost: 0
min_shot_size: 1
tier: "class_2"
number_params: "13b"
lifecycle:
available:
since_version: "<first watsonx.ai IFM version where this model is supported>"
<model name with underscore as delimiter>_resources:
limits:
cpu: "2"
memory: 128Gi
nvidia.com/gpu: "1" # shard setting for the GPU (2,4,or 8)
ephemeral-storage: 1Gi
requests:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1" # shard setting for the GPU (2,4,or 8)
ephemeral-storage: 10Mi
<model name with underscore as delimiter>_replicas: 1The following table lists
the variables for you to replace in the template.| ConfigMap field | Description |
|---|---|
metadata.name |
Model name with hyphens as delimiters. For example, if the model name is
tiiuae/falcon-7b, specify tiiuae-falcon-7b. |
data.model. |
Model name with underscores as delimiters <full_model_name>. For
example, if the model name is tiiuae/falcon-7b, specify
tiiuae_falcon_7b. |
data.model.<full_model_name>.pvc_name |
Persistent volume claim where the model source files are stored. Use the
pvc_name that you noted in an earlier step. For example,
tiiuae-falcon-7b-pvc |
data.model.<full_model_name>.pvc_size |
Size of persistent volume claim where the model source files are stored. For example,
60Gi. |
data.model.<full_model_name>.dir_name |
Directory where the model content is stored. This value matches the
MODEL_PATH from the model download job. For example,
models--tiiuae-falcon-7b |
data.model.<full_model_name>.storage_uri |
Universal resource identifier for the directory where the model source files are stored
with the syntax pvc://<pvc where model is downloaded>/. For example,
pvc://tiiuae-falcon-7b-pvc/. |
data.model.<full_model_name>.env.DTYPE_STR |
Data type of text strings that the model can process. For example,
float16.For more information about supported values, see Global parameters for custom foundation models. |
data.model.<full_model_name>.env.MAX_BATCH_SIZE |
Maximum batch size For more information about supported values, see Global parameters for custom foundation models. |
data.model.<full_model_name>.env.MAX_CONCURRENT_REQUESTS |
Maximum number of concurrent requests that the model can handle. For example,
1024.For more information about supported values, see Global parameters for custom foundation models. |
data.model.<full_model_name>.env.MAX_NEW_TOKENS |
Maximum number of tokens that the model can generate for a text inference request. For
example, 2047.For more information about supported values, see Global parameters for custom foundation models. |
data.model.<full_model_name>.env.
FLASH_ATTENTION |
Specify the value from the Flash attention column of the Supported foundation model architectures table. If the value is Not applicable, remove this entry from the configmap file. |
data.model.<full_model_name>.env.DEPLOYMENT_FRAMEWORK |
Specify the value from the Deployment framework column of the Supported foundation model architectures table. If the value is Not applicable, remove this entry from the configmap file. |
data.model.<full_model_name>.annotations.
productVersion |
The IBM
watsonx.ai service operator version. For example, 9.1.0.
To get this value, use the following command: |
data.model.<full_model_name>.annotations.cloudpakInstanceId |
The Cloud Pak for Datainstance ID. For example,
b0871d64-ceae-47e9-b186-6e336deaf1f1.To get this value, use the following
command: |
data.model.<full_model_name>.labels_syom.icpdsupport/module |
Model name with hyphens as delimiters. For example, if the model name is
tiiuae/falcon-7b, specify tiiuae-falcon-7b |
data.model.<full_model_name>.labels_syom.app |
Model name with hyphens as delimiters and prefixed with text-. For
example, if the model name is tiiuae/falcon-7b, specify
text-tiiuae-falcon-7b. |
data.model.<full_model_name>.labels_syom.syom_model |
Model name with single hyphens as delimiters, except for the first delimiter, which uses
two hyphens. For example, tiiuae--falcon-7b. |
data.model.<full_model_name>.wx_inference_proxy. |
Model ID (<full/model_name>). For example,
tiiuae/falcon-7b |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.label |
Model name without provider prefix. For example, falcon-7b. |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.provider |
Model provider. For example, tiiuae |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.short
discription of model |
Short description of the model in less than 100 characters. |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.long
discription of model |
Long description of the model. |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.min_shot_size |
min shot size |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.tier |
Model tier. |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.number_params |
Number of model parameters. For example, 7b |
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.lifecycle.available.since_version |
The first IBM
watsonx.ai service operator version in which the model was added. For
examples, 9.1.0. |
Sample ConfigMap file
apiVersion: v1
kind: ConfigMap
metadata:
name: tiiuae-falcon-7b
labels:
syom: watsonxaiifm_extra_models_config
data:
model: |
tiiuae_falcon_7b:
pvc_name: tiiuae-falcon-7b-pvc
pvc_size: 60Gi
isvc_yaml_name: isvc.yaml.j2
dir_name: model
force_apply: no
command: ["text-generation-launcher"]
serving_runtime: tgis-serving-runtime
storage_uri: pvc://tiiuae-falcon-7b-pvc/
env:
- name: MODEL_NAME
value: /mnt/models
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: TRANSFORMERS_CACHE
value: /mnt/models/
- name: HUGGINGFACE_HUB_CACHE
value: /mnt/models/
- name: DTYPE_STR
value: "float16"
- name: MAX_SEQUENCE_LENGTH
value: "2048"
- name: MAX_BATCH_SIZE
value: "256"
- name: MAX_CONCURRENT_REQUESTS
value: "1024"
- name: MAX_NEW_TOKENS
value: "2047"
- name: FLASH_ATTENTION
value: "true"
- name: DEPLOYMENT_FRAMEWORK
value: "tgis_native"
- name: HF_MODULES_CACHE
value: /tmp/huggingface/modules
annotations:
cloudpakId: 5e4c7dd451f14946bc298e18851f3746
cloudpakName: IBM watsonx.ai
productChargedContainers: All
productCloudpakRatio: "1:1"
productID: 3a6d4448ec8342279494bc22e36bc318
productMetric: VIRTUAL_PROCESSOR_CORE
productName: IBM Watsonx.ai
productVersion: 9.1.0
cloudpakInstanceId: b0871d64-ceae-47e9-b186-6e336deaf1f1
labels_syom:
app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
app.kubernetes.io/instance: watsonxaiifm
app.kubernetes.io/name: watsonxaiifm
icpdsupport/addOnId: watsonx_ai_ifm
icpdsupport/app: api
release: watsonxaiifm
icpdsupport/module: tiiuae-falcon-7b
app: text-tiiuae-falcon-7b
component: fmaas-inference-server # wont override value predictor
bam-placement: colocate
syom_model: "tiiuae--falcon-7b"
args:
- "--port=3000"
- "--grpc-port=8033"
- "--num-shard=1"
volumeMounts:
- mountPath: /opt/caikit/prompt_cache
name: prompt-cache-dir
subPath: prompt_cache
volumes:
- name: prompt-cache-dir
persistentVolumeClaim:
claimName: fmaas-caikit-inf-prompt-tunes-prompt-cache
wx_inference_proxy:
tiiuae/falcon-7b:
enabled:
- "true"
label: "falcon-7b"
provider: "tiiuae"
source: "Hugging Face"
functions:
- text_generation
tags:
- consumer_public
short_description: "A 7B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat and instruct datasets."
long_description: "Out-of-Scope Use: Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful./n/nBias, Risks, and Limitations: Falcon-7B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online./n/nRecommendations:Users of Falcon-7B-Instruct should develop guardrails and take appropriate precautions for any production use."
task_ids:
- question_answering
- generation
- summarization
- extraction
tasks_info:
question_answering:
task_ratings:
quality: 0
cost: 0
generation:
task_ratings:
quality: 0
cost: 0
summarization:
task_ratings:
quality: 0
cost: 0
extraction:
task_ratings:
quality: 0
cost: 0
min_shot_size: 1
tier: "class_2"
number_params: "7b"
lifecycle:
available:
since_version: "9.1.0"
tiiuae_falcon_7b_resources:
limits:
cpu: "2"
memory: 60Gi
nvidia.com/gpu: "1"
ephemeral-storage: 1Gi
requests:
cpu: "1"
memory: 60Gi
nvidia.com/gpu: "1"
ephemeral-storage: 10Mi
tiiuae_falcon_7b_replicas: 1What to do next
To test the custom foundation model that you added to a watsonx.ai lightweight engine installation, submit an inference request to the model programmatically. For more details, see Getting started with watsonx.ai Lightweight Engine.