Adding custom foundation models to watsonx.ai Lightweight Engine

If the curated set of models in IBM watsonx.ai does not include the foundation model that you want to use for inferencing from your watsonx.ai lightweight engine installation, you can install your own custom model.

To review the curated foundation models that are available with IBM watsonx.ai to check whether an existing model might meet your needs, see Foundation models in IBM watsonx.ai.

Important: If you installed the full IBM watsonx.ai service, you follow different steps to add custom foundation models. For more information, see Deploying custom foundation models in IBM watsonx.ai.

Prerequisites

The IBM watsonx.ai service must be installed in lightweight engine mode.

Supported foundation model architectures

To check the architecture of a foundation model, find the config.json file for the foundation model, and then check the model_type value.

The following table lists the general-purpose model architectures that are supported by the watsonx.ai lightweight engine:

Model type	Supported quantization methods	Deployment framework	Flash attention
bloom	Not applicable	tgis_native	false
codegen	Not applicable	hf_transformers	Not applicable
falcon	Not applicable	tgis_native	true
gemma2	Not applicable	Not applicable	Not applicable
gpt_bigcode	GPTQ	tgis_native	true
gpt_neox	Not applicable	tgis_native	true
gptj	Not applicable	hf_transformers	Not applicable
llama	GPTQ	tgis_native	true
llama2	GPTQ	tgis_native	true
mistral	Not applicable	hf_transformers	Not applicable
mixtral	GPTQ	Not applicable	Not applicable
mpt	Not applicable	hf_transformers	Not applicable
mt5	Not applicable	hf_transformers	Not applicable
nemotron	Not applicable	Not applicable	Not applicable
olmo	Not applicable	Not applicable	Not applicable
persimmon	Not applicable	Not applicable	Not applicable
phi	Not applicable	Not applicable	Not applicable
phi3	Not applicable	Not applicable	Not applicable
qwen2	AWQ	Not applicable	Not applicable
sphinx	Not applicable	Not applicable	Not applicable
t5	Not applicable	tgis_native	false

The following table lists the time-series model architectures that are supported by the watsonx.ai lightweight engine:

Model type	Supported quantization methods	Deployment framework	Flash attention
tinytimemixer	Not applicable	Not applicable	Not applicable

You can look up flash attention and deployment information for different model types from the table to add to the ConfigMap file that you define for the custom foundation model.

Deployment framework

Specifies the library to use for loading the foundation model. Although some foundation models support more than one deployment method, the table lists the deployment framework in which foundation models with the specified architecture perform best.

Flash attention

Flash attention is a mechanism that is used to scale transformer-based models more efficiently and enable faster inferencing. To host the model correctly, the service needs to know whether the model uses the flash attention mechanism.

Quantization methods

Quantization is a process that reduces the mount of compute resources and memory used hen you inference a foundation model. You can set the following quantization methods for foundation models with the architectures that are listed

Post-training quantization for generative pre-trained transformers (GPTQ)
Activation aware quantization (AWQ)

Procedure

A system administrator must complete these steps to add a custom foundation model to the IBM watsonx.ai lightweight engine.

Upload the model.
Follow the steps in the Setting up storage and uploading the model procedure.

Make a note of the pvc_name for the persistent volume claim where you store the downloaded model source files.

Important: Complete only the storage setup and model download tasks, and then return to this procedure. Other steps in the full-service installation instructions describe how to create a deployment to host the custom foundation model. You do not need to set up a deployment to use custom foundation models from a watsonx.ai lightweight engine installation.
Create a ConfigMap file for the custom foundation model.
ConfigMap files are used by the Red Hat® OpenShift® AI layer of the service to serve configuration information to independent containers that run in pods or to other system components, such as controllers. See Creating a ConfigMap file.
To register the custom foundation model, apply the ConfigMap file by using the following command:
```
oc apply -f configmap.yml
```
The service operator picks up the configuration information and applies it to your cluster.
You can check the status of the service by using the following command. When Completed is returned, the custom foundation models are ready for use.
```
oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}
```

Creating a ConfigMap file

Create a ConfigMap file for the custom foundation model by copying the following template, and then replacing the variables in the template with the appropriate values for your foundation model.

apiVersion: v1
kind: ConfigMap
metadata:
  name: <model name with dash as delimiter>
  labels:
    syom: watsonxaiifm_extra_models_config
data:
  model: |
    <model name with underscore as delimiter>:
      pvc_name: <pvc where model is downloaded>
      pvc_size: <size of pvc where model is downloaded>
      isvc_yaml_name: isvc.yaml.j2
      dir_name: <directory inside of pvc where model content is downloaded>
      force_apply: no
      command: ["text-generation-launcher"]
      serving_runtime: tgis-serving-runtime
      storage_uri: pvc://<pvc where model is downloaded>/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0" # if sharding, change to the value of shard. If sharding in 2, specify "0,1". If sharding in 4, specify "0,1,2,3".
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value:  /mnt/models/
        - name: DTYPE_STR
          value: "<value>"
        - name: MAX_SEQUENCE_LENGTH
          value: "<value>"
        - name: MAX_BATCH_SIZE
          value: "<value>"
        - name: MAX_CONCURRENT_REQUESTS
          value: "<value>"
        - name: MAX_NEW_TOKENS
          value: "<value>"
        - name: FLASH_ATTENTION
          value: "<value>"
        - name: DEPLOYMENT_FRAMEWORK
          value: "<value>"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: <watsonx.ai ifm version>
        cloudpakInstanceId: <cloudpak instance id>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: <model name with dash as delimiter>
        app: text-<model name with dash as delimiter> 
        component: fmaas-inference-server # wont override value predictor
        bam-placement: colocate
        syom_model: <model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens>
      args:
        - "--port=3000"
        - "--grpc-port=8033"
        - "--num-shard=1" # shard setting for the GPU (2,4, or 8)
      wx_inference_proxy:
        <model id>:
          enabled:
          - "true"
          label: "<label of model>"
          provider: "< provider of model>"
          source: "Hugging Face"
          tags:
          - consumer_public
          # Description needs to be updated
          short_description: "<short discription of model>"
          long_description: "<long discription of model>"
          task_ids:
          - question_answering
          - generation
          - summarization
          - classification
          - extraction
          tasks_info:
            question_answering:
              task_ratings:
                quality: 0
                cost: 0
            generation:
              task_ratings:
                quality: 0
                cost: 0
            summarization:
              task_ratings:
                quality: 0
                cost: 0
            classification:
              task_ratings:
                quality: 0
                cost: 0
            extraction:
              task_ratings:
                quality: 0
                cost: 0
          min_shot_size: 1
          tier: "class_2"
          number_params: "13b"
          lifecycle:
            available:
              since_version: "<first watsonx.ai IFM version where this model is supported>"
    <model name with underscore as delimiter>_resources:
      limits:
        cpu: "2"
        memory: 128Gi
        nvidia.com/gpu: "1" # shard setting for the GPU (2,4,or 8)
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 4Gi
        nvidia.com/gpu: "1" # shard setting for the GPU (2,4,or 8)
        ephemeral-storage: 10Mi
    <model name with underscore as delimiter>_replicas: 1

The following table lists the variables for you to replace in the template.

ConfigMap field	Description
`metadata.name`	Model name with hyphens as delimiters. For example, if the model name is `tiiuae/falcon-7b`, specify `tiiuae-falcon-7b`.
`data.model.`	Model name with underscores as delimiters `<full_model_name>`. For example, if the model name is `tiiuae/falcon-7b`, specify `tiiuae_falcon_7b`.
`data.model.<full_model_name>.pvc_name`	Persistent volume claim where the model source files are stored. Use the `pvc_name` that you noted in an earlier step. For example, `tiiuae-falcon-7b-pvc`
`data.model.<full_model_name>.pvc_size`	Size of persistent volume claim where the model source files are stored. For example, `60Gi`.
`data.model.<full_model_name>.dir_name`	Directory where the model content is stored. This value matches the `MODEL_PATH` from the model download job. For example, `models--tiiuae-falcon-7b`
`data.model.<full_model_name>.storage_uri`	Universal resource identifier for the directory where the model source files are stored with the syntax `pvc://<pvc where model is downloaded>/`. For example, `pvc://tiiuae-falcon-7b-pvc/`.
`data.model.<full_model_name>.env.DTYPE_STR`	Data type of text strings that the model can process. For example, `float16`. For more information about supported values, see Global parameters for custom foundation models.
`data.model.<full_model_name>.env.MAX_BATCH_SIZE`	Maximum batch size For more information about supported values, see Global parameters for custom foundation models.
`data.model.<full_model_name>.env.MAX_CONCURRENT_REQUESTS`	Maximum number of concurrent requests that the model can handle. For example, `1024`. For more information about supported values, see Global parameters for custom foundation models.
`data.model.<full_model_name>.env.MAX_NEW_TOKENS`	Maximum number of tokens that the model can generate for a text inference request. For example, `2047`. For more information about supported values, see Global parameters for custom foundation models.
`data.model.<full_model_name>.env. FLASH_ATTENTION`	Specify the value from the Flash attention column of the Supported foundation model architectures table. If the value is Not applicable, remove this entry from the configmap file.
`data.model.<full_model_name>.env.DEPLOYMENT_FRAMEWORK`	Specify the value from the Deployment framework column of the Supported foundation model architectures table. If the value is Not applicable, remove this entry from the configmap file.
`data.model.<full_model_name>.annotations. productVersion`	The IBM watsonx.ai service operator version. For example, `9.1.0`. To get this value, use the following command: `oc get watsonxaiifm watsonxaiifm-cr -o jsonpath="{.spec.version}"`
`data.model.<full_model_name>.annotations.cloudpakInstanceId`	The IBM® Software Hub instance ID. For example, `b0871d64-ceae-47e9-b186-6e336deaf1f1`. To get this value, use the following command: `oc get cm product-configmap -o jsonpath="{.data.CLOUD_PAK_INSTANCE_ID}"`
`data.model.<full_model_name>.labels_syom.icpdsupport/module`	Model name with hyphens as delimiters. For example, if the model name is `tiiuae/falcon-7b`, specify `tiiuae-falcon-7b`
`data.model.<full_model_name>.labels_syom.app`	Model name with hyphens as delimiters and prefixed with `text-`. For example, if the model name is `tiiuae/falcon-7b`, specify `text-tiiuae-falcon-7b`.
`data.model.<full_model_name>.labels_syom.syom_model`	Model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens. For example, `tiiuae--falcon-7b`.
`data.model.<full_model_name>.wx_inference_proxy.`	Model ID (`<full/model_name>`). For example, `tiiuae/falcon-7b`
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.label`	Model name without provider prefix. For example, `falcon-7b`.
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.provider`	Model provider. For example, `tiiuae`
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.short discription of model`	Short description of the model in less than 100 characters.
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.long discription of model`	Long description of the model.
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.min_shot_size`	min shot size
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.tier`	Model tier.
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.number_params`	Number of model parameters. For example, `7b`
`data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.lifecycle.available.since_version`	The first IBM watsonx.ai service operator version in which the model was added. For examples, `9.1.0`.

Sample ConfigMap file

The following sample ConfigMap file defines the tiiuae-falcon-7b custom foundation model.

apiVersion: v1
kind: ConfigMap
metadata:
  name: tiiuae-falcon-7b
  labels:
    syom: watsonxaiifm_extra_models_config
data:
  model: |
    tiiuae_falcon_7b:
      pvc_name: tiiuae-falcon-7b-pvc
      pvc_size: 60Gi
      isvc_yaml_name: isvc.yaml.j2
      dir_name: model
      force_apply: no
      command: ["text-generation-launcher"]
      serving_runtime: tgis-serving-runtime
      storage_uri: pvc://tiiuae-falcon-7b-pvc/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: DTYPE_STR
          value: "float16"
        - name: MAX_SEQUENCE_LENGTH
          value: "2048"
        - name: MAX_BATCH_SIZE
          value: "256"
        - name: MAX_CONCURRENT_REQUESTS
          value: "1024"
        - name: MAX_NEW_TOKENS
          value: "2047"
        - name: FLASH_ATTENTION
          value: "true"
        - name: DEPLOYMENT_FRAMEWORK
          value: "tgis_native"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: 9.1.0
        cloudpakInstanceId: b0871d64-ceae-47e9-b186-6e336deaf1f1
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: tiiuae-falcon-7b
        app: text-tiiuae-falcon-7b
        component: fmaas-inference-server # wont override value predictor
        bam-placement: colocate
        syom_model: "tiiuae--falcon-7b"
      args:
        - "--port=3000"
        - "--grpc-port=8033"
        - "--num-shard=1"
      volumeMounts:
        - mountPath: /opt/caikit/prompt_cache
          name: prompt-cache-dir
          subPath: prompt_cache
      volumes:
        - name: prompt-cache-dir
          persistentVolumeClaim:
            claimName: fmaas-caikit-inf-prompt-tunes-prompt-cache
      wx_inference_proxy:
        tiiuae/falcon-7b:
          enabled:
          - "true"
          label: "falcon-7b"
          provider: "tiiuae"
          source: "Hugging Face"
          functions:
          - text_generation
          tags:
          - consumer_public
          short_description: "A 7B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat and instruct datasets."
          long_description: "Out-of-Scope Use: Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful./n/nBias, Risks, and Limitations: Falcon-7B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online./n/nRecommendations:Users of Falcon-7B-Instruct should develop guardrails and take appropriate precautions for any production use."
          task_ids:
          - question_answering
          - generation
          - summarization
          - extraction
          tasks_info:
            question_answering:
              task_ratings:
                quality: 0
                cost: 0
            generation:
              task_ratings:
                quality: 0
                cost: 0
            summarization:
              task_ratings:
                quality: 0
                cost: 0
            extraction:
              task_ratings:
                quality: 0
                cost: 0
          min_shot_size: 1
          tier: "class_2"
          number_params: "7b"
          lifecycle:
            available:
              since_version: "9.1.0"
    tiiuae_falcon_7b_resources:
      limits:
        cpu: "2"
        memory: 60Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 60Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    tiiuae_falcon_7b_replicas: 1

What to do next

To test the custom foundation model that you added to a watsonx.ai lightweight engine installation, submit an inference request to the model programmatically. For more details, see Working with the watsonx.ai lightweight engine.