Adding custom foundation models to watsonx.ai Lightweight Engine

If the curated set of models in IBM watsonx.ai does not include the foundation model that you want to use for inferencing from your watsonx.ai lightweight engine installation, you can install your own custom model.

To review the curated foundation models that are available with IBM watsonx.ai to check whether an existing model might meet your needs, see Foundation models in IBM watsonx.ai.

Important: If you installed the full IBM watsonx.ai service, you follow different steps to add custom foundation models. For more information, see Deploying custom foundation models in IBM watsonx.ai.

Prerequisites

The IBM watsonx.ai service must be installed in lightweight engine mode.

Supported foundation model architectures

To check the architecture of a foundation model, find the config.json file for the foundation model, and then check the model_type value.

The following table lists the general-purpose model architectures that are supported by the watsonx.ai lightweight engine:
Model type Supported quantization methods Deployment framework Flash attention
bloom Not applicable tgis_native false
codegen Not applicable hf_transformers Not applicable
falcon Not applicable tgis_native true
gemma2 Not applicable Not applicable Not applicable
gpt_bigcode GPTQ tgis_native true
gpt_neox Not applicable tgis_native true
gptj Not applicable hf_transformers Not applicable
llama GPTQ tgis_native true
llama2 GPTQ tgis_native true
mistral Not applicable hf_transformers Not applicable
mixtral GPTQ Not applicable Not applicable
mpt Not applicable hf_transformers Not applicable
mt5 Not applicable hf_transformers Not applicable
nemotron Not applicable Not applicable Not applicable
olmo Not applicable Not applicable Not applicable
persimmon Not applicable Not applicable Not applicable
phi Not applicable Not applicable Not applicable
phi3 Not applicable Not applicable Not applicable
qwen2 AWQ Not applicable Not applicable
sphinx Not applicable Not applicable Not applicable
t5 Not applicable tgis_native false
The following table lists the time-series model architectures that are supported by the watsonx.ai lightweight engine:
Model type Supported quantization methods Deployment framework Flash attention
tinytimemixer Not applicable Not applicable Not applicable
You can look up flash attention and deployment information for different model types from the table to add to the ConfigMap file that you define for the custom foundation model.
Deployment framework
Specifies the library to use for loading the foundation model. Although some foundation models support more than one deployment method, the table lists the deployment framework in which foundation models with the specified architecture perform best.
Flash attention
Flash attention is a mechanism that is used to scale transformer-based models more efficiently and enable faster inferencing. To host the model correctly, the service needs to know whether the model uses the flash attention mechanism.
Quantization methods
Quantization is a process that reduces the mount of compute resources and memory used hen you inference a foundation model. You can set the following quantization methods for foundation models with the architectures that are listed
  • Post-training quantization for generative pre-trained transformers (GPTQ)
  • Activation aware quantization (AWQ)

Procedure

A system administrator must complete these steps to add a custom foundation model to the IBM watsonx.ai lightweight engine.
  1. Upload the model.

    Follow the steps in the Setting up storage and uploading the model procedure.

    Make a note of the pvc_name for the persistent volume claim where you store the downloaded model source files.

    Important: Complete only the storage setup and model download tasks, and then return to this procedure. Other steps in the full-service installation instructions describe how to create a deployment to host the custom foundation model. You do not need to set up a deployment to use custom foundation models from a watsonx.ai lightweight engine installation.
  2. Create a ConfigMap file for the custom foundation model.

    ConfigMap files are used by the Red Hat® OpenShift® AI layer of the service to serve configuration information to independent containers that run in pods or to other system components, such as controllers. See Creating a ConfigMap file.

  3. To register the custom foundation model, apply the ConfigMap file by using the following command:
    oc apply -f configmap.yml
    The service operator picks up the configuration information and applies it to your cluster.
  4. You can check the status of the service by using the following command. When Completed is returned, the custom foundation models are ready for use.
    oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}

Creating a ConfigMap file

Create a ConfigMap file for the custom foundation model by copying the following template, and then replacing the variables in the template with the appropriate values for your foundation model.
apiVersion: v1
kind: ConfigMap
metadata:
  name: <model name with dash as delimiter>
  labels:
    syom: watsonxaiifm_extra_models_config
data:
  model: |
    <model name with underscore as delimiter>:
      pvc_name: <pvc where model is downloaded>
      pvc_size: <size of pvc where model is downloaded>
      isvc_yaml_name: isvc.yaml.j2
      dir_name: <directory inside of pvc where model content is downloaded>
      force_apply: no
      command: ["text-generation-launcher"]
      serving_runtime: tgis-serving-runtime
      storage_uri: pvc://<pvc where model is downloaded>/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0" # if sharding, change to the value of shard. If sharding in 2, specify "0,1". If sharding in 4, specify "0,1,2,3".
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value:  /mnt/models/
        - name: DTYPE_STR
          value: "<value>"
        - name: MAX_SEQUENCE_LENGTH
          value: "<value>"
        - name: MAX_BATCH_SIZE
          value: "<value>"
        - name: MAX_CONCURRENT_REQUESTS
          value: "<value>"
        - name: MAX_NEW_TOKENS
          value: "<value>"
        - name: FLASH_ATTENTION
          value: "<value>"
        - name: DEPLOYMENT_FRAMEWORK
          value: "<value>"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: <watsonx.ai ifm version>
        cloudpakInstanceId: <cloudpak instance id>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: <model name with dash as delimiter>
        app: text-<model name with dash as delimiter> 
        component: fmaas-inference-server # wont override value predictor
        bam-placement: colocate
        syom_model: <model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens>
      args:
        - "--port=3000"
        - "--grpc-port=8033"
        - "--num-shard=1" # shard setting for the GPU (2,4, or 8)
      wx_inference_proxy:
        <model id>:
          enabled:
          - "true"
          label: "<label of model>"
          provider: "< provider of model>"
          source: "Hugging Face"
          tags:
          - consumer_public
          # Description needs to be updated
          short_description: "<short discription of model>"
          long_description: "<long discription of model>"
          task_ids:
          - question_answering
          - generation
          - summarization
          - classification
          - extraction
          tasks_info:
            question_answering:
              task_ratings:
                quality: 0
                cost: 0
            generation:
              task_ratings:
                quality: 0
                cost: 0
            summarization:
              task_ratings:
                quality: 0
                cost: 0
            classification:
              task_ratings:
                quality: 0
                cost: 0
            extraction:
              task_ratings:
                quality: 0
                cost: 0
          min_shot_size: 1
          tier: "class_2"
          number_params: "13b"
          lifecycle:
            available:
              since_version: "<first watsonx.ai IFM version where this model is supported>"
    <model name with underscore as delimiter>_resources:
      limits:
        cpu: "2"
        memory: 128Gi
        nvidia.com/gpu: "1" # shard setting for the GPU (2,4,or 8)
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 4Gi
        nvidia.com/gpu: "1" # shard setting for the GPU (2,4,or 8)
        ephemeral-storage: 10Mi
    <model name with underscore as delimiter>_replicas: 1
The following table lists the variables for you to replace in the template.
ConfigMap field Description
metadata.name Model name with hyphens as delimiters. For example, if the model name is tiiuae/falcon-7b, specify tiiuae-falcon-7b.
data.model. Model name with underscores as delimiters <full_model_name>. For example, if the model name is tiiuae/falcon-7b, specify tiiuae_falcon_7b.
data.model.<full_model_name>.pvc_name Persistent volume claim where the model source files are stored. Use the pvc_name that you noted in an earlier step. For example, tiiuae-falcon-7b-pvc
data.model.<full_model_name>.pvc_size Size of persistent volume claim where the model source files are stored. For example, 60Gi.
data.model.<full_model_name>.dir_name Directory where the model content is stored. This value matches the MODEL_PATH from the model download job. For example, models--tiiuae-falcon-7b
data.model.<full_model_name>.storage_uri Universal resource identifier for the directory where the model source files are stored with the syntax pvc://<pvc where model is downloaded>/. For example, pvc://tiiuae-falcon-7b-pvc/.
data.model.<full_model_name>.env.DTYPE_STR Data type of text strings that the model can process. For example, float16.

For more information about supported values, see Global parameters for custom foundation models.

data.model.<full_model_name>.env.MAX_BATCH_SIZE Maximum batch size

For more information about supported values, see Global parameters for custom foundation models.

data.model.<full_model_name>.env.MAX_CONCURRENT_REQUESTS Maximum number of concurrent requests that the model can handle. For example, 1024.

For more information about supported values, see Global parameters for custom foundation models.

data.model.<full_model_name>.env.MAX_NEW_TOKENS Maximum number of tokens that the model can generate for a text inference request. For example, 2047.

For more information about supported values, see Global parameters for custom foundation models.

data.model.<full_model_name>.env. FLASH_ATTENTION Specify the value from the Flash attention column of the Supported foundation model architectures table. If the value is Not applicable, remove this entry from the configmap file.
data.model.<full_model_name>.env.DEPLOYMENT_FRAMEWORK Specify the value from the Deployment framework column of the Supported foundation model architectures table. If the value is Not applicable, remove this entry from the configmap file.
data.model.<full_model_name>.annotations. productVersion The IBM watsonx.ai service operator version. For example, 9.1.0.

To get this value, use the following command: oc get watsonxaiifm watsonxaiifm-cr -o jsonpath="{.spec.version}"

data.model.<full_model_name>.annotations.cloudpakInstanceId The IBM® Software Hub instance ID. For example, b0871d64-ceae-47e9-b186-6e336deaf1f1.

To get this value, use the following command: oc get cm product-configmap -o jsonpath="{.data.CLOUD_PAK_INSTANCE_ID}"

data.model.<full_model_name>.labels_syom.icpdsupport/module Model name with hyphens as delimiters. For example, if the model name is tiiuae/falcon-7b, specify tiiuae-falcon-7b
data.model.<full_model_name>.labels_syom.app Model name with hyphens as delimiters and prefixed with text-. For example, if the model name is tiiuae/falcon-7b, specify text-tiiuae-falcon-7b.
data.model.<full_model_name>.labels_syom.syom_model Model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens. For example, tiiuae--falcon-7b.
data.model.<full_model_name>.wx_inference_proxy. Model ID (<full/model_name>). For example, tiiuae/falcon-7b
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.label Model name without provider prefix. For example, falcon-7b.
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.provider Model provider. For example, tiiuae
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.short discription of model Short description of the model in less than 100 characters.
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.long discription of model Long description of the model.
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.min_shot_size min shot size
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.tier Model tier.
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.number_params Number of model parameters. For example, 7b
data.model.<full_model_name>.wx_inference_proxy.<full/model_name>.lifecycle.available.since_version The first IBM watsonx.ai service operator version in which the model was added. For examples, 9.1.0.

Sample ConfigMap file

The following sample ConfigMap file defines the tiiuae-falcon-7b custom foundation model.
apiVersion: v1
kind: ConfigMap
metadata:
  name: tiiuae-falcon-7b
  labels:
    syom: watsonxaiifm_extra_models_config
data:
  model: |
    tiiuae_falcon_7b:
      pvc_name: tiiuae-falcon-7b-pvc
      pvc_size: 60Gi
      isvc_yaml_name: isvc.yaml.j2
      dir_name: model
      force_apply: no
      command: ["text-generation-launcher"]
      serving_runtime: tgis-serving-runtime
      storage_uri: pvc://tiiuae-falcon-7b-pvc/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: DTYPE_STR
          value: "float16"
        - name: MAX_SEQUENCE_LENGTH
          value: "2048"
        - name: MAX_BATCH_SIZE
          value: "256"
        - name: MAX_CONCURRENT_REQUESTS
          value: "1024"
        - name: MAX_NEW_TOKENS
          value: "2047"
        - name: FLASH_ATTENTION
          value: "true"
        - name: DEPLOYMENT_FRAMEWORK
          value: "tgis_native"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: 9.1.0
        cloudpakInstanceId: b0871d64-ceae-47e9-b186-6e336deaf1f1
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: tiiuae-falcon-7b
        app: text-tiiuae-falcon-7b
        component: fmaas-inference-server # wont override value predictor
        bam-placement: colocate
        syom_model: "tiiuae--falcon-7b"
      args:
        - "--port=3000"
        - "--grpc-port=8033"
        - "--num-shard=1"
      volumeMounts:
        - mountPath: /opt/caikit/prompt_cache
          name: prompt-cache-dir
          subPath: prompt_cache
      volumes:
        - name: prompt-cache-dir
          persistentVolumeClaim:
            claimName: fmaas-caikit-inf-prompt-tunes-prompt-cache
      wx_inference_proxy:
        tiiuae/falcon-7b:
          enabled:
          - "true"
          label: "falcon-7b"
          provider: "tiiuae"
          source: "Hugging Face"
          functions:
          - text_generation
          tags:
          - consumer_public
          short_description: "A 7B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat and instruct datasets."
          long_description: "Out-of-Scope Use: Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful./n/nBias, Risks, and Limitations: Falcon-7B-Instruct is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online./n/nRecommendations:Users of Falcon-7B-Instruct should develop guardrails and take appropriate precautions for any production use."
          task_ids:
          - question_answering
          - generation
          - summarization
          - extraction
          tasks_info:
            question_answering:
              task_ratings:
                quality: 0
                cost: 0
            generation:
              task_ratings:
                quality: 0
                cost: 0
            summarization:
              task_ratings:
                quality: 0
                cost: 0
            extraction:
              task_ratings:
                quality: 0
                cost: 0
          min_shot_size: 1
          tier: "class_2"
          number_params: "7b"
          lifecycle:
            available:
              since_version: "9.1.0"
    tiiuae_falcon_7b_resources:
      limits:
        cpu: "2"
        memory: 60Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 60Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    tiiuae_falcon_7b_replicas: 1

What to do next

To test the custom foundation model that you added to a watsonx.ai lightweight engine installation, submit an inference request to the model programmatically. For more details, see Working with the watsonx.ai lightweight engine.