Registering custom foundation models for global deployment

You can deploy custom foundation models globally using the watsonx.ai™ Inference Frameworks Manager (IFM) operator, which enables you to make these models available from any project or space, rather than being tied to a specific project or space. This enables enterprises with limited GPUs to utilize the benefits of custom foundation models without the need for a separate instance of the model to be deployed in each project or space. By deploying custom foundation models globally, you can provide a scalable solution for enterprises to use custom models across multiple projects and spaces.

Prerequisites

If you want to enable the text chat functionality, the custom foundation model that you want to deploy must include the chat template as part of the model configuration file tokenizer_config.json. For example, the model configuration file for the Llama-3.1-8B-Instruct model includes the chat template, as shown here:

Supported model architectures

You can deploy custom foundation model architectures that are based on the vLLM runtime at a global level.

The following model architectures are supported for deployment at a global level.

Table 1. Supported model architectures: general-purpose models
Model family	Foundation model examples	Supported Quantization method	Parallel Tensors (Multiple GPUs supported)	Deployment configurations
`bloom`	`bigscience/bloom-3b`, `bigscience/bloom-560m`	N/A	Yes	Small, Medium and Large
`exaone`	`lgai-exaone/exaone-3.0-7.8B-Instruct`	N/A	No	Small
`falcon`	`tiiuae/falcon-7b`	N/A	Yes	Small, Medium and Large
`gemma`	`google/gemma-2b`	N/A	Yes	Small, Medium and Large
`gemma2`	`google/gemma-2-9b`	N/A	Yes	Small, Medium and Large
`gpt_bigcode`	`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`	`gptq`	Yes	Small, Medium and Large
`gpt_neox`	`rinna/japanese-gpt-neox-small`, `EleutherAI/pythia-12b`, `databricks/dolly-v2-12b`	N/A	Yes	Small, Medium and Large
`gptj`	`EleutherAI/gpt-j-6b`	N/A	No	Small
`gpt2`	`gpt2`, `gpt2-xl`	N/A	Yes	Small, Medium and Large
`granite`	`ibm-granite/granite-3.0-8b-instruct`, `ibm-granite/granite-3b-code-instruct-2k`, `granite-8b-code-instruct`, `granite-7b-lab`	N/A	No	Small
`jais`	`core42/jais-13b`	N/A	Yes	Small, Medium and Large
`llama`	`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3.1-8B-Instruct`, `llama-2-13b-chat-hf`, `TheBloke/Llama-2-7B-Chat-AWQ`, `ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf`	`gptq`	Yes	Small, Medium and Large
`mistral`	`mistralai/Mistral-7B-v0.3`, `neuralmagic/OpenHermes-2.5-Mistral-7B-marlin`	N/A	No	Small
`mixtral`	`TheBloke/Mixtral-8x7B-v0.1-GPTQ`, `mistralai/Mixtral-8x7B-Instruct-v0.1`	`gptq`	No	Small
`mpt`	`mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`	N/A	No	Small
`nemotron`	`nvidia/Minitron-8B-Base`	N/A	Yes	Small, Medium and Large
`olmo`	`allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`	N/A	Yes	Small, Medium and Large
`persimmon`	`adept/persimmon-8b-base`, `adept/persimmon-8b-chat`	N/A	Yes	Small, Medium and Large
`phi`	`microsoft/phi-2`, `microsoft/phi-1_5`	N/A	Yes	Small, Medium and Large
`phi3`	`microsoft/Phi-3-mini-4k-instruct`	N/A	Yes	Small, Medium and Large
`qwen`	`DeepSeek-R1 (distilled variant)`	N/A	Yes	Small, Medium and Large
`qwen2`	`Qwen/Qwen2-7B-Instruct-AWQ`	`AWQ`	Yes	Small, Medium and Large

Table 2. Supported model architectures: automatic speech recognition (ASR) models
Model family	Foundation model examples	Supported Quantization method	Parallel Tensors (Multiple GPUs supported)	Deployment configurations
`whisper`	`openai/whisper-small`	N/A	No	Small, Medium and Large
`whisper`	`openai/whisper-large-v3-turbo`	N/A	No	Small, Medium and Large

Table 3. Supported model architectures: embedding and reranking models
Model family	Foundation model examples	Supported semantic retrieval	Deployment configurations
`BGE`	`BAAI/bge-reranker-v2-m3`	reranking	Small, Medium and Large
`E5`	`intfloat/multilingual-e5-large`	embedding and reranking	Small, Medium and Large
`granite`	`ibm/granite-embedding-107m-multilingual`, `ibm/granite-embedding-278m-multilingual`	embedding	Small, Medium and Large
`Jina Reranker`	`jinaai/jina-reranker-v2-base-multilingual`	reranking	Small, Medium and Large
`MiniLM`	`cross-encoder/ms-marco-minilm-l-12-v2`	reranking	Small, Medium and Large
`MiniLM`	`sentence-transformers/all-minilm-l6-v2`	embedding and reranking	Small, Medium and Large
`Qwen`	`Qwen/Qwen3-Embedding-0.6B`	embedding and reranking	Small, Medium and Large
`slate`	`ibm/slate-125m-english-rtrvr`, `ibm/slate-125m-english-rtrvr-v2`, `ibm/slate-30m-english-rtrvr`, `ibm/slate-30m-english-rtrvr-v2`	embedding and reranking	Small, Medium and Large

Procedure

To deploy a custom foundation model globally, follow these steps:

Set up the PVC storage and upload the custom foundation model. For more information, see Setting up storage and uploading the model.
Make a note of the pvc_name for the persistent volume claim where you store the downloaded model source files.
Create a ConfigMap file for the custom foundation model by using the vLLM runtime.
```
 * oc create -f <config_map_file>
 Poll for the predictor pod to be in running state 1/1
 * hermes-2-pro-mistral-7b-predictor-654986d764-mrpt5                1/1     Running     0               25m
```
Important:
1. You must specify the model ID in the ConfigMap in lowercase. The model ID cannot be specified in upper case or camel case. The model ID can contain only letters, numbers and underscores.
2. You must set serving_runtime to vllm-serving-runtime to deploy the global custom foundation model in the Configmap file.
3. To avoid naming conflicts with foundation models that are shipped by IBM, use a name that is unique. Do not use the same model_id as that of a custom foundation model that is shipped by IBM.
4. You must set the global_custom_foundation_model parameter to true in the wx-inference-proxy ConfigMap.
5. To enable the MLOps or Prompt engineers to chat with the globally deployed custom foundation model by using the chat functionality, add the text_chat function in the same ConfigMap file created for the global custom foundation model.
ConfigMap files are used by the Red Hat® OpenShift® AI layer of the service to serve configuration information to independent containers that run in pods or to other system components, such as controllers.
For more information, see Creating a Config Map file.
To register the custom foundation model, apply the ConfigMap file by using the following command:
```
oc apply -f configmap.yml
```
The service operator picks up the configuration information and applies it to your cluster.
You can check the status of the service by using the following command. When Completed is returned, the custom foundation models are ready for use.
```
oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}
```

Creating a `ConfigMap` file

The following table lists the variables for you to replace in the sample ConfigMaps.

ConfigMap field	Description
`metadata.name`	Model name with hyphens as delimiters. For example, if the model name is `tiiuae/falcon-7b`, specify `tiiuae-falcon-7b`.
`data.model.`	Model name with underscores as delimiters `<full_model_name>`. For example, if the model name is `tiiuae/falcon-7b`, specify `tiiuae_falcon_7b`.
`data.model.<full_model_name>.pvc_name`	Persistent volume claim where the model source files are stored. Use the `pvc_name` that you noted in an earlier step. For example, `tiiuae-falcon-7b-pvc`
`data.model.<full_model_name>.pvc_size`	Size of persistent volume claim where the model source files are stored. For example, `60Gi`.
`data.model.<full_model_name>.image`	If the model that you want to use is not yet supported by the standard inference servers, you can override the standard settings and use your own custom inference runtime image. Images that are listed in Open Shift registry were tested and are confirmed to work. You can also try out other images but they are not officially supported by IBM.
`data.model.<full_model_name>.env.DTYPE_STR`	Data type of text strings that the model can process. For example, `float16`. For more information about supported values, see Global parameters for custom foundation models.
`data.model.<full_model_name>.env.MAX_NEW_TOKENS`	Maximum number of tokens that the model can generate for a text inference request. For example, `2047`. For more information about supported values, see Global parameters for custom foundation models.
`data.model.<full_model_name>.env. ENABLE_AUTO_TOOL_CHOICE`	Tells vLLM that you want to enable the model to generate its own tool calls, when appropriate.
`data.model.<full_model_name>.env. TOOL_CALL_PARSER`	Tool parser to use. For a list of tool parsers that are suitable for various foundation models, see https://docs.vllm.ai/en/stable/features/tool_calling.html#automatic-function-calling.
`data.model.<full_model_name>.annotations.productVersion`	The service operator version. For example, `9.1.0`. To get this value, use the following command: `oc get watsonxaiifm watsonxaiifm-cr -o jsonpath="{.spec.version}"`
`data.model.<full_model_name>.annotations.cloudpakInstanceId`	The IBM® Software Hub instance ID. For example, `b0871d64-ceae-47e9-b186-6e336deaf1f1`. To get this value, use the following command: `oc get cm product-configmap -o jsonpath="{.data.CLOUD_PAK_INSTANCE_ID}"`
`data.model.<full_model_name>.annotations.model-id`	Model's ID. For example, `model-id: "meta-llama/llama-3-1-8b"`
`data.model.<full_model_name>.labels_syom.icpdsupport/module`	Model name with hyphens as delimiters. For example, if the model name is `tiiuae/falcon-7b`, specify `tiiuae-falcon-7b`
`data.model.<full_model_name>.labels_syom.app`	Model name with hyphens as delimiters and prefixed with `text-`. For example, if the model name is `tiiuae/falcon-7b`, specify `text-tiiuae-falcon-7b`.
`data.model.<full_model_name>.labels_syom.syom_model`	Model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens. For example, `tiiuae--falcon-7b`.
`data.model.<full_model_name>.wx_inference_proxy.`	Model ID (`<full/model_name>`). For example, `tiiuae/falcon-7b`
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.label`	Model name without provider prefix. For example, `falcon-7b`.
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.provider`	Model provider. For example, `tiiuae`
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.short_description`	Short description of the model in less than 100 characters.
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.long_discription`	Long description of the model.
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.number_params`	Number of model parameters. For example, `7b`
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.lifecycle.available.since_version`	The first service operator version in which the model was added. For examples, `9.1.0`.
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.functions`	Model function. Available options: `text_generation` and `text_chat`, `embedding`, `rerank`, `audio_transcriptions`. You must verify the supported functions before editing the ConfigMap. See the official model card for details. Note: `text_chat` applies only to models that use the vLLM runtime and meet all the text-chat related prerequisites. For more information about supported models, see Table 1.
`data.model.<full_model_name>.wx_inference_proxy.<model_id>.tags`	Any applicable tags. Must contain this tag: `vllm_runtime`
`data.model.<full_model_name>_resources.limits`	The maximum amount of resources that a custom foundation model can use. cpu: Assigned CPU cores memory: Assigned RAM nvidia.com/gpu: GPU shards ephemeral storage: Storage space on local (ephemeral) disk For more information, see Resource utilization guidelines.
`data.model.<full_model_name>_resources.requests`	The minimum (guaranteed) amount of each resource that a custom foundation model will get. cpu: Assigned CPU cores memory: Assigned RAM nvidia.com/gpu: GPU shards ephemeral storage: Storage space on local (ephemeral) disk For more information, see Resource utilization guidelines.
`data.model.<full_model_name>.env.CUDA_VISIBLE_DEVICES`	A comma-separated list of GPU indices that the LLM runtime must use. For example `0,1` exposes GPUs 0 and 1 `0` exposes only the first GPU Only the GPUs listed here will be available for model loading and inference. Make sure that the number of indices corresponds to the `nvidia.com/gpu` count that is configured under both `limits` and `requests` in your Kubernetes spec.
`data.model.<full_model_name>.env.NUM_GPUS`	The total number of GPUs that the the LLM runtime will initialize. This number must be equal to the count of indices in `CUDA_VISIBLE_DEVICES`. For example: If `CUDA_VISIBLE_DEVICES` is set to `0,1`, then set `NUM_GPUS=2` If only GPU 0 is exposed, set `NUM_GPUS=1` Make sure that this value matches the `nvidia.com/gpu` count that is specified in both `limits` and `requests` of your pod’s resource configuration.
`data.model.<full_model_name>.env.LANGUAGE`	Supported language for this model
`data.model.<full_model_name>.env.MAX_LOG_LEN`	Maximum number of prompt characters or prompt ID numbers to be printed in log
`data.model.<full_model_name>.env.VLLM_CACHE_ROOT`	Root directory for cache files.

Sample ConfigMap files

Create a ConfigMap file for the custom foundation model by copying one of the following sample ConfigMaps and then replacing the variables in the template with the appropriate values for your foundation model.

The following sample ConfigMap file defines the meta-llama/Meta-Llama-3-8B-Instruct custom foundation model.

apiVersion: v1
kind: ConfigMap
metadata:
  name: meta-llama-3-1-8b-instruct
  namespace: cpd-instance
  labels:
    syom: watsonxaiifm_extra_models_config
  finalizers:
    - watsonxaiifm.cpd.ibm.com/finalizer
data:
  model: |
    meta_llama_3_1_8b_instruct:
      pvc_name: meta-llama-3-1-8b-instruct-pvc
      pvc_size: 62Gi
      command: []
      isvc_yaml_name: isvc.yaml.j2
      dir_name: model
      force_apply: no
      serving_runtime: vllm-serving-runtime
      storage_uri: pvc://meta-llama-3-1-8b-instruct-pvc/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: DTYPE_STR
          value: "float16"
        - name: MAX_SEQUENCE_LENGTH
          value: "2048"
        - name: MAX_BATCH_SIZE
          value: "256"
        - name: MAX_CONCURRENT_REQUESTS
          value: "1024"
        - name: MAX_NEW_TOKENS
          value: "2048"
        - name: FLASH_ATTENTION
          value: "true"
        - name: DEPLOYMENT_FRAMEWORK
          value: "tgis_native"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: SERVED_MODEL_NAME  # must match wx_inference_proxy.<model_name>
          value: meta-llama/meta-llama-3-1-8b-instruct
        - name: NUM_GPUS
          value: "1"
        - name: PORT
          value: "3000"
        - name: VLLM_CACHE_ROOT # Additional environment parameters for 5.3.0
          value: /tmp/vllm_cache
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: 12.0.0
        cloudpakInstanceId: cd686e30-7b77-4256-a9be-c25e97f5f838
        model-id: "meta-llama/meta-llama-3-1-8b-instruct" # Additional annotation for 5.3.0 - must match wx_inference_proxy.<model_name>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: meta-llama-3-1-8b-instruct
        app: text-meta-llama-3-1-8b-instruct
        component: fmaas-inference-server
        bam-placement: colocate
        syom_model: meta--llama-3-1-8b-instruct
      args: []
      wx_inference_proxy:
        meta-llama/meta-llama-3-1-8b-instruct:
          global_custom_foundation_model: true
          enabled:
            - "true"
          label: "meta-llama-3-1-8b-instruct"
          provider: "meta-llama"
          source: "Hugging Face"
          functions:
            - text_generation
            - text_chat
          tags:
            - vllm_runtime  # Additional tags for 5.3.0
          short_description: "A large language model from Meta's LLaMA 3 series, fine-tuned to follow instructions and perform a wide range of natural language understanding and generation tasks."
          long_description: "A powerful 8-billion parameter model from Meta's LLaMA 3 series, specifically fine-tuned to enhance instruction-following capabilities, making it effective for a wide range of NLP tasks."
          task_ids:
            - question_answering
            - generation
            - summarization
            - classification
            - extraction
          tasks_info:
            question_answering:
              task_ratings: { quality: 0, cost: 0 }
            generation:
              task_ratings: { quality: 0, cost: 0 }
            summarization:
              task_ratings: { quality: 0, cost: 0 }
            classification:
              task_ratings: { quality: 0, cost: 0 }
            extraction:
              task_ratings: { quality: 0, cost: 0 }
          min_shot_size: 1
          tier: "class_2"
          number_params: "8b"
          lifecycle:
            available:
              since_version: "9.1.0"
    meta_llama_3_1_8b_instruct_resources:
      limits:
        cpu: "2"
        memory: 128Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 4Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    meta_llama_3_1_8b_instruct_replicas: 1

Sample ConfigMap for automatic speech recognition (ASR) models:

apiVersion: v1
data:
  model: |
    whisper_tiny:
      pvc_name: <pvc name>
      svc_name: whisper-tiny
      pvc_size: 100Gi
      dir_name: .
      force_apply: "no"
      isvc_yaml_name: isvc.yaml.j2
      image: http://registry.redhat.io/rhoai/odh-vllm-cuda-rhel9@sha256:751e2359439161babb9ad8e93e16251888a8c07aed895ffa55e4dfaf2a45f89d
      serving_runtime: vllm-serving-runtime
      storage_uri: pvc://<pvc name>/      # keep the trailing dash after the pvc name
      annotations:
        model-id: "openai/whisper-tiny" # Additional annotation for 5.3.0 - must match wx_inference_proxy.<model_name>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: whisper-tiny
        #app: text-llava-next-video-7b-hf
        component: fmaas-inference-server
        bam-placement: colocate
        syom_model: openai--whisper-tiny  # double dash after model provider name
      env:
        - name: VLLM_CACHE_ROOT
          value: /tmp/vllm_cache
        - name: MODEL_NAME
          value: /mnt/models
        - name: SERVED_MODEL_NAME   # must match wx_inference_proxy.<model_name>
          value: openai/whisper-tiny
        - name: LANGUAGE
          value: "en"
        - name: NUM_GPUS
          value: "1"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: PORT
          value: "3000"
        - name: MAX_LOG_LEN
          value: "100"
      volumeMounts:
        - name: home
          mountPath: /home/vllm
        - name: tmp
          mountPath: /tmp
        - name: shm
          mountPath: /dev/shm
      volumes:
        - name: home
          emptyDir: {}
        - name: tmp
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 4Gi
      wx_inference_proxy:
        openai/whisper-tiny:
          enabled:
          - "true"
          label: openai/whisper-tiny
          provider: IBM
          source: IBM
          functions:
          - audio_transcriptions
          short_description: Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.
          long_description:  Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.
          tags:
          - vllm_runtime
          - consumer_public
          min_shot_size: 1
          input_tier: class_14
          output_tier: class_15
          number_params: 805k
          lifecycle:
            available:
              since_version: 11.0.0
          versions:
          - version: 1.0.0
            since_version: 11.0.0
    whisper_tiny_resources:
      limits:
        cpu: "3"
        memory: 96Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "2"
        memory: 85Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    whisper_tiny_replicas: 1
kind: ConfigMap
metadata:
  finalizers:
  - watsonxaiifm.cpd.ibm.com/finalizer
  labels:
    syom: watsonxaiifm_extra_models_config
  name: whisper-tiny

Sample ConfigMap for embedding and reranking models:

Important:

For the functions field, you must verify the supported functions before editing the ConfigMap. See the official model card for details.

You must set the VLLM_USE_V1 environment variable to 0 in the Config Map file. Otherwise the inference requests will fail.

apiVersion: v1
data:
  model: |
    jina_reranker_v2_base_multilingual:
      pvc_name: jina-reranker-v2-base-multilingual-pvc
      pvc_size: 62Gi
      isvc_yaml_name: isvc.yaml.j2
      embedding_model: "true"
      dir_name: .
      force_apply: no
      serving_runtime: vllm-serving-runtime
      storage_uri: pvc://jina-reranker-v2-base-multilingual-pvc/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value:  /mnt/models/
        - name: DTYPE_STR
          value: "float16"
        - name: MAX_SEQUENCE_LENGTH
          value: "2048"
        - name: MAX_NEW_TOKENS
          value: "2048"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: SERVED_MODEL_NAME  # must match wx_inference_proxy.<model_name>
          value: jinaai/jina-reranker-v2-base-multilingual
        - name: NUM_GPUS
          value: "1"
        - name: PORT
          value: "3000"
        - name: VLLM_CACHE_ROOT
          value: /home/vllm
        - name: VLLM_USE_V1
          value: "0"
      volumeMounts:
        - name: home
          mountPath: /home/vllm
        - name: tmp
          mountPath: /tmp
        - name: shm
          mountPath: /dev/shm
      volumes:
        - name: home
          emptyDir: {}
        - name: tmp
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 4Gi
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: 12.0.0
        cloudpakInstanceId: cd346dc7-29fd-46db-a874-7e33cb8cf1ac
        model-id: jinaai/jina-reranker-v2-base-multilingual  #Additional annotation for 5.3.0 - must match wx_inference_proxy.<model_name>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: jina-reranker-v2-base-multilingual
        app: text-jina-reranker-v2-base-multilingual
        component: fmaas-inference-server
        bam-placement: colocate
        syom_model: jina--reranker-v2-base-multilingual
      command:
        - "/bin/sh"
      args:
        - "-c"
        - "vllm serve /mnt/models --served-model-name jinaai/jina-reranker-v2-base-multilingual --port 3000 --trust-remote-code"
      wx_inference_proxy:
        jinaai/jina-reranker-v2-base-multilingual:
          global_custom_foundation_model: true
          enabled:
          - "true"
          label: "jina-reranker-v2-base-multilingual"
          provider: "jinaai"
          source: "Hugging Face"
          tags:
          - vllm_runtime
          functions:
          - rerank
          short_description: "A multilingual transformer-based reranker that scores query-document relevance with high accuracy"
          long_description: "The Jina Reranker v2 is a cross-encoder model fine-tuned for multilingual text reranking, capable of handling up to 1024 tokens with a sliding window for longer texts. It delivers state-of-the-art performance across retrieval, code, and text-to-SQL reranking tasks."
          tier: "class_c1"
          number_params: 278m
          lifecycle:
            available:
              since_version: "10.1.0"
    jina_reranker_v2_base_multilingual_resources:
      limits:
        cpu: "2"
        memory: 128Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 4Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    jina_reranker_v2_base_multilingual_replicas: 1
kind: ConfigMap
metadata:
  finalizers:
  - watsonxaiifm.cpd.ibm.com/finalizer
  labels:
    syom: watsonxaiifm_extra_models_config
  name: jina-reranker-v2-base-multilingual
  namespace: cpd-instance

Scaling global deployments

To scale your global deployment, you must edit the ConfigMap file for your deployment:

Open the ConfigMap file in your default text editor:

oc edit configmap <configmap name> -n <your namespace>

For example:

oc edit configmap meta-llama-3-1-8b-instruct -n cpd-instance

Edit the <model_name>_replicas value. For example, change meta_llama_3_1_8b_instruct_replicas: 1 to meta_llama_3_1_8b_instruct_replicas: 2

Trigger reconciliation:

oc patch watsonxaiifm watsonxaiifm-cr -n <your namespace> --type merge -p '{"spec":{"syom_update_at":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}}'

What to do next

To test a custom foundation model that is deployed globally from a project or deployment space, submit an inference request to the model. For more information, see Deploying custom foundation models.