Registering custom foundation models for global deployment

You can deploy custom foundation models globally using the watsonx.ai™ Inference Frameworks Manager (IFM) operator, which enables you to make these models available from any project or space, rather than being tied to a specific project or space. This enables enterprises with limited GPUs to utilize the benefits of custom foundation models without the need for a separate instance of the model to be deployed in each project or space. By deploying custom foundation models globally, you can provide a scalable solution for enterprises to use custom models across multiple projects and spaces.

Prerequisites

  • If you want to enable the text chat functionality, the custom foundation model that you want to deploy must include the chat template as part of the model configuration file tokenizer_config.json. For example, the model configuration file for the Llama-3.1-8B-Instruct model includes the chat template, as shown here:an example chat template that is used as part of the model config file

Supported model architectures

You can deploy custom foundation model architectures that are based on the vLLM runtime at a global level.

The following model architectures are supported for deployment at a global level.

Table 1. Supported model architectures: general-purpose models
Model family Foundation model examples Supported Quantization method Parallel Tensors (Multiple GPUs supported) Deployment configurations
bloom bigscience/bloom-3b, bigscience/bloom-560m N/A Yes Small, Medium and Large
exaone lgai-exaone/exaone-3.0-7.8B-Instruct N/A No Small
falcon tiiuae/falcon-7b N/A Yes Small, Medium and Large
gemma google/gemma-2b N/A Yes Small, Medium and Large
gemma2 google/gemma-2-9b N/A Yes Small, Medium and Large
gpt_bigcode bigcode/starcoder, bigcode/gpt_bigcode-santacoder gptq Yes Small, Medium and Large
gpt_neox rinna/japanese-gpt-neox-small, EleutherAI/pythia-12b, databricks/dolly-v2-12b N/A Yes Small, Medium and Large
gptj EleutherAI/gpt-j-6b N/A No Small
gpt2 gpt2, gpt2-xl N/A Yes Small, Medium and Large
granite ibm-granite/granite-3.0-8b-instruct, ibm-granite/granite-3b-code-instruct-2k, granite-8b-code-instruct, granite-7b-lab N/A No Small
jais core42/jais-13b N/A Yes Small, Medium and Large
llama meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, llama-2-13b-chat-hf, TheBloke/Llama-2-7B-Chat-AWQ, ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf gptq Yes Small, Medium and Large
mistral mistralai/Mistral-7B-v0.3, neuralmagic/OpenHermes-2.5-Mistral-7B-marlin N/A No Small
mixtral TheBloke/Mixtral-8x7B-v0.1-GPTQ, mistralai/Mixtral-8x7B-Instruct-v0.1 gptq No Small
mpt mosaicml/mpt-7b, mosaicml/mpt-7b-storywriter, mosaicml/mpt-30b N/A No Small
nemotron nvidia/Minitron-8B-Base N/A Yes Small, Medium and Large
olmo allenai/OLMo-1B-hf, allenai/OLMo-7B-hf N/A Yes Small, Medium and Large
persimmon adept/persimmon-8b-base, adept/persimmon-8b-chat N/A Yes Small, Medium and Large
phi microsoft/phi-2, microsoft/phi-1_5 N/A Yes Small, Medium and Large
phi3 microsoft/Phi-3-mini-4k-instruct N/A Yes Small, Medium and Large
qwen DeepSeek-R1 (distilled variant) N/A Yes Small, Medium and Large
qwen2 Qwen/Qwen2-7B-Instruct-AWQ AWQ Yes Small, Medium and Large
Table 2. Supported model architectures: automatic speech recognition (ASR) models
Model family Foundation model examples Supported Quantization method Parallel Tensors (Multiple GPUs supported) Deployment configurations
whisper openai/whisper-small N/A No Small, Medium and Large
whisper openai/whisper-large-v3-turbo N/A No Small, Medium and Large
Table 3. Supported model architectures: embedding and reranking models
Model family Foundation model examples Supported semantic retrieval Deployment configurations
BGE BAAI/bge-reranker-v2-m3 reranking Small, Medium and Large
E5 intfloat/multilingual-e5-large embedding and reranking Small, Medium and Large
granite ibm/granite-embedding-107m-multilingual, ibm/granite-embedding-278m-multilingual embedding Small, Medium and Large
Jina Reranker jinaai/jina-reranker-v2-base-multilingual reranking Small, Medium and Large
MiniLM cross-encoder/ms-marco-minilm-l-12-v2 reranking Small, Medium and Large
MiniLM sentence-transformers/all-minilm-l6-v2 embedding and reranking Small, Medium and Large
Qwen Qwen/Qwen3-Embedding-0.6B embedding and reranking Small, Medium and Large
slate ibm/slate-125m-english-rtrvr, ibm/slate-125m-english-rtrvr-v2, ibm/slate-30m-english-rtrvr, ibm/slate-30m-english-rtrvr-v2 embedding and reranking Small, Medium and Large

Procedure

To deploy a custom foundation model globally, follow these steps:
  1. Set up the PVC storage and upload the custom foundation model. For more information, see Setting up storage and uploading the model.

    Make a note of the pvc_name for the persistent volume claim where you store the downloaded model source files.

  2. Create a ConfigMap file for the custom foundation model by using the vLLM runtime.
     * oc create -f <config_map_file>
     Poll for the predictor pod to be in running state 1/1
     * hermes-2-pro-mistral-7b-predictor-654986d764-mrpt5                1/1     Running     0               25m
    Important:
    1. You must specify the model ID in the ConfigMap in lowercase. The model ID cannot be specified in upper case or camel case. The model ID can contain only letters, numbers and underscores.
    2. You must set serving_runtime to vllm-serving-runtime to deploy the global custom foundation model in the Configmap file.
    3. To avoid naming conflicts with foundation models that are shipped by IBM, use a name that is unique. Do not use the same model_id as that of a custom foundation model that is shipped by IBM.
    4. You must set the global_custom_foundation_model parameter to true in the wx-inference-proxy ConfigMap.
    5. To enable the MLOps or Prompt engineers to chat with the globally deployed custom foundation model by using the chat functionality, add the text_chat function in the same ConfigMap file created for the global custom foundation model.

    ConfigMap files are used by the Red Hat® OpenShift® AI layer of the service to serve configuration information to independent containers that run in pods or to other system components, such as controllers.

    For more information, see Creating a Config Map file.
  3. To register the custom foundation model, apply the ConfigMap file by using the following command:
    oc apply -f configmap.yml
    The service operator picks up the configuration information and applies it to your cluster.
  4. You can check the status of the service by using the following command. When Completed is returned, the custom foundation models are ready for use.
    oc get watsonxaiifm -n ${PROJECT_CPD_INST_OPERANDS}

Creating a ConfigMap file

The following table lists the variables for you to replace in the sample ConfigMaps.
ConfigMap field Description
metadata.name Model name with hyphens as delimiters. For example, if the model name is tiiuae/falcon-7b, specify tiiuae-falcon-7b.
data.model. Model name with underscores as delimiters <full_model_name>. For example, if the model name is tiiuae/falcon-7b, specify tiiuae_falcon_7b.
data.model.<full_model_name>.pvc_name Persistent volume claim where the model source files are stored. Use the pvc_name that you noted in an earlier step. For example, tiiuae-falcon-7b-pvc
data.model.<full_model_name>.pvc_size Size of persistent volume claim where the model source files are stored. For example, 60Gi.
data.model.<full_model_name>.image If the model that you want to use is not yet supported by the standard inference servers, you can override the standard settings and use your own custom inference runtime image. Images that are listed in Open Shift registry were tested and are confirmed to work. You can also try out other images but they are not officially supported by IBM.
data.model.<full_model_name>.env.DTYPE_STR Data type of text strings that the model can process. For example, float16.

For more information about supported values, see Global parameters for custom foundation models.

data.model.<full_model_name>.env.MAX_NEW_TOKENS Maximum number of tokens that the model can generate for a text inference request. For example, 2047.

For more information about supported values, see Global parameters for custom foundation models.

data.model.<full_model_name>.env. ENABLE_AUTO_TOOL_CHOICE Tells vLLM that you want to enable the model to generate its own tool calls, when appropriate.
data.model.<full_model_name>.env. TOOL_CALL_PARSER Tool parser to use. For a list of tool parsers that are suitable for various foundation models, see https://docs.vllm.ai/en/stable/features/tool_calling.html#automatic-function-calling.
data.model.<full_model_name>.annotations.productVersion The service operator version. For example, 9.1.0.

To get this value, use the following command: oc get watsonxaiifm watsonxaiifm-cr -o jsonpath="{.spec.version}"

data.model.<full_model_name>.annotations.cloudpakInstanceId The IBM® Software Hub instance ID. For example, b0871d64-ceae-47e9-b186-6e336deaf1f1.

To get this value, use the following command: oc get cm product-configmap -o jsonpath="{.data.CLOUD_PAK_INSTANCE_ID}"

data.model.<full_model_name>.annotations.model-id Model's ID. For example, model-id: "meta-llama/llama-3-1-8b"
data.model.<full_model_name>.labels_syom.icpdsupport/module Model name with hyphens as delimiters. For example, if the model name is tiiuae/falcon-7b, specify tiiuae-falcon-7b
data.model.<full_model_name>.labels_syom.app Model name with hyphens as delimiters and prefixed with text-. For example, if the model name is tiiuae/falcon-7b, specify text-tiiuae-falcon-7b.
data.model.<full_model_name>.labels_syom.syom_model Model name with single hyphens as delimiters, except for the first delimiter, which uses two hyphens. For example, tiiuae--falcon-7b.
data.model.<full_model_name>.wx_inference_proxy. Model ID (<full/model_name>). For example, tiiuae/falcon-7b
data.model.<full_model_name>.wx_inference_proxy.<model_id>.label Model name without provider prefix. For example, falcon-7b.
data.model.<full_model_name>.wx_inference_proxy.<model_id>.provider Model provider. For example, tiiuae
data.model.<full_model_name>.wx_inference_proxy.<model_id>.short_description Short description of the model in less than 100 characters.
data.model.<full_model_name>.wx_inference_proxy.<model_id>.long_discription Long description of the model.
data.model.<full_model_name>.wx_inference_proxy.<model_id>.number_params Number of model parameters. For example, 7b
data.model.<full_model_name>.wx_inference_proxy.<model_id>.lifecycle.available.since_version The first service operator version in which the model was added. For examples, 9.1.0.
data.model.<full_model_name>.wx_inference_proxy.<model_id>.functions Model function. Available options: text_generation and text_chat, embedding, rerank, audio_transcriptions. You must verify the supported functions before editing the ConfigMap. See the official model card for details.
Note:

text_chat applies only to models that use the vLLM runtime and meet all the text-chat related prerequisites. For more information about supported models, see Table 1.

data.model.<full_model_name>.wx_inference_proxy.<model_id>.tags Any applicable tags. Must contain this tag: vllm_runtime
data.model.<full_model_name>_resources.limits The maximum amount of resources that a custom foundation model can use.
  • cpu: Assigned CPU cores
  • memory: Assigned RAM
  • nvidia.com/gpu: GPU shards
  • ephemeral storage: Storage space on local (ephemeral) disk
For more information, see Resource utilization guidelines.
data.model.<full_model_name>_resources.requests The minimum (guaranteed) amount of each resource that a custom foundation model will get.
  • cpu: Assigned CPU cores
  • memory: Assigned RAM
  • nvidia.com/gpu: GPU shards
  • ephemeral storage: Storage space on local (ephemeral) disk
For more information, see Resource utilization guidelines.
data.model.<full_model_name>.env.CUDA_VISIBLE_DEVICES A comma-separated list of GPU indices that the LLM runtime must use. For example
  • 0,1 exposes GPUs 0 and 1
  • 0 exposes only the first GPU
Only the GPUs listed here will be available for model loading and inference. Make sure that the number of indices corresponds to the nvidia.com/gpu count that is configured under both limits and requests in your Kubernetes spec.
data.model.<full_model_name>.env.NUM_GPUS The total number of GPUs that the the LLM runtime will initialize. This number must be equal to the count of indices in CUDA_VISIBLE_DEVICES. For example:
  • If CUDA_VISIBLE_DEVICES is set to 0,1, then set NUM_GPUS=2
  • If only GPU 0 is exposed, set NUM_GPUS=1
Make sure that this value matches the nvidia.com/gpu count that is specified in both limits and requests of your pod’s resource configuration.
data.model.<full_model_name>.env.LANGUAGE Supported language for this model
data.model.<full_model_name>.env.MAX_LOG_LEN Maximum number of prompt characters or prompt ID numbers to be printed in log
data.model.<full_model_name>.env.VLLM_CACHE_ROOT Root directory for cache files.

Sample ConfigMap files

Create a ConfigMap file for the custom foundation model by copying one of the following sample ConfigMaps and then replacing the variables in the template with the appropriate values for your foundation model.

The following sample ConfigMap file defines the meta-llama/Meta-Llama-3-8B-Instruct custom foundation model.
apiVersion: v1
kind: ConfigMap
metadata:
  name: meta-llama-3-1-8b-instruct
  namespace: cpd-instance
  labels:
    syom: watsonxaiifm_extra_models_config
  finalizers:
    - watsonxaiifm.cpd.ibm.com/finalizer
data:
  model: |
    meta_llama_3_1_8b_instruct:
      pvc_name: meta-llama-3-1-8b-instruct-pvc
      pvc_size: 62Gi
      command: []
      isvc_yaml_name: isvc.yaml.j2
      dir_name: model
      force_apply: no
      serving_runtime: vllm-serving-runtime
      storage_uri: pvc://meta-llama-3-1-8b-instruct-pvc/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: DTYPE_STR
          value: "float16"
        - name: MAX_SEQUENCE_LENGTH
          value: "2048"
        - name: MAX_BATCH_SIZE
          value: "256"
        - name: MAX_CONCURRENT_REQUESTS
          value: "1024"
        - name: MAX_NEW_TOKENS
          value: "2048"
        - name: FLASH_ATTENTION
          value: "true"
        - name: DEPLOYMENT_FRAMEWORK
          value: "tgis_native"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: SERVED_MODEL_NAME  # must match wx_inference_proxy.<model_name>
          value: meta-llama/meta-llama-3-1-8b-instruct
        - name: NUM_GPUS
          value: "1"
        - name: PORT
          value: "3000"
        - name: VLLM_CACHE_ROOT # Additional environment parameters for 5.3.0
          value: /tmp/vllm_cache
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: 12.0.0
        cloudpakInstanceId: cd686e30-7b77-4256-a9be-c25e97f5f838
        model-id: "meta-llama/meta-llama-3-1-8b-instruct" # Additional annotation for 5.3.0 - must match wx_inference_proxy.<model_name>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: meta-llama-3-1-8b-instruct
        app: text-meta-llama-3-1-8b-instruct
        component: fmaas-inference-server
        bam-placement: colocate
        syom_model: meta--llama-3-1-8b-instruct
      args: []
      wx_inference_proxy:
        meta-llama/meta-llama-3-1-8b-instruct:
          global_custom_foundation_model: true
          enabled:
            - "true"
          label: "meta-llama-3-1-8b-instruct"
          provider: "meta-llama"
          source: "Hugging Face"
          functions:
            - text_generation
            - text_chat
          tags:
            - vllm_runtime  # Additional tags for 5.3.0
          short_description: "A large language model from Meta's LLaMA 3 series, fine-tuned to follow instructions and perform a wide range of natural language understanding and generation tasks."
          long_description: "A powerful 8-billion parameter model from Meta's LLaMA 3 series, specifically fine-tuned to enhance instruction-following capabilities, making it effective for a wide range of NLP tasks."
          task_ids:
            - question_answering
            - generation
            - summarization
            - classification
            - extraction
          tasks_info:
            question_answering:
              task_ratings: { quality: 0, cost: 0 }
            generation:
              task_ratings: { quality: 0, cost: 0 }
            summarization:
              task_ratings: { quality: 0, cost: 0 }
            classification:
              task_ratings: { quality: 0, cost: 0 }
            extraction:
              task_ratings: { quality: 0, cost: 0 }
          min_shot_size: 1
          tier: "class_2"
          number_params: "8b"
          lifecycle:
            available:
              since_version: "9.1.0"
    meta_llama_3_1_8b_instruct_resources:
      limits:
        cpu: "2"
        memory: 128Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 4Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    meta_llama_3_1_8b_instruct_replicas: 1

Sample ConfigMap for automatic speech recognition (ASR) models:

apiVersion: v1
data:
  model: |
    whisper_tiny:
      pvc_name: <pvc name>
      svc_name: whisper-tiny
      pvc_size: 100Gi
      dir_name: .
      force_apply: "no"
      isvc_yaml_name: isvc.yaml.j2
      image: http://registry.redhat.io/rhoai/odh-vllm-cuda-rhel9@sha256:751e2359439161babb9ad8e93e16251888a8c07aed895ffa55e4dfaf2a45f89d
      serving_runtime: vllm-serving-runtime
      storage_uri: pvc://<pvc name>/      # keep the trailing dash after the pvc name
      annotations:
        model-id: "openai/whisper-tiny" # Additional annotation for 5.3.0 - must match wx_inference_proxy.<model_name>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: whisper-tiny
        #app: text-llava-next-video-7b-hf
        component: fmaas-inference-server
        bam-placement: colocate
        syom_model: openai--whisper-tiny  # double dash after model provider name
      env:
        - name: VLLM_CACHE_ROOT
          value: /tmp/vllm_cache
        - name: MODEL_NAME
          value: /mnt/models
        - name: SERVED_MODEL_NAME   # must match wx_inference_proxy.<model_name>
          value: openai/whisper-tiny
        - name: LANGUAGE
          value: "en"
        - name: NUM_GPUS
          value: "1"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: HUGGINGFACE_HUB_CACHE
          value: /mnt/models/
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: PORT
          value: "3000"
        - name: MAX_LOG_LEN
          value: "100"
      volumeMounts:
        - name: home
          mountPath: /home/vllm
        - name: tmp
          mountPath: /tmp
        - name: shm
          mountPath: /dev/shm
      volumes:
        - name: home
          emptyDir: {}
        - name: tmp
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 4Gi
      wx_inference_proxy:
        openai/whisper-tiny:
          enabled:
          - "true"
          label: openai/whisper-tiny
          provider: IBM
          source: IBM
          functions:
          - audio_transcriptions
          short_description: Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.
          long_description:  Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.
          tags:
          - vllm_runtime
          - consumer_public
          min_shot_size: 1
          input_tier: class_14
          output_tier: class_15
          number_params: 805k
          lifecycle:
            available:
              since_version: 11.0.0
          versions:
          - version: 1.0.0
            since_version: 11.0.0
    whisper_tiny_resources:
      limits:
        cpu: "3"
        memory: 96Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "2"
        memory: 85Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    whisper_tiny_replicas: 1
kind: ConfigMap
metadata:
  finalizers:
  - watsonxaiifm.cpd.ibm.com/finalizer
  labels:
    syom: watsonxaiifm_extra_models_config
  name: whisper-tiny
Sample ConfigMap for embedding and reranking models:
Important:

For the functions field, you must verify the supported functions before editing the ConfigMap. See the official model card for details.

You must set the VLLM_USE_V1 environment variable to 0 in the Config Map file. Otherwise the inference requests will fail.

apiVersion: v1
data:
  model: |
    jina_reranker_v2_base_multilingual:
      pvc_name: jina-reranker-v2-base-multilingual-pvc
      pvc_size: 62Gi
      isvc_yaml_name: isvc.yaml.j2
      embedding_model: "true"
      dir_name: .
      force_apply: no
      serving_runtime: vllm-serving-runtime
      storage_uri: pvc://jina-reranker-v2-base-multilingual-pvc/
      env:
        - name: MODEL_NAME
          value: /mnt/models
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRANSFORMERS_CACHE
          value: /mnt/models/
        - name: HUGGINGFACE_HUB_CACHE
          value:  /mnt/models/
        - name: DTYPE_STR
          value: "float16"
        - name: MAX_SEQUENCE_LENGTH
          value: "2048"
        - name: MAX_NEW_TOKENS
          value: "2048"
        - name: HF_MODULES_CACHE
          value: /tmp/huggingface/modules
        - name: SERVED_MODEL_NAME  # must match wx_inference_proxy.<model_name>
          value: jinaai/jina-reranker-v2-base-multilingual
        - name: NUM_GPUS
          value: "1"
        - name: PORT
          value: "3000"
        - name: VLLM_CACHE_ROOT
          value: /home/vllm
        - name: VLLM_USE_V1
          value: "0"
      volumeMounts:
        - name: home
          mountPath: /home/vllm
        - name: tmp
          mountPath: /tmp
        - name: shm
          mountPath: /dev/shm
      volumes:
        - name: home
          emptyDir: {}
        - name: tmp
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 4Gi
      annotations:
        cloudpakId: 5e4c7dd451f14946bc298e18851f3746
        cloudpakName: IBM watsonx.ai
        productChargedContainers: All
        productCloudpakRatio: "1:1"
        productID: 3a6d4448ec8342279494bc22e36bc318
        productMetric: VIRTUAL_PROCESSOR_CORE
        productName: IBM Watsonx.ai
        productVersion: 12.0.0
        cloudpakInstanceId: cd346dc7-29fd-46db-a874-7e33cb8cf1ac
        model-id: jinaai/jina-reranker-v2-base-multilingual  #Additional annotation for 5.3.0 - must match wx_inference_proxy.<model_name>
      labels_syom:
        app.kubernetes.io/managed-by: ibm-cpd-watsonx-ai-ifm-operator
        app.kubernetes.io/instance: watsonxaiifm
        app.kubernetes.io/name: watsonxaiifm
        icpdsupport/addOnId: watsonx_ai_ifm
        icpdsupport/app: api
        release: watsonxaiifm
        icpdsupport/module: jina-reranker-v2-base-multilingual
        app: text-jina-reranker-v2-base-multilingual
        component: fmaas-inference-server
        bam-placement: colocate
        syom_model: jina--reranker-v2-base-multilingual
      command:
        - "/bin/sh"
      args:
        - "-c"
        - "vllm serve /mnt/models --served-model-name jinaai/jina-reranker-v2-base-multilingual --port 3000 --trust-remote-code"
      wx_inference_proxy:
        jinaai/jina-reranker-v2-base-multilingual:
          global_custom_foundation_model: true
          enabled:
          - "true"
          label: "jina-reranker-v2-base-multilingual"
          provider: "jinaai"
          source: "Hugging Face"
          tags:
          - vllm_runtime
          functions:
          - rerank
          short_description: "A multilingual transformer-based reranker that scores query-document relevance with high accuracy"
          long_description: "The Jina Reranker v2 is a cross-encoder model fine-tuned for multilingual text reranking, capable of handling up to 1024 tokens with a sliding window for longer texts. It delivers state-of-the-art performance across retrieval, code, and text-to-SQL reranking tasks."
          tier: "class_c1"
          number_params: 278m
          lifecycle:
            available:
              since_version: "10.1.0"
    jina_reranker_v2_base_multilingual_resources:
      limits:
        cpu: "2"
        memory: 128Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 1Gi
      requests:
        cpu: "1"
        memory: 4Gi
        nvidia.com/gpu: "1"
        ephemeral-storage: 10Mi
    jina_reranker_v2_base_multilingual_replicas: 1
kind: ConfigMap
metadata:
  finalizers:
  - watsonxaiifm.cpd.ibm.com/finalizer
  labels:
    syom: watsonxaiifm_extra_models_config
  name: jina-reranker-v2-base-multilingual
  namespace: cpd-instance

Scaling global deployments

To scale your global deployment, you must edit the ConfigMap file for your deployment:

  1. Open the ConfigMap file in your default text editor:
    oc edit configmap <configmap name> -n <your namespace>
    For example:
    oc edit configmap meta-llama-3-1-8b-instruct -n cpd-instance
  2. Edit the <model_name>_replicas value. For example, change meta_llama_3_1_8b_instruct_replicas: 1 to meta_llama_3_1_8b_instruct_replicas: 2
  3. Trigger reconciliation:
    oc patch watsonxaiifm watsonxaiifm-cr -n <your namespace> --type merge -p '{"spec":{"syom_update_at":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}}'

What to do next

To test a custom foundation model that is deployed globally from a project or deployment space, submit an inference request to the model. For more information, see Deploying custom foundation models.