Known issues and limitations for IBM watsonx.ai

The following issues and limitations apply to IBM watsonx.ai.

Known issues

Setting up LoRA inference after upgrading IBM watsonx.ai and the BFM
Models that require multiple GPUs may fail to start during deployment
Models might fail to start when scheduled on nodes with unsupported GPU types
Assets with large document chunk sizes are deleted after vector index creation
Deploying with vector indexes in Prompt Lab using API returns incorrect answers
Evaluating AutoAI RAG patterns in an inference notebook generates an error
Cannot import files downloaded from a database connection into an AutoAI RAG experiment
Running AutoAI RAG experiment with SQL knowledge base fails with error
Running notebook generated from AutoAI RAG experiment that uses SQL database fails with AttributeError
Running AutoAI RAG experiment with embedding model fails with model pre-selection error
Configuring settings for previously run AutoAI RAG experiment after upgrade doesn’t show previously used models
Rerunning an AutoAI RAG experiment after upgrade fails
Some foundation models fail to start on a cluster with L40S GPUs
The index type that the unstructured data integration flow creates for a vector collection is not supported by the Prompt Lab
Text extraction jobs get stuck in running state
Text generation API request for a foundation model fails with downstream caikit request error
Deprecated and withdrawn models crash when the service is upgraded
Deploying the gpt-oss-20b or gpt-oss-120b custom foundation model fails with the ValueError message
Deploying OpenAI models in air-gapped environments fails due to missing tiktoken encoding files
Text extraction or classification request fails with path not found errors
Deployments of custom foundation models might fail on CPD 5.4.0 due to vLLM incompatibility
The voxtral-small-24b-2507 model returns an error with 2-D audio input
Text chat requests with moderations enabled fail for certain models due to missing tokenize endpoint
Text Semantic Schema API fails for documents over 20 pages with ExhaustedRetryError

Limitations

Jobs for certain asset types are not filtered out in the watsonx experience
Deploying a custom foundation model on MIG-enabled clusters with multiple GPUs fails
Working with vector indexes from connected folder assets that use a Cloud Object Storage connection without specifying a bucket is not supported
PDF file fails to load in AutoAI for RAG experiment when using older cipher in watsonx.ai SDK
Cannot use the llama-3-1-8b-instruct model to generate evaluation data in AutoAI RAG experiment
A custom foundation model doesn’t have enough resources to be deployed successfully
IBM watsonx.ai REST API requests fail intermittently
Git-integrated projects are not supported for watsonx.ai document processing

Known issues

Setting up LoRA inference after upgrading IBM watsonx.ai and the BFM

Applies to: 5.4.0

Problem

After upgrading IBM watsonx.ai to Version 5.4 and upgrading the base foundation model (BFM), you need to run the following steps to ensure that the low-rank adaptation (LoRA) fine tuning and inference works with the upgraded service and BFM.

Solution

Prerequisite

Run these steps to find the name of the predictor pod:

Identify the predictor pod for the BFM deployment:

oc get pods -l WML_DEPLOYMENT_ID=<your-BFM-deploymentID>

Save the predictor pod name:
```
PREDICTOR_POD=<predictor-pod-name>
```

Login to the predictor pod, navigate to the lora-adapters directory, and list the existing adapters:
```
oc rsh $PREDICTOR_POD
cd /lora-adapters/
ls
```
If you see a directory with the LoRA asset ID, set the following variables:
```
LORA_ASSET_ID=<your-lora-asset-id>
LORA_DEPLOYMENT_ID=<your-lora-deployment-id>
```
Create a new directory using the LoRA deployment ID and copy the existing content from directory with the LoRA asset ID:
```
cp -r $LORA_ASSET_ID $LORA_DEPLOYMENT_ID
```
1. Verify the contents:
```
ls $LORA_DEPLOYMENT_ID
```
Update the adapter_config.json file by navigating to the new directory:
```
cd /lora-adapters/$LORA_DEPLOYMENT_ID
```
1. Open the adapter_config.json file:
```
vi adapter_config.json
```
2. Locate the following field in the adapter_config.json file:
```
"base_model_name_or_path": "/data/models/<model-name>"
```
3. Update the field to:
```
"base_model_name_or_path": "/mnt/models/<model-name>"
```
4. Save and exit the editor.

Models that require multiple GPUs may fail to start during deployment

Applies to: 5.4.0

Problem

When deploying a model that requires multiple GPUs, the model may fail to start and display the following error message:

  VllmWorkerProcess pid=609)[0;0m INFO 09-01 16:08:09 [pynccl.py:69] vLLM is using nccl==2.21.5

Cause

This issue exits in models that require more than one GPU.

Solution

Edit the configuration, using the following command:

 oc edit watsonxaiifm watsonxaiifm-cr -n ${PROJECT_CPD_INST_OPERANDS}

Get all the environment variable of the failed model predictor deployment from .spec.template.spec.container and add in the existing environment variables as follows in the watsonxaiifm-cr:

 spec:
  model_install_parameters:
    <model-name-with-underscore>:
      env:
        <all the prepared environment variables>
         ...
         - name: NCCL_P2P_DISABLE
           value: "1"

See the following example using the llama-3-3-70b-instruct model:

 spec:
  model_install_paramters:
    llama_3_3_70b_instruct:
      env:
      - name: MODEL_NAME
        value: /mnt/models/llama-3-3-70b-instruct
      - name: SERVED_MODEL_NAME
        value: meta-llama/llama-3-3-70b-instruct
      - name: MAX_SEQUENCE_LENGTH
        value: "70000"
      - name: MAX_NUM_SEQS
        value: "8"
      - name: MAX_NEW_TOKENS
        value: "8192"
      - name: DISABLE_PROMPT_LOGPROBS
        value: "true"
      - name: ENABLE_AUTO_TOOL_CHOICE
        value: "true"
      - name: TOOL_CALL_PARSER
        value: llama3_json
      - name: CHAT_TEMPLATE
        value: /mnt/models/llama-3-3-70b-instruct/tool_chat_template_llama3.1_json.jinja
      - name: VLLM_ATTENTION_BACKEND
        value: XFORMERS
      - name: NUM_GPUS
        value: "4"
      - name: CUDA_VISIBLE_DEVICES
        value: 0,1,2,3
      - name: HUGGINGFACE_HUB_CACHE
        value: /mnt/models/
      - name: HF_MODULES_CACHE
        value: /tmp/huggingface/modules
      - name: PORT
        value: "3000"
      - name: MAX_LOG_LEN
        value: "100"
      - name: GPRC_PORT
        value: "8033"
      - name: NCCL_NVLS_ENABLE
        value: "0"
      - name: NCCL_P2P_DISABLE
        value: "1"

Models might fail to start when scheduled on nodes with unsupported GPU types

Applies to: 5.4.0

Problem: When deploying a foundation model, the model predictor pod may not come up and could keep restarting if it is scheduled on a node with an unsupported GPU type.
Cause: The cluster has GPU types that are not supported by the deployed foundation model.

Solution

Run the following command to get the available GPU types. For the GPU type that is supported for the foundation model, copy the GPU node label value.
```
oc get node -L nvidia.com/gpu.product
```
For details on which models are compatible with specific GPU types, see GPU requirements for models in watsonx.ai.

Run the following command with the updated model ID and GPU node label.

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec": {"model_install_parameters": {"<model_name_with_underscore>": {"nodeSelector": {"nvidia.com/gpu.product": "<node-label>"}, "tolerations":[{ "effect": "NoSchedule", "key": "gpu", "operator": "Exists" }]}}}}'

For models provisioned by other services, contact IBM Support.

Assets with large document chunk sizes are deleted after vector index creation

Applies to: 5.4.0

Problem: Using large document chunk sizes when creating in-memory vector indexes results in assets being deleted shortly after creation.
Cause: The embedding request takes too long, triggering a rollback and deleting the asset.

Solution: Reduce the chunk size in Advanced settings when creating a vector index.

Deploying with vector indexes in Prompt Lab using API returns incorrect answers

Applies to: 5.4.0

Problem: When deploying an AI service in Prompt Lab using Milvus or Elasticsearch vector indexes, some deployments fail to retrieve context correctly from the vector index.
Cause: Missing or improperly configured credentials in the vector index connection.

Solution

The issue can be resolved by applying any of the following workarounds:

Using shared credentials for the vector index connection.
If using personal credentials, ensure they are explicitly added to the connection when it is promoted to the space.

You can also use Chat mode in Prompt Lab to deploy services and ask questions.

Evaluating AutoAI RAG patterns in an inference notebook generates an error

Applies to: 5.4.0

Problem: When running an inference notebook to test and evaluate a pattern generated in an AutoAI RAG experiment, the following error is generated: No module named 'unitxt'.
Cause: The unitxt==1.14.0 library is needed to evaluate the patterns but it is not installed on the runtime by default.
Solution: Install the unitxt==1.14.0 library.

Cannot import files downloaded from a database connection into an AutoAI RAG experiment

Applies to: 5.4.0

Problem: If you download a file from a connected data source into a project, you cannot import it into an AutoAI RAG experiment.
Solution: Upload the file directly to the experiment.

Running AutoAI RAG experiment with SQL knowledge base fails with error

Applies to: 5.4.0

Problem

If you are using an SQL relational database as a knowledge base in an AutoAI for RAG experiment, the experiment might fail with the following error:

Number of retries for executing the SQL query has exceeded the limit of 3.

Cause

Some models might struggle generating accurate SQL query.

Solution

Run the experiment with a different model. Recommended models are:

openai/gpt-oss-120b
mistralai/mistral-medium-2505
meta-llama/llama-3-3-70b-instruct

Running notebook generated from AutoAI RAG experiment that uses SQL database fails with AttributeError

Applies to: 5.4.0

Problem

If you created an AutoAI RAG experiment using an SQL database as the knowledge base and then saved the experiment as a notebook, running that notebook results in the following error:

AttributeError: 'DataFrame' object has no attribute 'correct_answer_document_ids'

Cause

The notebook attempts to calculate faithfulness and context_correctness metrics, which are not supported for experiments that use an SQL database.

Solution

Remove the faithfulness and context_correctness metrics from the notebook.
Remove any direct references to the correct_answer_document_ids key in the notebook.
Rerun the notebook.

Running AutoAI RAG experiment with embedding model fails with model pre-selection error

Applies to: 5.4.0

Problem

If you’re using an embedding model to run an AutoAI RAG experiment, the experiment might fail with the following error:

Foundation models pre-selection has failed. None of the given models has been successfully evaluated.

Cause

The router container in the embedding model pods does not have enough memory to process the load.

Solution

Increase the memory requests and limits for the router container by running the following command:

 oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{
  "spec": {
    "watsonx_router_container_resources": {
      "limits": {
        "cpu": "500m",
        "memory": "10Gi",
        "ephemeral-storage": "1Gi"
      },
      "requests": {
        "cpu": "300m",
        "memory": "1Gi",
        "ephemeral-storage": "10Mi"
      }
    }
  }
}'

Wait for the pod(s) to restart.
Rerun the experiment.

Configuring settings for previously run AutoAI RAG experiment after upgrade doesn’t show previously used models

Applies to: 5.4.0

Problem: If you’re rerunning an AutoAI RAG experiment created before upgrading to watsonx.ai version 2.3.0, some previously used models may not appear in the model selection list on the experiment settings page.
Cause: Models that are not tagged with text_chat are excluded by the new model filters.
Solution: Select a model with text_chat support.

Rerunning an AutoAI RAG experiment after upgrade fails

Applies to: 5.4.0

Problem

If you’re rerunning an AutoAI RAG experiment created before upgrading to watsonx.ai version 2.3.0 by clicking the Play button, the experiment fails.

Cause

The experiment contains settings that have been deprecated in 2.3.0.

Solution

Open the experiment and click Reconfigure.
Submit the training again.
Run the experiment.

Some foundation models fail to start on a cluster with L40S GPUs

The ibm-defense-4-0-small model may get stuck in a crash loop when in a node equipped with NVIDIA L40S GPUs

Applies to: 5.4.0

Problem

When deploying the ibm-defense-4-0-small model on IBM Cloud Pak® for Data with watsonx.ai IFM using NVIDIA L40S GPUs, the model predictor pod may enter a crash loop due to CUDA out-of-memory errors.

The model fails to load with the following error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 864.00 MiB. 
GPU 0 has a total capacity of 44.39 GiB of which 15.38 MiB is free. 
Including non-PyTorch memory, this process has 44.37 GiB memory in use.

Cause

The ibm-defense-4-0-small model (GraniteMoE Hybrid architecture) requires approximately 44 GiB of GPU memory to load its weights in float16 precision. This consumes nearly the entire 46 GiB capacity of a single L40S GPU, leaving insufficient memory for runtime allocations.

Solution

Run the following commands to fix the issue:

Patch the Watsonxaiifm CR to configure the model with 2 GPUs:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": {"install_model_list":["ibm-defense-4-0-small"], "model_install_parameters": {"ibm_defense_4_0_small":{"shard": 2}}}}'

After applying the patch, verify that the model is up:
```
oc get isvc ibm-defense-4-0-small -n $instance_ns
```
After a few minutes, the predictor pod should reach Ready: True state.

Resolution

The default configuration for ibm-defense-4-0-small in the Watsonxaiifm CR should include shard: 2 and NUM_GPUS=2.

The llama-3-1-70b-instruct model does not start on a cluster with four L40S GPUs

Applies to: 5.4.0

Problem

When you try to deploy the llama-3-1-70b-instruct foundation model on four L40S GPUs, the watsonxaiifm-cr shows the following error message:

Reconcile History: The failed task is : unknown task and the error message is: \
No message available

The llama-3-1-70b-instruct-predictor pod shows the following message:

Back-off restarting failed container kserve-container in pod
      llama-3-1-70b-instruct-predictor-etc_ibm-cpd-operands

The logs of the kserve-container in the predictor pod show the error message:

ValueError: The model's max seq len (131072) is larger than the maximum number \
of tokens that can be stored in KV cache (81840). Try increasing `gpu_memory_utilization` \
or decreasing `max_model_len` when initializing the engine.

Cause

The maximum sequence length of 131,072 that is specified for the foundation model is too long for the provided resources to handle.

Solution

Set a lower maximum sequence length value for the llama-3-1-70b-instruct foundation model. Run a patch command to set the value of MAX_SEQUENCE_LENGTH to 80,000. Because the maximum sequence length parameter setting is an environment parameter, other values also need to be specified in the command. Complete the following steps:

Before you change any values, you can check the current values for the configuration by using the following command:
```
oc get deploy llama-3-1-70b-instruct-predictor -o yaml
```

Set the value of ${PROJECT_CPD_INST_OPERANDS} to the namespace where the model is installed, and then run this command:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace ${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": { "model_install_parameters": {"llama_3_1_70b_instruct": {"env": \
[{ "name": "MODEL_NAME", "value": "/mnt/models/models--meta-llama--llama-3-1-70b-instruct" },{ "name": "SERVED_MODEL_NAME", "value": "meta-llama/llama-3-1-70b-instruct" },{ "name": "MAX_SEQUENCE_LENGTH", "value": "80000" },{ "name": "MAX_NUM_SEQS", "value": "8" },
{ "name": "MAX_NEW_TOKENS", "value": "8192" },{ "name": "DISABLE_PROMPT_LOGPROBS", "value": "true" },{ "name": "ENABLE_AUTO_TOOL_CHOICE", "value": "true" },{ "name": "TOOL_CALL_PARSER", "value": "llama3_json" },{ "name": "CHAT_TEMPLATE", "value": "/mnt/models/models--meta-llama--llama-3-1-70b-instruct/tool_chat_template_llama3.1_json.jinja" },
{ "name": "VLLM_ATTENTION_BACKEND", "value": "XFORMERS" },{ "name": "NUM_GPUS", "value": "4" },{ "name": "CUDA_VISIBLE_DEVICES", "value": "0,1,2,3" },{ "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" },{ "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" },{ "name": "PORT", "value": "3000" },
{ "name": "MAX_LOG_LEN", "value": "100" },{ "name": "GPRC_PORT", "value": "8033" },{ "name": "NCCL_NVLS_ENABLE", "value": "0" } \]}}}}'

After you run the patch command, the watsonxaiifm-cr switches to the InProgress state, and then reaches the Completed state. You can use the following command to verify that the predictor pod is running:
```
# oc get po | grep predictor
```

The llama-3-3-70b-instruct model does not start on a cluster with four L40S GPUs

Applies to: 5.4.0

Problem

When you try to deploy the llama-3-3-70b-instruct foundation model on four L40S GPUs, the watsonxaiifm-cr shows the following error message:

Reconcile History: The failed task is : unknown task and the error message is: \
No message available

The llama-3-3-70b-instruct-predictor pod shows the following message:

Back-off restarting failed container kserve-container in pod
      llama-3-3-70b-instruct-predictor-etc_ibm-cpd-operands

The logs of the kserve-container in the predictor pod show the error message:

ValueError: The model's max seq len (131072) is larger than the maximum number \
of tokens that can be stored in KV cache (81840). Try increasing `gpu_memory_utilization` \
or decreasing `max_model_len` when initializing the engine.

Cause

The maximum sequence length of 131,072 that is specified for the foundation model is too long for the provided resources to handle.

Solution

Set a lower maximum sequence length value for the llama-3-3-70b-instruct foundation model. Run a patch command to set the value of MAX_SEQUENCE_LENGTH to 80,000. Because the maximum sequence length parameter setting is an environment parameter, other values also need to be specified in the command. Complete the following steps:

Before you change any values, you can check the current values for the configuration by using the following command:
```
oc get deploy llama-3-3-70b-instruct-predictor -o yaml
```

Set the value of ${PROJECT_CPD_INST_OPERANDS} to the namespace where the model is installed, and then run this command:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace ${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec": { "llama_3_3_70b_instruct_replicas": "1","llama_3_3_70b_instruct_resources": {"limits": {"cpu": "3", "ephemeral-storage": "1Gi", "memory": "96Gi", "nvidia.com/gpu": "4"},
"requests": {"cpu": "2", "ephemeral-storage": "10Mi", "memory": "85Gi", "nvidia.com/gpu": "4"}},"model_install_parameters": {"llama_3_3_70b_instruct": {"env": \[{ "name": "MODEL_NAME", "value": "/mnt/models/llama-3-3-70b-instruct" },{ "name": "SERVED_MODEL_NAME", "value": "meta-llama/llama-3-3-70b-instruct" },\
{ "name": "MAX_SEQUENCE_LENGTH", "value": "80000" },{ "name": "MAX_NUM_SEQS", "value": "8" },{ "name": "MAX_NEW_TOKENS", "value": "8192" },{ "name": "DISABLE_PROMPT_LOGPROBS", "value": "true" },{ "name": "ENABLE_AUTO_TOOL_CHOICE", "value": "true" },{ "name": "TOOL_CALL_PARSER", "value": "llama3_json" },
{ "name": "CHAT_TEMPLATE", "value": "/mnt/models/llama-3-3-70b-instruct/tool_chat_template_llama3.1_json.jinja" },{ "name": "VLLM_ATTENTION_BACKEND", "value": "XFORMERS" },{ "name": "NUM_GPUS", "value": "4" },{ "name": "CUDA_VISIBLE_DEVICES", "value": "0,1,2,3" },{ "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" },
{ "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" },{ "name": "PORT", "value": "3000" },{ "name": "MAX_LOG_LEN", "value": "100" },{ "name": "GPRC_PORT", "value": "8033" },{ "name": "NCCL_NVLS_ENABLE", "value": "0" } \], "shard": "4"}}}}'

After you run the patch command, the watsonxaiifm-cr switches to the InProgress state, and then reaches the Completed state. You can use the following command to verify that the predictor pod is running:
```
# oc get po | grep predictor
```

The gpt-oss-120b model does not start on a cluster with one L40S GPU

Applies to: 5.4.0

Problem

When you try to deploy the gpt-oss-120b foundation model on two L40S GPUs, the watsonxaiifm-cr fails with OutOfMemoryError.

Cause

A single L40S GPU does not have enough memory to load the gpt-oss-120b model.

Solution

Use two L40S GPUs to run the gpt-oss-120b model in your cluster by patching the custom resource as follows:

oc patch watsonxaiifm watsonxaiifm-cr \
--namespace ${PROJECT_CPD_INST_OPERANDS} \
--type merge \
--patch '{"spec":{"model_install_parameters":{"gpt_oss_120b":{"shard":2,"command":["/bin/sh","-c","vllm serve /mnt/models/gpt-oss-120b --served-model-name openai/gpt-oss-120b --port 3000 --max_num_seqs 16 --tensor_parallel_size=2 --max-model-len 131072 --tool-server demo --tool-call-parser openai --enable-auto-tool-choice"]}}}}'

The index type that the unstructured data integration flow creates for a vector collection is not supported by the Prompt Lab

Applies to: 5.4.0

Problem: If you use an unstructured data integration flow to chunk and embed data into a vector store, and upload vectorized documents from the vector store in Prompt Lab, the responses generated in chat mode are incorrect. The error occurs because the unstructured data flow supports the COSINE metric type and the Prompt Lab supports the L2 metric type.

Text extraction jobs get stuck in running state

Applies to: 5.4.0

Problem

Text extraction jobs might get stuck in running state. If this error occurs, check the wdu pod logs for this message:

Attempted to retry provider connection to many times. Shutting down worker.

Solution

Restart all the wdu pods and run the job again.

Text generation API request for a foundation model fails with downstream caikit request error

Applies to: 5.4.0

Problem

When a /ml/v1/text/generation_stream request contains moderations parameters that cause the request to return only 1 token, then the request will fail with the message:

"message":"Downstream Caikit request failed: unexpected error occurred while processing request"

Solution

The error occurs when the text generation API request returns only 1 token. Make sure the request returns more than 1 token by specifying the min_new_tokens parameter in the API request body as follows:

min_new_tokens: 2

Deprecated and withdrawn models crash when the service is upgraded

Applies to: 5.4.0

Problem

After upgrading the watsonx.ai service to version 13.0.0, deprecated or withdrawn model crash with the CrashLoopBackOff error.

Cause

The deprecated and withdrawn models don't have access to the advanced router which is required for IBM® Software Hub 5.4.0.

Solution

Use the following steps to edit the isvc of each crashing model to fix the error:

Add an annotation under the spec.predictor.annotations section for the foundation model.
The model ID is in the format <model vendor>/<model name> that you can get from the API. For example, the allam-1-13b-instruct model has the API model ID sdaia/allam-1-13b-instruct. For details, see Getting foundation model information.
```
annotations:
  cloudpakId: 5....
  cloudpakInstanceId: ${PROJECT_CPD_INST_OPERANDS}
  cloudpakName: IBM watsonx.ai
  model-id: <API model ID>
```

In the env section, add the following environment variable:

env:
  - name: SERVED_MODEL_NAME
    value: <API model ID>

Edit the wx-inference-proxy-configmap. In the list of installed models, add the tags attribute under the model ID of the model that crashes, if not already present. Then add the following parameter under tags:
```
install_model_list:
  <Model ID>:
    tags:
    - vllm_runtime
```
Restart the wx-inference-proxy pod.

Deploying the gpt-oss-20b or gpt-oss-120b custom foundation model fails with the ValueError message

Applies to: 5.4.0

Problem

Deployment of gpt-oss-20b or gpt-oss-120b custom foundation model might fail with this message:

Custom foundation model deployment failed because it uses an unsupported quantization method. Only List(aqlm, compressed-tensors, gptq, awq, bitsandbytes) quantized models are currently supported for watsonx-cfm-caikit-1.1

Solution

Create a custom runtime definition and re-deploy the model. When creating the custom runtime definition:

Add this command in command section:

"command": [
                "bash",
                "-c",
                "vllm serve /mnt/models --port 3000 --tokenizer /mnt/models --dtype bfloat16 --max-model-len 40960 --trust-remote-code --tensor-parallel-size 1 --served-model-name $SERVED_MODEL_NAME"
            ]

Add these environment variables in the custom runtime definition:

               {
                "name": "HF_HOME",
                "value": "/home/vllm/.cache/huggingface"
                },
               {
                     "name": "HF_HUB_CACHE",
                     "value": "/home/vllm/.cache/huggingface/hub"
                },
                {
                    "name": "HF_HUB_OFFLINE",
                    "value": "1"
                }

For more information on how to create a custom runtime definition see Building a custom inference runtime image for your custom foundation model.

Deploying OpenAI models in air-gapped environments fails due to missing tiktoken encoding files

Applies to: 5.4.0

Problem

When deploying OpenAI-based custom foundation models (such as openai/gpt-oss-20b or openai/gpt-oss-120b) in air-gapped or offline environments, the deployment fails because required tiktoken encoding files cannot be downloaded from external sources.

This issue occurs when:

You are deploying an OpenAI model in an air-gapped environment (no Internet connectivity)
The deployment is on-premises with restricted network access
The tiktoken library cannot access https://openaipublic.blob.core.windows.net/encodings/

You might receive error messages that indicate tokenization failures or missing encoding resources.

Cause

OpenAI models use the tiktoken library for tokenization, which requires encoding files that are not included with the model weights. These encoding files are normally downloaded automatically at runtime from OpenAI's public blob storage. In air-gapped environments, this automatic download fails, causing the deployment to fail.

Solution

To deploy OpenAI models in air-gapped environments, you must manually download the required tiktoken encoding files and transfer them to your environment before deployment.

On a machine with internet access, download the required tiktoken encoding files:
```
mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
```
Note: Different OpenAI models might require different encoding files. The o200k_base.tiktoken file is used by newer models like GPT-4o, while cl100k_base.tiktoken is used by GPT-3.5 and GPT-4 models. Download both files to ensure compatibility.
Transfer the tiktoken_encodings directory to your air-gapped environment by using your organization's approved file transfer method (such as secure file transfer, USB drive, or other approved mechanisms).
Upload the encoding files to the same persistent volume claim (PVC) where your model files are stored. The files must be placed in a tiktoken_encodings directory at the root level of your model storage location.
Verify that o200k_base.tiktoken and cl100k_base.tiktoken encoding files are accessible in the correct location:
```
oc exec -it <model-pod-name> -n <namespace> -- ls -la /mnt/models/tiktoken_encodings/
```
Update your model's ConfigMap to make sure that the tiktoken library can locate these files. Add the following environment variable to your deployment configuration:
```
env:
  - name: TIKTOKEN_CACHE_DIR
    value: "/mnt/models/tiktoken_encodings"
```

Apply the updated ConfigMap and redeploy your model:

oc apply -f <your-configmap-file>.yaml
oc rollout restart deployment <deployment-name> -n <namespace>

Text extraction or classification request fails with path not found errors

Applies to: 5.4.0

Problem

The text extraction and classification API calls fail with the following "404 path not found" error:

{"errors":[{"code":"path_not_found_error","message":"URI path '/ml/v1/text/extractions...' does not exist", ...}]}

Cause

The error occurs when the Watson Machine Learning operator is installed after the watsonx.ai IFM operator. The Watson Machine Learning ConfigMap does not exist and the IFM operator fails to add the routes for the /ml/v1/text/extractions and /ml/v1/text/classifications API endpoints.

Solution

To force the watsonx.ai IFM operator to recreate the routes to the API endpoints, use the following procedure:

Make sure the Watson Machine Learning is installed and is in the Completed state.
```
oc get wmlbase -n ${PROJECT_CPD_INST_OPERANDS}
```

Verify that the Watson Machine Learning ConfigMap exists in your namespace.

oc get configmap watson-machine-learning-configmap -n ${PROJECT_CPD_INST_OPERANDS}

Delete the watsonx.ai IFM routes Zen extension.

oc delete zenextension watsonxaiifm-routes -n ${PROJECT_CPD_INST_OPERANDS}

You can either wait for watsonx.ai IFM operator to reconcile the watsonxaiifm custom resource and recreate the Zen extension or restart the operator to force reconciliation as follows:
1. Identify the operator namespace and the operator deployment name:
```
oc get deployment -n ${PROJECT_CPD_INST_OPERATORS} | grep watsonx-ai-ifm
```
2. Restart the deployment and wait for the rollout to finish:
```
oc rollout restart deployment/<deployment-name> -n ${PROJECT_CPD_INST_OPERATORS}
```
Wait for the Zen extension to be recreated and change to the Completed state.
```
oc get zenextension watsonxaiifm-routes -n ${PROJECT_CPD_INST_OPERANDS}
```

Verify that the /ml/v1/text/extractions and /ml/v1/text/classifications routes for the text extraction and classification APIs are included in the nginx config in the Zen extension:

oc get zenextension watsonxaiifm-routes -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.spec.nginx\.conf}' | grep "text/extractions"

oc get zenextension watsonxaiifm-routes -n ${PROJECT_CPD_INST_OPERANDS} -o jsonpath='{.spec.nginx\.conf}' | grep "text/classifications"

Deployments of custom foundation models might fail because of vLLM incompatibility

Applies to: 5.4.0

Problem Deployment of a custom foundation model might fail to start.

Cause The Red Hat vLLM image that is currently used with custom foundation models is not compatible with some models. Examples of such models include:

ibm-granite/granite-vision-3.3-2b
meta-llama/Llama-Guard-3-11B-Vision
meta-llama/Llama-3.2-11B-Vision-Instruct

Solution Deploy your model by using a custom runtime. See Building a custom inference runtime image for your custom foundation model. For the custom runtime, you can try using an older Red Hat image, such as registry.redhat.io/rhoai/odh-vllm-cuda-rhel9:v2.25.3.

The voxtral-small-24b-2507 model returns an error with 2-D audio input

Applies to: 5.4.0

Problem When you use the text/chat endpoint to inference the voxtral-small-24b-2507 model with 2-D audio input in base64-encoded format, the model returns the following error:

Invalid input argument for Model 'mistralai/voxtral-small-24b-2507': audio_array.ndim=2 audio_array.ndim=2

Cause The voxtral-small-24b-2507 model does not support 2-D audio channel input.

Solution Convert the 2-D audio channel base64 text into 1-D format before you pass it as input. Use the following command to convert the audio:

echo "<your_base64_string>" | base64 -d | ffmpeg -i pipe:0 -ac 1 -ar 16000 -f wav pipe:1 2>/dev/null | base64 -w 0

Text chat requests with moderations enabled fail for certain models due to a missing tokenize endpoint

Applies to: 5.4.0

Problem When you make text/chat and text/chat_stream calls to the following models with moderations enabled, the requests fail:

magistral-small-2509
magistral-medium-2509
mistral-medium-2508
ministral-3b-instruct-2512
ministral-8b-instruct-2512
voxtral-mini-2507

The following table shows the failure scenarios:

Scenario	`text/chat`	`text/chat_stream`
HAP and PII input/output set to `true`	Fails with error: `Unrecognized Model id`	Fails with error: `Model is currently unavailable`
HAP and PII input set to `false`, output set to `true`	Fails with error: `Downstream vllm request failed: unexpected error occurred while processing request`	Fails with error: `Downstream vllm request failed: unexpected error occurred while processing request`

Cause These models do not have a tokenize endpoint, which is required for moderation functionality.

Solution No workaround is available. Avoid using moderations with these models.

Text Semantic Schema API fails for documents over 20 pages with ExhaustedRetryError

Applies to: 5.4.0

Problem

When using the Text Semantic Schema API to create a semantic schema for a document that exceeds 20 pages, the request fails. The failure occurs after repeated page processing retries are exhausted, resulting in the following error:

ExhaustedRetryError('Document semantic-kvp-create-schema-xxx failed due to page request XXX exhausting retry count') was directly caused by the following exception RuntimeError('Unknown job type: semantic-kvp-create-schema in page worker')

Limitations

Jobs for certain asset types are not filtered out in the watsonx™ experience: The jobs for metadata enrichment, metadata import, and datastage asset types are not filtered out in the Watson Studio service in the watsonx experience. For more information, see Switching between experiences.

Deploying a custom foundation model on MIG-enabled clusters with multiple GPUs fails

Problem

If you deploy a custom foundation model on a cluster with MIG enablement and use multiple GPUs, your deployment might fail. You might receive the following error message:

Custom foundation model deployment failed with architecture 'mpt' because this architecture does not support parallel tensors. Provide the predefined hardware specification 'WX-S' or specify only one GPU in a custom hardware specification.

Cause: Deploying custom foundation models on MIG-enabled clusters with multiple GPUs is not supported for TGIS and vLLM runtimes.

Solution: For clusters that are enabled to use MIG, you must deploy your custom foundation models by using a single MIG partition only and use the vLLM runtime for deployment. You can configure the size of the MIG partition based on the GPU requirement of your custom foundation model.

Working with vector indexes from connected folder assets that use a Cloud Object Storage connection without specifying a bucket is not supported

To use a connected folder asset with a vector index in your project, the connected folder asset must use a Cloud Object Storage connection that specifies a bucket.

PDF file fails to load in AutoAI for RAG experiment when using older cipher in watsonx.ai SDK

Problem: When using an older cypher, like RC4, to read data from a PDF file in an AutoAI for RAG experiment using the watsonx.ai SDK, the file fails to load.

Cause: The watsonx.ai SDK uses the pypdf library, which deprecates some legacy encryption algorithms.

Solution

Decrypt the PDF file before processing.

Clear the CRYPTOGRAPHY_OPENSSL_NO_LEGACY environment variable before importing pypdf.
```
import os
del os.environ['CRYPTOGRAPHY_OPENSSL_NO_LEGACY']
```

Decrypt the PDF file.

from pypdf import PdfReader, PdfWriter

reader = PdfReader("example.pdf")
if reader.is_encrypted:
   reader.decrypt("")  # if there was no password

writer = PdfWriter(clone_from=reader)

with open("decrypted-pdf.pdf", "wb") as f:
   writer.write(f)

For more information, see Manually decrypting and encrypting PDF files on watsonx.ai SDK.

Cannot use the llama-3-1-8b-instruct model to generate evaluation data in AutoAI RAG experiment

When using the meta-llama/llama-3-1-8b-instruct model in an AutoAI RAG experiment, you cannot generate evaluation data. and the experiment will fail with the following error:

unhandled errors in a TaskGroup

A custom foundation model doesn’t have enough resources to be deployed successfully

Applies to: 5.4.0

Problem: To add a custom foundation model, you created a custom deployment by using a predefined hardware specification. However, after you deploy the custom foundation model, the following error is displayed: Failed to deploy the custom foundation model. The runtime failed to start.
Cause: The predefined hardware specification does not allocate enough resources to the custom foundation model.
Solution: Define a custom hardware specification that has enough resources to support your custom foundation model. For more information, see Creating custom hardware specifications.

IBM watsonx.ai REST API requests fail intermittently

Applies to: 5.4.0

Problem: A watsonx.ai REST API request might fail intermittently with a NoHttpResponseException error.
Solution: Re-run the REST API request.

Git-integrated projects are not supported for watsonx.ai document processing

Applies to: 5.4.0

You cannot use watsonx.ai document processing operations, such as text extraction or text classification, with projects integrated with Github.