Known issues and limitations for IBM watsonx.ai
The following issues and limitations apply to IBM
watsonx.ai.
- Known issues
-
- Setting up LoRA inference after upgrading IBM watsonx.ai and the BFM
- Models that require multiple GPUs may fail to start during deployment
- Models might fail to start when scheduled on nodes with unsupported GPU types
- Assets with large document chunk sizes are deleted after vector index creation
- Deploying with vector indexes in Prompt Lab using API returns incorrect answers
- Evaluating AutoAI RAG patterns in an inference notebook generates an error
- Cannot import files downloaded from a database connection into an AutoAI RAG experiment
- Running AutoAI RAG experiment with SQL knowledge base fails with error
- Running notebook generated from AutoAI RAG experiment that uses SQL database fails with AttributeError
- Running AutoAI RAG experiment with embedding model fails with model pre-selection error
- Configuring settings for previously run AutoAI RAG experiment after upgrade doesn’t show previously used models
- Rerunning an AutoAI RAG experiment after upgrade fails
- Some foundation models fail to start on a cluster with L40S GPUs
- The index type that the unstructured data integration flow creates for a vector collection is not supported by the Prompt Lab
- Text extraction jobs get stuck in running state
- Text generation API request for a foundation model fails with downstream caikit request error
- Deprecated and withdrawn models crash when the service is upgraded
- Deploying the gpt-oss-20b or gpt-oss-120b custom foundation model fails with the ValueError message
- Deploying OpenAI models in air-gapped environments fails due to missing tiktoken encoding files
- Text extraction or classification request fails with path not found errors
- Deployments of custom foundation models might fail on CPD 5.4.0 due to vLLM incompatibility
- The voxtral-small-24b-2507 model returns an error with 2-D audio input
- Text chat requests with moderations enabled fail for certain models due to missing tokenize endpoint
- Text Semantic Schema API fails for documents over 20 pages with ExhaustedRetryError
- Limitations
-
- Jobs for certain asset types are not filtered out in the watsonx experience
- Deploying a custom foundation model on MIG-enabled clusters with multiple GPUs fails
- Working with vector indexes from connected folder assets that use a Cloud Object Storage connection without specifying a bucket is not supported
- PDF file fails to load in AutoAI for RAG experiment when using older cipher in watsonx.ai SDK
- Cannot use the llama-3-1-8b-instruct model to generate evaluation data in AutoAI RAG experiment
- A custom foundation model doesn’t have enough resources to be deployed successfully
- IBM watsonx.ai REST API requests fail intermittently
- Git-integrated projects are not supported for watsonx.ai document processing
Known issues
- Setting up LoRA inference after upgrading IBM watsonx.ai and the BFM
- Models that require multiple GPUs may fail to start during deployment
- Models might fail to start when scheduled on nodes with unsupported GPU types
- Assets with large document chunk sizes are deleted after vector index creation
- Deploying with vector indexes in Prompt Lab using API returns incorrect answers
- Evaluating AutoAI RAG patterns in an inference notebook generates an error
- Cannot import files downloaded from a database connection into an AutoAI RAG experiment
- Running AutoAI RAG experiment with SQL knowledge base fails with error
- Running notebook generated from AutoAI RAG experiment that uses SQL database fails with
AttributeError - Running AutoAI RAG experiment with embedding model fails with model pre-selection error
- Configuring settings for previously run AutoAI RAG experiment after upgrade doesn’t show previously used models
- Rerunning an AutoAI RAG experiment after upgrade fails
- Some foundation models fail to start on a cluster with L40S GPUs
-
- The ibm-defense-4-0-small model may get stuck in a crash loop when in a node equipped with NVIDIA L40S GPUs
-
Applies to: 5.4.0
- Problem
- When deploying the
ibm-defense-4-0-smallmodel on IBM Cloud Pak® for Data with watsonx.ai IFM using NVIDIA L40S GPUs, the model predictor pod may enter a crash loop due to CUDA out-of-memory errors. - Cause
- The
ibm-defense-4-0-smallmodel (GraniteMoE Hybrid architecture) requires approximately 44 GiB of GPU memory to load its weights in float16 precision. This consumes nearly the entire 46 GiB capacity of a single L40S GPU, leaving insufficient memory for runtime allocations. - Solution
- Run the following commands to fix the issue:
- Patch the
WatsonxaiifmCR to configure the model with 2 GPUs:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace=${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": {"install_model_list":["ibm-defense-4-0-small"], "model_install_parameters": {"ibm_defense_4_0_small":{"shard": 2}}}}' - After applying the patch, verify that the model is
up:
oc get isvc ibm-defense-4-0-small -n $instance_ns - After a few minutes, the predictor pod should reach
Ready: Truestate.
- Patch the
- Resolution
- The default configuration for
ibm-defense-4-0-smallin theWatsonxaiifmCR should include shard: 2 andNUM_GPUS=2.
- The llama-3-1-70b-instruct model does not start on a cluster with four L40S GPUs
-
Applies to: 5.4.0
- Problem
- When you try to deploy the llama-3-1-70b-instruct foundation model on four L40S GPUs, the
watsonxaiifm-crshows the following error message:
The llama-3-1-70b-instruct-predictor pod shows the following message:Reconcile History: The failed task is : unknown task and the error message is: \ No message available
The logs of the kserve-container in the predictor pod show the error message:Back-off restarting failed container kserve-container in pod llama-3-1-70b-instruct-predictor-etc_ibm-cpd-operandsValueError: The model's max seq len (131072) is larger than the maximum number \ of tokens that can be stored in KV cache (81840). Try increasing `gpu_memory_utilization` \ or decreasing `max_model_len` when initializing the engine. - Cause
- The maximum sequence length of 131,072 that is specified for the foundation model is too long for the provided resources to handle.
- Solution
- Set a lower maximum sequence length value for the llama-3-1-70b-instruct foundation model. Run a
patch command to set the value of
MAX_SEQUENCE_LENGTHto 80,000. Because the maximum sequence length parameter setting is an environment parameter, other values also need to be specified in the command. Complete the following steps:- Before you change any values, you can check the current values for the configuration by using
the following
command:
oc get deploy llama-3-1-70b-instruct-predictor -o yaml - Set the value of
${PROJECT_CPD_INST_OPERANDS}to the namespace where the model is installed, and then run this command:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace ${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": { "model_install_parameters": {"llama_3_1_70b_instruct": {"env": \ [{ "name": "MODEL_NAME", "value": "/mnt/models/models--meta-llama--llama-3-1-70b-instruct" },{ "name": "SERVED_MODEL_NAME", "value": "meta-llama/llama-3-1-70b-instruct" },{ "name": "MAX_SEQUENCE_LENGTH", "value": "80000" },{ "name": "MAX_NUM_SEQS", "value": "8" }, { "name": "MAX_NEW_TOKENS", "value": "8192" },{ "name": "DISABLE_PROMPT_LOGPROBS", "value": "true" },{ "name": "ENABLE_AUTO_TOOL_CHOICE", "value": "true" },{ "name": "TOOL_CALL_PARSER", "value": "llama3_json" },{ "name": "CHAT_TEMPLATE", "value": "/mnt/models/models--meta-llama--llama-3-1-70b-instruct/tool_chat_template_llama3.1_json.jinja" }, { "name": "VLLM_ATTENTION_BACKEND", "value": "XFORMERS" },{ "name": "NUM_GPUS", "value": "4" },{ "name": "CUDA_VISIBLE_DEVICES", "value": "0,1,2,3" },{ "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" },{ "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" },{ "name": "PORT", "value": "3000" }, { "name": "MAX_LOG_LEN", "value": "100" },{ "name": "GPRC_PORT", "value": "8033" },{ "name": "NCCL_NVLS_ENABLE", "value": "0" } \]}}}}' - After you run the patch command, the
watsonxaiifm-crswitches to the InProgress state, and then reaches the Completed state. You can use the following command to verify that the predictor pod is running:# oc get po | grep predictor
- Before you change any values, you can check the current values for the configuration by using
the following
command:
- The llama-3-3-70b-instruct model does not start on a cluster with four L40S GPUs
-
Applies to: 5.4.0
- Problem
- When you try to deploy the llama-3-3-70b-instruct foundation model on four L40S GPUs, the
watsonxaiifm-crshows the following error message:
The llama-3-3-70b-instruct-predictor pod shows the following message:Reconcile History: The failed task is : unknown task and the error message is: \ No message available
The logs of the kserve-container in the predictor pod show the error message:Back-off restarting failed container kserve-container in pod llama-3-3-70b-instruct-predictor-etc_ibm-cpd-operandsValueError: The model's max seq len (131072) is larger than the maximum number \ of tokens that can be stored in KV cache (81840). Try increasing `gpu_memory_utilization` \ or decreasing `max_model_len` when initializing the engine. - Cause
- The maximum sequence length of 131,072 that is specified for the foundation model is too long for the provided resources to handle.
- Solution
- Set a lower maximum sequence length value for the llama-3-3-70b-instruct foundation model. Run a
patch command to set the value of
MAX_SEQUENCE_LENGTHto 80,000. Because the maximum sequence length parameter setting is an environment parameter, other values also need to be specified in the command. Complete the following steps:- Before you change any values, you can check the current values for the configuration by using
the following
command:
oc get deploy llama-3-3-70b-instruct-predictor -o yaml - Set the value of
${PROJECT_CPD_INST_OPERANDS}to the namespace where the model is installed, and then run this command:oc patch watsonxaiifm watsonxaiifm-cr \ --namespace ${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec": { "llama_3_3_70b_instruct_replicas": "1","llama_3_3_70b_instruct_resources": {"limits": {"cpu": "3", "ephemeral-storage": "1Gi", "memory": "96Gi", "nvidia.com/gpu": "4"}, "requests": {"cpu": "2", "ephemeral-storage": "10Mi", "memory": "85Gi", "nvidia.com/gpu": "4"}},"model_install_parameters": {"llama_3_3_70b_instruct": {"env": \[{ "name": "MODEL_NAME", "value": "/mnt/models/llama-3-3-70b-instruct" },{ "name": "SERVED_MODEL_NAME", "value": "meta-llama/llama-3-3-70b-instruct" },\ { "name": "MAX_SEQUENCE_LENGTH", "value": "80000" },{ "name": "MAX_NUM_SEQS", "value": "8" },{ "name": "MAX_NEW_TOKENS", "value": "8192" },{ "name": "DISABLE_PROMPT_LOGPROBS", "value": "true" },{ "name": "ENABLE_AUTO_TOOL_CHOICE", "value": "true" },{ "name": "TOOL_CALL_PARSER", "value": "llama3_json" }, { "name": "CHAT_TEMPLATE", "value": "/mnt/models/llama-3-3-70b-instruct/tool_chat_template_llama3.1_json.jinja" },{ "name": "VLLM_ATTENTION_BACKEND", "value": "XFORMERS" },{ "name": "NUM_GPUS", "value": "4" },{ "name": "CUDA_VISIBLE_DEVICES", "value": "0,1,2,3" },{ "name": "HUGGINGFACE_HUB_CACHE", "value": "/mnt/models/" }, { "name": "HF_MODULES_CACHE", "value": "/tmp/huggingface/modules" },{ "name": "PORT", "value": "3000" },{ "name": "MAX_LOG_LEN", "value": "100" },{ "name": "GPRC_PORT", "value": "8033" },{ "name": "NCCL_NVLS_ENABLE", "value": "0" } \], "shard": "4"}}}}' - After you run the patch command, the
watsonxaiifm-crswitches to the InProgress state, and then reaches the Completed state. You can use the following command to verify that the predictor pod is running:# oc get po | grep predictor
- Before you change any values, you can check the current values for the configuration by using
the following
command:
- The gpt-oss-120b model does not start on a cluster with one L40S GPU
-
Applies to: 5.4.0
- Problem
- When you try to deploy the
gpt-oss-120b foundation model on two L40S GPUs,
the
watsonxaiifm-crfails withOutOfMemoryError. - Cause
- A single L40S GPU does not have enough memory to load the gpt-oss-120b model.
- Solution
- Use two L40S GPUs to run the
gpt-oss-120b model in your cluster by patching
the custom resource as
follows:
oc patch watsonxaiifm watsonxaiifm-cr \ --namespace ${PROJECT_CPD_INST_OPERANDS} \ --type merge \ --patch '{"spec":{"model_install_parameters":{"gpt_oss_120b":{"shard":2,"command":["/bin/sh","-c","vllm serve /mnt/models/gpt-oss-120b --served-model-name openai/gpt-oss-120b --port 3000 --max_num_seqs 16 --tensor_parallel_size=2 --max-model-len 131072 --tool-server demo --tool-call-parser openai --enable-auto-tool-choice"]}}}}'
- The index type that the unstructured data integration flow creates for a vector collection is not supported by the Prompt Lab
- Text extraction jobs get stuck in running state
- Text generation API request for a foundation model fails with downstream
caikitrequest error - Deprecated and withdrawn models crash when the service is upgraded
- Deploying the
gpt-oss-20borgpt-oss-120bcustom foundation model fails with theValueErrormessage - Deploying OpenAI models in air-gapped environments fails due to missing tiktoken encoding files
- Text extraction or classification request fails with path not found errors
- Deployments of custom foundation models might fail because of vLLM incompatibility
- The
voxtral-small-24b-2507model returns an error with 2-D audio input - Text chat requests with moderations enabled fail for certain models due to a missing tokenize endpoint
- Text Semantic Schema API fails for documents over 20 pages with
ExhaustedRetryError
Limitations
- Jobs for certain asset types are not filtered out in the watsonx™ experience
- The jobs for
metadata enrichment,metadata import, anddatastageasset types are not filtered out in the Watson Studio service in the watsonx experience. For more information, see Switching between experiences.
- Deploying a custom foundation model on MIG-enabled clusters with multiple GPUs fails
-
- Problem
- If you deploy a custom foundation model on a cluster with MIG enablement and use multiple GPUs,
your deployment might fail. You might receive the following error
message:
Custom foundation model deployment failed with architecture 'mpt' because this architecture does not support parallel tensors. Provide the predefined hardware specification 'WX-S' or specify only one GPU in a custom hardware specification.
- Cause
- Deploying custom foundation models on MIG-enabled clusters with multiple GPUs is not supported
for
TGISandvLLMruntimes.
- Solution
- For clusters that are enabled to use MIG, you must deploy your custom foundation models by using
a single MIG partition only and use the
vLLMruntime for deployment. You can configure the size of the MIG partition based on the GPU requirement of your custom foundation model.
- Working with vector indexes from connected folder assets that use a Cloud Object Storage connection without specifying a bucket is not supported
- To use a connected folder asset with a vector index in your project, the connected folder asset must use a Cloud Object Storage connection that specifies a bucket.
- PDF file fails to load in AutoAI for RAG experiment when using older cipher in watsonx.ai SDK
-
- Problem
- When using an older cypher, like RC4, to read data from a PDF file in an AutoAI for RAG experiment using the watsonx.ai SDK, the file fails to load.
- Cause
- The watsonx.ai SDK uses the
pypdflibrary, which deprecates some legacy encryption algorithms.
- Solution
- Decrypt the PDF file before processing.
- Clear the CRYPTOGRAPHY_OPENSSL_NO_LEGACY environment variable before importing
pypdf.import os del os.environ['CRYPTOGRAPHY_OPENSSL_NO_LEGACY'] - Decrypt the PDF
file.
from pypdf import PdfReader, PdfWriter reader = PdfReader("example.pdf") if reader.is_encrypted: reader.decrypt("") # if there was no password writer = PdfWriter(clone_from=reader) with open("decrypted-pdf.pdf", "wb") as f: writer.write(f)
- Clear the CRYPTOGRAPHY_OPENSSL_NO_LEGACY environment variable before importing
- Cannot use the
llama-3-1-8b-instructmodel to generate evaluation data in AutoAI RAG experiment - When using the
meta-llama/llama-3-1-8b-instructmodel in an AutoAI RAG experiment, you cannot generate evaluation data. and the experiment will fail with the following error:unhandled errors in a TaskGroup - A custom foundation model doesn’t have enough resources to be deployed successfully
-
Applies to: 5.4.0
- Problem
- To add a custom foundation model, you created a custom deployment by using a predefined hardware
specification. However, after you deploy the custom foundation model, the following error is
displayed:
Failed to deploy the custom foundation model. The runtime failed to start. - Cause
- The predefined hardware specification does not allocate enough resources to the custom foundation model.
- Solution
- Define a custom hardware specification that has enough resources to support your custom foundation model. For more information, see Creating custom hardware specifications.
- IBM watsonx.ai REST API requests fail intermittently
- Git-integrated projects are not supported for watsonx.ai document processing
Applies to: 5.4.0
You cannot use watsonx.ai document processing operations, such as text extraction or text classification, with projects integrated with Github.