Known issues and limitations for Watson Machine Learning
The following known issues and limitations apply to Watson Machine Learning.
Known issues
-
Known issues for AutoAI
- Importing an AutoAI notebook from a catalog can result in runtime error
- Creating an online deployment of a stored AutoAI model results in an error
- Accessing training data with the watsonx.data Presto connector results in an error
- Running notebook saved from an AutoAI experiment results in dependencies error
- Running AutoAI experiment results in pipeline error
-
Known issues for Watson Machine Learning
- Upgrading Watson Machine Learning may fail because of runtime errors
- Upgrading Watson Machine Learning may fail due to orphaned objects
- Unusable deployments after an upgrade or restore from backup
- Decision Optimization deployment job fails with error:
Add deployment failed with deployment not finished within time - Configuring runtime definition for a specific GPU node fails
- StatefulSet update failure or missing attribute in conditional check during playbook execution
- Space export with connections and connected data assets fails
- After the deployment of a custom foundation model is initiated, it remains at initialization stage indefinitely
- After upgrading Cloud Pak for Data, patching the hardware specification of a foundation model deployment might fail
EventData.FailedTaskPatherror messages appear in log forwml-operatorafter fresh installation or upgrade- A certification path error appears when users inference online deployments by using Java
Limitations
-
Limitations for Watson Machine Learning
- Deep Learning experiments with storage volumes in a Git enterprise project are not supported
- Deep Learning jobs are not supported on IBM Power (ppc64le) or Z (s390x) platforms.
- Deploying a model on an s390x cluster might require retraining
- Limits on size of model deployments
- Automatic mounting of storage volumes is not supported by online and batch deployments
- Batch deployment jobs that use large inline payload might get stuck in
startingorrunningstate - Setting environment variables in a
condayaml file does not work for deployments - Jobs for batch deployments that use package extensions may fail
-
Limitations for AutoAI experiments
Known issues for AutoAI
Importing an AutoAI notebook from a catalog can result in runtime error
Applies to: 5.2.0 (watsonx.ai 2.0 and later)
If you save an AutoAI notebook to an IBM Knowledge Catalog, and then you import it into a project and run it, you might get this error: Library not compatible or missing.
This error results from a mismatch between the runtime environment saved in the catalog and the runtime environment required to run the notebook in the project. To resolve, update the runtime environment to the latest supported version. For
example, if the imported notebook uses Runtime 23.1 in the catalog version, update to Runtime 24.1 and run the notebook job again.
Creating an online deployment of a stored AutoAI model results in an error
Applies to: 5.2.0
Fixed in: 5.2.1
After training an AutoAI model and promoting it to the deployment space, creating an online deployment of the model results in the following error: Failed to fetch pipeline-model.json. To workaround this issue, restore the model
using python SDK or rerun the AutoAI experiment. For details about restoring the model with python SDK, see Web Service Deployment Modules for AutoAI models.
Accessing training data with the watsonx.data Presto connector results in an error
Applies to: 5.2.0
Fixed in: 5.2.2
When using the watsonx.data Presto connector in an AutoAI experiment, accessing training data from the catalog using Presto C++ engine might result in this unexpected error:
Record could not be fetched from the data source: SQL error: Query failed
Workaround:
Restart the Presto C++ worker pods.
Running notebook saved from an AutoAI experiment results in dependencies error
Applies to: 5.2.1
Fixed in: 5.2.2
After running an AutoAI experiment with fairness enabled and saving it as a notebook, running the notebook results in the following dependencies error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. autoai-ts-libs 5.0.3 requires dill<0.4,>=0.3.1.1, but you have dill 0.4.0 which is incompatible.
Workaround:
-
Install the latest version of
autoai-ts-libs, version 5.0.5 or later, from test.pypi in the notebook with fairness enabled. -
If you encounter additional dependency issues between between
dilland the dataset, install compatible versions of the required libraries by running this command:
!pip install dill==0.3.8 mystic==0.3.9 klepto==0.2.2
Running AutoAI experiment results in pipeline error
Applies to: 5.2.1 and later
When running a new AutoAI experiment, the experiment fails and generates the following error:
Pipeline progress is unavailable due to an error.
Workaround:
Rerun the experiment.
Known issues for Watson Machine Learning
Upgrading Watson Machine Learning may fail because of runtime errors
Applies to: 5.2.0
When upgrading Watson Machine Learning from version 5.0.1 or 5.1.x to version 5.2.0 while on the s390x architecture, the upgrade may fail. This is because of a runtime error returning as undefined.
The following is an example of the error that can be seen in the Watson Machine Learning custom resource (CR):
Message: AnsibleUndefinedVariable: 'onnxruntime_opset_19_server_json' is undefined
The playbook has failed. See earlier output for exact error
Confirm the failure in the Watson Machine Learning CR
Run the following command to see the status of the wml-cr:
oc describe wmlbase wml-cr -n zen
The following is an exmaple output:
Message: AnsibleUndefinedVariable: 'onnxruntime_opset_19_multi_server_json' is undefined
The playbook has failed. See earlier output for exact error
Reason: Failed
Status: True
Type: Failure
Last Transition Time: 2025-06-04T06:12:13Z
Message: Running reconciliation
Reason: Running
Status: True
Type: Running
Progress: 5%
Progress Message: Finished Pre-Configuration
Reconcile History:
The last reconciliation was completed successfully.
Versions:
Reconciled: 5.1.2
Wml Status: InProgress
Events: <none>
Workaround: To fix this issue, follow these steps:
-
Enable maintenance mode on the Watson Machine Learning CR:
oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}' -n <cpd_instance_ns> -
Identify the Watson Machine Learning operator pod:
oc get pods -n <cpd-operator-ns> | grep ibm-cpd-wml-operator -
Take note of the playbook upgrade file:
- If you are upgrading from version 5.0.x, the upgrade playbook file willbe called
upgrade50-s390x.yml. - If you are upgrading from version 5.1.x, the upgrade playbook file willbe called
upgrade51-s390x.yml.
- If you are upgrading from version 5.0.x, the upgrade playbook file willbe called
-
Copy and edit the playbook upgrade file by entering the pod:
oc rsh -n <cpd-operator-ns> <wml-operator-pod-name> -
While inside the pod, change directories to get to the playbook upgrade files:
cd /opt/ansible/5.2.0/roles/wml-base/tasks -
Copy the playbook upgrade file to a location outside the pod:
oc cp -n <cpd-operator-ns> <wml-operator-pod-name>:/opt/ansible/5.2.0/roles/wml-base/tasks/upgrade50-s390x.yml /tmp/upgrade50-s390x.yml -
Keep a backup of the playbook upgrade file:
mv upgrade50-s390x.yml upgrade50-s390x.yml.org -
Edit the upgrade playbook file and update the list of
jsonfiles under the section:name: Load runtime definitions for include_role: name: "common" tasks_from: load_runtime_definition.yml loop:To add the following
jsonfiles to the list:- { file_name: "files/onnxruntime_opset_19-server.json", var_name: "onnxruntime_opset_19_server_json" } - { file_name: "files/onnxruntime_opset_19-multi-server.json", var_name: "onnxruntime_opset_19_multi_server_json" }Here is an example of the complete code block with the updated json list:
- name: Load runtime definitions for include_role: name: "common" tasks_from: load_runtime_definition.yml loop: - { file_name: "files/auto_ai.kb-server.json", var_name: "auto_ai_kb_server_json" } - { file_name: "files/auto_ai.ts-server.json", var_name: "auto_ai_ts_server_json" } - { file_name: "files/auto_rag-server.json", var_name: "auto_rag_server_json" } - { file_name: "files/training-job-server.json", var_name: "training_job_server_json" } - { file_name: "files/spss-modeler_batch-server.json", var_name: "spss_modeler_batch_server_json" } - { file_name: "files/spss-modeler_online-server.json", var_name: "spss_modeler_online_server_json" } - { file_name: "files/do-server.json", var_name: "do_server_json" } - { file_name: "files/wml-rshiny-server.json", var_name: "wml_rshiny_server_json" } - { file_name: "files/wml-rshiny-rstudio-24.1-r4.3-server.json", var_name: "wml_rshiny_rstudio_241_r43_server_json" } - { file_name: "files/spark-mllib_3.4-multi-server.json", var_name: "spark_mllib_34_multi_server_json" } - { file_name: "files/spark-mllib_3.5-multi-server.json", var_name: "spark_mllib_35_multi_server_json" } - { file_name: "files/pmml-3.0_4.3-multi-server.json", var_name: "pmml_30_43_multi_server_json" } - { file_name: "files/wx-cfm-job-0.1-server.json", var_name: "wx_cfm_job_01_server_json" } - { file_name: "files/wx-fm-deployment-job-1.0-server.json", var_name: "wx_fm_deployment_job_10_server_json" } - { file_name: "files/runtime-24.1-py3.11-server.json", var_name: "runtime_241_py311_server_json" } - { file_name: "files/runtime-24.1-py3.11-multi-server.json", var_name: "runtime_241_py311_multi_server_json" } - { file_name: "files/runtime-24.1-py3.11-cuda-server.json", var_name: "runtime_241_py311_cuda_server_json" } - { file_name: "files/tensorflow_rt24.1-py3.11-server.json", var_name: "tensorflow_rt241_py311_server_json" } - { file_name: "files/tensorflow_rt24.1-py3.11-dist-server.json", var_name: "tensorflow_rt241_py311_dist_server_json" } - { file_name: "files/tensorflow_rt24.1-py3.11-edt-server.json", var_name: "tensorflow_rt241_py311_edt_server_json" } - { file_name: "files/pytorch-onnx_rt24.1-py3.11-edt-multi-server.json", var_name: "pytorch_onnx_rt241_py311_edt_multi_server_json" } - { file_name: "files/pytorch-onnx_rt24.1-py3.11-edt-server.json", var_name: "pytorch_onnx_rt241_py311_edt_server_json" } - { file_name: "files/pytorch-onnx_rt24.1-py3.11-multi-server.json", var_name: "pytorch_onnx_rt241_py311_multi_server_json" } - { file_name: "files/pytorch-onnx_rt24.1-py3.11-server.json", var_name: "pytorch_onnx_rt241_py311_server_json" } - { file_name: "files/pytorch-onnx_rt24.1-py3.11-dist-multi-server.json", var_name: "pytorch_onnx_rt241_py311_dist_multi_server_json" } - { file_name: "files/pytorch-onnx_rt24.1-py3.11-dist-server.json", var_name: "pytorch_onnx_rt241_py311_dist_server_json" } - { file_name: "files/onnxruntime_opset_19-server.json", var_name: "onnxruntime_opset_19_server_json" } - { file_name: "files/onnxruntime_opset_19-multi-server.json", var_name: "onnxruntime_opset_19_multi_server_json" } - { file_name: "files/autoai-kb_rt24.1-py3.11-server.json", var_name: "autoai_kb_rt241_py311_server_json" } - { file_name: "files/autoai-ts_rt24.1-py3.11-server.json", var_name: "autoai_ts_rt241_py311_server_json" } - { file_name: "files/runtime-24.1-r4.3-server.json", var_name: "runtime_241_r43_server_json" } - { file_name: "files/dl-training-workload_rt24.1-py3.11-cuda-server.json", var_name: "dl_training_workload_rt241_py311_cuda_server_json" } - { file_name: "files/dl-training-workload_rt24.1-py3.11-server.json", var_name: "dl_training_workload_rt241_py311_server_json" } - { file_name: "files/training-job-go-server.json", var_name: "training_job_go_server_json" } - { file_name: "files/wml-hpo-job-server.json", var_name: "wml_hpo_job_server_json" } - { file_name: "files/pytorch-onnx_rt25.1-py3.12-edt-multi-server.json", var_name: "pytorch_onnx_rt251_py312_edt_multi_server_json" } - { file_name: "files/pytorch-onnx_rt25.1-py3.12-edt-server.json", var_name: "pytorch_onnx_rt251_py312_edt_server_json" } - { file_name: "files/pytorch-onnx_rt25.1-py3.12-multi-server.json", var_name: "pytorch_onnx_rt251_py312_multi_server_json" } - { file_name: "files/pytorch-onnx_rt25.1-py3.12-server.json", var_name: "pytorch_onnx_rt251_py312_server_json" } - { file_name: "files/pytorch-onnx_rt25.1-py3.12-dist-multi-server.json", var_name: "pytorch_onnx_rt251_py312_dist_multi_server_json" } - { file_name: "files/pytorch-onnx_rt25.1-py3.12-dist-server.json", var_name: "pytorch_onnx_rt251_py312_dist_server_json" } - { file_name: "files/tensorflow_rt25.1-py3.12-server.json", var_name: "tensorflow_rt251_py312_server_json" } - { file_name: "files/tensorflow_rt25.1-py3.12-dist-server.json", var_name: "tensorflow_rt251_py312_dist_server_json" } - { file_name: "files/tensorflow_rt25.1-py3.12-edt-server.json", var_name: "tensorflow_rt251_py312_edt_server_json" } - { file_name: "files/runtime-25.1-py3.12-server.json", var_name: "runtime_251_py312_server_json" } - { file_name: "files/runtime-25.1-py3.12-multi-server.json", var_name: "runtime_251_py312_multi_server_json" } - { file_name: "files/autoai-kb_rt25.1-py3.12-server.json", var_name: "autoai_kb_rt251_py312_server_json" } - { file_name: "files/autoai-ts_rt25.1-py3.12-server.json", var_name: "autoai_ts_rt251_py312_server_json" } - { file_name: "files/runtime-25.1-r4.4-server.json", var_name: "runtime_251_r44_server_json" } -
Push the updated file back into the pod:
oc cp /tmp/upgrade50-s390x.yml <wml-operator-pod-name>:/opt/ansible/5.2.0/roles/wml-base/tasks/upgrade50-s390x_new.yml -n <cpd-operator-ns> oc rsh -n <cpd-operator-ns> <wml-operator-pod-name> cd /opt/ansible/5.2.0/roles/wml-base/tasks -
Rename the new modified file to the original:
mv upgrade50-s390x_new.yml upgrade50-s390x.yml chmod 777 upgrade50-s390x.yml exit -
Turn off maintenance mode:
oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": false}}' -n <cpd_instance_ns>
Upgrading Watson Machine Learning may fail due to orphaned objects
Applies to: 5.2.0
When upgrading Watson Machine Learning, it may fail because of Watson Machine Learning deployment objects which are orphaned.
If this issue is detected early on during the upgrade process, the pre-upgrade job will fail.
Confirm upgrade failure is due to orphaned objects
If you want to confirm that the upgrade failure is due to the orphaned objects, run the following steps:
- Fetch the pod to confirm the job run has failed:
oc get pods -n <namespace> | grep wml-pre-upgrade-check - Check the pod output for indications of orphaned deployments:
oc logs <pod-name> | grep "Orphaned deployments found! Exiting with failure." - If orphaned objects are detected, you will receive a detailed report outlining the orphaned objects found in the Watson Machine Learning environment. For example:
Review the list of Watson Machine Learning deployment objects in the log report.2025/05/03 11:36:27,837|INFO|check_dep_orphans.py:250: Generating report..... 2025/05/03 11:36:27,841|INFO|check_dep_orphans.py:252: Below are the WML deployments orphaned objects found. 2025/05/03 11:36:27,843|INFO|check_dep_orphans.py:255: ------------------------------------------------ 2025/05/03 11:36:27,844|INFO|check_dep_orphans.py:256: space_id: 350d356e-fdf7-42da-9f6d-71a72ee77221 2025/05/03 11:36:27,845|INFO|check_dep_orphans.py:257: ------------------------------------------------ 2025/05/03 11:36:27,846|INFO|check_dep_orphans.py:260: Deployments: 2025/05/03 11:36:27,849|INFO|check_dep_orphans.py:264: e854d101-1c55-46c1-bb89-3589c27395ac {'name': 'base', 'sw_spec_id': '121c6a74-4ed4-5828-8b81-56ef47f3bc2f', 'type': 'base', 'model_id': 'fcb4c4d0-3587-4069-b58c-6fcec002384d'} 2025/05/03 11:36:27,850|INFO|check_dep_orphans.py:266: Models: 2025/05/03 11:36:27,851|INFO|check_dep_orphans.py:271: f5742786-4546-43c5-853f-13758d04ee0a 2025/05/03 11:36:27,854|INFO|check_dep_orphans.py:273: Derived software specification: 2025/05/03 11:36:27,855|INFO|check_dep_orphans.py:278: 7274f418-19c9-4d9c-b13f-ba9b1411fc79 {'name': 'derived1_cw_spec-upgrade', 'base_sw_spec_id': '121c6a74-4ed4-5828-8b81-56ef47f3bc2f'} 2025/05/03 11:36:27,856|INFO|check_dep_orphans.py:280: Missing software specification: 2025/05/03 11:36:27,858|INFO|check_dep_orphans.py:285: 121c6a74-4ed4-5828-8b81-56ef47f3bc2f 27,896|ERROR|check_dep_orphans.py:448: Orphaned deployments found! Exiting with failure. Error: /opt/ibm/scripts/check_dep_orphans.pyc rc=1
Workaround to clear orphaned objects
To clear the orphaned objects, run the following steps:
-
You can delete orphaned objects directly by patching the Watson Machine Learning custom resource (CR) with the
delete_orphanfield:oc patch wmlbase wml-cr -n <namespace> --type=merge -p '{"spec":{"delete_orphan":true}}'After running the patch command, the Watson Machine Learning operator will reconcile and the job will re-run.
-
When the job run is complete, you will see a message:
oc logs <pod-name> | grep "Orphaned deployments deleted successfully. Exiting with success."For example:
[root@api.xxxxxx.cp.fyre.ibm.com ~]# oc logs wml-pre-upgrade-check-kqpqs | grep "Orphaned deployments deleted successfully. Exiting with success." 2025/05/03 11:49:33,713|INFO|check_dep_orphans.py:445: Orphaned deployments deleted successfully. Exiting with success.
Unusable deployments after an upgrade or restoring from backup
Applies to: 5.2.0
For deployments created on Cloud Pak for Data, generating predictions with a deployment might fail after an upgrade. The error message for this problem is:
Deployment: <deployment-ID> has been suspended due to the deployment owner either not being a member of the deployment space: <space-ID> any more or removed from the system.
These errors can also occur following a restore from backup.
The resolution is to update the deployments by using the following steps. You must use alternative steps that are specific to R Shiny deployments.
To update deployments, except for R Shiny deployments:
-
For
HOST="CP4D_HOSTNAME", replace "CPD_HOSTNAME" with the Cloud Pak for Data hostname. -
For
SPACE_ID="WML_SPACE_ID", replace "WML_SPACE_ID" with the space ID of the deployment that is failing. -
For
DEPLOYMENT_ID="WML_DEPLOYMENT_ID"replace "WML_DEPLOYMENT_ID" with the deployment ID of the broken deployment. -
Use
"Authorization: ZenApiKey <token>"and supply a valid token. If you export the environment variable use${TOKEN}instead of<token>. -
Use this CURL command to replace the "OWNER_ID" with actual owner ID on this cluster in the PATCH payload.
curl -k -X PATCH "$HOST/ml/v4/deployments/$DEPLOYMENT_ID?version=2020-04-20&space_id=$SPACE_ID" -H "content-type: application/json" -H "Authorization: ZenApiKey <token>" --data '[{ "op": "replace", "path": "/metadata/owner", "value": "OWNER_ID" }]'
To run this script, you must generate and export the token as the ${TOKEN} environment variable. For details, see Generating an API authorization token.
To update R-Shiny deployments:
-
Use
oc get pods -n NAMESPACE | grep "wml-deployment-manager"and replace theNAMESPACEwithWML Namespace. -
For
oc exec -it WML_DEPLOYMENT_MANAGER_POD_NAME bash -n NAMESPACE, replace theWML_DEPLOYMENT_MANAGER_POD_NAMEwith the Deployment manager pod name displayed in the previous step and replace theNAMESPACEwith the Watson Machine Learning namespace`. -
For
deployment_id="DEPLOYMENT_ID", replace theDEPLOYMENT_IDwith the deployment ID. -
For
space_id="SPACE_ID", replace theSPACE_IDwith the space ID for the deployment. -
For
HOST="https://wml-deployment-manager-svc.NAMESPACE.svc:16500", replace theNAMESPACEwith the Watson Machine Learning namespace`. -
Use
"Authorization: ZenApiKey <token>"and supply a valid token. If you export the environment variable use${TOKEN}instead of<token>. -
Re-create the R Shiny deployment using the following CURL command:
curl -k -X PUT "$HOST/ml/v4_private/recreate_deployment/$deployment_id?version=2020-06-12&space_id=$space_id" -H "Authorization: ZenApiKey <token>" -
Verify the status of R Shiny deployment and wait for the deployment to become "Ready" before proceeding to the next step.
curl -k -X GET "$HOST/ml/v4/deployments/$deployment_id?version=2020-06-12&space_id=$space_id" -H "Authorization: ZenApiKey ${TOKEN}" -
If you are upgrading to Cloud Pak for Data 4.8.0 or restoring from backup, scale up the number of copies by 1 from the deployment space UI.

The deployment state will be changed from "Unusable" to "Deployed" state.

- You can optionally scale the number of copies back to 1 or the original setting when the deployment is working as expected.
2.To run this script, you must generate and export the token as the${TOKEN}environment variable. For details, see Generating an API authorization token.
Decision Optimization deployment job fails with error: "Add deployment failed with deployment not finished within time"
Applies to: 5.2.0
If your decision optimization deployment job fails with the following error, complete the steps to extend the timeout window.
"status": {
"completed_at": "2022-09-02T02:35:31.711Z",
"failure": {
"trace": "0c4c4308935a3c4f2d9987b22139c61c",
"errors": [{
"code": "add_deployment_failed_in_runtime",
"message": "Add deployment failed with deployment not finished within time"
}]
},
"state": "failed"
}
To update the deployment timeout in the deployment manager:
-
Edit the
wmlbase wml-crand add this line:ignoreForMaintenance: true. This sets the WML operator into maintenance mode, which stops automatic reconciliation. The automatic reconciliation will undo any configmap changes applied otherwise.oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}' -n <namespace>For example:
oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}' -n zen -
Capture the contents of the
wmlruntimemanagerconfigmap in a YAML file.oc get cm wmlruntimemanager -n <namespace> -o yaml > wmlruntimemanager.yamlFor example:
oc get cm wmlruntimemanager -n zen -o yaml > wmlruntimemanager.yaml -
Create a backup of the
wmlruntimemanagerYAML file.cp wmlruntimemanager.yaml wmlruntimemanager.yaml.bkp -
Open the
wmlruntimemanager.yaml.vi wmlruntimemanager.yaml -
Navigate to file
runtimeManager.confand search for propertyservice. -
Increase the number of retries in the
retry_countfield to extend the timeout window:service { jobs { do { check_deployment_status { retry_count = 420 // Increase the number of retries to extend the timeout window } retry_delay = 1000 } } }Where:
Field retry_count= Number of retriesField retry_delay= Delay between each retry in milliseconds
In the example, the timeout is configured as 7 minutes (
retry_count * retry_delay = 420 * 1000= 7 minutes). If you want to increase the timeout further, you can increase the number of retries in theretry_countfield. -
Apply the deployment manager configmap changes:
oc delete -f wmlruntimemanager.yaml oc create -f wmlruntimemanager.yaml -
Restart the deployment manager pods:
oc get pods -n <namespace> | grep wml-deployment-manager oc delete pod <podname> -n <namespace> -
Wait for the deployment manager pod to come up:
oc get pods -n <namespace> | grep wml-deployment-manager
If you plan to upgrade the Cloud Pak for Data cluster, you must bring the WML operator out of maintenance mode by setting the field ignoreForMaintenance to false in wml-cr.
Configuring runtime definition for a specific GPU node fails
Applies to: 5.2.0
When you configure the runtime definition to use a specific GPU node with the nodeaffinity property, the runtime definition fails.
As a workaround, you must enable the MIG configuration for all GPU nodes if MIG is enabled for even a single GPU node. You must also use the Single profile type for all the GPU nodes. Mixed profiling is not supported.
To learn more about single and mixed profiling strategies, see NVIDIA documentation.
StatefulSet update failure or missing attribute in conditional check during playbook execution
Applies to: 5.2.1 and later
wml-cr might fail and return any of these error messages:
Failed to replace object: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"StatefulSet.apps \\"wml-cpd-etcd\\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than \'replicas\', \'ordinals\', \'template\', \'updateStrategy\', \'persistentVolumeClaimRetentionPolicy\' and \'minReadySeconds\' are forbidden","reason":"Invalid","details":{"name":"wml-cpd-etcd","group":"apps","kind":"StatefulSet","causes":[{"reason":"FieldValueForbidden","message":"Forbidden: updates to statefulset spec for fields other than \'replicas\', \'ordinals\', \'template\', \'updateStrategy\', \'persistentVolumeClaimRetentionPolicy\' and \'minReadySeconds\' are forbidden","field":"spec"}]},"code":422}\n'
The conditional check '( etcd_statefulset.result.status.replicas == 3 )' failed. The error was: error while evaluating conditional (( etcd_statefulset.result.status.replicas == 3 )): 'dict object' has no attribute 'result'
The playbook has failed. See earlier output for exact error
The problem occurs because the data-wml-cpd-etcd PersistentVolumeClaim (PVC) was originally created using the ocs-storagecluster-cephfs storage class, which was defined in the wml-cr configuration. From
release 5.2.x, the system prioritizes the blockStorageClass setting. As a result, it tries to switch to ocs-storagecluster-ceph-rbd, which causes a conflict with the existing PVC setup.
To resolve the issue:
-
Get the etcd pvc storage class and verify whether it is set to
ocs-storagecluster-cephfs:oc get pvc data-wml-cpd-etcd-0 -n <cpd-instance-namespace> -o jsonpath='{.spec.storageClassName}' -
Get the block storage class that was set in
wml-crand verify whether it is set toocs-storagecluster-ceph-rbd:oc get wmlbase wml-cr -n <cpd-instance-namespace> -o jsonpath='{.spec.blockStorageClass}' -
If the etcd pvc storage class is created with
ocs-storagecluster-cephfsand block storage class is set toocs-storagecluster-ceph-rbd, patch thewml-cr.oc patch wmlbase wml-cr \ --type=merge \ --patch '{"spec": {"blockStorageClass":"ocs-storagecluster-cephfs"}}'
Space export with connections and connected data assets fails
Applies to: 5.2.1
Fixed in: 5.2.2
After importing a deployment space as a zip file, it is not possible to export the same space as a zip file. The error message that appears informs user that some files are missing.
Workaround:
To minimize the risk, when exporting the space, select fewer than 20 assets and avoid selecting the all_assets or asset_types parameters.
After the deployment of a custom foundation model is initiated, it remains at initialization stage indefinitely
Applies to: 5.2.0 Fixed in: 5.2.1
After the deployment of a custom foundation model is initiated, it might remain at initialization stage indefinitely.
Workaround:
-
Put the watsonx aiifm operator in maintenance mode:
oc patch --namespace ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}'Restart the pod. Wait until the operator transitions into maintenance mode.
-
Open a shell session inside the operator pod:
oc exec -it ibm-cpd-watsonx-ai-ifm-operator-pod -n ${PROJECT_CPD_INST_OPERANDS} -- bash -
Run the debug tools script:
. /debug/debug_tools.sh -
Navigate to the template directory:
cd /opt/ansible/11.0.0/roles/watsonxaiifm/templates -
Use the vi editor to open the template file:
vi wx-ocs-pvc-cr.yaml.j2 -
Find the line that contains
"stopSeconds": 30and change the value to"stopSeconds": 120. This increases the PVC timeout. -
Remove temporary debug files and exit the pod:
rm -rf /tmp/debug exit -
Exit maintenance mode:
oc patch --namespace ${PROJECT_CPD_INST_OPERANDS} watsonxaiifm watsonxaiifm-cr --type merge --patch '{"spec": {"ignoreForMaintenance": false}}'
After upgrading Cloud Pak for Data, patching the hardware specification of a foundation model deployment might fail
Applies to: 5.1.0 and later
After upgrading Cloud Pak for Data, patching the hardware specification of a base foundation model or a custom foundation model deployment might fail.
Workaround:
If this happens, run this code and try patching the hardware specification again:
for r in $(oc get rta --no-headers | awk '{print $1}'); do if oc get rta $r -o yaml | grep "action_queue" -A 10 | grep "^ \-" | grep "retry/[0-9]" > /dev/null; then ra=$(oc get rta $r -o jsonpath='{@.spec.resourceAction}'); kubectl patch rta $r --type=json --subresource status --patch="[{\"op\": \"replace\", \"path\": \"/status/action_queue\", \"value\": []},{\"op\": \"replace\", \"path\": \"/status/last_action/resource_action\", \"value\": \"$ra\"}]"; fi; done
EventData.FailedTaskPath error messages appear in log for wml-operator after fresh installation or upgrade
Applies to: 5.1.1, 5.1.2, and 5.1.3
After a fresh installation of Watson Machine Learning, you might see this error message appear repeatedly in wml-operator logs:
"EventData.FailedTaskPath":"/opt/ansible/bar.yaml:19","error":"[playbook task failed]","stacktrace":"github.com/operator-framework/ansible-operator-plugins/internal/ansible/events.loggingEventHandler.Handle\n\t/host/internal/ansible/events/log_events.go:111"}
Workaround:
To resolve the issue, follow these steps:
-
Put the wml operator in maintenance mode:
oc patch wmlbase wml-cr -n <instance-namespace> --type=merge -p '{"spec":{"ignoreForMaintenance": true}}' -
Get the operator pod name (for example,
ibm-cpd-wml-operator-7d6bc6d47d-qj7qg):oc get po -n cpd-operator-512 | grep ibm-cpd-wml-operator -
Get inside the pod:
oc rsh -n <operator-namespace> <operator pod name> -
Run this command inside the pod: 5.1.1 release:
sed -i 's/5.1.0/5.1.1/g' bar.yaml5.1.2 release:
sed -i 's/5.1.0/5.1.2/g' bar.yaml5.1.3 release:
sed -i 's/5.1.0/5.1.2/g' bar.yaml -
Use the
ctrl + Dkey combination to exit the operator pod. -
Exit maintenance mode:
oc patch wmlbase wml-cr -n <instance-namespace> --type=merge -p '{"spec":{"ignoreForMaintenance": false}}'
A certification path error appears when users inference online deployments by using Java
Applies to: 5.2 and later
A certification path error might appear when you inference online deployments by using Java. See an example error:
The URL is not valid.
PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Workaround:
-
Open a web browser, paste the URL of the prediction endpoint into the address bar and then press Enter.
-
Access security information from the address bar. In most browsers, you must click on the padlock icon. Then open certificate details.
-
Export the certificate (for example, as
cpd-cert.crt). -
Open your system's terminal or command-line tool and then set these environment variables:
$FILE_WITH_CERTS: use name of the file that contains the exported certificates$CERT_ALIAS: create an alias for the new certificate
-
Import the certificates by using
keytool. See example code for Linux:keytool -import -trustcacerts -alias $CERT_ALIAS -file $FILE_WITH_CERTS -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -
Verify that the certificate is properly added. See example Linux code:
keytool -list -v -keystore %JAVA_HOME%/lib/security/cacerts | grep $CERT_ALIAS -
Compile and execute your code.
Limitations for AutoAI experiments
AutoAI file gets pushed to the Git repository in default Git projects
After you create an AutoAI experiment in a default Git project, you create a commit and see a file that includes your experiment name in the list of files that can be committed. There are no consequences to including this file in your commit. The AutoAI experiment will not appear in the asset list for any other user who pulls the file into their local clone using Git. Additionally, other users won’t be prevented from creating an AutoAI experiment with the same name.
Maximum number of feature columns in AutoAI experiments
The maximum number of feature columns for a classification or regression experiment is 5000.
Limitations for Watson Machine Learning
Deep Learning experiments with storage volumes in a Git enterprise project are not supported
If you create a Git project with assets in storage volumes, then create a Deep Learning experiment, running the experiment fails. This use case is not currently supported.
Deep Learning jobs are not supported on IBM Power (ppc64le) or Z (s390x) platforms
If you submit a Deep Learning training job on IBM Power (ppc64le) or Z (s390x) platform, the job fails with an InvalidImageName error. This is the expected behavior as Deep Learning jobs are not supported on IBM Power (ppc64le)
or Z (s390x) platforms.
Deploying a model on an s390x cluster might require retraining
If you trained an AI model on a platform different than s390x, such as x86/ppc, and then you try to deploy the model on the s390x platform, such a deployment might fail and report an endianness issue: Argument shape does not agree with the input data.
This happens if an older version of Pytorch (older than 2.1.2) was used to train the model (runtimes older than 24.1). To resolve the problem:
- Retrain the model by using a runtime that contains a newer version of Pytorch on the x86/ppc platform and then deploy the model on the s390x platform
- Retrain the AI model on the s390x platform and then deploy the model on the s390x platform
Limits on size of model deployments
Limits on the size of models you deploy with Watson Machine Learning depend on factors such as the model framework and type. In some instances, when you exceed a threshold, you will be notified with an error when you try to store a model in
the Watson Machine Learning repository, for example: OverflowError: string longer than 2147483647 bytes. In other cases, the failure might be indicated by a more general error message, such as The service is experiencing some downstream errors, please re-try the request or There's no available attachment for the targeted asset. Any of these results indicate that you have exceeded the allowable size limits for that type of deployment.
Automatic mounting of storage volumes is not supported by online and batch deployments
You cannot use automatic mounts for storage volumes with Watson Machine Learning online and batch deployments. Watson Machine Learning does not support this feature for Python-based runtimes, including R-script, SPSS Modeler, Spark, and Decision Optimization. You can use only automatic mounts for storage volumes with Watson Machine Learning shiny app deployments and notebook runtimes.
As a workaround, you can use the download method from the Data assets library, which is a part of
the ibm-watson-machine-learning python client.
Batch deployment jobs that use large inline payload might get stuck in starting or running state
If you provide a large asynchronous payload for your inline batch deployment, it can result in the runtime manager process to go out of heap memory.
In the following example, 92 MB of payload was passed inline to the batch deployment which resulted in the heap to go out of memory.
Uncaught error from thread [scoring-runtime-manager-akka.scoring-jobs-dispatcher-35] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[scoring-runtime-manager]
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172)
at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538)
at java.base/java.lang.StringBuilder.append(StringBuilder.java:174)
...
This could result in concurrent jobs getting stuck in starting or running state. The starting state can only be cleared once the deployment is deleted and a new deployement is created. The running state can be cleared without deleting the deployment.
As a workaround, use data references instead of inline for huge payloads that are provided to batch deployments.
Setting environment variables in a conda yaml file does not work for deployments
Setting environment variables in a conda yaml file does not work for deployments. This means that you cannot override existing environment variables, for example LD_LIBRARY_PATH, when deploying assets in Watson Machine
Learning.
Example:
variables:
my_var: my_value
In this code, my_value will not be effective in a deployment's environment.
Workaround for online deployments:
If you're using a Python function, consider setting default parameters. For details, see Deploying Python functions.
Workaround for batch jobs:
For Python functions and Python scripts, if you're running batch jobs, use scoring.environment_variables in job's payload.
Example code that creates a batch deployment for a Python function by using the ibm-watsonx-ai SDK:
scoring_payload = {
'input_data': [{
'values': [[0]]
}],
"environment_variables" : {
"my_var": "my_value",
}
}
client.deployments.create_job(deployment_id, scoring_payload)
Jobs for batch deployments that use package extensions may fail
Jobs for batch deployments that use package extensions may fail with this error: WMLClientError: The product version <version> is not supported yet. This happens when a package extension downgrades ibm-watsonx-ai.