Known issues and limitations for Watson Machine Learning

The following known issues and limitations apply to Watson Machine Learning.

Known issues

Known issues for Federated Learning
- Authentication failures for Federated Learning training jobs when allowed_ips are specified in the Remote Training System
Known issues for AutoAI
- Importing an AutoAI notebook from a catalog can result in runtime error
Known issues for Deep Learning
- GPU profiles not displayed in hardware specification list for TensorFlow models with GPU inferencing enabled
Known issues for Watson Machine Learning

Limitations

Limitations for Watson Machine Learning
Limitations for AutoAI experiments

Known issues for Federated Learning

Authentication failures for Federated Learning training jobs when allowed IPs are specified in the Remote Training System

Applies to: 5.1.0 and later

Currently, the Red Hat OpenShift Ingress controller is not setting the X-Forwarded-For header with the client's IP address regardless of the forwardedHeaderPolicy setting. This causes authentication failures for Federated Learning training jobs when allowed_ips are specified in the Remote Training System even though the client IP address is correct.

To use the Federated Learning Remote Training System IP restriction feature in Cloud Pak for Data 4.0.3, configure an external proxy to inject the X-Forwarded-For header. For more information, see the article on configuring ingress.

Known issues for AutoAI

Importing an AutoAI notebook from a catalog can result in runtime error

Applies to: 5.1.0 and later (watsonx.ai 2.0 and later)

If you save an AutoAI notebook to an IBM Knowledge Catalog, and then you import it into a project and run it, you might get this error: Library not compatible or missing.

This error results from a mismatch between the runtime environment saved in the catalog and the runtime environment required to run the notebook in the project. To resolve, update the runtime environment to the latest supported version. For example, if the imported notebook uses Runtime 23.1 in the catalog version, update to Runtime 24.1 and run the notebook job again.

Tip: When you update your runtime environment, check that you have adequate computing resources. The recommended configuration is at least 2 vCPU and 8GB RAM for an experiment notebook, and at least 4 vCPU and 16GB RAM for a pipeline notebook.

Known issues for Deep Learning

GPU profiles not displayed in hardware specification list for TensorFlow models with GPU inferencing enabled

Applied to: 5.1.2 and later

When you deploy a TensorFlow model for GPU inferencing, the GPU profiles are not included in the dropdown list of hardware specifications.

The root cause of the issue is that the software specification is not automatically updated when the model is promoted to the deployment space.

To work around this issue, you can modify the code to remove the ID when overriding the software specification for GPU, as shown here:

if (options.modelAttributes.useOnGpu && payloadFromFile.software_spec?.name) {
	delete payloadFromFile.software_spec.id
	payloadFromFile.software_spec.name = constants.SOFTWARE_SPEC_NAME_RT24_1_PY3_11_CUDA;
}

Known issues for Watson Machine Learning

Upgrade may fail because the include_vars task fails

Applies to: 5.1.0 and later

When upgrading from 5.1.0 to 5.1.1, 5.1.2 or 5.1.3, the upgrade may fail because the include_vars task fails.

To fix this, run the following commands:

Set the Watson Machine Learning operator into Maintenance mode:

oc patch wmlbase wml-cr -n <instance-namespace>  --type=merge  -p '{"spec":{"ignoreForMaintenance": true}}'

Find the name of the operetaor pod:

[root@api.wmlprtwrx418.cp.fyre.ibm.com ~]# oc get po -n cpd-operator-512 | grep  ibm-cpd-wml-operator
ibm-cpd-wml-operator-<example1>                                     1/1     Running     0          15h
ibm-cpd-wml-operator-catalog-<example2>

Access the pod:

oc rsh -n <operator-namespace>  ibm-cpd-wml-operator-<example1>

Run the following commands based on the version you are upgrading to:

Version 5.1.1

 sed -i 's/5.1.0/5.1.1/g' bar.yaml

Version 5.1.2

 sed -i 's/5.1.0/5.1.2/g' bar.yaml

Version 5.1.3

 sed -i 's/5.1.0/5.1.3/g' bar.yaml

Exit the operator pod and remove the Maintenace mode from the Watson Machine Learning operator:

oc patch wmlbase wml-cr -n <instance-namespace>  --type=merge  -p '{"spec":{"ignoreForMaintenance": false}}'

Unusable deployments after an upgrade or restoring from backup

Applies to: 5.1.0 and later

For deployments created on Cloud Pak for Data, generating predictions with a deployment might fail after an upgrade. The error message for this problem is:

Deployment: <deployment-ID> has been suspended due to the deployment owner either not being a member of the deployment space: <space-ID> any more or removed from the system.

These errors can also occur following a restore from backup.

The resolution is to update the deployments by using the following steps. You must use alternative steps that are specific to R Shiny deployments.

To update deployments, except for R Shiny deployments:

For HOST="CP4D_HOSTNAME", replace "CPD_HOSTNAME" with the Cloud Pak for Data hostname.
For SPACE_ID="WML_SPACE_ID", replace "WML_SPACE_ID" with the space ID of the deployment that is failing.
For DEPLOYMENT_ID="WML_DEPLOYMENT_ID" replace "WML_DEPLOYMENT_ID" with the deployment ID of the broken deployment.
Use "Authorization: ZenApiKey <token>" and supply a valid token. If you export the environment variable use ${TOKEN} instead of <token>.

Use this CURL command to replace the "OWNER_ID" with actual owner ID on this cluster in the PATCH payload.

curl  -k -X PATCH "$HOST/ml/v4/deployments/$DEPLOYMENT_ID?version=2020-04-20&space_id=$SPACE_ID" -H "content-type: application/json" -H "Authorization: ZenApiKey <token>" --data '[{ "op": "replace", "path": "/metadata/owner", "value": "OWNER_ID" }]'

Note:

To run this script, you must generate and export the token as the ${MY_TOKEN} environment variable. For details, see Generating an API authorization token.

To update R-Shiny deployments:

Use oc get pods -n NAMESPACE | grep "wml-deployment-manager" and replace the NAMESPACE with WML Namespace.
For oc exec -it WML_DEPLOYMENT_MANAGER_POD_NAME bash -n NAMESPACE, replace the WML_DEPLOYMENT_MANAGER_POD_NAME with the Deployment manager pod name displayed in the previous step and replace the NAMESPACE with the Watson Machine Learning namespace`.
For deployment_id="DEPLOYMENT_ID", replace the DEPLOYMENT_ID with the deployment ID.
For space_id="SPACE_ID", replace the SPACE_ID with the space ID for the deployment.
For HOST="https://wml-deployment-manager-svc.NAMESPACE.svc:16500", replace the NAMESPACE with the Watson Machine Learning namespace`.
Use "Authorization: ZenApiKey <token>" and supply a valid token. If you export the environment variable use ${TOKEN} instead of <token>.

Re-create the R Shiny deployment using the following CURL command:

curl -k -X PUT "$HOST/ml/v4_private/recreate_deployment/$deployment_id?version=2020-06-12&space_id=$space_id" -H "Authorization: ZenApiKey <token>"

Verify the status of R Shiny deployment and wait for the deployment to become "Ready" before proceeding to the next step.

curl -k -X GET "$HOST/ml/v4/deployments/$deployment_id?version=2020-06-12&space_id=$space_id" -H "Authorization: ZenApiKey ${MY_TOKEN}"

If you are upgrading to Cloud Pak for Data 4.8.0 or restoring from backup, scale up the number of copies by 1 from the deployment space UI.

The deployment state will be changed from "Unusable" to "Deployed" state.

Restoring R Shiny deployment

Note:

You can optionally scale the number of copies back to 1 or the original setting when the deployment is working as expected.
2.To run this script, you must generate and export the token as the ${MY_TOKEN} environment variable. For details, see Generating an API authorization token.

Decision Optimization deployment job fails with error: "Add deployment failed with deployment not finished within time"

Applies to: 5.1.0 and later

If your decision optimization deployment job fails with the following error, complete the steps to extend the timeout window.

"status": {
     "completed_at": "2022-09-02T02:35:31.711Z",
     "failure": {
         "trace": "0c4c4308935a3c4f2d9987b22139c61c",
         "errors": [{
              "code": "add_deployment_failed_in_runtime",
              "message": "Add deployment failed with deployment not finished within time"
         }]
     },
     "state": "failed"
   }

To update the deployment timeout in the deployment manager:

Edit the wmlbase wml-cr and add this line: ignoreForMaintenance: true. This sets the WML operator into maintenance mode, which stops automatic reconciliation. The automatic reconciliation will undo any configmap changes applied otherwise.
```
oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}' -n <namespace>
```
For example:
```
oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}' -n zen
```

Capture the contents of the wmlruntimemanager configmap in a YAML file.

oc get cm wmlruntimemanager -n <namespace> -o yaml > wmlruntimemanager.yaml

For example:

oc get cm wmlruntimemanager -n zen -o yaml > wmlruntimemanager.yaml

Create a backup of the wmlruntimemanager YAML file.

cp wmlruntimemanager.yaml wmlruntimemanager.yaml.bkp

Open the wmlruntimemanager.yaml.
```
vi wmlruntimemanager.yaml
```
Navigate to file runtimeManager.conf and search for property service.
Increase the number of retries in the retry_count field to extend the timeout window:
```
service {

     jobs {

         do {
             check_deployment_status {
                 retry_count = 420   // Increase the number of retries to extend the timeout window }
                 retry_delay = 1000
             }
         }
     }
```
Where:
- Field retry_count = Number of retries
- Field retry_delay = Delay between each retry in milliseconds
In the example, the timeout is configured as 7 minutes (retry_count * retry_delay = 420 * 1000 = 7 minutes). If you want to increase the timeout further, you can increase the number of retries in the retry_count field.

Apply the deployment manager configmap changes:

oc delete -f wmlruntimemanager.yaml
oc create -f wmlruntimemanager.yaml

Restart the deployment manager pods:

oc get pods -n <namespace> | grep wml-deployment-manager

oc delete pod <podname> -n <namespace>

Wait for the deployment manager pod to come up:

oc get pods -n <namespace> | grep wml-deployment-manager

Note:

If you plan to upgrade the Cloud Pak for Data cluster, you must bring the WML operator out of maintenance mode by setting the field ignoreForMaintenance to false in wml-cr.

Configuring runtime definition for a specific GPU node fails

Applies to: 5.1.0 and later

When you configure the runtime definition to use a specific GPU node with the nodeaffinity property, the runtime definition fails.

As a workaround, you must enable the MIG configuration for all GPU nodes if MIG is enabled for even a single GPU node. You must also use the Single profile type for all the GPU nodes. Mixed profiling is not supported. To learn more about single and mixed profiling strategies, see NVIDIA documentation.

Online Backup with Netapp causes Watson Machine Learning to enter InMaintenance Mode

Applies to: 5.1.1

Problem description: After performing a Netapp backup, Watson Machine Learning enter the InMaintenance mode. You might receive the following message:

wml WmlBase wml-cr 2025-02-08T02:15:55Z zen 5.1.1 5.1.1 5.1.1-1625 100% Completed wml install/upgrade/restart The last reconciliation was completed successfully. InMaintenance

Root cause: The issue is caused by the pre-hooks and post-hooks configuration in the backup-meta, which puts the Watson Machine Learning CR into maintenance mode during the backup process. However, the Watson Machine Learning CR eventually reconciles and reaches the Completed state, but this may take longer than the default timeout value of 1800s.

Workaround: No changes are required to the configmap. If you encounter this issue, please wait a little longer (more than 1800s) for the Watson Machine Learning CR to reconcile and reach the Completed state. The Watson Machine Learning CR will automatically transition to the Completed state once the reconciliation is complete.

Hyperparameter tuning fails when using 2 parallel jobs

Applies to: 5.1.1

When running a hyperparameter tuning workload using 2 parallel jobs, the workload may fail.

Try running your hyperparameter tuning workload using a single job.

Limitations for AutoAI experiments

AutoAI file gets pushed to the Git repository in default Git projects

After you create an AutoAI experiment in a default Git project, you create a commit and see a file that includes your experiment name in the list of files that can be committed. There are no consequences to including this file in your commit. The AutoAI experiment will not appear in the asset list for any other user who pulls the file into their local clone using Git. Additionally, other users won’t be prevented from creating an AutoAI experiment with the same name.

Maximum number of feature columns in AutoAI experiments

The maximum number of feature columns for a classification or regression experiment is 5000.

No support for Cloud Pak for Data authentication with storage volume connection

You cannot use a storage volume connection with the 'Cloud Pak for Data authentication' option enabled as a data source in an AutoAI experiment. AutoAI does not currently support the user authentication token. Instead, disable the 'Cloud Pak for Data authentication' option in the storage volume connection to use the connection as a data source in your AutoAI experiment.

Limitations for Watson Machine Learning

Deep Learning experiments with storage volumes in a Git enterprise project are not supported

If you create a Git project with assets in storage volumes, then create a Deep Learning experiment, running the experiment fails. This use case is not currently supported.

Deep Learning jobs are not supported on IBM Power (ppc64le) or Z (s390x) platforms

If you submit a Deep Learning training job on IBM Power (ppc64le) or Z (s390x) platform, the job fails with an InvalidImageName error. This is the expected behavior as Deep Learning jobs are not supported on IBM Power (ppc64le) or Z (s390x) platforms.

Deploying a model on an s390x cluster might require retraining

Training an AI model on a different platform such as x86/ppc and deploying the AI model on s390x using Watson Machine Learning might fail because of an endianness issue. In such cases, retrain and deploy the existing AI model on the s390x platform to resolve the problem.

Limits on size of model deployments

Limits on the size of models you deploy with Watson Machine Learning depend on factors such as the model framework and type. In some instances, when you exceed a threshold, you will be notified with an error when you try to store a model in the Watson Machine Learning repository, for example: OverflowError: string longer than 2147483647 bytes. In other cases, the failure might be indicated by a more general error message, such as The service is experiencing some downstream errors, please re-try the request or There's no available attachment for the targeted asset. Any of these results indicate that you have exceeded the allowable size limits for that type of deployment.

Automatic mounting of storage volumes is not supported by online and batch deployments

You cannot use automatic mounts for storage volumes with Watson Machine Learning online and batch deployments. Watson Machine Learning does not support this feature for Python-based runtimes, including R-script, SPSS Modeler, Spark, and Decision Optimization. You can use only automatic mounts for storage volumes with Watson Machine Learning shiny app deployments and notebook runtimes.

As a workaround, you can use the download method from the Data assets library, which is a part of the ibm-watson-machine-learning python client.

Batch deployment jobs that use large inline payload might get stuck in starting or running state

If you provide a large asynchronous payload for your inline batch deployment, it can result in the runtime manager process to go out of heap memory.

In the following example, 92 MB of payload was passed inline to the batch deployment which resulted in the heap to go out of memory.

Uncaught error from thread [scoring-runtime-manager-akka.scoring-jobs-dispatcher-35] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[scoring-runtime-manager]
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
	at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172)
	at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538)
	at java.base/java.lang.StringBuilder.append(StringBuilder.java:174)
   ...

This could result in concurrent jobs getting stuck in starting or running state. The starting state can only be cleared once the deployment is deleted and a new deployement is created. The running state can be cleared without deleting the deployment.

As a workaround, use data references instead of inline for huge payloads that are provided to batch deployments.

Setting environment variables in a conda yaml file does not work for deployments

Setting environment variables in a conda yaml file does not work for deployments. This means that you cannot override existing environment variables, for example LD_LIBRARY_PATH, when deploying assets in Watson Machine Learning.

As a workaround, if you're using a Python function, consider setting default parameters. For details, see Deploying Python functions.

Deploying assets on IBM Z and LinuxONE fails

Deploying assets on IBM Z and LinuxONE fails because Watson Machine Learning for Cloud Pak for Data version 5.1.1 does not support deployments on the s390x architecture.

Hyperparameter tuning runs with 2 parallel jobs

A hyperparameter (HPO) tuning job will run with a maximum of 2 parallel jobs, even if you set the max_parallel_job_num parameter of hyper_parameters_optimization in training_reference to a value larger than 2.

Parent topic: Service issues