Known issues for Watson Studio and supplemental services

These known issues apply to Watson Studio and the services that require Watson Studio.

General issues

Common Core Services operator status is hung on infinite reconcile loop during upgrade with lingering cronjobs

The status of the Common Core Services operator is hung on an infinite reconcile loop during an upgrade with lingering cronjobs.

Workaround

  1. Check if there are any cronjobs in suspended state using the following labels:

    oc get cronjobs -n <cpd_instance_namespace> -l 'created-by=spawner,ccs.cpd.ibm.com/upgradedTo4x!=4.0.6'

    Example response:

     NAME                                              SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
     038207f9-6e91-4df4-9860-e8ef6c30aca0-1000330999   13 21 4 2 *    True     0        <none>          4d23h
     0954d225-2600-44de-8d60-f29ae31aa96c-1000330999   51 18 4 2 *    True     0        <none>          5d2h
     14ce1614-7059-40c9-9a7e-5f2fa33d65a8-1000330999   12 23 4 2 *    False     0        <none>          4d21h
     184b04fe-d893-481c-89c9-7ce96c7fb4d4-1000330999   0 * * * *      False     0        100s            6d14h
     253b988a-31a8-46d0-ba9c-239d83746dc3-1000330999   28 * * * *     False     0        33m             6d14h
    
  2. Delete each of the returned suspended cronjobs and associated secrets by running the following command.

    Only the jobs that are in suspended state (third column SUSPEND is True) will be deleted. The rest should be untouched.

     oc delete cronjob -n <cpd_instance_namespace> <cronjob_name>
     oc delete secret -n <cpd_instance_namespace> <cronjob_name>-sct
    

    Note: If an error is received stating the secret could not be found, the error can be ignored.

  3. After completing the previous step, the Common Core Services operator should stop running reconcile phases and mark the installation as complete.

Applies to: 4.0.6.

Fixed in: 4.0.7.

Internal service error occurs when you upload files, that already exist on a server, to a remote NFS volume server

If you are uploading files that already exist on a remote NFS volume server, you must update the permissions of the existing files on the remote server or create a new directory to upload all files to that directory. Otherwise, an internal service occurs.

Only users who have access to the NFS server can change the permission of the files and create new directories.

Backup and restore limitations

Offline quiesce is supported only at the OpenShift project level and it restores only to the same machine and to the same namespace.

Deployments view (Operations view/dashboard) has some limitations

The Deployments view has the following limitation:

Spark environments can be selected although Spark is not installed

When you create a job for a notebook or a Data Refinery flow that you promoted to a deployment space, and select a Spark environment, you might see the following error message:

Error submitting job. Unable to fetch environment access info from envId spark30****. Error: [{"statusCode":500,"message":"{\"trace\":\"39e52bbb-2816-4bf6-9dad-5aede584ac7a\",\"errors\":[{\"code\":\"default_spark_create_failed\",\"message\":\"Could not create Hummingbird instance, because a wrong status code was returned: 404.\"}]}"}]

The reason for this error is that Spark was not installed on the cluster on which you created the deployment space. Contact your administrator to install the Spark service on that cluster.

Applies to: 4.0.0 and 4.0.1.

Fixed in: 4.0.2

UI display might not load properly

If the UI does not load properly, the Watson Studio administrator must restart redis.

Workaround

If the client behaves unexpectedly, for example, enters a redirection loop or parts of the user interface fail to load, then complete the following steps to restart redis:

  1. Log in to the OpenShift cluster.
  2. Restart the pods for redis using the following command:

     oc delete po -n <project> | oc get po -n <project> -l component=redis -o jsonpath="{.items[*].metadata.name}"
    

Can’t stop an active runtime for notebooks, JupyterLab, Data Refinery and SPSS Modeler

If you try stopping an active runtime for notebooks, JupyterLab, Data Refinery or SPSS Modeler from the environments page in a project, the runtime is removed from the list. However, when you reload the page, the runtime appears again.

The reason is that the runtime couldn't be deleted properly. To delete the runtime entirely, you must have Cloud Pak for Data administrator rights to view the logs and OpenShift administrator rights to delete the runtime.

Workaround

To stop the runtime and remove it from the list of active runtimes:

  1. As Cloud Pak for Data administrator, check the log file for information you need before you can delete the runtime.

    1. Navigate to Administration > Monitoring.
    2. Select Pods.
    3. In Find Pods, enter spawner. This should return one pod named spawner-api-<id>.
    4. Click (Shows the action icon) on the right of the entry and select View Log. You might have to download the log file to see its full contents.

      The log file shows entries such as:

       WARN : Cannot find service for type=<type>,dsxProjectId=<project>,dsxUserId=<user>,runtimeEnvId=<id>
      

      For example:

       Example: WARN : Cannot find service for type=jupyter-py37,dsxProjectId=1d35b7f8-cef1-4432-92ae-aa08afe4c8c6,dsxUserId=1000330999,runtimeEnvId=jupconda37oce-1d35b7f8-cef1-4432-92ae-aa08afe4c8c6
      
  2. Log in to the Openshift cluster that Cloud Pak for Data is installed on as an OpenShift administrator.
  3. Run the following command using the values from the log entry:
    oc get deployment -l type=<type>,dsxProjectId=<project>,dsxUserId=<user>,runtimeEnvId=<id>
    
    For example:
    oc get deployment -l type=jupyter-py37,dsxProjectId=1d35b7f8-cef1-4432-92ae-aa08afe4c8c6,dsxUserId=1000330999,runtimeEnvId=jupconda37oce-1d35b7f8-cef1-4432-92ae-aa08afe4c8c6
    
    This command returns a deployment with a particular name.
  4. Run the following command:
    oc delete deployment <name>
    
  5. Then run:
    oc get secret -l type=<type>,dsxProjectId=<project>,dsxUserId=<user>,runtimeEnvId=<id>
    
    For example:
    oc get secret -l type=jupyter-py37,dsxProjectId=1d35b7f8-cef1-4432-92ae-aa08afe4c8c6,dsxUserId=1000330999,runtimeEnvId=jupconda37oce-1d35b7f8-cef1-4432-92ae-aa08afe4c8c6
    
    This command returns a secret with a particular name.
  6. Finally run the following command with the secret name:
    oc delete secret <name>
    

Fixed in: 4.0.3

Projects

Option to log all project activities is enabled but project logs don't contain activities and return an empty list

Log all project activities is enabled, but the project logs don't contain activities and return an empty list.

Workaround: If the project logs are empty after 30 minutes or more, restart the rabbitmq pod by completing the following steps:

  1. Search for all the sts (stateful sets) of rabbitmq by running oc get pods | grep rabbitmq-ha.

    This will return 3 pods:

    [root@api.xen-ss-ocs-408-409.cp.fyre.ibm.com ~]# oc get pods | grep rabbitmq-ha
    rabbitmq-ha-0                                                     1/1     Running     0             4d6h
    rabbitmq-ha-1                                                     1/1     Running     0             4d6h
    rabbitmq-ha-2                                                     1/1     Running     0             4d7h
    
  2. Restart each pod by running oc delete pod rabbitmq-ha-0 rabbitmq-ha-1 rabbitmq-ha-2.

Applies to: 4.0.7 and later.

Job scheduling does not work consistently with default git projects

Creating a new default git project or changing branched in your local clone in an existing git-based project will corrupt your existing job schedules for that project.

Import of a project larger than 1 GB in Watson Studio fails

If you create an empty project in Watson Studio and then try to import a project that is larger than 1 GB in size, the operation might fail depending on the size and compute power of the Cloud Pak for Data cluster.

Export of a large project in Watson Studio fails with a time-out

If you are trying to export a project with a large number of assets (for example, more than 7000), the export process can time-out and fail. In that case, although you could export assets in subsets, the recommended solution is to export using the APIs available from the CPDCTL command line interface tool.

Git operations fail with invalid token error

If you are performing a git action and the associated git token of the project has become invalid, the operation will fail. You will be unable to use a valid token to complete the action. To resolve the issue, use this command from the project terminal to add a valid token.

git remote set-url origin https://[USERNAME]:[NEW TOKEN]@github.com/[USERNAME]/[REPO].git

Applies to: 4.0.2 and 4.0.3

Cannot switch checkout branch in project with default Git integration after changing project assets or files

If the local Git repository has untracked changes, sometimes checkout would fail with unexpected response code: 500. This is caused by files in the new branch that would overwrite your local changes.

Workaround:

Before checking out a different branch, first commit all your changes. Alternatively, use the project terminal to stash or revert any untracked changes.

Fixed in: 4.0.6

Cannot stop job runs in a project associated with a Git repository

Files containing information about job runs are not, by default, included in the files that are pushed to the Git repository associated with a project. They are excluded in the .gitignore file. However, if the .gitignore is updated to include such files and a user commits and pushes those files while a job run is active, then users working with that Git repository (in the same project or in a separate project) will see those job runs as active after pulling the changes. They will get an error if they try to stop any of these job runs.

Workaround:

To remove job runs that are marked as active but cannot be stopped, ask the user who pushed the files for those active job runs to push the files again after the job runs have completed.

Cannot work with all assets pulled from a Git repository

If you work in a project with default Git integration, the Git repository might contain assets added from another project that uses the same Git repository. If this project is on a different Cloud Pak for Data cluster where other services are installed, you will not be able to work with any assets if the necessary service is not available on your cluster.

For example, if the Git repository contains SPSS Modeler flows from another project on a different cluster and SPSS Modeler is not installed on your cluster, when you try to open a SPSS Modeler flow, the Web page will be blank.

Workaround:

To work with all assets pulled from a Git repository used in projects on another cluster, you need to ask a system administrator to install the missing services on your cluster.

Can't include a Cognos dashboard when exporting a project to desktop

Currently, you cannot select a Cognos dashboard when you export a project to desktop.

Workaround:

Although you cannot add a dashboard to your project export, you can move a dashboard from one project to the another.

To move a dashboard to another project:

  1. Download the dashboard JSON file from the original project. Download a dashboard to the desktop
  2. Export the original project to desktop by clicking  the Export to desktop icon from the project toolbar.
  3. Create a new project by importing the project ZIP with the required data sources.
  4. Create a new dashboard by clicking the From file tab and adding the JSON file you downloaded from the original project. Create a dashboard from file
  5. A dialog box will pop up asking you if you want to re-link each of your data sources. Click the re-link button and select the asset in the new project that corresponds to the data source.

Can't create a project although the project name is unique

If you want to create a project and are getting an error message stating that the project name already exists, although the name is unique, the reason might be that you have been assigned a role with either the permission to "manage projects" or "monitor project workloads". There is currently a defect preventing these permissions from creating projects.

Workaround

Ask a Cloud Pak for Data administrator to remove the permissions "manage projects" or "monitor project workloads" that were assigned to your role to enable you to create a project.

Should these permissions be needed, they may be assigned to a dedicated monitoring role that does not need to create projects.

Fixed in: 4.0.5

Don't use the Git repository from projects with deprecated Git integration in projects with default Git integration

You shouldn't use the Git repository from a project with deprecated Git integration in a project with default Git integration as this can result in an error. For example, in Bitbucket, you will see an error stating that the repository contains content from a deprecated Git project although the selected branch contains default Git project content.

In a project with default Git integration, you can either use a new clean Git repository or link to one that was used in a project with default Git integration.

Can't use connections in a Git repository that require a JDBC driver and were created in a project on another cluster

If your project is associated with a Git repository that was used in a project in another cluster and contains connections that require a JDBC driver, the connections will not work in your project. If you upload the required JDBC JAR file, you will see an error stating that the JDBC driver could not be initialized.

This error is caused by the JDBC JAR file that is added to the connection as a presigned URI. This URI is not valid in a project in another cluster. The JAR file can no longer be located even if it exists in the cluster, and the connection will not work.

Workaround

To use any of these connections, you need to create new connections in the project. The following connections require a JDBC driver and are affected by this error situation:

Importing a project that contains a Data Refinery job that uses Spark 2.4 fails

If you try to import a project that contains a Data Refinery job that runs in a Spark 2.4 environment, the import will fail. The reason for this is that Spark 2.4 was removed in Cloud Pak for Data 4.0.7.

Workaround

Create a new Data Refinery job and select a Spark 3.0 environment.

Applies to: 4.0.7

Fixed in: 4.0.8

Connections

Cannot create a connection to Oracle when credentials must be stored in a secret in a vault

If you try to create a connection to Oracle and the administrator has configured Cloud Pak for Data to enforce using secrets from a vault, the connection will fail.

Workaround: Disable the vault enforcement temporarily. Users will be able to create a connection to Oracle that uses a vault for credentials. However, users will also be able temporarily to create connections (including connections other than Oracle) without using credentials that are stored in a secret in a vault. After the connection to Oracle is created, you can enforce using secrets from a vault again.

Applies to: 4.0.8
Fixed in: 4.0.9

Authentication fields unavailable in the Cloud Object Storage (COS) connection

When you create or edit a Cloud Object Storage connection, the authentication fields are not available in the user interface if you change the authentication method.

Workaround: If you change the authentication method, clear any fields where you have already entered values.

Applies to: 4.0.7 and later

FTP connection: Use the SSH "authentication mode" only with the SSH "connection method"

If you create an FTP connection with the Anonymous, Basic, or SSL connection mode and you specify the SSH authentication mode, the Test connection will fail.

Workaround: Specify the SSH authentication mode only when you specify the SSH connection mode.

Applies to: 4.0.7
Fixed in: 4.0.8

Cannot retrieve data from a Greenplum connection

After you create a connection to Greenplum, you might not be able to select its tables or assets.
Workaround: In the asset browser, click the Refresh button (Refresh icon) to access the table or asset. You might need to refresh several times.

Applies to: 4.0.6
Fixed in: 4.0.9

Cannot access or create connected data assets or open connections in Data Refinery

The following scenario can prevent you from accessing or creating connected data assets in a project and from opening connections in Data Refinery:

As a workaround for this scenario, delete any orphaned referenced connections that are still in the project.

Applies to: 3.5.10
Fixed in: 4.0.5

Cannot access an Excel file from a connection in cloud storage

This problem can occur when you create a connected data asset for an Excel file in a space, catalog, or project. The data source can be any cloud storage connection. For example, IBM Cloud Object Storage, Amazon S3, or Google Cloud Storage.

Workaround: When you create a connected data asset, select which spreadsheet to add.

Applies to: 4.0.4
Fixed in: 4.0.5

Cannot create an SQL Query connected data asset that has personal credentials

If you want to create a connected data asset for an SQL Query connection that has personal credentials, the Select connection source page might stop responding when you click the SQL Query connection.

Workaround: Edit the connection from Edit connection page.

  1. Go to the project's Assets page and click the link for the SQL Query connection to open the Edit connection page.
  2. Enter the credentials and click Save.
  3. Return to Add to project > Connected data > Select source, and select data from the SQL Query connection.

Applies to: 4.0.3
Fixed in: 4.0.4

Personal credentials are not supported for connected data assets in Data Refinery

If you create a connected data asset with personal credentials, other users must use the following workaround in order to use the connected data asset in Data Refinery.

Workaround:

  1. Go to the project page, and click the link for the connected data asset to open the preview.
  2. Enter credentials.
  3. Open Data Refinery and use the authenticated connected data asset for a source or target.

Applies to: 3.5.0 and later

Assets

Limitations for previews of assets

You can't see previews of these types of assets:

Can't load files to projects that have #, %, or ? characters in the name

You can't create a data asset in a project by loading a file that contains a hash character (#), percent sign (%), or a question mark (?) in the file name.

Applies to: 3.5.0
Fixed in: 4.0.6

Can't load CSV files to projects that are larger that 20 GB

You can't load a CSV file to an analytics project in Cloud Pak for Data that is larger than 20 GB.

Hadoop integration

Unable to install pip packages using install_packages() on a Power machine

If you are using a Power cluster, you might see the following error when attempting to install pip packages with hi_core_utils.install_packages():

ModuleNotFoundError: No module named '_sysconfigdata_ppc64le_conda_cos6_linux_gnu'

To work around this known limitation of hi_core_utils.install_packages() on Power, export the following environment variable before calling install_packages():

# For Power machines, export this env var to work around a known issue in
# hi_core_utils.install_packages()...
os.environ['_CONDA_PYTHON_SYSCONFIGDATA_NAME'] = "_sysconfigdata_powerpc64le_conda_cos7_linux_gnu"

On certain HDP Clusters, the Execution Engine for Apache Hadoop service installation fails

The installation fails during the Knox Gateway Configuration step. The issue is because the Knox gateway fails to start and happens on some nodes. See more information about the Knox gateway issue here.

The following errors occur:

The workaround is to remove the org.eclipse.persistence.core-2.7.2.jar file from the installation directory by using the following command:

mv /opt/ibm/dsxhi/gateway/dep/org.eclipse.persistence.core-2.7.2.jar /tmp/

Cannot stop jobs for a registered Hadoop target host

When a registered Hadoop cluster is selected as the Target Host for a job run, the job cannot be stopped. As a workaround, view the Watson Studio Local job logs to find the Yarn applicationId; then, use the ID to manually stop the Hadoop job on the remote system. When the remote job is stopped, the Watson Studio Local job will stop on its own with a "Failed" status. Similarly, jobs that are started for registered Hadoop image push operations cannot be stopped either.

Apache Livy session on RStudio and Data Refinery has known issues with curl packages

The workaround the curl package issue is to downgrade the curl package using the following command:

install.packages("https://cran.r-project.org/src/contrib/Archive/curl/curl_3.3.tar.gz", repos=NULL)

Related to: 4.0.7 and later

Code that is added to a Python 3.7 or Python 3.8 notebook for an HDFS-connected asset fails

The HDFS-connected asset fails because the Set Home as Root setting is selected for the HDFS connection. To work around this issue, create the connected asset using the HDFS connection without selecting Set Home as Root.

Support for Spark versions

Apache Spark 3.1 for Power is not supported.

Applies to: 4.0.6

Software version list disappears when defining a Cloud Pak for Data environment

When you're defining a Cloud Pak for Data environment, the Software version list disappears in the following situations as you're choosing the system configuration for a Hadoop cluster edge node:

Applies to: 4.0.6

Support for specific Python versions with Execution Engine for Apache Hadoop

Applies to: 4.0.6

Jupyter notebook with Python 3.8 is not supported by Execution Engine for Apache Hadoop

The following issues are the result of Python 3.8 not being supported by Execution Engine for Apache Hadoop:

See more information about Spark and Python 3.8 that is described in Apache's JIRA tracker.

Applies to: 4.0.1

Fixed in: 4.0.6

The Livy service does not restart when a cluster is rebooted

The Livy service does not automatically restart after a system reboot if the HDFS Namenode is not in an active state.

Applies to: 3.5.0 and later

Pushing the Python 3.7 image to a registered Execution Engine for Apache Hadoop system hangs on Power

When the Python 3.7 Jupyter image is pushed to an Execution Engine for Apache Hadoop registered system using platform configurations, installing dist-keras package into the image (using pip) hangs. The job runs for hours but never completes, and the console output for the job ends with:

  Attempting to install HI addon libs to active environment ...
    ==> Target env: /opt/conda/envs/Python-3.7-main ...
    ====> Installing conda packages ...
    ====> Installing pip packages ...

This hang is caused by a pip regression in dependency resolution, as described in the New resolver downloads hundreds of different package versions, without giving reason issue.

To work around this problem, do the following steps:

  1. Stop the image push job that is hung. To find the job that is hung, use the following command:

     oc get job -l headless-type=img-saver
    
  2. Delete the job:

     oc delete job <jobid>
    
  3. Edit the Execution Engine for Apache Hadoop image push script located at /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh.

  4. Add the flag --use-deprecated=legacy-resolver to the pip install command, as follows:

    ``` @@ -391,7 +391,7 @@

      conda-env=2.6.0 opt_einsum=3.1.0 > /tmp/hiaddons.out 2>&1 ; then
        echo " OK"
        echo -n "  ====> Installing pip packages ..."
    
  5. if pip install dist-keras==0.2.1 >> /tmp/hiaddons.out 2>&1 ; then
  6. if pip install --use-deprecated=legacy-resolver dist-keras==0.2.1 >> /tmp/hiaddons.out 2>&1 ; then

    echo " OK" hiaddonsok=true ```

  7. Re-start your image push operation again by clicking Replace image in the Execution Engine for Apache Hadoop registration page.

To edit the Execution Engine for Apache Hadoop image push script:

  1. Access the pod running utils-api. This has the /cc-home/.scripts directory mounted.

     oc get pod | grep utils-api
    
  2. Extract the existing /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh file:

     oc cp <utils_pod_id>:/cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh dsx-hi-save-image.sh
    
  3. Modify the file per the workaround steps:

     vi dsx-hi-save-image.sh
    
  4. Copy the new file into the pod, in the /tmp dir:

     oc cp dsx-hi-save-image.sh <utils_pod_id>:/tmp
    
  5. Exec into the pod, and do:

     oc rsh <utils_pod_id>
    
  6. Make a backup:

     cp -up /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh.bak
    
  7. Dump the content in /tmp/dsx-hi-save-image.sh to /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh:

     cat /tmp/dsx-hi-save-image.sh > /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh`
    
  8. Do a diff to make sure you have the changes:

     diff /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh /cc-home/.scripts/proxy-pods/dsx-hi/dsx-hi-save-image.sh.bak
    
  9. Exit the utils-api pod:

     exit
    

Pushing an image to a Spectrum Conductor environment fails

If you push an image to a Spectrum Conductor environment, the image push fails.

Applies to: 4.0.0 and later

Fixed in: 4.0.2

Notebooks using a Hadoop environment that don't have the Jupyter Enterprise Gateway service enabled can't be used

When a notebook is opened using a Hadoop environment that does not have the Jupyter Enterprise Gateway (JEG) service enabled, the kernel remains disconnected. This notebook does not display any error message, and it cannot be used.

To address this issue, confirm that the Hadoop environment that is defined to use with a JEG service does have JEG enabled.

Cannot use HadoopLibUtils to connect through Livy in RStudio

RStudio includes a new version of sparklyr. sparklyr doesn't work with the HadoopLibUtils library that's provided to connect to remote Hadoop clusters using Livy. The following error occurs: Error: Livy connections now require the Spark version to be specified. As a result, you cannot create tables using Hadoop.

If you're not rolling back the sparklyr 1.6.3 changes, you can install sparklyr 1.6.2. Restart an R session and use sparklyr 1.6.2 for Hadoop.

Install sparklyr 1.6.2, by using the following steps:

  require(remotes)
  library(remotes)
  install_version(“sparklyr”, version = “1.6.2”, repos = http://cran.us.r-project.org)
  packageVersion("sparklyr")

After the package is installed, load sparklyr 1.6.2 when you are using HadoopLibUtilsR.

Hadoop Refinery: Cannot output to a connected parquet asset that is a directory

This is a specific problem for when running a Refinery data shaping, using the Execution Engine for Apache Hadoop connector for HDFS. When you select an output target directory that contains a set of parquet files, the Data Refinery layer cannot determine the file format for this directory.

Instead, it leaves the File Format field blank. This causes a failure when writing the output because File Format ‘ ‘ is not supported.

To work around this issue, select an output target that is empty or not a directory.

Hadoop notebook jobs don't support using environment variables containing spaces

When you're setting up a notebook job using a remote Hadoop environment, there is an option to specify environment variables that can be accessed by the notebook. There is an issue, where if the environment variable is defined to contains spaces, the JEG job will fail. The remote Hadoop jeg.log displays:

  [E 2020-05-13 10:28:54.542 EnterpriseGatewayApp] Error occurred during launch of KernelID

To work around this issue, it is recommended to not define an environment variable that contains spaces as a value. You can also, if necessary, encode the value and decode when the notebook is using the content of the environment variable.

Notebooks

An error 500 is received after opening an existing notebook

An error 500 is received after opening an existing notebook with a large amount of data assets. The notebook takes a long time to load and the notebook-UI pod considers it as a failure.

Workaround: Use the notebook URL directly instead of clicking on the project, then the notebook.

Notebook fails to load in default Spark 3.0 and R 3.6 environments

Your notebook fails to load in default Spark 3.0 and R 3.6 environments and you receive the Failed to load notebook error message.

To resolve this issue:

  1. Go to Active runtimes.
  2. Delete the runtime that is in Starting status.
  3. Restart the runtime.

Applies to: 4.0.8.

No indication that notebooks reference deprecated or unsupported Python versions in deployment spaces

The user is not notified that their deployment space contains notebooks that are referencing deprecated or unsupported Python versions.

Applies to: 4.0.7

Insert to code function can cause kernel to fail on IBM Z

For JupyterLab notebooks running on IBM Z and LinuxONE platforms, using the Insert to Code feature or utility in a notebook to load data can result in a kernel failure.

Important: These changes only apply to newly created Runtimes. In case there are still active runtimes when you apply the change, make sure to stop and start them again.

To resolve this issue:

  1. Log in to Cloud Pak for Data as administrator and paste the following URL into the browser. Replace {CloudPakforData_URL> with the URL of your Cloud Pak for Data system.

    <CloudPakforData_URL>/zen-data/v1/volumes/files/%2F_global_%2Fconfig%2F.runtime-definitions%2Fibm%2Fjupyter-lab-py38-server.json
    
  2. Open the file in your favorite editor. The last 2 lines of the file will look like this:

      }
     }
    

    Append this code Between the second last and the last curly brace (}):

     ,
       "env": [
           {
               "name": "APP_ENV_ENABLE_MEM_LIMIT_KERNEL_MANAGER",
               "value": "false"
           }
       ]
    

    The end of the file should look like this:

     },
       "env": [
           {
               "name": "APP_ENV_ENABLE_MEM_LIMIT_KERNEL_MANAGER",
               "value": "false"
           }
       ]
     }
    
  3. Save the file.

  4. Get the required platform access token. This command returns the bearer token in the accessToken field:

     curl <CloudPakforData_URL>/v1/preauth/validateAuth -u <username>:<password>
    
  5. Upload the JSON file you edited in previous steps:

     curl -k -X PUT \
       '<CloudPakforData_URL>/zen-data/v1/volumes/files/%2F_global_%2Fconfig%2F.runtime-definitions%2Fibm' \
       -H 'Authorization: Bearer <platform-access-token>' \
       -H 'content-type: multipart/form-data' \
       -F upFile=@/path_to_runtime_def/jupyter-lab-py38-server.json
    

    If the changed JSON file uploads successfully, you will see the following response:

    {
       "_messageCode_": "Success",
       "message": "Successfully uploaded file and created the necessary directory structure"
    }
    

    Note: If your cluster is using a self-signed certificate which you did not add to your client, use the -k option to avoid certificate issues.

Applies to: 4.0.2

Fixed in: 4.0.3

R 3.6 notebook kernel won't start because of slow kernel connection

For R Jupyter notebooks running on specific IBM Power platforms, the R kernel will not start and a slow kernel connection message is displayed.

To resolve this issue:

  1. Log in to Cloud Pak for Data as administrator and paste the following URL into the browser. Replace {CloudPakforData_URL> with the URL of your Cloud Pak for Data system.

     <CloudPakforData_URL>/zen-data/v1/volumes/files/%2F_global_%2Fconfig%2F.runtime-definitions%2Fibm%2Fjupyter-r36-server.json
    
  2. Open the file in your favorite editor. The last 2 lines of the file will look like this:

      }
     }
    

    Append this code Between the second last and the last curly brace (}):

     ,
       "env": [
           {
               "name": "APP_ENV_ENABLE_MEM_LIMIT_KERNEL_MANAGER",
               "value": "false"
           }
       ]
    

    The end of the file should look like this:

     },
       "env": [
           {
               "name": "APP_ENV_ENABLE_MEM_LIMIT_KERNEL_MANAGER",
               "value": "false"
           }
       ]
     }
    
  3. Save the file.

  4. Get the required platform access token. This command returns the bearer token in the accessToken field:

     curl <CloudPakforData_URL>/v1/preauth/validateAuth -u <username>:<password>
    
  5. Upload the JSON file you edited in previous steps:

     curl -k -X PUT \
         '<CloudPakforData_URL>/zen-data/v1/volumes/files/%2F_global_%2Fconfig%2F.runtime-definitions%2Fibm' \
         -H 'Authorization: Bearer <platform-access-token>' \
         -H 'content-type: multipart/form-data' \
         -F upFile=@/path_to_runtime_def/jupyter-r36-server.json
    

    If the changed JSON file uploads successfully, you will see the following response:

    {
       "_messageCode_": "Success",
       "message": "Successfully uploaded file and created the necessary directory structure"
    }
    

    Note: If your cluster is using a self-signed certificate which you did not add to your client, use the -k option to avoid certificate issues.

Insert to code function does not work on SSL-enabled Db2 on Cloud connections from 3.5.x imported projects or 3.5.x git repos

If you import a project that was created in Cloud Pak for Data 3.5.x and the project contains a Db2 on Cloud connection with SSL enabled, the Notebooks "Insert to code" feature will not work. The problem also occurs if you synchronize with a Git project from Cloud Pak for Data version 3.5.x.

To fix this problem, edit the connection: Click the connection on the project Data assets page. Clear and re-select the Port is SSL-enabled checkbox in the Edit connection page.

Applies to: 4.0.0 and later

Issues with Insert to code function for an Informix connection

If you use the Insert to code function for a connection to an Informix database, and if Informix is configured for case-sensitive identifiers, the inserted code throws a runtime error if you try to query a table with an upper-case name. In the cell output, there's an error message similar to the following:

    DatabaseError: Execution failed on sql: SELECT * FROM informix.FVT_EMPLOYEE
    java.sql.SQLException: The specified table (informix.fvt_employee) is not in the database.
    unable to rollback

Workaround

In your notebook, edit the inserted code.

For example:

Error in notebooks when rendering data from Cloudera Distribution for Hadoop

When running Jupyter notebooks against Cloudera Distribution for Hadoop 5.16 or 6.0.1 with Spark 2.2, the first dataframe operation for rendering cell output from the Spark driver results in a JSON encoding error.

Workaround:

When running Jupyter notebooks against CDH 5.16 or 6.0.1 with Spark 2.2, the first dataframe operation for rendering cell output from the Spark driver results in a JSON encoding error. To workaround this error, do one of the following procedures:

    %%spark
    import sys, warnings
    def python_major_version ():
        return(sys.version_info[0])
    with warnings.catch_warnings(record=True):
         print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())

Notebook loading considerations

The time that it takes to create a new notebook or to open an existing one for editing purposes might vary. If no runtime container is available, a container needs to be created and only after it is available, the Jupyter notebook user interface can be loaded. The time it takes to create a container depends on the cluster load and size. Once a runtime container exists, subsequent calls to open notebooks will be significantly faster.

Kernel not found when opening a notebook imported from a Git repository

If you import a project from a Git repository that contains notebooks that were created in JupyterLab, and try opening the notebooks from the project Assets page, you will see a message stating that the required notebook kernel can't be found.

The reason is that you are trying to open the notebook in an environment that doesn't support the kernel required by the notebook, for example in an environment without Spark for a notebook that uses Spark APIs. The information about the environment dependency of a notebook that was created in JupyterLab and exported in a project is currently not available when this project is imported again from Git.

Workaround:

You need to associate the notebook with the correct environment definition. You can do this:

Environment runtime can't be started because the software customization failed

If your Jupyter notebook runtime can't be started and a 47 killed error is logged, the software customization process could not be completed because of lack of memory.

You can customize the software configuration of a Jupyter notebook environment by adding conda and pip packages. However, be aware that conda does dependency checking when installing packages which can be memory intensive if you add many packages to a customization.

To complete a customization successfully, you must make sure that you select an environment with sufficient RAM to enable dependency checking at the time the runtime is started.

If you only want packages from one conda channel, you can prevent unnecessary dependency checking by excluding the default channels. To do this, remove defaults from the channels list in the customization template and add nodefaults.

Notebook returns UTC time and not local time

The Python function datetime returns the date and time for the UTC time zone and not the local time zone where a user is located. The reason is that the default environment runtimes use the time zone where they were created, which is UTC time.

Workaround:

If you want to use your local time, you need to download and modify the runtime configuration file to use your time zone. You don't need to make any changes to the runtime image. After you upload the configuration file again, the runtime will use the time you set in the configuration file.

Required role: You must be a Cloud Pak for Data cluster administrator to change the configuration file of a runtime.

To change the time zone:

  1. Download the configuration file of the runtime you are using. Follow the steps in Downloading the runtime configuration.
  2. Update the runtime definition JSON file and extend the environment variable section to include your time zone, for example, for Europe/Viena use:
    {
    "name": "TZ",
    "value": "Europe/Vienna"
    }
    
  3. Upload the changed JSON file to the Cloud Pak for Data cluster. You can use the Cloud Pak for Data API.

    1. Get the required platform access token. The command returns the bearer token in the accessToken field:
      curl <CloudPakforData_URL>/v1/preauth/validateAuth -u <username>:<password>
      
    2. Upload the JSON file:

      curl -X PUT \
      'https://<CloudPakforData_URL>/zen-data/v1/volumes/files/%2F_global_%2Fconfig%2F.runtime-definitions%2Fibm' \
      -H 'Authorization: Bearer <platform-access-token>' \
      -H 'content-type: multipart/form-data' \
      -F upFile=@/path/to/runtime/def/<custom-def-name>-server.json
      

      Important: Change the name of the modified JSON file. The file name must end with server.json and the same file name must be used across all clusters to enable exporting and importing analytics projects across cluster boundaries.

      If the changed JSON file was uploaded successfully, you will see the following response:

       {
           "_messageCode_": "Success",
           "message": "Successfully uploaded file and created the necessary directory
                      structure"
       }
      
  4. Restart the notebook runtime.

Can't promote existing notebook because no version exists

If you are working with a notebook that you created prior to IBM Cloud Pak for Data 4.0.0, and you want to promote this notebook to a deployment space, you will get an error message stating that no version exists for this notebook.

If this notebook also has a job definition, in addition to saving a new version, you need to edit the job settings.

To enable promoting existing notebooks and edit job settings, see Promoting notebooks.

Applies to: 4.0.0 and later when upgrading from 3.5

JuypterLab hangs if notebook code cell has more than 1000 lines of code

When you insert more than 1000 lines of code into a notebook code cell in JuypterLab, you will notice that JuypteLab hangs. You are asked to either close the browser tab or to continue waiting.

Workaround

Until an open source fix is available for this error, make sure that your notebook code cells in JupyteLab have less than 1000 lines of code. If you need to paste many lines of data to a code cell, rather store this data in the file and load the data to the notebook or script.

Errors when mixing Insert to code function options in R notebooks with Spark

The Insert to code function leverages Apache Arrow and Apache Flight in some function options for faster data access. This means that you can select Insert to code function options that use the old method to access data as well as options that use Arrow and Flight. To avoid errors when running R notebooks with Spark, you must be careful to not mix options that use different data access methods in the same notebook.

The R Insert-to-code options include:

Wherever possible, do not mix the options R DataFrame and R DataFrame (deprecated), SparkSession DataFrame, or Credentials within the same notebook, even for different connection types.

For example, if you run code that was inserted by options other than R DataFrame first, followed by code from the R DataFrame option, you will see the following error:

error in py_module_import(module, convert = convert): ImportError: /opt/ibm/conda/miniconda/lib/python/site-packages/pyarrow/../../.././libbrotlienc.so.1: undefined symbol: BrotliDefaultAllocFunc

Workaround

If you can't avoid mixing Insert to code options that use different access methods in the same notebook, add the following code at the top of your notebook, and run this cell first:

library("reticulate")
pa <- import("pyarrow")
library(ibmWatsonStudioLib)
wslib <- access_project_or_space()

SparkSession fails to start in Spark 3.0 & R 3.6 notebooks after upgrading to version 4.0.5

When you upgrade from Cloud Pak for Data 4.0.3 to 4.0.5 or from Cloud Pak for Data 3.5 Refresh 9 (October 2021) to 4.0.5 and you open a notebook in a Spark 3.0 & R 3.6 environment, an error is displayed stating that a SparkSession cannot be started.

The reason for this occurring is that JDBC drivers are not compatible with the latest log4j versions.

Workaround

While upgrading, also update the custom JDBC drivers that you have deployed and make sure that you use the latest versions of these drivers that are compatible with log4j 2.17.0.

Insert to code function in notebooks on Spark 2.4 with Scala or R doesn't support Flight Service

The insert to code function in notebooks that run in environments that use Spark 2.4 with Scala or R doesn't support the Flight Service based on Apache Arrow Flight to communicate with a database connections or connected data assets when loading data into a data structure.

Workaround

You can continue using the old insert to code function. Although the old insert to code function doesn't support as many data sources, it does include the capability of adding the data source credentials. With the credentials, you can write your own code to access the asset and load data into data structures of your choice in your notebooks.

Note that you can't use the code that was generated by the insert to code function for Scala or R for a notebook that runs in Spark 3.0 in a notebook that runs in Spark 2.4.

Code inserted by the Insert to code function for Mongo DB connections in Scala 2.12 notebooks with Spark 3.0 sometimes returns errors

Sometimes the insert to code function for Mongo DB connections in notebooks that run in Spark 3.0 and Scala 2.12 returns an error when the inserted code is run.

Workaround

If the inserted code returns an error when accessing data from a Mongo DB connection, you need to write your own code to access and load data into the data structures of your choice in your notebook.

Fixed in: 4.0.7

Python 3.8 notebook kernel dies while running generated code from Insert to code function

If you run the code generated by the Insert to code function to load data from a database connection to your notebook and you encounter an error stating that the kernel appears to have died, the reason might be that the data set that you are trying to load is too large and the kernel has run out of memory.

Workaround

If you know that your data set is large (possibly larger than the memory allocated to the kernel), you should edit the generated code before you run it in your notebook.

If the generated code contains a select_statement interaction property like in the following Db2 example:

Db2_data_request = {
    'connection_name': """Db2""",
    'interaction_properties': {
        'select_statement': 'SELECT * FROM "USER"."TABLE"'
    }
}

Modify the Db2 select_statement to only retrieve the first 5000 rows as follows:

Db2_data_request = {
    'connection_name': """Db2""",
    'interaction_properties': {
        'select_statement': 'SELECT * FROM "USER"."TABLE" FETCH FIRST 5000 ROWS ONLY'
    }
}

For the select_statements in all other database connections, use the corresponding SQL expressions.

If the generated code contains schema_name and table_name like in the following example:

PostgreSQL_data_request = {
    'connection_name': """PostgreSQL""",
    'interaction_properties': {
        #'row_limit': 500,
        'schema_name': 'schema',
        'table_name': 'table'
    }
}

Remove the comment on row_limit to retrieve only the first 500 rows as follows:

PostgreSQL_data_request = {
    'connection_name': """PostgreSQL""",
    'interaction_properties': {
        'row_limit': 500,
        'schema_name': 'schema',
        'table_name': 'table'
    }
}

Fixed in: 4.0.8

Save data and upload file size limitation in project-lib and ibm-watson-studio-lib for Python

If you use the save_data or upload_file functions in ibm-watson-studio-lib for Python or the save_data function in project-lib for Python, the data or file size is not allowed to exceed 2 GB.

2 GB is a hard limit in project-lib for Python. In ibm-watson-studio-lib for Python, you can follow the steps described in the workaround to save data or upload a file larger than 2 GB.

Workaround:

To work with data or a file that is larger than 2 GB in size in ibm-watson-studio-lib for Python, you need to move the file or the data to the storage associated with the project.

  1. If you upload a file in a notebook using the upload_file function, the file is already available in the file system of your environment runtime and you can skip step 2.
  2. If you upload data in a notebook using the save_data function, you need to save the data to a file in the local file system of your environment runtime, for example:
     with open("my_asset.csv", "wb") as file:
         file.write(data)
    
  3. Retrieve the path to the mounted project storage:

     wslib.mount.get_base_dir()
    

    Make a note of the path, for example, /project_data/data_asset/.

  4. Copy the files to the mounted project storage, for example:
     !cp my_asset.csv /project_data/data_asset/
    
  5. Register the files in the mounted project storage as data assets in your project:
     wslib.mount.register_asset("/project_data/data_asset/my_asset.csv", asset_name="my_asset.csv")
    

Notebooks fail to start even after custom environment definition is fixed

If your custom environment definition has a problem, for example it references a custom image that is no longer available, or tries to use too much CPU or memory resources, the associated notebook or JupyterLab runtime will not start. This is the expected behavior.

However, even if you update the environment definition to fix the issue, the notebook or JupyterLab runtime will still not start with this environment.

Workaround

Create a new environment definition and associate this environment definition with your notebook or select it when you launch JupyterLab.

Applies to: 4.0.7

Notebook or JupyterLab runtimes might not be accessible after running for more than 12 hours

If you run a notebook or a JupyterLab session for more than 12 hours, you might get an error stating that the runtime can no longer be accessed.

Workaround

To access the runtime again:

  1. Stop the runtime from the Environments tab of the project, or under Projects > Active Runtimes.
  2. Start the notebook or JupyterLab session again.

Applies to: 4.0.7

Fixed in: 4.0.8

Insert to code fails when the Flight service load is very high

When working with the Insert to code function in a notebook after upgrading from Cloud Pak for Data 4.0.7 to 4.0.8, you might see an error stating that the Flight service is unavailable because the server concurrency limit was reached. The reason for this error occurring is the overall high load on the Flight service and its inability to process any further requests. This error does not mean that there is a problem with the code in your notebook.

Workaround

If you see the error when running a notebook interactively, try running the cell again. If you see the error in the log of a job, try running the job again, if possible at a time when the system is less busy.

Applies to: 4.0.7

Anaconda Repository for IBM Cloud Pak for Data

Channel names for Anaconda Repository for IBM Cloud Pak for Data don't support double-byte characters

When you create a channel in Anaconda Team Edition, you can't use double-byte characters or most special characters. You can use only these characters: a-z 0-9 - _

RStudio

RStudio Sparklyr package 1.4.0 can't connect with Spark 3.0 kernel

When users try to connect the Sparklyr R package in RStudio with a remote Spark 3.0 kernel, the connection fails because of Sparklyr R package connection issues. The connection issues are due to recent changes to Sparklyr R package version 1.4.0. This will be addressed in future releases. The workaround is to use the Spark 2.4 kernel.

Applies to: 4.0.0 only.

Fixed in: 4.0.1

Sparklyr R package version 1.7.0 is now used in Spark 3.0 kernels.

Running job for R script and selected RStudio environment results in an error

When you're running a job for an R Script and a custom RStudio environment was selected, the following error occurs if the custom RStudio environment was created with a previous release of Cloud Pak for Data: The job uses an environment that is not supported. Edit your job to select an alternative environment.

To work around this issue, delete and re-create the custom RStudio environment with the same settings.

Applies to: 4.0.0 only.

Fixed in: 4.0.1

Git integration broken when RStudio crashes

If RStudio crashes while working on a script and you restart RStudio, integration to the associated Git repository is broken. The reason is that the RStudio session workspace is in an incorrect state.

Workaround

If Git integration is broken after RStudio crashed, complete the following steps to reset the RStudio session workspace:

  1. Click on the Terminal tab next to the Console tab to create a terminal session.
  2. Navigate to the working folder /home/wsuser and rename the .rstudio folder to .rstudio.1.
  3. From the File menu, click Quit Session... to end the R session.
  4. Click Start New Session when prompted. A new R project with Git integration is created.

No Git tab although RStudio is launched with Git integration

When you launch RStudio in a project with Git integration, the Git tab might not be visible on the main RStudio window. The reason for this is that if the RStudio runtime needs longer than usual to start, the .Rprofile file that enables integration to the associated Git repository cannot run.

Workaround

To add the Git tab to RStudio:

  1. Run the following command from the RStudio terminal:
    cp $R_HOME/etc/.Rprofile  $HOME/.Rprofile
    echo "JAVA_HOME='/opt/conda/envs/R-3.6'" >> $HOME/.Renviron
    
  2. From the Session menu, select Quit Session... to quit the session.
  3. If you are asked whether you want to save the workspace image to ~.RData, select Don't save.
  4. Then click Start New Session.

Applies to: 4.0.0 only.

Fixed in: 4.0.1

RStudio doesn't open although you were added as project collaborator

If RStudio will not open and all you see is the endless spinner, the reason is that, although you were added as collaborator to the project, you have not created your own personal access token to the Git repository associated with the project. To open RStudio with Git integration, you must select your own access token.

To create your own personal access token, see Collaboration in RStudio.

Data in persistent storage volume not mounted when RStudio is launched

If you use a PersistentVolumeClaim (PVC) on the Cloud Pak for Data cluster to store large data sets, the storage volume is not automatically mounted when RStudio is launched in a project with default Git integration.

Workaround:

If you want to work with data in a persistent storage volume in RStudio, you must either:

In both of these types of projects, the persistent storage volume is automatically mounted when RStudio is launched and can be viewed and accessed in the /mnts/ folder.

Applies to: 4.0.1 thru 4.0.3

Fixed in: 4.0.4

Can't connect to Hadoop Livy in RStudio

If you are working in RStudio and you try to connect to Hadoop Livy, you will see an error stating that a Livy session can't be started. The reason for this error is a version mismatch between the installed cURL 4.3 R package and Livy connection using Sparklyr.

Workaround

To successfully connect to Livy using Sparklyr:

  1. Downgrade the cURL R package to the 3.3 version.

    As CRAN installs the latest cURL version by default, use the following command to downgrade to version 3.3:

     install.packages("https://cran.r-project.org/src/contrib/Archive/curl/curl_3.3.tar.gz", repos=NULL)
    

Runtime pod fails when runtime is started

Oftentimes, when you start an RStudio runtime, the associated runtime pod fails, changing from being in Running state to Terminating state. This behavior can also be observed when starting SPSS, Jupyter notebook, or Data Refinery runtimes.

The reason for these pods failing is that the runtime manager is waiting for the runtime operator to update its status in a POST operation which times out, resulting in the deletion of the runtime operator.

Applies to: 4.0.7

Fixed in: 4.0.8

Data Refinery

Cannot run a Data Refinery flow job with certain unsigned data types

If the source table contains one of the following data types or equivalents, the Data Refinery flow job will fail with a ClassCastException error:

Applies to: 4.0.8 and later

Cannot view visualization charts in Data Refinery after upgrade

After you upgrade to Cloud Pak for Data 4.0.8, the visualization charts do not open in Data Refinery.

Workaround

Restart the Data Refinery pods:

  1. Find the names of Data Refinery pods.
oc get pod -l release=ibm-data-refinery-prod

For example:

oc get po -l release=ibm-data-refinery-prod
NAME                            READY   STATUS    RESTARTS   AGE
wdp-dataprep-5444f79b5d-xhlww   1/1     Running   0          39h
wdp-shaper-5fc5d87674-c8lq9     1/1     Running   0          39h
  1. Delete the pods by name.
oc delete pod <pod names>

For example:

oc delete pod wdp-dataprep-5444f79b5d-xhlww  wdp-shaper-5fc5d87674-c8lq9
pod "wdp-dataprep-5444f79b5d-xhlww" deleted
pod "wdp-shaper-5fc5d87674-c8lq9" deleted

The two pods will restart after the old pods are terminated.

Applies to: 4.0.8

Cannot run a Data Refinery flow job with data from a Hadoop cluster

If you run a Data Refinery flow job with data from one of the following connections, the job will fail:

Applies to: 4.0.6 - 4.0.7
Fixed in: 4.0.8

Option to open saved visualization assets is disabled in Data Refinery

After creating a visualization in Data Refinery and clicking Save to project, the option to open the saved visualization asset is disabled.

Applies to: 4.0.7.
Fixed in: 4.0.8.

Cannot refine data that uses commas in the source data and a target that uses a delimited file format

If the source file uses commas in the data (the commas are part of the data, not the delimiters), and you specify the Delimited file format for the target, the job will fail.

Workaround: Choose the CSV file format for the target.

Applies to: 4.0.6 and later

Data Refinery flow job fails when writing double-byte characters to an Avro file

If you run a job for a Data Refinery flow that uses a double-byte character set (for example, the Japanese or Chinese languages), and the output file is in the Avro file format, the job will fail.

Applies to: 3.5.0
Fixed in: 4.0.6

Data Refinery flow job fails with a large data asset

If your Data Refinery flow job fails with a large data asset, try these troubleshooting tips to fix the problem:

Applies to: 3.5.0 and later

Certain Data Refinery flow GUI operations might not work on large data assets

Data Refinery flow jobs that include these GUI operation might fail for large data assets.

Applies to 3.5.0 and later. These operations are not fixed yet:

These operations are fixed in 4.0.3:

See Data Refinery flows with large data sets need updating when using certain GUI operations.

These operations are fixed in 4.0.6:

See Data Refinery flows with large data sets need updating when using certain GUI operations.

Data Refinery flows with large data sets need updating when using certain GUI operations

For running Data Refinery jobs with large data assets, the following GUI operations have performance enhancements that require you to update any Data Refinery flows that use them:

Applies to: 4.0.3 and later:

Applies to: 4.0.6 and later:

To improve the job performance of a Data Refinery flow that uses these operations, update the Data Refinery flow by opening it and saving it, and then running a job for it. New Data Refinery flows automatically have the performance enhancements. For instructions, see Managing Data Refinery flows.

Data Refinery flow job fails for large Excel files

This problem is due to insufficient memory. To solve this problem, the cluster admin can add a new server to the cluster or add more physical memory. Alternatively, the cluster admin can use the OpenShift console to increase the memory allocated to the wdp-connect-connector service, but be aware that doing so might decrease the memory available to other services.

Applies to: 4.0.2
Fixed in: 4.0.3

Cannot run a Data Refinery flow job with data from an Amazon RDS for MySQL connection

If you create a Data Refinery flow with data from an Amazon RDS for MySQL connection, the job will fail.

Applies to: 4.0.0
Fixed in: 4.0.1

Duplicate connections in a space resulting from promoting a Data Refinery flow to a space

When you promote a Data Refinery flow to a space, all dependent data is promoted as well. If the Data Refinery flow that is being promoted has a dependent connection asset and a dependent connected data asset that references the same connection asset, the connection asset will be duplicated in the space.

The Data Refinery flow will still work. Do not delete the duplicate connections.

Applies to: 3.5.0 and later

Data Refinery flow fails with "The selected data set wasn't loaded" message

The Data Refinery flow might fail if there are insufficient resources. The administrator can monitor the resources and then add resources by scaling the Data Refinery service or by adding nodes to the Cloud Pak for Data cluster.

Applies to: 3.5.0 and later

Jobs

Spark jobs are supported only by API

If you want to run analytical and machine learning applications on your Cloud Pak for Data cluster without installing Watson Studio, you must use the Spark jobs REST APIs of Analytics Engine powered by Apache Spark. See Getting started with Spark applications.

UI displays job run started by Scheduler and not by a specific user

If you manually trigger a run for a job that has a schedule defined, the UI will show that the job run was started by Scheduler, not that it was started by a particular user.

Applies to: 4.0.5

Excluding days when scheduling a job causes unexpected results

If you select to schedule a job to run every day of the week excluding given days, you might notice that the scheduled job does not run as you would expect. The reason might be due to a discrepancy between the timezone of the user who creates the schedule, and the timezone of the master node where the job runs.

This issue only exists if you exclude days of a week when you schedule to run a job.

Error occurs when jobs are edited

You cannot edit jobs that were created prior to upgrading to Cloud Pak for Data version 3.0 or later. An erorr occurs when you edit those jobs. Create new jobs after upgrading to Cloud Pak for Data version 3.0 or later.

Errors can also occur if the user who is trying to edit the job or schedule is different from the user who started or created the job. For example, if a Project Editor attempts to edit a schedule that was created by another user in the project, an error occurs.

Can't delete notebook job stuck in starting or running state

If a notebook job is stuck in starting or running state and won't stop, although you tried to cancel the job and stopped the active environment runtime, you can try deleting the job by removing the job-run asset manually using the API.

  1. Retrieve a bearer token from the user management service using an API call:
    curl -k -X POST https://PLATFORM_CLUSTER_URL/icp4d-api/v1/authorize -H 'cache-control: no-cache' -H 'content-type: application/json' -d '{"username":"your_username","password":"your_password"}'
    
  2. (Optional) Get the job-run asset and test the API call. Replace ${token}, ${asset_id}, and ${project_id} accordingly.
    curl -H 'accept: application/json' -H 'Content-Type: application/json' -H "Authorization: Bearer ${token}" -X GET "<PLATFORM_CLUSTER_URL>/v2/assets/${asset_id}?project_id=${project_id}"
    
  3. Delete the job-run asset. Again replace ${token}, ${asset_id}, and ${project_id} accordingly.
    curl -H 'accept: application/json' -H 'Content-Type: application/json' -H "Authorization: Bearer ${token}" -X DELETE "<PLATFORM_CLUSTER_URL>/v2/assets/${asset_id}?project_id=${project_id}"
    

Notebook runs successfully in notebook editor but fails when run as job

Some libraries require a kernel restart after a version change. If you need to work with a library version that isn't pre-installed in the environment in which you start the notebook, and you install this library version through the notebook, the notebook only runs successfully after you restart the kernel. However, when you run the notebook non-interactively, for example as a notebook job, it fails because the kernel can't be restarted. To avoid this, define an environment defintion and add the library version you require as a software customization. See Creating environment definitions.

Can't change the schedule in existing jobs after upgrading to Cloud Pak for Data 4.0.7

If you created scheduled jobs in earlier versions of Cloud Pak for Data and are upgrading to Cloud Pak for Data version 4.0.7, you can't change or remove the schedule from these existing jobs.

Workaround

If you need to change the schedule in an existing job after upgrading to Cloud Pak for Data version 4.0.7:

  1. Delete the existing job.
  2. Create a new scheduled job.

For details, see Creating and managing jobs in an analytics project.

Can't run a Scala 2.12 with Spark 3.0 notebook job in a deployment space

If you use code generated by the Insert to code function in a Scala 2.12 with Spark 3.0 notebook and you want to run this code in a job in a deployment space, you must use the code generated by the deprecated Insert to code function.

If you run code that was generated by an Insert to code function that uses the Flight service, your job will fail.

Federated Learning

Authentication failures for Federated Learning training jobs when allowed IPs are specified in the Remote Training System

Currently, the Openshift Ingress Controller is not setting the X-Forwarded-For header with the client's IP address regardless of the forwardedHeaderPolicy setting. This will cause authentication failures for Federated Learning training jobs when allowed_ips are specified in the Remote Training System even though the client IP address is correct.

To use the Federated Learning Remote Training System IP restriction feature in Cloud Pak for Data 4.0.3, configure an external proxy to inject the X-Forwarded-For header. For more information see this article on configuring ingress

Applies to: 4.0.3 or later

Data module not found in IBM Federated Learning

The data handler for IBM Federated Learning is trying to extract a data module from the FL library but is unable to find it. You might see the following error message:

ModuleNotFoundError: No module named 'ibmfl.util.datasets'

The issue possibly results from using an outdated DataHandler. Please review and update your DataHandler to conform to the latest spec. Here is the link to the most recent MNIST data handler or ensure your gallery sample versions are up-to-date.

Applies to: 4.0.3 or later

Unable to save Federated Learning experiment following upgrade

If you train your Federated Learning model in a previous version of Cloud Pak for Data, then upgrade, you might get this error when you try to save the model following the upgrade: "Unexpected error occurred creating model. The issue results from training with a framework that is not supported in the upgraded version of Cloud Pak for Data. To resolve the issue, retrain the model with a supported framework, then save.

Applies to: 4.0.2 or later

Watson Machine Learning

Deployments fail for Keras models published to catalog then promoted from project to space

If you publish a Keras model with custom layers to a catalog, and then copy it back to a project, deployments for the model will fail after promotion to a space. The flow is as follows:

  1. Create a model with custom layers of type tensorflow_2.7 or tensorflow_rt22.1 with software specification runtime-22.1-py3.9 or tensorflow_rt22.1-py3.9.
  2. Publish the model to a Watson Knowledge Catalog.
  3. From the catalog, add the model to a project.
  4. Promote the model to a space.

At this point, the custom layer information is lost, so deployments of the model will fail. To resolve the issue, save the model to a project without publishing to a catalog.

Predictions API in Watson Machine Learning service can timeout too soon

If the predictions API (POST /ml/v4/deployments/{deployment_id}/predictions) in the Watson Machine Learning deployment service is timing out too soon, follow these steps to manually update the timeout interval.

Applies to: 4.0 and higher

Updating the Prediction service in Envoy

  1. Capture the wmlenvoyconfig configmap content in a yaml file:

    oc get cm wmlenvoyconfig -o yaml > wmlenvoyconfig.yaml
    
  2. Search for the property timeout_ms and update the value to a required timeout ( in milliseconds) in the yaml file wmlenvoyconfig.yaml:

    “timeout_ms”: <REQUIRED_TIMEOUT_IN_MS>
    

For example: To update the timeout to 600000 milliseconds:

"timeout_ms": 600000
  1. To apply the timeout changes, first delete the configmap:

    oc delete -f wmlenvoyconfig.yaml
    
  2. Recreate the configmap:

oc create -f wmlenvoyconfig.yaml
  1. Edit the wml-cr file to add this line ignoreForMaintenance: true. This sets the operator into maintenance mode which stops automatic reconciliation. The automatic reconciliation will undo any configmap changes applied otherwise.
oc patch wmlbase wml-cr --type merge --patch '{"spec": {"ignoreForMaintenance": true}}' -n <namespace>
  1. Restart the Envoy pod:

    oc rollout restart deployment wml-deployment-envoy
    
  2. Wait for the Envoy pod to come up:

    oc get pods | grep wml-deployment-envoy
    

Updating the Prediction service in NGINX

  1. Make a backup of wml-base-routes configmap:

     oc get cm wml-base-routes -o yaml > wml-base-routes.yaml.bkp
    
  2. Edit the configmap

     oc edit cm wml-base-routes
    
  3. Search for keyword predictions in the configmap wml-base-routes and update the following properties for timeout (in secs) value required for that location:

     proxy_send_timeout
     proxy_read_timeout
     send_timeout
    

    For example, to update the properties proxy_send_timeout, proxy_read_timeout and send_timeout to timeout 600 in seconds, update these locations:

     location ~ ^/ml/v4/deployments/([a-z0-9_]+)/predictions {
       rewrite ^\/ml/v4/deployments/([a-z0-9_]+)/(.*) /ml/v4/deployments/$1/$2 break;
       proxy_pass https://wml-envoy-upstream;
       proxy_http_version 1.1;
       proxy_set_header Connection "";
       proxy_set_header Host $http_host;
       proxy_set_header x-global-transaction-id $x_global_transaction_id;
       proxy_set_header v4-deployment-id $1;
       proxy_pass_request_headers      on;
       proxy_connect_timeout       30;
       proxy_send_timeout          600;
       proxy_read_timeout          600;
       send_timeout                600;
       proxy_next_upstream error timeout;
       }
    
     location ~ ^/ml/v4/deployments/([a-z0-9-]+)/predictions {
       rewrite ^\/ml/v4/deployments/([a-z0-9-]+)/(.*) /ml/v4/deployments/$1/$2 break;
       proxy_pass https://wml-envoy-upstream;
       proxy_http_version 1.1;
       proxy_set_header Connection "";
       proxy_set_header Host $http_host;
       proxy_set_header x-global-transaction-id $x_global_transaction_id;
       proxy_set_header v4-deployment-id $1;
       proxy_pass_request_headers      on;
       proxy_connect_timeout       30;
       proxy_send_timeout          600;
       proxy_read_timeout          600;
       send_timeout                600;
       proxy_next_upstream error timeout;
     }
    
  4. Search for keyword icpdata_addon_version in the labels section, and update values by appending a number to them, in the form: <old value>-<any_random_number>. For example, if the value is 4.0.3 , change it to 4.0.3-100

     labels:
       app: wml-base-routes
       app.kubernetes.io/instance: ibm-wml-cpd
       app.kubernetes.io/managed-by: ansible
       app.kubernetes.io/name: ibm-wml-cpd
       app.kubernetes.io/version: 4.0.3
       icpdata_addon: "true"
       icpdata_addon_version: 4.0.3-100
       release: ibm-wml
    
  5. Update the WML configuration file in the NGINX pod:

    1. List the NGINX pods:

      oc get pods | grep "ibm-nginx"
      
    2. Get into any one of ibm-nginx pods:

      oc exec -it <pod> bash
      
      1. Search for keyword predictions in the file /user-home/_global_/nginx-conf.d/wml-base-routes.conf and update the properties proxy_send_timeout , proxy_read_timeout and send_timeout to timeout (in secs) value required.

      For example, to update the properties proxy_send_timeout, proxy_read_timeout and send_timeout to timeout 600 in seconds, update these locations:

      location ~ ^/ml/v4/deployments/([a-z0-9_]+)/predictions {
      rewrite ^\/ml/v4/deployments/([a-z0-9_]+)/(.*) /ml/v4/deployments/$1/$2 break;
      proxy_pass https://wml-envoy-upstream;
      proxy_http_version 1.1;
      proxy_set_header Connection "";
      proxy_set_header Host $http_host;
      proxy_set_header x-global-transaction-id $x_global_transaction_id;
      proxy_set_header v4-deployment-id $1;
      proxy_pass_request_headers      on;
      proxy_connect_timeout       30;
      proxy_send_timeout          600;
      proxy_read_timeout          600;
      send_timeout                600;
      proxy_next_upstream error timeout;
      }
      
      location ~ ^/ml/v4/deployments/([a-z0-9-]+)/predictions {
      rewrite ^\/ml/v4/deployments/([a-z0-9-]+)/(.*) /ml/v4/deployments/$1/$2 break;
      proxy_pass https://wml-envoy-upstream;
      proxy_http_version 1.1;
      proxy_set_header Connection "";
      proxy_set_header Host $http_host;
      proxy_set_header x-global-transaction-id $x_global_transaction_id;
      proxy_set_header v4-deployment-id $1;
      proxy_pass_request_headers      on;
      proxy_connect_timeout       30;
      proxy_send_timeout          600;
      proxy_read_timeout          600;
      send_timeout                600;
      proxy_next_upstream error timeout;
      }
      
  6. Restart the NGINX pods:

     oc rollout restart deployment ibm-nginx
    
  7. Check that the NGINX pods have come up:

     oc get pods | grep "ibm-nginx"
    

Updating the HAProxy timeout

To update the HAProxy timeout, see HAProxy timeout settings for the load balancer.

Deployment of AutoAI model can fail when training input and deployment input don't match

If you train an AutoAI experiment using a database as the training data source, then deploy the model using a CSV file as the deployment input data, the deployment might fail with an error stating cannot resolve... and eventually times out.

To resolve the error, use the same type of data source for training and deploying the model.

Applies to: 4.0.6

Deploying SPSS Modeler flows with Data Asset Import node inside supernode fails

If your SPSS Modeler flow includes a supernode containing a Data Asset Import node, the deployment of the flow will fail. To resolve the issue, move the Data Asset Import node outside of the supernode.

Fixed in: 4.0.8

Deploying some SPSS model types saved as PMML fails

If you save SPSS model of type Carma, Sequence, or Apriori as PMML, then promote or import the PMML file to a deployment space, online or batch deployments created from the PMML file will fail with this error:

Deploy operation failed with message: Invalid PMML or modelSequenceModel not supported by Scoring Engine.

To resolve this issue, use the Save Branch as a Model option to save the SPSS flow as a model, promote it to a space, and create the deployments.

Applies to: 4.0.6 and earlier

Fixed in: 4.0.7 for Apriori and Carma models.

Deployments can fail with framework mismatch between training and WMLA

When you are creating a deployment that relies on WMLA, make sure your framework is supported for both training and deployment or the deployment might fail. For example, if you train a model on a Pytorch framework based on Python 3.7, and deploy on a version of WMLA that supports Python 3.9, you will get this error: your model ir_version is higher than the checker's, indicating a mismatch. In that case, retrain your model using a framework supported for training and deployment.

SPSS deployment jobs with no schema ID fail

When creating a batch deployment job for SPSS models without input schemas defined, you can manually define the schema id and associated data asset. If you select a data asset but don't provide an associated schema id, you are not prompted to correct the error and the job is created without any input data references.

Applies to: 4.0.3 Fixed in: 4.0.7

Deployment unusable because deployment owner left the space

If deployment owner leaves the space, the deployment will become unusable. Here is how you can verify this:

If this happens, a space administrator can assign a new deployment owner. New owner must either be a space admin or an editor.

If you are updating the current deployment owner, only a replace operation is allowed with value path: /metadata/owner.

For information on how to update the deployment owner, refer to the "Updating deployment details using the Patch API command" section in Updating a deployment.

Applies to: 4.0.2 and 4.0.3

Duplicate deployment serving names need updating

Starting in 4.0.3, serving names that users assign to deployments must be unique per cluster. Users can check if an existing serving name is unique using the API call GET /ml/v4/deployments?serving_name&conflict=true API. If the call returns a status code of 204, the name is unique and no further change is required. If the call returns a status code of 409, the user can update the serving name using PATCH API. Deployments with invalid serving names will fail with an error requiring the user to update the name. For details on serving names, see Creating an online deployment. For details on using the PATCH command, see Update the deployment metadata.

Applies to: 4.0.2

Upgrade from Cloud Pak for Data 3.5 appears to fail before resolving

While Watson Machine Learning installation is in progress, the Resource Creation status may temporarily show Failed. The Watson Machine Learning resource will attempt reconciliation and the issue should automatically resolve. If the Watson Machine Learning CR status does not change to Complete after an extended period of time, contact IBM Support.

Applies to: 4.0.3

Restrictions for IBM Z and IBM LinuxONE users

When Cloud Pak for Data is installed on the IBM Z and LinuxONE platforms, Watson Studio and Watson Machine Learning users will not be able to use, run, or deploy the following types of assets:

Additionally, note the following:

Applies to: 4.0.2 and greater

Spark and PMML models are not supported on FIPS-enabled clusters

Spark and PMML models deployed on FIPS-enabled clusters can fail with the error "Model deployment failed because a pod instance is missing."

Applies to: 4.0.3 Fixed in: 4.0.6

Deployments might fail after restore from backup

After restoring from a backup, users might be unable to deploy new models and score existing models. To resolve this issue, after the restore operation, wait until operator reconciliation completes. You can check the status of the operator with this command:

  kubectl describe WmlBase wml-cr -n <namespace_of_wml> | grep "Wml Status" | awk '{print $3}'

Applies to: 4.0.2

Job run retention not working as expected

If you override the default retention settings for preserving job runs and specify an amount, you might find that the number retained does not match what you specified.

Applies to: 4.0.2 and 4.0.3

Deployment unusable when owner ID is removed

If the ID belonging to the owner of the deployment is removed from the organization or the space, then deployments associated with that ID become unusable.

AutoAI requirement for AVX2

The AVX2 instruction set is not required to run AutoAI experiments, however it does improve performance. AutoAI experiments will run more slowly without AVX2.

AutoAI AVX2 limitation

AutoAI experiments that use SnapML algorithms will not work if the CPU used to train the AutoAI experiment does not support AVX2. The training will fail with an error.

Applies to: 4.0.2 and 4.0.3 Fixed in: 4.0.4

Watson Machine Learning might require manual rescaling

By default, the small installation of Watson Machine Learning comes up with one pod. When the load on the service increases, you may experience these symptoms, indicating the need to manually scale the wmlrepository service:

  1. wmlrepository service pod restarts with an Out Of Memory error
  2. wmlrepository service request fails with this error:

    Generic exception of type HttpError with message: akka.stream.BufferOverflowException: Exceeded configured max-open-requests value of [256]. This means that the request queue of this pool  has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
    

    Use this command to scale the repository:

    ``` ./cpd-linux scale -a wml --config medium -s server.yaml -n

    medium.yaml commands:

  3. scale --replicas=2 deployment wmlrepository ```

Do not import/export models between clusters running on different architectures

When you export a project or space, the contents, including model assets, are included in the export package. You can then import the project or space to another server cluster. Note that the underlying architecture must be the same or you might encounter failures with the deployment of your machine learning models. For example, if you export a space from a cluster running the Power platform, then import to a cluster running x86-64, you may be unable to deploy your machine learning models.

Deleting model definitions used in Deep Learning experiments

Currently, users can create create model definition assets from the Deep Learning Experiment Builder but cannot delete a model definition. They must use REST APIs to delete model definition assets.

RShiny app might load an empty page if user application sources many libraries from an external network

Initial load of an RShiny application may result in an empty page if the user application is sourcing many dependent libraries from an external network. If this happens, try refreshing the app after a while. It is general best practice and recommendation to source the dependent libraries locally by bundling them into the RShiny app.

Python function or Python script deployments may fail if itc_utils library and flight service is used to access data

Python function or Python script deployments may fail if the Python function or script uses the itc_utils library to access data through flight service. As a workaround, make these changes in your code:

  1. Remove the RUNTIME_FLIGHT_SERVICE_URL environment variable:

     os.environ.pop("RUNTIME_FLIGHT_SERVICE_URL")
    

This has to be done before the itcfs.get_flight_client API is invoked.

  1. Initialize the itc client by using this line of code:

     read_client = itcfs.get_flight_client(host="wdp-connect-flight.cpd-instance.svc.cluster.local", port=443)
    

Applies to: 4.5.x releases

Automatic mounting of storage volumes not supported by online and batch deployments

You cannot use automatic mounts for storage volumes with Watson Machine Learning online and batch deployments. Watson Machine Learning does not support this feature for Python-based runtimes, including R-script, SPSS Modeler, Spark, and Decision Optimization. You can only use automatic mounts for storage volumes with Watson Machine Learning shiny app deployments and notebook runtimes.

As a workaround, you can use the download method from the Data assets library, which is a part of the ibm-watson-machine-learning python client.

Applies to: 4.0 and later

Parent topic: IBM Watson Studio