Known issue and limitations

Known issues and limitations in IBM Watson® Machine Learning Accelerator.

  • In 2.3.5 and later, if you are upgrading or downgrading the IBM Watson Machine Learning Accelerator service, the Grafana admin password is unchanged.

    Before you upgrade or downgrade, make sure to remember the current admin password. After the upgrade or downgrade is completed, you can continue using the same password.

  • If you directly installed 2.3.5 or you upgraded or downgraded to 2.3.5, the elastic distributed inference API page is unable to render. The elastic distributed inference API page available from the service console (Help > API for Inference) fails to open with the following error:
    Errors
    Parser error on line 37
    
    Unable to render this definition The provided definition does not specify a valid version field.
    Please indicate a valid Swagger or OpenAPI version field. Supported version fields are swagger: "2.0"
    and those that match openapi: 3.0.n (for example, openapi: 3.0.0)
    To resolve this error and for the API page to render successfully, by creating a temporary openapi.yaml file in the PV.
    1. Change your project area to IBM Watson Machine Learning Accelerator:
      oc project wmla-ns
    2. Create job.yaml:
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: fix-edi-job
      spec:
        template:
          metadata:
            name: fix-edi-job
          spec:
            containers:
            - command:
              - sh
              - -c
              - |
                sed -i s#\"2.3.5#\"2.3.5\"# /var/www/openapi.yaml
              image: busybox
              name: fix
              volumeMounts:
              - name: data
                mountPath: /var/www
                subPath: wml-edi/docs
            restartPolicy: Never
            volumes:
              - name: data
                persistentVolumeClaim:
                  claimName: wmla-edi
    3. Apply the job:
      oc apply -f job.yaml
    4. Ensure that the job completed successfully:
      oc get pods -w
    5. Delete the job:
      oc delete -f job.yaml
    6. Clear your web browser cache, and open the API again.
      Note: If imd is restarted, you may need to complete these steps again.
  • In 2.3.3, 2.3.4, and 2.3.5, after upgrading to the latest version of IBM Watson Machine Learning Accelerator, the following error can occur when running elastic distributed training:
    "ModuleNotFoundError: No module named 'fabric_model'"
    To resolve this issue, update the dlpd configuration file and restart dlpd.
    1. Get the wmla-dlpd pod name in the IBM Watson Machine Learning Accelerator namespace (here the namespace name is wmla):
      oc get pods -n wmla | grep dlpd
      wmla-dlpd-76777d9674-nxp6c              2/2     Running   0          47m
    2. Get the current value of FABRIC_HOME in the pod:
      oc exec -it wmla-dlpd-76777d9674-nxp6c -c dlpd -- cat /var/shareDir/dli/conf/dlpd/dlpd.conf|grep FABRIC_HOME
          "FABRIC_HOME": "/opt/ibm/spectrumcomputing/dli/2.3.5/fabric",
      If the version (2.3.5 in this example) is not the same as currently installed version of IBM Watson Machine Learning Accelerator, open and edit the dlpd.conf file and update the version to the currently installed version and save your change.
      oc exec -it wmla-dlpd-76777d9674-nxp6c -c dlpd -- vi /var/shareDir/dli/conf/dlpd/dlpd.conf
    3. Verify that the file was updated and delete the wmla-dlpd pods to trigger a restart:
      oc delete pod wmla-dlpd-76777d9674-nxp6c 
  • If running elastic distributed training using NVIDIA T4, you must make sure to disable direct GPU-to-GPU communication when running training jobs. To run elastic training jobs on NVIDIA T4 using dlicmd, include --msd-env NCCL_P2P_DISABLE=1 and set workerDeviceNum to greater than 1. See: dlicmd.py reference
  • In 2.3.4, on Power, when setting enable_onnx as True in FabricModel for elastic distributed TensorFlow model, the following error can occur during conversion from a trained model to an ONNX model file:
    Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.
    To resolve this issue, increase the PIDs limit to greater than 1024:
    1. Check the pidsLimit on the worker node:
      oc debug node/worker
      Creating debug namespace/openshift-debug-node-tmjz4 ...
      Starting pod/worker-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: xxx.xxx.xxx.xxx
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-4.4# crio-status config | grep pid
          pids_limit = 1024
    2. Create ContainerRuntimeConfig to enlarge pidsLimit:
      cat << EOF | oc apply -f -
      apiVersion: machineconfiguration.openshift.io/v1
      kind: ContainerRuntimeConfig
      metadata:
       name: pids-limit
      spec:
       machineConfigPoolSelector:
         matchLabels:
           pools.operator.machineconfiguration.openshift.io/worker: '' 
       containerRuntimeConfig: 
         pidsLimit: 2048
      EOF
    3. Verify the ContainerRuntimeConfig for pidsLimit after the worker node is restarted:
      oc debug node/worker.pbm.host.com
      Creating debug namespace/openshift-debug-node-rw9gz ...
      Starting pod/workerpbmhostcom-debug ...
      To use host binaries, run `chroot /host`
      Pod IP: xxx.xxx.xxx.xxx
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-4.4# crio-status config | grep pid
          pids_limit = 2048
  • In 2.3.2 and 2.3.3, there is limitation to set enable_onnx as True in FabricModel for elastic distributed TensorFlow model, which sometimes fails with out of memory error during converting trained model into ONNX model file. This issue is fixed in Refresh 4.
  • GPU packing fails for some users. GPU packing does not accept submissions from users that have usernames that are in an email format (for example, jsmith@ibm.com). If you are logged in with a username that is an email address, GPU packing will fail with the following error message:
    Error 500: Invalid file: File path (wmla-gpu-packing-jsmith%ibm.com-15) contains encoded characters: %40
  • In IBM Watson Machine Learning Accelerator 2.3.0, a limitation exists where preemption in elastic distributed training does not occur if the namespace where IBM Watson Machine Learning Accelerator is installed has namespace GPU quota set.
  • When running deep learning training with the elastic distributed training engine, make sure that the pod CPU quota defined is greater than the total number of available GPU in the cluster.
  • In the Watson Machine Learning Accelerator console, a timestamp error occurs on the Resource Usage page:
    "{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"end timestamp must not be before start time\"}\n"
    The timestamp error occurs when inconsistent timezones are used. Set the timezone to UTC to resolve this issue.
  • Training jobs cannot run because conda environment is not ready after installing Watson Machine Learning Accelerator.
    Verify your environment is ready to run deep learning training. Issue the following command to verity your conda environments are synced and running:
    python dlicmd.py --status --rest-host wmla-console.ibm.com --rest-port -1 --debug-level info
    where wmla-console.ibm.com is the console URL.
    Verify that condaSynced is set to true. If true, you can start using Watson Machine Learning Accelerator and submitting training workloads.
    {
      “wmlaCRId”: “b3cbb02b-7d1e-4cda-bb6b-c49a5396ce67",
      “condaSynced”: “true”,
      “cloudpakInstanceId”: “702748fb-5385-4bf2-83b6-1993e1745735",
      “condaSyncedTime”: “”,
      “addOnId”: “wml-accelerator”,
      “serviceInstanceId”: “1602343544697283",
      “buildDate”: “2020-10-10T12:58:55Z”
    }

Notebook server issues

  • Jupyter notebook interface stops responding. If you try to open the notebook server from the console, the page cannot be opened, remains blank, or the following message appears:
    Server unavailable or unreachable
    Your server at /user/admin/ is not running. Would you like to restart it?

    To resolve this issue, stop the server and ensure that the jupyter-admin pod was terminated. Restart the server and reopen the console. Any previously started notebook kernels will be recovered.

  • If multiple users start a Notebook server from the IBM Watson Machine Learning Accelerator console, the Notebook server is started successfully for the first user but subsequent users cannot start the Notebook server and do not have a valid server token. The following error will appear in the Jupyter Hub logs:
    [AutoLogin] ------------ Start AutoLogin ------------
    [AutoLogin] Error: the argument token is null!
    To resolve this issue, close the browser and start the Notebook server again.
  • When a notebook server has pending kernels, new kernels cannot be started, and existing running kernels can no longer be executed (issue 784).

    To resolve this issue, kill the pending kernel application from the Watson Machine Learning Accelerator console. Navigate to Monitoring > Applications, find the pending notebook, and from the menu click Stop.

  • When running JupyterLab, the code blocks in the notebook cannot be executed.
    If the code blocks cannot be executed, this can be cause by a restart of the enterprise gateway app:
    [EnterpriseGatewayApp] Getting the job. Job id: admin-1602072579264, State: KILLED
    [msd:144] Polling - job state:KILLED, pod state:None
    [EnterpriseGatewayApp] MSDProcessProxy.kill, msd job ID: admin-1602072579264

    To resolve this issue, try stopping and starting the notebook server.