Known issue and limitations
Known issues and limitations in IBM Watson® Machine Learning Accelerator.
- In 2.3.5 and later, if you are
upgrading or downgrading the IBM Watson Machine Learning Accelerator service,
the Grafana admin password is unchanged.
Before you upgrade or downgrade, make sure to remember the current admin password. After the upgrade or downgrade is completed, you can continue using the same password.
- If you directly
installed 2.3.5 or you upgraded or downgraded to 2.3.5, the elastic distributed inference API page
is unable to render. The elastic distributed inference API page available from the service console
()
fails to open with the following error:
To resolve this error and for the API page to render successfully, by creating a temporary openapi.yaml file in the PV.Errors Parser error on line 37 Unable to render this definition The provided definition does not specify a valid version field. Please indicate a valid Swagger or OpenAPI version field. Supported version fields are swagger: "2.0" and those that match openapi: 3.0.n (for example, openapi: 3.0.0)- Change your project area to IBM Watson Machine Learning Accelerator:
oc project wmla-ns - Create job.yaml:
apiVersion: batch/v1 kind: Job metadata: name: fix-edi-job spec: template: metadata: name: fix-edi-job spec: containers: - command: - sh - -c - | sed -i s#\"2.3.5#\"2.3.5\"# /var/www/openapi.yaml image: busybox name: fix volumeMounts: - name: data mountPath: /var/www subPath: wml-edi/docs restartPolicy: Never volumes: - name: data persistentVolumeClaim: claimName: wmla-edi - Apply the job:
oc apply -f job.yaml - Ensure that the job completed successfully:
oc get pods -w - Delete the job:
oc delete -f job.yaml - Clear your web browser cache, and open the API again. Note: If imd is restarted, you may need to complete these steps again.
- Change your project area to IBM Watson Machine Learning Accelerator:
- In 2.3.3, 2.3.4, and 2.3.5, after
upgrading to the latest version of IBM Watson Machine Learning Accelerator, the
following error can occur when running elastic distributed
training:
To resolve this issue, update the dlpd configuration file and restart dlpd."ModuleNotFoundError: No module named 'fabric_model'"- Get the wmla-dlpd pod name in the IBM Watson Machine Learning Accelerator
namespace (here the namespace name is wmla):
oc get pods -n wmla | grep dlpd wmla-dlpd-76777d9674-nxp6c 2/2 Running 0 47m - Get the current value of
FABRIC_HOMEin the pod:
If the version (oc exec -it wmla-dlpd-76777d9674-nxp6c -c dlpd -- cat /var/shareDir/dli/conf/dlpd/dlpd.conf|grep FABRIC_HOME "FABRIC_HOME": "/opt/ibm/spectrumcomputing/dli/2.3.5/fabric",2.3.5in this example) is not the same as currently installed version of IBM Watson Machine Learning Accelerator, open and edit the dlpd.conf file and update the version to the currently installed version and save your change.oc exec -it wmla-dlpd-76777d9674-nxp6c -c dlpd -- vi /var/shareDir/dli/conf/dlpd/dlpd.conf - Verify that the file was updated and delete the wmla-dlpd pods to trigger a restart:
oc delete pod wmla-dlpd-76777d9674-nxp6c
- Get the wmla-dlpd pod name in the IBM Watson Machine Learning Accelerator
namespace (here the namespace name is wmla):
- If running elastic distributed training using NVIDIA T4, you must make sure to disable direct
GPU-to-GPU communication when running training jobs. To run elastic training jobs on NVIDIA T4 using
dlicmd, include
--msd-env NCCL_P2P_DISABLE=1and set workerDeviceNum to greater than 1. See: dlicmd.py reference - In 2.3.4, on Power, when
setting enable_onnx as True in FabricModel for elastic
distributed TensorFlow model, the following error can occur during conversion from a trained model
to an ONNX model
file:
To resolve this issue, increase the PIDs limit to greater than 1024:Check failed: ret == 0 (11 vs. 0)Thread creation via pthread_create() failed.- Check the pidsLimit on the worker node:
oc debug node/worker Creating debug namespace/openshift-debug-node-tmjz4 ... Starting pod/worker-debug ... To use host binaries, run `chroot /host` Pod IP: xxx.xxx.xxx.xxx If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# crio-status config | grep pid pids_limit = 1024 - Create ContainerRuntimeConfig to enlarge
pidsLimit:
cat << EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: pids-limit spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: '' containerRuntimeConfig: pidsLimit: 2048 EOF - Verify the ContainerRuntimeConfig for pidsLimit after the
worker node is restarted:
oc debug node/worker.pbm.host.com Creating debug namespace/openshift-debug-node-rw9gz ... Starting pod/workerpbmhostcom-debug ... To use host binaries, run `chroot /host` Pod IP: xxx.xxx.xxx.xxx If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# crio-status config | grep pid pids_limit = 2048
- Check the pidsLimit on the worker node:
- In 2.3.2 and 2.3.3, there is limitation to set enable_onnx as True in FabricModel for elastic distributed TensorFlow model, which sometimes fails with out of memory error during converting trained model into ONNX model file. This issue is fixed in Refresh 4.
- GPU
packing fails for some users. GPU packing does not accept submissions from users that have usernames
that are in an email format (for example, jsmith@ibm.com). If you are logged in with a username that
is an email address, GPU packing will fail with the following error message:
Error 500: Invalid file: File path (wmla-gpu-packing-jsmith%ibm.com-15) contains encoded characters: %40 - In IBM Watson Machine Learning Accelerator 2.3.0, a limitation exists where preemption in elastic distributed training does not occur if the namespace where IBM Watson Machine Learning Accelerator is installed has namespace GPU quota set.
- When running deep learning training with the elastic distributed training engine, make sure that the pod CPU quota defined is greater than the total number of available GPU in the cluster.
- In the Watson Machine Learning Accelerator console, a timestamp
error occurs on the Resource Usage page:
The timestamp error occurs when inconsistent timezones are used. Set the timezone to UTC to resolve this issue."{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"end timestamp must not be before start time\"}\n" - Training jobs cannot run because conda environment is not ready after installing
Watson Machine Learning Accelerator. Verify your environment is ready to run deep learning training. Issue the following command to verity your conda environments are synced and running:
where wmla-console.ibm.com is the console URL.python dlicmd.py --status --rest-host wmla-console.ibm.com --rest-port -1 --debug-level infoVerify that condaSynced is set to true. If true, you can start using Watson Machine Learning Accelerator and submitting training workloads.{ “wmlaCRId”: “b3cbb02b-7d1e-4cda-bb6b-c49a5396ce67", “condaSynced”: “true”, “cloudpakInstanceId”: “702748fb-5385-4bf2-83b6-1993e1745735", “condaSyncedTime”: “”, “addOnId”: “wml-accelerator”, “serviceInstanceId”: “1602343544697283", “buildDate”: “2020-10-10T12:58:55Z” }
Notebook server issues
- Jupyter notebook interface stops
responding. If you try to open the notebook server from the console, the page cannot be opened,
remains blank, or the following message appears:
Server unavailable or unreachable Your server at /user/admin/ is not running. Would you like to restart it?To resolve this issue, stop the server and ensure that the jupyter-admin pod was terminated. Restart the server and reopen the console. Any previously started notebook kernels will be recovered.
- If multiple
users start a Notebook server from the IBM Watson Machine Learning Accelerator
console, the Notebook server is started successfully for the first user but subsequent users cannot
start the Notebook server and do not have a valid server token. The following error will appear in
the Jupyter Hub logs:
To resolve this issue, close the browser and start the Notebook server again.[AutoLogin] ------------ Start AutoLogin ------------ [AutoLogin] Error: the argument token is null! - When a notebook server has pending kernels, new kernels cannot be started, and
existing running kernels can no longer be executed (issue
784).
To resolve this issue, kill the pending kernel application from the Watson Machine Learning Accelerator console. Navigate to , find the pending notebook, and from the menu click Stop.
- When running JupyterLab, the code blocks in the notebook cannot be executed. If the code blocks cannot be executed, this can be cause by a restart of the enterprise gateway app:
[EnterpriseGatewayApp] Getting the job. Job id: admin-1602072579264, State: KILLED [msd:144] Polling - job state:KILLED, pod state:None [EnterpriseGatewayApp] MSDProcessProxy.kill, msd job ID: admin-1602072579264To resolve this issue, try stopping and starting the notebook server.