Limitations and known issues

These limitations and known problems exist in IBM Spectrum Conductor Deep Learning Impact 1.2.

  • If using elastic distributed training for natural language processing, the training engine may have issues starting if the Spark Context driver fails to initialize. For example:
    INFO EGOClusterDriverWrapper: Waiting for spark context initialization ... 9
    ERROR EGOClusterDriverWrapper: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
    EGOClusterDriverWrapper: Final app status: 1, exitCode: 63, (reason: Timed out waiting for SparkContext.)
    To resolve this issue, update the spark-env.sh file found in the DLI_SHARED_FS/conf directory. Add the following line to the end of this file:
    export SPARK_EGO_CLIENT_CONTEXT_WAITTRIES=1000
  • When installing deep learning frameworks to use as plugins, TensorFlow and PyTorch cannot be installed together in the same Anaconda environment. Make sure to install PyTorch to its own Anaconda environment.
  • Framework plugins do not support a Spark instance group where the SPARK_EGO_APP_SCHEDULE_POLICY is configured for fairshare. Plugins only support Spark instance group that are configured with fifo. To learn more on configuring Spark instance groups, see Configuring after installation.
  • In the cluster management console, the Drivers and Executors link on the Deep Learning pages does not load. To see the Drivers and Executors page, make sure that the Spark instance group is running. If the Spark instance group is not running, start the Spark instance group and try again.
  • A deep learning training fails or runs with errors after a task is killed by an executor.
    INFO EGOExecutorBackend: Got kill task 1 with concurrent number(0)
    INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
    INFO Executor: [task] [killing] taskName:task 1.0 in stage 0.0 taskId:1 finishTime:1512136348229 state:KILLED
    To resolve this issue, disable resource reclaiming in the Spark instance group consumer.
    1. Navigate to Resources > Consumers.
    2. Select the Spark executor consumer for the IBM Spectrum Conductor Deep Learning Impact Spark instance group.
    3. Click the Consumer Properties tab, and complete the following:
      1. Deselect the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
        Note: To see the current value set for SPARK_EGO_RECLAIM_GRACE_PERIOD:
        1. Go to Workload > Spark > Spark Instance Groups.
        2. Click on the IBM Spectrum Conductor Deep Learning Impact Spark instance group.
        3. Select Manage > Configure.
        4. Click on Spark configuration and search for SPARK_EGO_RECLAIM_GRACE_PERIOD.
      Note: Do these changes on all the same Spark executor consumers including parent and children.
    4. Click Apply to save the changes.
    5. Restart the Spark instance group.