Configuring after installation

After installing IBM Spectrum Conductor with Spark and IBM Spectrum Conductor Deep Learning Impact, start the cluster management console for the first time and configure IBM Spectrum Conductor Deep Learning Impact.

Before you begin

Ensure that you have successfully installed IBM Spectrum Conductor Deep Learning Impact, see Installing on the master host.

Procedure

  1. Locate and start the cluster management console. See Locating the cluster management console.
  2. Verify that IBM Spectrum Conductor Deep Learning Impact was installed successfully, by selecting the Workload menu, and navigating to the Spark > Deep Learning option.
    If this option is available, then IBM Spectrum Conductor Deep Learning Impact was installed successfully. Otherwise, if the Deep Learning option is unavailable, troubleshoot the IBM Spectrum Conductor Deep Learning Impact installation using the cws_dl_install.log file in the $EGO_TOP/dli/logs directory. Additionally, see the log files in the $EGO_TOP/dli/dlpd/logs directory.
  3. Ensure that GPU is enabled for deep learning workloads. If you do not have GPU enabled, enable GPU now. See, Enabling GPUs.
  4. Create a resource group for GPU executors where the advanced formula is set to ngpus. See, Using resource groups with GPU hosts.
  5. If needed, create a resource group for CPU executors. The resource group for CPU executors must contain all of the hosts that are in the GPU executors resource group.
  6. Create a Spark instance group for IBM Spectrum Conductor Deep Learning Impact using the dli-sig-template template.
    1. Select the Workload tab and click Spark > Spark Instance Groups.
    2. In the Instance Group List tab, click New.
    3. Click the Templates button to load the dli-sig-template template.
      Attention: The dli-sig-template template creates a Spark instance group that is used for running distributed deep learning workloads. In order to use distributed deep learning with auto-scaling, use this template to create a second Spark instance group for the purpose of running auto-scaling deep learning workloads.
    4. Click Use to select and use the dli-sig-template template.
    5. Provide a name for the Spark instance group.
    6. Provide a directory for the Spark deployment. The egoadmin user must have read, write, and execute permissions to the directory specified and its parent directory. If using a different user, the cluster administrator should have privileges of the user group that the user belongs to and the user's umask must be set to 002.
    7. Leave the execution user set to egoadmin.
    8. Provide a Spark version.
      Important: If you change the version from Spark 1.6.1, the default configurations in the dli-sig-template template are lost and you must ensure that all the Spark configurations are made as follows. By default the template is configured for distributed training without auto-scaling. To use auto-scaling, make sure to set the auto-scaling environment variables as specified.
      Ensure that the following Spark configurations are selected:
      • SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
      • SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to 1.
      • SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to 1.
      • SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to 1.
      • SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to 6000.
      • SPARK_EGO_CONF_DIR_EXTRA must be set to ${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is set /gpfs/dlfs1 for deep learning module installation, then set SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
      To support distributed training, set the following:
      • SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK must be set to 1,2,4.
      • SPARK_EGO_APP_SCHEDULE_POLICY must be set to fifo.
      • SPARK_EGO_ENABLE_PREEMPTION is set to false.
      • SPARK_EGO_SLOTS_REQUIRED_TIMEOUT must be adjusted to a smaller value to ensure that jobs time out in a reasonable amount of time. If this value is too large, jobs competing for resources can be stuck waiting too long and are abruptly stopped by the executor.
      Otherwise, to support distributed training with auto-scaling, set the following:
      • SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is set to 1 .
      • SPARK_EGO_APP_SCHEDULE_POLICY is set to fairshare.
        Important: Distributed training with auto-scaling can work with both fifo and fairshare, however fairshare is preferred. When using fairshare, make sure that you:
        • Do not disable reclaim for the executor consumers and do not set SPARK_EGO_RECLAIM_GRACE_PERIOD. Use default, IBM Spectrum Conductor with Spark reclaim settings for the consumer.
        • Do not change the SPARK_EGO_SLOTS_REQUIRED_TIMEOUT value for a Spark instance group with fairshare.
      • SPARK_EGO_ENABLE_PREEMPTION is set to true.
      Important:

      If you are using Caffe you must set the JAVA_HOME variable to your OpenJDK path. This path must be the same on all hosts. This environment variable is not included in the default dli-sig-template template.  For example, on RHEL 7 this path might be /usr/lib/jvm/java-1.8.0-openjdk/.

      For information on additional Spark parameters, see Creating a Spark instance group to use GPUs.

    9. Under Resource Groups and Plans enable GPU slot allocation and specify the resource group from which resources are allocated to executors in the Spark instance group.

      Make sure that the CPU executors resource group contains all the CPU and GPU executor hosts. If you do not do this, GPU slots are used for the shuffle service.

      • Select a CPU resource group for use by Spark executors (CPU slots).
      • Select the previously created GPU resource group for use by Spark executors (GPU slots) drop. Ensure that you do not select the resource group to be used by Spark drivers.
    10. Create the Spark instance group by clicking Create and Deploy Instance Group.
  7. Edit the consumer properties of the Spark instance group.
    1. Navigate to Resources > Consumers.
    2. Select the <Spark-instance-group-name>-spark consumer.
      1. Under the Consumer Properties tab, deselect the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
        Note: To see the current value set for SPARK_EGO_RECLAIM_GRACE_PERIOD:
        1. Go to Workload > Spark > Spark Instance Groups.
        2. Click on the IBM Spectrum Conductor Deep Learning Impact Spark instance group.
        3. Select Manage > Configure.
        4. Click on Spark configuration and search for SPARK_EGO_RECLAIM_GRACE_PERIOD.
    3. Select the <Spark-instance-group-name>-sparkexecutor consumer.
      1. Under the Consumer Properties tab, deselect the Rebalance when resource plan changes or time interval changes option.
      2. Set Reclaim grace period to the same value as the value set for the SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark configuration.
    4. Click Apply
  8. Start the Spark instance group for IBM Spectrum Conductor Deep Learning Impact.
    1. Navigate to the Workload tab and select Spark > Spark Instance Groups.
    2. Select the Spark instance group and click Start.

Results

IBM Spectrum Conductor Deep Learning Impact is configured successfully and is ready to use. To verify that IBM Spectrum Conductor Deep Learning Impact is configured correctly, do the following:
  1. Verify that the cluster management console has no issues. If you see any issues check the log files on the management host in the $EGO_TOP/gui/logs directory.
  2. Check that the deep learning services have started. See that all services are in STARTED state after running the egosh service list -l command.
    If any services are not started refer to the corresponding log files on the management host for more information:
    • $EGO_TOP/dli/dlpd/logs/dlpd.log
    • $EGO_TOP/dli/dlpd/dlrest/logs/messages.log
    • $EGO_TOP/dli/dlmao/logs/start_dlmao_service.sh.log.monitor
    • $EGO_TOP/dli/dlmao/logs/start_dlmao_service.sh.log.optimizer
    • $EGO_TOP/dli/dlmao/logs/monitor.log.hostname
    • $EGO_TOP/dli/dlmao/logs/optimizer.log.hostname
    • $EGO_TOP/dli/mongodb/logs/mongod.log
    • $EGO_TOP/integration/elk/log/shipper-err.log.hostname
    • $EGO_TOP/integration/elk/log/shipper-err.log.hostname
  3. For additional issues, refer to IBM Spectrum Conductor with Spark troubleshooting information.