After installing IBM Spectrum Conductor with Spark
and IBM Spectrum Conductor Deep Learning Impact, start the cluster management console for the first time and configure IBM Spectrum Conductor Deep Learning Impact.
Procedure
-
Locate and start the cluster management console.
See Locating the cluster management console.
-
Verify that IBM Spectrum Conductor Deep Learning Impact was
installed successfully, by selecting the Workload menu, and navigating to the option.
If this option is available, then IBM Spectrum Conductor Deep Learning Impact was installed successfully.
Otherwise, if the Deep Learning option is unavailable, troubleshoot the IBM Spectrum Conductor Deep Learning Impact installation using the
cws_dl_install.log file in the $EGO_TOP/dli/logs
directory. Additionally, see the log files in the $EGO_TOP/dli/dlpd/logs
directory.
-
Ensure that GPU is enabled for deep learning workloads. If you do not have GPU enabled, enable
GPU now. See, Enabling GPUs.
-
Create a resource group for GPU executors where the advanced formula is set to
ngpus. See, Using resource groups with GPU hosts.
-
If needed, create a resource group for CPU executors. The resource group for CPU
executors must contain all of the hosts that are in the GPU executors resource group.
-
Create a Spark instance group for IBM Spectrum Conductor Deep Learning Impact using the
dli-sig-template template.
-
Select the Workload tab and click .
-
In the Instance Group List tab, click New.
-
Click the Templates button to load the
dli-sig-template template.
Attention: The dli-sig-template template creates a
Spark instance group that is used for running distributed deep learning workloads. In order to use
distributed deep learning with auto-scaling, use this template to create a second Spark instance
group for the purpose of running auto-scaling deep learning workloads.
-
Click Use to select and use the dli-sig-template
template.
-
Provide a name for the Spark instance group.
-
Provide a directory for the Spark deployment. The egoadmin user must have read, write,
and execute permissions to the directory specified and its parent directory. If using a different
user, the cluster administrator should have privileges of the user group that the user belongs to
and the user's umask must be set to 002.
-
Leave the execution user set to egoadmin.
-
Provide a Spark version.
Important: If you change the version from Spark 1.6.1, the
default configurations in the dli-sig-template template are lost and you must
ensure that all the Spark configurations are made as follows. By default the template is configured
for distributed training without auto-scaling. To use auto-scaling, make sure to set the
auto-scaling environment variables as specified.
Ensure that the following Spark configurations are selected:
- SPARK_EGO_EXECUTOR_SLOTS_MAX must be set to 1.
- SPARK_EGO_EXECUTOR_SLOTS_RESERVE must be set to
1.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX must be set to
1.
- SPARK_EGO_GPU_EXECUTOR_SLOTS_RESERVE must be set to
1.
- SPARK_EGO_EXECUTOR_IDLE_TIMEOUT must be set to
6000.
- SPARK_EGO_CONF_DIR_EXTRA must be set to
${DLI_SHARED_FS}/conf. For example, if DLI_SHARED_FS is
set /gpfs/dlfs1 for deep learning module installation, then set
SPARK_EGO_CONF_DIR_EXTRA to /gpfs/dlfs1/conf.
To support distributed training, set the following:
- SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK must be set to
1,2,4.
- SPARK_EGO_APP_SCHEDULE_POLICY must be set to fifo.
- SPARK_EGO_ENABLE_PREEMPTION is set to false.
- SPARK_EGO_SLOTS_REQUIRED_TIMEOUT must be adjusted to a smaller
value to ensure that jobs time out in a reasonable amount of time. If this value is too large, jobs
competing for resources can be stuck waiting too long and are abruptly stopped by the executor.
Otherwise, to support distributed training with auto-scaling, set the following:
- SPARK_EGO_AUTOSCALE_GPU_SLOTS_PER_TASK is set to 1
.
- SPARK_EGO_APP_SCHEDULE_POLICY is set to
fairshare.
Important: Distributed training with auto-scaling can
work with both
fifo and
fairshare, however
fairshare is preferred. When using
fairshare, make
sure that you:
- Do not disable reclaim for the executor consumers and do not set
SPARK_EGO_RECLAIM_GRACE_PERIOD. Use default, IBM Spectrum Conductor with Spark
reclaim settings for the consumer.
- Do not change the SPARK_EGO_SLOTS_REQUIRED_TIMEOUT value for a Spark instance
group with fairshare.
- SPARK_EGO_ENABLE_PREEMPTION is set to true.
Important:
If you are using Caffe you must set the JAVA_HOME variable to your OpenJDK
path. This path must be the same on all hosts. This environment variable is not included in the
default dli-sig-template template. For example, on RHEL 7 this path might be
/usr/lib/jvm/java-1.8.0-openjdk/.
For information on additional Spark parameters, see Creating a Spark
instance group to use GPUs.
-
Under Resource Groups and Plans enable GPU slot allocation and specify
the resource group from which resources are allocated to executors in the Spark instance group.
Make sure that the CPU executors resource group contains all the CPU and GPU
executor hosts. If you do not do this, GPU slots are used for the shuffle service.
- Select a CPU resource group for use by Spark executors (CPU slots).
- Select the previously created GPU resource group for use by Spark executors (GPU
slots) drop. Ensure that you do not select the resource group to be used by Spark
drivers.
-
Create the Spark instance group by clicking Create and Deploy Instance
Group.
-
Edit the consumer properties of the Spark instance group.
-
Navigate to
.
-
Select the <Spark-instance-group-name>-spark consumer.
- Under the Consumer Properties tab, deselect the Rebalance when
resource plan changes or time interval changes option.
- Set Reclaim grace period to the same value as the value set for the
SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark
configuration.
Note: To see the current value set for
SPARK_EGO_RECLAIM_GRACE_PERIOD:
- Go to .
- Click on the IBM Spectrum Conductor Deep Learning Impact Spark instance
group.
- Select
.
- Click on Spark configuration and search for
SPARK_EGO_RECLAIM_GRACE_PERIOD.
-
Select the <Spark-instance-group-name>-sparkexecutor consumer.
- Under the Consumer Properties tab, deselect the Rebalance when
resource plan changes or time interval changes option.
- Set Reclaim grace period to the same value as the value set for the
SPARK_EGO_RECLAIM_GRACE_PERIOD environment variable in SIG Spark
configuration.
-
Click Apply
-
Start the Spark instance group for IBM Spectrum Conductor Deep Learning Impact.
-
Navigate to the Workload tab and select .
-
Select the Spark instance group and click Start.
Results
IBM Spectrum Conductor Deep Learning Impact is configured successfully
and is ready to use. To verify that IBM Spectrum Conductor Deep Learning Impact is configured correctly, do the following:
- Verify that the cluster management console has no issues.
If you see any issues check the log files on the management host in the
$EGO_TOP/gui/logs directory.
- Check that the deep learning services have started. See that all services are in
STARTED state after running the egosh service list
-l command.
If any services are not started refer to the corresponding log files on the
management host for more information:
- $EGO_TOP/dli/dlpd/logs/dlpd.log
- $EGO_TOP/dli/dlpd/dlrest/logs/messages.log
- $EGO_TOP/dli/dlmao/logs/start_dlmao_service.sh.log.monitor
- $EGO_TOP/dli/dlmao/logs/start_dlmao_service.sh.log.optimizer
- $EGO_TOP/dli/dlmao/logs/monitor.log.hostname
- $EGO_TOP/dli/dlmao/logs/optimizer.log.hostname
- $EGO_TOP/dli/mongodb/logs/mongod.log
- $EGO_TOP/integration/elk/log/shipper-err.log.hostname
- $EGO_TOP/integration/elk/log/shipper-err.log.hostname
- For additional issues, refer to
IBM Spectrum Conductor with Spark troubleshooting
information.