Dynamic and Resilient, Elastic Deep Learning with IBM Spectrum Conductor Deep Learning Impact
Helena. 270005V76W Visits (10580)
Authors: Junfeng Liu, Yonggang Hu
Published on December 19, 2017 / Updated on January 10, 2018
The Elastic Deep Learning capabilities of IBM Spectrum Conductor Deep Learning Impact are designed for large scale distributed deep learning workloads. It transforms static monolithic training into a dynamic process that is resilient to failures and automatically scales GPU allocation while training.
Data scientists, deep learning developers and administrators can use Elastic Deep Learning capabilities to simplify production deployment, to improve run time efficiency and deliver on service level agreements (SLAs).
Elastic Deep Learning
Training your deep learning models based on frameworks such as TensorFlow and Caffe takes a long time. Accelerator technologies such as GPUs help reduce training time, but as models and data sets get bigger and bigger they still require days or weeks to train. Training is also an iterative and resource intensive process – you run many rounds of training, tweak parameters, run more training, tweak parameters, etc., all towards the goal of high model accuracy.
Scaling out in a cluster and running distributed training is a typical approach to improve training time, but there are still many challenges. The current approaches to distributed training tend to be complex, inflexible, and monolithic. Requiring developers to figure out cluster and network topology, data ingest and a parallel execution plans, and hand-coding it into the model increases complexity and development time. Intermingling the model logic with runtime control creates portability challenges when moving from development to production and out to the cloud.
Introducing Elastic Deep Learning from IBM Spectrum Conductor and Deep Learning Impact, a technology designed to resolve many of these distributed training challenges, including:
Dynamic distributed runtime management – Replace the framework specific data ingest and runtime configuration with a scheduler that determines data partition, number of batches, dynamically allocates resources, ingest data and schedule for every iteration. High efficiency scheduling engine ensures the necessary throughput to fully utilize all allocated resources.
Automatic and dynamic model scaling (up and down) – Transparent to model code and without disrupting the deep learning jobs in progress, dynamic resource allocation that adds resources when they become available, providing training speed-up, and removes them when needed by other jobs. Hyperparameter aware scheduling ensures that the adjustment of resource does not affect the training convergence and accuracy.
Resiliency to failures and environment changes – Aside from graceful changes, which can easily be handled due to the system dynamic capabilities. When a host fails or cloud resources disappear, the running tasks are automatically re-run on another host with the latest parameters and weights.
Automatic Quality of Service (QoS) – Dynamic rebalancing of resource allocations after every iteration based on policy and job priority. Existing workloads continue to execute without interruption.
Decouple model logic from runtime control – Hides the complexity of resource allocation, data ingest, configuration and runtime control from model developers, reducing the need to have knowledge of the underlying cluster or cloud resources, availability and topology.
Support native models (graph, neural net definition) in Caffe and Tensorflow – The data scientist or model developer can focus on the model itself and not deployment and production runtime configuration.
Optimized training with accuracy – By using available resources and not just a fixed number, the training jobs finish faster. By considering model constraints and infrastructure limitations (GPU memory, network bandwidth, etc.), training is optimized by scaling for both speed and convergence.
Better Return on Investment (ROI) – By fully utilizing your resources (i.e., CPUs, GPUs, network bandwidth, etc.) and making sure none of the idle, you get the most out of your investment. Enabling multitenancy allows multiple developers and line of businesses (LOBs) to share common resources, eliminating the need for dedicated clusters.
When comparing an Elastic Deep Learning job to typical training jobs, typical jobs require a static allocation of resources without interruption (e.g., 16 GPUs on 4 machines running for 2 weeks). Meaning that typical jobs cannot take advantage of free resources and cannot reduce the number of allocated resources (without killing entire jobs) to accommodate higher priority workloads. A deep learning framework uses MPI-like communications, a single hardware failure or loss of rank causes the entire job to fail. This is especially painful if the job has been running for days or weeks. It is also unrealistic to assume zero changes or failures in cloud and commodity based clusters.
Instead of using a static resource allocation strategy, Elastic Deep Learning uses resilient, dynamic fine-grained controls for the training process. The IBM Spectrum Conductor Deep Learning Impact scheduler is responsible for distributing the training tasks (iterations), coordinating synchronization among tasks and dynamically allocating resources from resource managers.
When a data scientist submits a training job they can specify the maximum number of GPUs that the job can use. While executing, IBM Spectrum Conductor Deep Learning Impact allocates as many GPUs as available to the job, up to the maximum number specified, or removes GPUs as other policies dictate. Regardless, the job continues to execute until complete. With Elastic Deep Learning, users can utilize QoS policies to define how the system behaves during GPU allocation and job execution. IBM Spectrum Conductor Deep Learning Impact includes the following policies that are utilized by Elastic Deep Learning:
Preemption – Enables the running task to complete the iteration and grads synchronization before assigning the resource to another job.
Fairshare – Sets the policy scheduler to share all available resources among all submitted jobs as equally as possible.
Priority – Allows the higher priority job to use more resources than the other jobs.
In this example illustrated using the following screens, Job 1 is submitted at 8:09 and runs on 8 GPUs. At 8:10 Job 2 is submitted, the Fairshare policy ensures the GPUs are shared between all jobs, so 4 Job 1 tasks are pre-empted and allowed to complete, then the GPUs are allocated to Job 2. Just past 8:13 the priority of Job 2 is increased, a Job 1 tasks gets pre-empted and completes, then the GPU is allocated to Job 2. 3 GPUs are allocated to Job 1 and 5 are allocated to Job 2. Just past 8:17 Job 1 runs to completion freeing up 3 GPUs, which then get allocated to Job 2, which enjoys faster training time to completion since it runs on 8 GPUs.
In the following screens, we see the complete tasks lists and the slope differences that indicates the change of iteration completion rate. The completed tasks continue to rise demonstrating dynamic scaling and speedup. Here each task represents one training iteration.
From the Insight page, users can monitor the convergence trend as below. We can see there was no interruption on the job – loss curve kept going down during the scaling-up and down. The scheduler manages the training process in consideration of hyperparameter, which enables continued convergence trend during dynamic scaling.
To try Elastic Deep Learning today, down
For more information about IBM Spectrum Conductor Deep Learning Impact and Elastic Deep Learning, go to: www.