Our IBM Research AI team has developed a novel compression algorithm that could significantly improve training times for deep learning models in large-scale AI systems. Using this technique, we show for the first time that it is possible to dramatically reduce communication overheads during training by 40-200X over existing methods. These results, which mark a significant step forward in the training of deep networks, are being presented at the 2018 AAAI Conference.
Training a deep learning model on a large dataset is a challenging and expensive task that can take anywhere from hours to weeks to complete. To tackle this problem, typically a cluster of four to 128 GPU accelerators is used to divide the overall task, reducing training time by exploiting the combined computational strengths of multiple accelerators all working on the same problem. Larger the number of GPUs, lesser is the computational time spent in each accelerator for the given training task.
However, in addition to computation, these accelerators also need to communicate critical parameters periodically to each other during training. As the number of GPUs in the system grows and the computation time (in each GPU) drops, this communication overhead starts to become an increasing fraction of the total training time.
Furthermore, GPU computational capabilities (measured in TFlops) have grown by over 10X just in the past couple of years — a trend that is expected to continue unabated for the next few years with advances in GPU architectures and the use of special-purpose training chips. This trend further reduces computational times while leaving communication times mostly untouched, leading to a huge bottleneck in communication — resulting in systems that are unable to exploit the full capabilities of these advanced accelerators.
One way to tackle this problem is to compress the communicated data using one of many popular compression techniques (e.g. Lempel-Ziv compression). However, since the time required to perform compression and decompression is significant, these compression techniques do not end up providing any real performance benefit during training even though they have the potential to reduce the volume of the communicated data. What is therefore needed is a compression technique that is computationally friendly — i.e fast — and exploits the resiliency of deep networks to lossy compression.
Our team set out to precisely to create such a compression technique, which we call AdaComp (short for Adaptive Compression), with the goal that it could be ubiquitously applied to model training across the deep learning space. As is the case with any compression technique, selecting the right subset of parameters to communicate while ignoring the remaining values in the overall set (that could be as large as 10’s or 100’s of megabytes) is a key challenge since it may involve techniques such as global sorting that are computationally expensive.
The key insight behind AdaComp is that if this selection is done by dividing the large set of parameters for each layer of the deep network into smaller “chunks” and then applying a local selection algorithm in each chunk, the algorithm can be both efficiently parallelized (and hence computationally friendly) and will not impact model convergence. In addition, the AdaComp technique exploits the simultaneous benefits of sparsity (by picking very few elements in the set) and quantization (representing the selected values with a binary (1-bit) or ternary (2-bit) representation to obtain very high compression rates (as high 200X) for a wide spectrum of deep models and datasets.
Our team also notes that while previously published, state-of-art deep learning compression algorithms worked well on fully connected networks (used in speech recognition), they resulted in significant model accuracy degradation when applied to modern deep networks used in image classification. On the other hand, AdaComp successfully demonstrated tremendous compression ratios in a broad range of DL application domains, ranging from state-of-the-art convolution neural networks (CNNs) used in image classification to recurrent neural network (RNNs) used in language modeling and fully-connected networks (DNNs) used in some speech-to-text classification. Very high compression rates were easily obtained across the board without any noticeable degradation in model accuracy.
Our team is excited about AdaComp and its potential to revolutionize deep learning training. We believe that this revolutionary compression algorithm and its derivatives will become foundational as we move towards an era of hyper efficient large-scale deep learning training compute subsystems. For more information, read our paper, “AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training” or check out Chia-Yu’s presentation at AAAI.