April 11, 2019 | Written by: Wei Zhang, Xiaodong Cui, and Brian Kingsbury
Share this post:
Deep learning has revolutionized how we model complex data. The key to obtaining high-performance models is having access to lots of data: millions of images, thousands of hours of speech, and billions of words of text. Despite having access to high-performance GPUs to train such models, the data-hungry nature of training algorithms has outstripped the ability of single GPU devices to train large-scale models in an acceptable period of time. As a result, various schemes for distributing the training to multiple compute nodes, or “distributed learning”, have been proposed.
In 2017, IBM Research demonstrated a distributed processing architecture that set a new world record for processing the ImageNET data base. Now, we have adapted these concepts to the problem of speech modeling for the purpose of automatic speech recognition (ASR) and demonstrated a 15-fold speedup in processing with efficient use of multiple GPUs and no loss in accuracy on SWITCHBOARD, a standard speech benchmark database. This work is described in our recent paper (details below) at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
ASR systems transcribe speech into corresponding text. If you have used Apple Siri, Google Assistant, Amazon Alexa, or any PDA translator, then you have used an ASR system. At IBM, we aim at providing the best such systems to developers via cloud services like Speech to Text. Such systems need to process many thousands of hours of speech to achieve high accuracy, which could take weeks of elapsed time. This new breakthrough reduces weeks to days, critical for rapid turnaround in this age of fast-moving development and deployment.
Applying distributed learning to ASR is harder than applying it to image classification for three reasons. First, complete ASR systems require complex system engineering. Apart from the deep learning training component, ASR systems require sophisticated encoding systems to convert human voices to features understood by deep learning systems and also require sophisticated decoding systems and to combine acoustic and language information to convert deep learning outputs to human-readable text. Basic image classification requires none of these components. Second, the data used to build ASR systems is more complex. Take the two largest public datasets in image classification (ImageNet) and speech recognition (SWB2000) as a comparison: ImageNet has 1.3 million training samples and 1,000 evenly distributed classes, whereas SWB2000 has over 30 million training samples and 32,000 classes. These 32,000 classes are naturally very unevenly distributed. The implication is the objective function landscape of ASR is much more complex. Third, ASR is more challenging for distributed training. For the state-of-the-art deep learning models, ASR per sample training time is 3x shorter than that of image classification, whereas the model size is 1.5x larger. A very low computation/communication ratio makes ASR almost 5x more challenging to distribute than the image classification workload.
A centralized distributed learning architecture (left) and a decentralized
distributed learning architecture (right).
In our ICASSP paper, we tackle ASR distributed training problems in two steps. First, we increase the workload parallelism by increasing batch size. Batch size, or the number of samples that are processed in parallel, is the key factor in distributed deep learning: a larger batch size allows more computing devices to work concurrently, but it harms the model accuracy. Due to the complex nature of ASR, its model accuracy is very sensitive to batch size. The common belief is for a state-of-the art deep learning model, the largest batch size without accuracy degradation is 256. In our paper, we found a principled way to enlarge batch size up to 2,560 while maintaining the same level of model accuracy. This means it is possible to scale ASR training from a few GPUs to dozens of GPUs. Second, we applied a state-of-the-art distributed deep learning technique called asynchronous decentralized parallel SGD (ADPSGD), which we previously published in the 2018 International Conference on Machine Learning (ICML’18), to ASR. A traditional distributed deep learning algorithm is either a synchronous approach or a parameters-server (PS)-based asynchronous approach. Both approaches have drawbacks. The synchronous approach suffers from the straggler problem in distributed systems, where a slower device will slow down the entire system. The PS approach tends to generate less accurate models, and its centralized architecture has a single point of failure: if the PS is slow, every computing device will be slowed down. Our ICML’18 paper is the first piece of work that theoretically and empirically proved that an asynchronous and decentralized system can guarantee model accuracy and linear-speedup for any non-convex optimization problem. Thus, ADPSGD circumvents the problems presented by the synchronous approach and the PS approach. To our delight, ADPSGD solved the ASR workload on the spot: it shortened the job running time from about 1 week on 1 V100 GPU to 11.5 hours on a 32 V100 GPU system, while maintaining the same level of model accuracy. Turning around a training job in half a day is desirable, as it enables researchers to rapidly iterate to develop new algorithms. This also allows developers fast turnaround time to adapt existing models to their applications, especially for custom use cases when massive amounts of speech are needed to achieve the high levels of accuracy needed for robustness and usability.
System architecture of ADPSGD.
ASR is a fascinating workload due to its technical challenges. While this piece of work is our first serious attempt at accelerating ASR via distributed training, we are actively improving our systems by (1) designing algorithms that can admit larger batch size and (2) optimizing system performance by leveraging better hardware features. We have seen very promising preliminary results and we are confident that our system will be further improved. So please stay tuned!
Distributed Deep Learning Strategies For Automatic Speech Recognition
Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny