High-Efficiency Distributed Learning for Speech Modeling

Share this post:

Deep learning has revolutionized how we model complex data. The key to obtaining high-performance models is having access to lots of data: millions of images, thousands of hours of speech, and billions of words of text. Despite having access to high-performance GPUs to train such models, the data-hungry nature of training algorithms has outstripped the ability of single GPU devices to train large-scale models in an acceptable period of time. As a result, various schemes for distributing the training to multiple compute nodes, or “distributed learning”, have been proposed.

In 2017, IBM Research demonstrated a distributed processing architecture that set a new world record for processing the ImageNET data base. Now, we have adapted these concepts to the problem of speech modeling for the purpose of automatic speech recognition (ASR) and demonstrated a 15-fold speedup in processing with efficient use of multiple GPUs and no loss in accuracy on SWITCHBOARD, a standard speech benchmark database. This work is described in our recent paper (details below) at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

ASR systems transcribe speech into corresponding text. If you have used Apple Siri, Google Assistant, Amazon Alexa, or any PDA translator, then you have used an ASR system. At IBM, we aim at providing the best such systems to developers via cloud services like Speech to Text. Such systems need to process many thousands of hours of speech to achieve high accuracy, which could take weeks of elapsed time. This new breakthrough reduces weeks to days, critical for rapid turnaround in this age of fast-moving development and deployment.

Applying distributed learning to ASR is harder than applying it to image classification for three reasons. First, complete ASR systems require complex system engineering. Apart from the deep learning training component, ASR systems require sophisticated encoding systems to convert human voices to features understood by deep learning systems and also require sophisticated decoding systems and to combine acoustic and language information to convert deep learning outputs to human-readable text. Basic image classification requires none of these components. Second, the data used to build ASR systems is more complex. Take the two largest public datasets in image classification (ImageNet) and speech recognition (SWB2000) as a comparison: ImageNet has 1.3 million training samples and 1,000 evenly distributed classes, whereas SWB2000 has over 30 million training samples and 32,000 classes. These 32,000 classes are naturally very unevenly distributed. The implication is the objective function landscape of ASR is much more complex. Third, ASR is more challenging for distributed training. For the state-of-the-art deep learning models, ASR per sample training time is 3x shorter than that of image classification, whereas the model size is 1.5x larger. A very low computation/communication ratio makes ASR almost 5x more challenging to distribute than the image classification workload.

A centralized distributed learning architecture (left) and a decentralized
distributed learning architecture (right).

In our ICASSP paper, we tackle ASR distributed training problems in two steps. First, we increase the workload parallelism by increasing batch size. Batch size, or the number of samples that are processed in parallel, is the key factor in distributed deep learning: a larger batch size allows more computing devices to work concurrently, but it harms the model accuracy. Due to the complex nature of ASR, its model accuracy is very sensitive to batch size. The common belief is for a state-of-the art deep learning model, the largest batch size without accuracy degradation is 256. In our paper, we found a principled way to enlarge batch size up to 2,560 while maintaining the same level of model accuracy. This means it is possible to scale ASR training from a few GPUs to dozens of GPUs. Second, we applied a state-of-the-art distributed deep learning technique called asynchronous decentralized parallel SGD (ADPSGD), which we previously published in the 2018 International Conference on Machine Learning (ICML’18), to ASR. A traditional distributed deep learning algorithm is either a synchronous approach or a parameters-server (PS)-based asynchronous approach. Both approaches have drawbacks. The synchronous approach suffers from the straggler problem in distributed systems, where a slower device will slow down the entire system. The PS approach tends to generate less accurate models, and its centralized architecture has a single point of failure: if the PS is slow, every computing device will be slowed down. Our ICML’18 paper is the first piece of work that theoretically and empirically proved that an asynchronous and decentralized system can guarantee model accuracy and linear-speedup for any non-convex optimization problem. Thus, ADPSGD circumvents the problems presented by the synchronous approach and the PS approach. To our delight, ADPSGD solved the ASR workload on the spot: it shortened the job running time from about 1 week on 1 V100 GPU to 11.5 hours on a 32 V100 GPU system, while maintaining the same level of model accuracy. Turning around a training job in half a day is desirable, as it enables researchers to rapidly iterate to develop new algorithms. This also allows developers fast turnaround time to adapt existing models to their applications, especially for custom use cases when massive amounts of speech are needed to achieve the high levels of accuracy needed for robustness and usability.

System architecture of ADPSGD.

ASR is a fascinating workload due to its technical challenges. While this piece of work is our first serious attempt at accelerating ASR via distributed training, we are actively improving our systems by (1) designing algorithms that can admit larger batch size and (2) optimizing system performance by leveraging better hardware features. We have seen very promising preliminary results and we are confident that our system will be further improved. So please stay tuned!

Distributed Deep Learning Strategies For Automatic Speech Recognition
Wei Zhang, Xiaodong Cui, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung, Michael Picheny

Research Staff Member, IBM Research

Xiaodong Cui

Research Staff Member, IBM Research

Brian Kingsbury

Distinguished Research Staff Member, IBM Research

More Publications stories

We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page:

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading