A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Share this post:

Automatic speech recognition (ASR) based on deep learning has made great progress recently, thanks to the use of large amounts of training data, expressive models, and high computational power. Consequently, efficient distributed learning strategies are crucial for training acoustic models with deep architectures.

In our previous work published in this year’s ICASSP [1], we used a distributed training approach – Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) – to successfully shorten the training time of a deep LSTM acoustic model from one week to 11.5 hours on 32 NVIDIA V100 GPUs without degradation of recognition accuracy on the 2,000-hour Switchboard corpus, a well-established dataset in the speech community for benchmarking ASR performance. In a recently published paper in this year’s INTERSPEECH [2], we were able to achieve additional improvement on the efficiency of ADPSGD, reducing the training time from 11.5 hours to 5.2 hours using 64 NVIDIA V100 GPUs.

Figure 1. ADPSGD converges with significantly larger batch size than synchronous centralized SGD.

Figure 1. ADPSGD converges with significantly larger batch size than synchronous centralized SGD.

First, large batch sizes are critical for scaling distributed training to a large number of learners. We observed that ADPSGD may allow significantly larger batch sizes with good loss convergence than synchronous centralized parallel SGD (SCPSGD). Fig.1 shows that ADPSGD can converge with a batch size up to 12,288 samples with a loss close to that of the single-GPU baseline. In contrast, SCPSGD only converges up to a batch size of 4,096 samples. Batch sizes larger than 4,096 may give rise to significantly degraded loss.

While a rigorous theory is still being developed to explain this phenomenon, we speculate that since SCPSGD is a special case of ADPSGD, the local model averaging among only neighboring learners in ADPSGD is equivalent to a noise perturbation of global model averaging in SCPSGD. This noise perturbation may provide opportunities to use a larger batch size in ADPSGD that is not possible for SCPSGD. This property gives ADPSGD great advantages when scaling out distributed training to a large number of learners.

Second, to improve communication efficiency on the same node while also reducing main memory traffic and CPU pressure among nodes, we designed a hierarchical ADPSGD architecture (H-ADPSGD) that is illustrated in Fig.2. The learners on the same computing node construct a super-learner via NVIDIA NCCL using a synchronous ring-based all-reduce implementation (Sync-Ring). The super-learners then form another ring under ADPSGD (ADPSGD-Ring). In addition, as gradient computation on GPUs overlaps with the ADPSGD communication, this design also significantly improves the computation/communication ratio in the distributed training.

Figure 2 Hierarchical-ADPSGD system architecture

Figure 2 Hierarchical-ADPSGD system architecture

The distributed training of the LSTM acoustic model using the proposed H-ADPSGD is carried out on a cluster with eight nodes connected via 100 Gbit/s Ethernet. Each node has eight NVIDIA V100 GPUs. The batch size on each GPU is 128, which gives a global batch size of 8,192. The model was trained for 16 epochs and achieved 7.6% WER for the Switchboard task and 13.2% WER for the Callhome task.

While it took about one week to train the model on a single V100 GPU and 11.5 hours in our ICASSP paper [1] using ADPSGD on 32 NVIDIA V100 GPUs, it only took 5.2 hours to train under H-ADPSGD on 64 NVIDIA V100 GPUs. Overall, H-ADPSGD gives 40x speedup without accuracy loss. This also marks an additional 50% training time reduction from our ICASSP paper [1].

To the best of our knowledge, this is the first time that an asynchronous distributed algorithm is demonstrated to scale better with a large batch size than the synchronous approach for large-scale deep learning models. And 5.2 hours is the fastest training time that reaches this level of recognition accuracy on the 2,000-hour Switchboard dataset, to date.


[1] W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung and M. Picheny, “Distributed Deep Learning Strategies For Automatic Speech Recognition,” ICASSP, Brighton, United Kingdom, May, 2019,  pp. 5706-5710   (

[2] W. Zhang, X. Cui, U. Finkler, G. Saon,  A. Kayi, A. Buyuktosunoglu, B. Kingsbury, D. Kung and M. Picheny, “A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition,” INTERSPEECH, Graz, Austria, September, 2019, pp. 2628-2632   (

Research Staff Member, IBM Research

Xiaodong Cui

Research Staff Member, IBM Research

Brian Kingsbury

Distinguished Research Staff Member, IBM Research

More AI stories

We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page:

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading