AI

Reducing Speech-to-Text Model Training Time on Switchboard-2000 from a Week to Under Two Hours

Share this post:

Automatic Speech Recognition (ASR), Computer Vision (CV) and Natural Language Processing (NLP) are three major Deep Learning (DL) applications that impact billions of people’s daily lives. These large scale DL tasks usually take a long time to train. Therefore, reducing training time while maintaining model accuracy has crucial business impact in AI applications and has been an active research topic. For example, in 2017, researchers at Facebook announced training ImageNet (CV) under one hour [1]. In 2020, researchers at UC Berkeley and Google announced Bert model training (NLP) under two hours [2].

In this blog, we introduce the work published in our recent ICASSP 2020 paper [3] in which we successfully shorten the training time on the 2000-hour Switchboard dataset, which is one of the largest public ASR benchmarks, from over a week to less than two hours on a 128-GPU IBM high-performance computing (HPC) cluster. To the best of our knowledge, this is the fastest training time recorded on this dataset. The techniques developed in this work can also speed up large-scale training of models with deep architectures in a wide spectrum of DL applications.

Prior Work

In our ICASSP 2019 paper [4], we shortened training time from over a week on a single NVIDIA V100 GPU device to about 11.5 hours on 32 V100 devices by finding ways to increase batch size in distributed training and by testing different distributed training strategies. We found Asynchronous Decentralized Parallel SGD (AD-PSGD [5]) to be an excellent algorithm for ASR distributed training due to its sound convergence guarantee and relaxed constraints on underlying computing systems. In a subsequent study [6], we showed that AD-PSGD enables much larger batch size than the commonly used synchronous training algorithm, and we designed a hierarchical version of AD-PSGD to train a model in 5.2 hours on 64 V100 devices.

Technical Contributions

In this work [3], we study two improvements to AD-PSGD that enable us to use even more learners by:

  1. randomizing the communication pattern between learners to increase the convergence speed (RAND-PSGD), and
  2. introducing a delay of one update between the weights used by a learner to compute gradients and the global estimate of the weights (D1D-PSGD).

One of our proposed algorithms, D1D-PSGD, now makes it possible for us to train a model on SWB-2000 in just under two hours using 128 V100 devices on a cutting-edge IBM HPC cluster.

Randomized communication between learners improves convergence

In the canonical AD-PSGD algorithm, each learner only communicates to its left and right neighbors on a communication ring.

Fig.1: An example of previously studied AD-PSGD with eight learners. The eight learners form a ring. Each learner only communicates with its left and right neighbor to save communication bandwidth.

Fig.1: An example of previously studied AD-PSGD with eight learners. The eight learners form a ring. Each learner only communicates with its left and right neighbor to save communication bandwidth.

 

While AD-PSGD with this fixed communication pattern has been successfully applied to distributed training, we find that when the number of learners increases, the convergence of AD-PSGD slows down. This is because the fixed communication pattern slows down the diffusion of information when there are many learners on the ring. More rounds of communication are needed to exchange model parameters between the two most distant learners on the ring.

We introduce the work published in our recent ICASSP 2020 paper in which we successfully shorten the training time on the 2000-hour Switchboard dataset, which is one of the largest public ASR benchmarks, from over a week to less than two hours.

To ensure speedy communication between any pair of learners in the communication topology — and therefore improve the convergence speed to consensus — we proposed a randomized mixing strategy where each learner randomly picks two other learners to communicate in a given iteration, which converges significantly faster to consensus while keeping the same communication cost, compared to AD-PSGD with a fixed communication pattern. We name this new algorithm Randomization-Accelerated Decentralized Parallel SGD (RAND-PSGD). Fig.2 illustrates the communication pattern in RAND-PSGD: in each iteration “t,” a learner randomly picks two neighbors to communicate with.

Fig.2: An illustration of RAND-PSGD. In each iteration t, a learner communicates to two randomly picked learners.

Fig.2: An illustration of RAND-PSGD. In each iteration t, a learner communicates to two randomly picked learners.

 

Accepting limited staleness leads to a very fast algorithm

To study how close RAND-PSGD is to the convergence speed of standard synchronous SGD (implemented via an AllReduce call) in a decentralized setting, we also proposed Delay-by-one Decentralized Parallel SGD, referred to as D1D-PSGD. In D1D-PSGD, we overlap the gradient calculation with an AllReduce on weights. The weights each learner uses to calculate gradients lag the AllReduced weights by precisely one iteration, thus the name Delay-by-one. D1D-PSGD characterizes the upper bound on convergence speed in a decentralized setting as the background weights averaging is achieved by a global AllReduce. One key difference between D1D and synchronous training is that each learner still works on slightly different weights due to asynchrony. This difference is important as it provides D1D enough system noise to avoid being trapped early in a poor local minimum when the batch size is large.

Figure 3 demonstrates that both RAND-PSGD and D1D-PSGD significantly improve the training quality when the number of learners grows from 16 to 64 for both ImageNet (upper panels) and SWB2000 (lower panels).

Fig. 3: Performance comparison of AD-PSGD, RAND-PSGD and D1D-PSGD on ImageNet (upper) and SWB2000 (lower). RAND-PSGD and D1D-PSGD converge at a similar speed and both are faster than AD-PSGD.

Fig. 3: Performance comparison of AD-PSGD, RAND-PSGD and D1D-PSGD on ImageNet (upper) and SWB2000 (lower). RAND-PSGD and D1D-PSGD converge at a similar speed and both are faster than AD-PSGD.

 

Impact on Business:

High Performance Computing

We conducted the distributed acoustic model training on SWB2000 using an IBM Power AC922 supercomputer. It has a node architecture similar to the current fastest supercomputer in the US, IBM Summit [5]. Each node is equipped with IBM POWER9 CPUs and NVIDIA Volta V100 GPUs, all connected together with NVIDIA’s high-speed NVLink dual links totaling 50GB/s bandwidth in each direction.

Each node contains 22 cores, 512GB of DDR4 memory, 96GB of High Bandwidth Memory (HBM2) for use by the accelerators, and is equipped with six GPUs. Nodes are connected with Mellanox EDR 100G Infiniband interconnect technology, and each node has a combined network bandwidth of 25GB/s. Each node is equipped with 500GB NVME storage. By using D1D-PSGD on an IBM supercomputer, we were able to train LSTM acoustic models on SWB2000 in less than two hours on 128 GPUs and still achieve competitive recognition performance, as recorded in Table 1. This demonstrates our algorithm can enable a world-class computing infrastructure to deliver world-class training results.

Table 1: WER/Time after 16 epochs for D1D-PSGD.

Table 1: WER/Time after 16 epochs for D1D-PSGD.

Cloud Computing

While D1D-PSGD is a good choice for HPC environments, such as IBM’s supercomputer, for deployments on heterogeneous computing platforms, such as a cloud with various types of machines, AD-PSGD or RAND-PSGD is the preferred choice, as neither algorithm requires global synchronization, and both avoid the straggler problem faced by synchronous methods. We summarize the key comparisons between AD-PSGD, RAND-PSGD and D1D-PSGD in Table 2.

Table 2: Convergence and Cloud-friendliness comparison between AD-PSGD, RAND-PSGD and D1D-PSGD

Table 2: Convergence and Cloud-friendliness comparison between AD-PSGD, RAND-PSGD and D1D-PSGD

Application to other tasks

Our underlying technique is not limited to ASR tasks and is widely applicable to other tasks (e.g., CV, NLP). Furthermore, our system is inherently decentralized and naturally friendly to tasks where training data cannot be shared (i.e. due to privacy concerns) across learners. Lastly, our system is asynchronous, which works particularly well when there is workload imbalance across learners.

 

[1] Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour

[2] Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (ICLR 2020)

[3] Improving Efficiency in Large-Scale Decentralized Distributed Training (ICASSP 2020)

[4] Asynchronous Decentralized Parallel Stochastic Gradient Descent (ICML 2018)

[5] Distributed Deep Learning Strategies For Automatic Speech Recognition (ICASSP 2019)

[6] A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition (INTERSPEECH 2019)

[7] Top500 June 2020 listing website

Research Staff Member, IBM Research

Xiaodong Cui

Research Staff Member, IBM Research

Brian Kingsbury

Distinguished Research Staff Member, IBM Research

More AI stories

AI Could Help Enable Accurate Remote Monitoring of Parkinson’s Patients

In a paper recently published in Nature Scientific Reports, IBM Research and scientists from several other medical institutions developed a new way to estimate the severity of a person’s Parkinson’s disease (PD) symptoms by remotely measuring and analyzing physical activity as motor impairment increased. Using data captured by wrist-worn accelerometers, we created statistical representations of […]

Continue reading

Image Captioning as an Assistive Technology

IBM Research's Science for Social Good team recently participated in the 2020 VizWiz Grand Challenge to design and improve systems that make the world more accessible for the blind.

Continue reading

IBM & MIT Roundtable: Solving AI’s Big Challenges Requires a Hybrid Approach

At IBM Research’s recent “The Path to More Flexible AI” virtual roundtable, a panel of MIT and IBM experts discussed some of the biggest obstacles they face in developing artificial intelligence that can perform optimally in real-world situations.

Continue reading