October 19, 2020 | Written by: Zoltan Tuske
Share this post:
Powerful neural networks have enabled the use of “end-to-end” speech recognition models that directly map a sequence of acoustic features to a sequence of words.
It is generally believed that direct sequence-to-sequence speech recognition models are competitive with traditional hybrid models only when a large amount of training data is used. However, in our recent paper we show that state-of-the-art recognition performance can be achieved with an attention-based encoder-decoder sequence-to-sequence model on the Switchboard-300 benchmark, which has only 300 hours of training speech.
Our training recipe has also been validated in large scale experiments on the Switchboard-2000 benchmark, which has 2000 hours of training speech. Overall, the combination of various regularization methods and a single, simple but fairly large model outperforms the previous state-of-the-art models by a large margin. Our final model has word error rates of 4.7% and 7.8% on the Switchboard and CallHome portions of the Hub5’00 test set, respectively, using only standard 2000-hour data resources (see Table 2).
The attention-based encoder-decoder model is a typical example of a universal sequence-to-sequence model without conditional independence assumptions. Such models are quite flexible and have proven to be extremely useful for problems with non-monotonic alignments between input and output sequences, such as translation. Speech mostly requires monotonic alignments, so it was unclear if such a generic model could outperform better structured models designed for monotonic alignment like hybrid hidden Markov model/deep neural network, connectionist temporal classification (CTC), or RNN transducer (RNN-T) models. In our study we demonstrated how to reached a new record with a universal attention-based model.
Because a sequence-to-sequence model takes entire utterances as observations, data sparsity is a general challenge for sequence-to-sequence approaches. In contrast to traditional hybrid models, where even recurrent networks are trained on randomized, aligned chunks of labels and features, whole-sequence models are more prone to memorizing the training samples. In order to improve generalization, many of the methods we investigated introduce additional noise, either directly or indirectly, to stochastic gradient descent (SGD) training to avoid narrow, local optima. The other techniques we studied address the highly non-convex nature of training neural networks, ease the optimization process, and speed up convergence.
The detailed description of the methods we found useful to tackle data sparsity and overfitting problems, weight decay, dropout, DropConnect, zoneout, label smoothing, batch normalization, scheduled sampling, residual networks, curriculum learning, speed and tempo perturbation, weight noise, and SpecAugment, can be found in our paper here.
We used a rather simple model built from standard LSTM layers and a decoder with a single-headed attention mechanism. The structures of the encoder building block and the decoder are shown in Figure 1.
Encoder building block
Attention-based decoder network used in the experiments
The models were implemented in PyTorch and trained on 32 P100 GPUs using distributed synchronous stochastic gradient descent with up to 32 sequences per GPU per batch, and performing roughly 180k update steps using the following recipe:
- Over the first 1.5% of updates, the learning rate was warmed up and the batch size was gradually increased from 8 to 32.
- In the first 15% of updates, the neural network was trained on sequences sorted in ascending order of length of the input. Afterwards, batches were randomized within length buckets, ensuring that a batch always contained sequences with similar length.
- Weight noise sampled from a normal distribution was switched on after 30% of the training.
- After 65% of the training, updates of sufficient statistics in the batch-normalization layers were turned off, converting them into fixed affine transformations.
- The learning rate was annealed after 75% of training, and simultaneously label smoothing was also switched off.
We observed that each of the above listed methods contributed to the improved sequence-to-sequence neural network training and reduced word error rate. An ablation study demonstrated, however, that not all methods in the final recipe are equally important, as seen in Table 1.
Ablation study on the final training recipe. Models are trained on SWB-300. WER is measured without using an external LM
In addition to careful regularization and data augmentation, scaling up the model size to over 600M parameters turned out also to be crucial in achieving a new record on the Switchboard conversational telephony speech recognition tasks, with the biggest model comprising 14 encoder and 4 decoder layers. The results achieved by our best performing models can be seen in Table 2.
We emphasize that these remarkable results were achieved using a single, pretty simple, speaker independent system trained from scratch without any multi-task learning, or explicit modeling of hidden variables (unlike hybrid, CTC, or RNN-T models).
Detailed results with the best performing systems on both SWB-300 and SWB-2000
Considering the recognition speed and applicability of the largest model, we measured 0.73-0.77 real-time factor and 6.5-6.4% total WER on Hub5’00 after varying the beam between 4 and 16, using a single core of an Intel Xeon Platinum 8280 processor and 8-bit integer weight quantization. Setting the beam size to 1, we also demonstrated that excellent results are possible with a minimalistic, practically search-free, greedy decoding algorithm.
Effect of beam size and model size on word error rate (WER) measured on the CallHome (chm) part of Hub5’2000. “300h” indicates models trained on SWB-300, whereas “2k” corresponds to the 2000-hour Switchboard+Fisher training setup