October 18, 2019 | Written by: Gakuto Kurata, Kartik Audhkhasi, and Brian Kingsbury
Share this post:
End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex. For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.
However, current E2E ASR systems also suffer from important limitations:
- First, E2E ASR systems need orders of magnitude more training data than hybrid ASR systems to achieve similar word error rate (WER). This is due to the propensity of E2E ASR systems to overfit the training data when it is limited.
- Second, connectionist temporal classification (CTC), a popular variant of E2E ASR, is not amenable to ensembling and student-teacher transfer learning, both of which are useful for deploying highly accurate ASR systems with latency constraints.
We presented three papers at INTERSPEECH 2019 that addresses these shortcomings of E2E approaches for speech recognition.
“Forget a Bit to Learn Better: Soft Forgetting for CTC-based Automatic Speech Recognition,” K. Audhkhasi, G. Saon, Z. Tüske, B. Kingsbury, and M. Picheny
E2E ASR systems have matched the accuracy of hybrid ASR systems across several speech recognition tasks. This is due to powerful neural network architectures, WER-related loss functions, and exhaustive experimentation relying on graphics/tensor processing units.
Prior work has shown that CTC-based E2E ASR systems perform well when using bidirectional long short-term memory (BLSTM) networks unrolled over the whole speech utterance in order to capture more acoustic context at each time step, as shown in Figure 1 below:
Figure 1: Bidirectional LSTM CTC that unrolls over the whole speech utterance.
We observed that this also leads to overfitting, especially on small training data sets, since it is easy for the over-parameterized BLSTM to memorize training sequences. We proposed soft forgetting as a solution to combat this overfitting:
- First, we unrolled the BLSTM network only over small non-overlapping chunks of the input utterance.
- Second, we randomly picked a chunk size for each batch instead of a fixed global chunk size.
- Third, in order to retain some utterance-level information, we encouraged the hidden states of the BLSTM network to approximate those of a pre-trained whole-utterance BLSTM. We achieved this by incorporating an additional mean-squared error term (“twin loss“) in the training loss function. The following Figure 2 summarizes soft forgetting:
Figure 2: Soft forgetting that unrolls the Bidirectional Long Short-Term Memory LSTM network over non-overlapping chunks and uses a pre-trained whole utterance BLSTM for regularization.
We conducted experiments on the 300-hour English Switchboard dataset. Our results showed that soft forgetting improves the WER above a competitive whole-utterance phone CTC BLSTM by an average of 7-9% relative. We obtained WERs of 9.1%/17.4% using speaker-independent and 8.7%/16.8% using speaker-adapted models, respectively on the Hub5-2000 Switchboard/CallHome test sets. Furthermore, adding the latest offline and on-the-fly data augmentation techniques resulted in a WER of 7.9%/15.7% for the speaker-independent phone CTC system. This matched the WER of a competitive speaker-adapted hybrid BLSTM system. We also showed that soft forgetting improves the WER when the model is used with limited temporal context for streaming recognition.
“Advancing sequence-to-sequence based speech recognition,” Z. Tüske, K. Audhkhasi, and G. Saon
Attention-based encoder-decoder or “all-neural” sequence-to-sequence models are an alternative approach to E2E ASR. These models comprise a recurrent neural network that encodes the input acoustic feature sequence; an attention neural network that assigns attention weights to the sequence of encoder outputs; and a decoder recurrent neural network that predicts the sequence of symbols, e.g., characters.In this paper, by using a well-known, publicly available, large, speech corpus (LibriSpeech), we demonstrated our endeavor to improve state-of-the-art speech recognition results with the attention-based approach.
Our results challenged the traditional hybrid model coupled with powerful recurrent language model rescoring. Specifically, we investigated the effect of the several techniques on sequence-to-sequence ASR models: sophisticated data augmentation, various dropout schemes, scheduled sampling, warm-restart, various input features, modeling units, model sizes, sub-sampling rates, language models, discriminative training, and decoder search configurations.
By choosing training configurations and search parameters optimally for sequence-to-sequence ASR models, we demonstrated significant improvement in the WER of the attention-based encoder-decoder models. Among various applied techniques, we found that smaller modeling units and moderate application of frame-rate reduction are critical. In addition, we showed that a system combination with a straightforward variation in model size, representation unit, and input features led to further improvement, which indicates the further potential of encoder-decoder models. Table 1 gives a summary of our results on the various LibriSpeech test sets.
Table 1: Comparison with the published results. Please see the paper for the references.
“Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation,” G. Kurata and K. Audhkhasi
Traditional ASR systems trained from frame-level alignments can easily leverage posterior fusion to improve ASR accuracy and build a better single model with knowledge distillation. E2E ASR systems trained using the CTC loss do not require frame-level alignment and hence simplify model training. However, sparse and arbitrary posterior spike timings from CTC models pose a new set of challenges in posterior fusion from multiple models and knowledge distillation from between CTC models. For example, Figure 3(a) shows posterior spikes from unidirectional LSTM (UniLSTM) phone CTC models trained from different initializations and with different training data order, where posterior spikes are not aligned, and thus posterior fusion does not work well. Figure 3(c) shows posterior spikes from UniLSTM and bidirectional LSTM (BiLSTM) phone CTC models. Due to completely different spike timings, knowledge distillation from BiLSTM to UniLSTM has not been straightforward.
Figure 3: Posteriors for “this (DH IH S) is (IH S) true (T R UW)” in the Switchboard test set.
We propose a method to train a CTC model so that its spike timings are guided to align with those of a pre-trained guiding CTC model. We call this guided CTC training. Figure 4 shows a typical form of guided CTC training for UniLSTM CTC models. We first train the guiding UniLSTM model and then train multiple UniLSTM models with guiding spike positions to be the same as those from the pre-trained guiding model. As a result, all models sharing the same guiding model have aligned spike timings, and their posteriors can be fused. For example, in Figure 3 (b), spikes from the bottom two guided models are aligned with those of the guiding model at the top.
Figure 4: Typical form of the guided CTC training.
Figure 5 shows a schematic diagram of the proposed guided CTC training. Vertical and horizontal axes of the matrices indicate output symbols and time indexes, respectively. The guiding CTC model at the left makes a mask that sets a “1” only at the output symbol of the highest posterior at each time index. This mask is applied to the posterior from the model being trained via an element-wise product. The summation of the masked posteriors at the right is multiplied by -1 and is minimized jointly with the normal CTC loss.
Figure 5: Schematic diagram of proposed guided CTC training.
The paper investigated the advantage of the proposed guided CTC training in various scenarios. Here we introduce the most practically important experiment to improve the accuracy of a UniLSTM phone CTC model using BiLSTM phone CTC models as teachers. To make an online streaming application, we can only use a unidirectional model, but the accuracy of a bidirectional model is typically much higher. We realized knowledge distillation from BiLSTM CTC model to UniLSTM CTC model by using the proposed guided CTC training. Specifically, we first trained the unidirectional model and then trained the bidirectional models using the trained unidirectional model as a guiding model.
Through this approach, the trained bidirectional models have spike timings matched to the unidirectional model and thus can serve as teachers to train a unidirectional model. Figure 3 (d) shows that the bottom two bidirectional models have the aligned spike timings with the top guiding unidirectional model. Note that the spike timings of the bidirectional models trained with standard methods and the proposed guided CTC training are completely different (compare Figures 3(c) and (d)). As shown in line 2E of Table 2, the unidirectional model trained with this approach has much better accuracy than the normal unidirectional model in 2A.
Table 2: Word Error Rates (WERs) for knowledge distillation from bidirectional LSTM (BiLSTM) to unidirectional LSTM (UniLSTM) phone CTC models. [%]