September 30, 2019 | Written by: Zvi Kons, Slava Shechtman, and Alex Sorin
Categorized: AI | IBM Research-Haifa
Share this post:
Recent advances in deep learning are dramatically improving the development of Text-to-Speech (TTS) systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech.
Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs.
In order to address these challenges our IBM Research AI team has developed a new method for neural speech synthesis based on a modular architecture, which combines three deep neural networks (DNNs) with intermediate signal processing of the networks’ output. We presented this work in our paper “High quality, lightweight and adaptable TTS using LPCNet” at Interspeech 2019. The TTS architecture is lightweight and can synthesize high-quality speech in real-time. Each network learns a different aspect of a speaker’s voice, making it possible to efficiently train each component independently.
Figure 1: TTS System Architecture
Another advantage of our approach is that once the base networks are trained, they can be easily adapted to a new speaking style or voice, such as for branding and personalization purposes, even with small amounts of training data.
The synthesis process applies a language specific front-end module that converts input text into a sequence of linguistic features. The following three DNNs are then applied in sequence:
1. Prosody Prediction
Prosody features are represented as a four-dimensional prosody vector per TTS unit (roughly one-third of a phone’s HMM states), comprising the unit’s log-duration, initial log-pitch, final log-pitch and log-energy. These features are learned at training time so they can be predicted from textual features extracted by the front-end at synthesis time. Prosody is extremely important, not only for helping the speech sound natural and lively, but also to best-represent the specific speaker’s style in the training or adaptation data. The prosody adaptation to an unseen speaker is based on a Variational Auto Encoder (VAE). More details on the network architecture can be found in our paper as well as 
Figure 2: Prosody generator training and retraining
2. Acoustic Feature Prediction
Acoustic feature vectors provide the spectral representation of the speech at short 10 millisecond frames, from which the actual audio can be generated. The acoustic features are learned at training time so they can be predicted from the phonetic labels and prosody during synthesis.
Figure 3: Synthesizer Network
The DNN model created represents the voice of the speaker in the training or adaptation data. The architecture is based on convolutional and recurrent layers for extraction of local context and time-dependent patterns in the phonetic sequence and pitch pattern. The DNN predicts the acoustic features along their first and second derivatives. This is followed by the maximum likelihood procedure and formant enhancement filters, which help to generate better-sounding speech.
3. Neural Vocoder
The neural vocoder is responsible for generating the actual speech samples from the acoustic features. It is trained on the speaker’s natural speech samples together with their corresponding features. Specifically, we were the first to use a novel, lightweight, high-quality neural vocoder called LPCNet  in a fully commercialized TTS system.
The novelty of this vocoder is that it doesn’t try to predict the complex speech signal directly by a DNN. Instead, the DNN only predicts the less-complex glottal tract residual signal and then uses LPC filters to convert it to the final speech signal.
Figure 4: LPCNet Neural Vocoder
Voice adaptation to a target speaker can be easily achieved by retraining the three networks, based on some small amount of data from the target speaker. In our paper, we present results of adaptation experiments in terms of speech quality and similarity to the target speaker. There are also samples of adaptation to eight different VCTK  speakers (four male, four female) in this sample page.
Listening Tests Results
The figure below shows the crowd-listening tests results. For quality evaluations, the MOS (Mean Opinion Score) values are based on averaging quality scores (1-5) given by listeners for many synthesized and natural samples from the VCTK speakers. For similarity evaluations, the listeners were presented with pairs of samples and asked to rate the similarity between them (on a scale of 1-4).
We evaluated the quality and similarity to the target speaker of synthesized speech using female/male-adapted voices using five, 10 and 20 minutes of target speech, as well as natural speech of the target speakers.
The test results show that we can maintain both high quality and high similarity to the original speaker even for voices that were train on as little as five minutes of speech.
Figure 5: Quality and Similarity Listening tests results
This work was productized by IBM Watson and was the basis for a new IBM Watson TTS service release with upgraded quality voices (select “V3” voices in the IBM Watson TTS demo).
All authors of “High quality, lightweight and adaptable TTS using LPCNet” contributed to this article: Zvi Kons, Slava Shechtman, Alex Sorin, Carmel Rabinovitz, Ron Hoory.
 Z. Kons, S. Shechtman, A. Sorin, R. Hoory, C. Rabinovitz and E. Da Silva Morais, “Neural TTS Voice Conversion,” 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 290-296
 J. Valin and J. Skoglund, “LPCNET: Improving Neural Speech Synthesis through Linear Prediction,” ICASSP 2019, Brighton, United Kingdom, 2019, pp. 5891-5895
 Veaux, Christophe; Yamagishi, Junichi; MacDonald, Kirsten. (2017). “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit”, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/1994