AI

High quality, lightweight and adaptable Text-to-Speech (TTS) using LPCNet

Share this post:

Recent advances in deep learning are dramatically improving the development of Text-to-Speech (TTS) systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech.

Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs.

In order to address these challenges our IBM Research AI team has developed a new method for neural speech synthesis based on a modular architecture, which combines three deep neural networks (DNNs) with intermediate signal processing of the networks’ output. We presented this work in our paper “High quality, lightweight and adaptable TTS using LPCNet” at Interspeech 2019. The TTS architecture is lightweight and can synthesize high-quality speech in real-time. Each network learns a different aspect of a speaker’s voice, making it possible to efficiently train each component independently.

Figure 1: TTS System Architecture

Figure 1: TTS System Architecture

Another advantage of our approach is that once the base networks are trained, they can be easily adapted to a new speaking style or voice, such as for branding and personalization purposes, even with small amounts of training data.

The synthesis process applies a language specific front-end module that converts input text into a sequence of linguistic features. The following three DNNs are then applied in sequence:

1. Prosody Prediction

Prosody features are represented as a four-dimensional prosody vector per TTS unit (roughly one-third of a phone’s HMM states), comprising the unit’s log-duration, initial log-pitch, final log-pitch and log-energy. These features are learned at training time so they can be predicted from textual features extracted by the front-end at synthesis time. Prosody is extremely important, not only for helping the speech sound natural and lively, but also to best-represent the specific speaker’s style in the training or adaptation data. The prosody adaptation to an unseen speaker is based on a Variational Auto Encoder (VAE). More details on the network architecture can be found in our paper as well as [1]

Figure 2: Prosody generator training and retraining

Figure 2: Prosody generator training and retraining

2. Acoustic Feature Prediction

Acoustic feature vectors provide the spectral representation of the speech at short 10 millisecond frames, from which the actual audio can be generated. The acoustic features are learned at training time so they can be predicted from the phonetic labels and prosody during synthesis.

Figure 3: Synthesizer Network

Figure 3: Synthesizer Network

The DNN model created represents the voice of the speaker in the training or adaptation data. The architecture is based on convolutional and recurrent layers for extraction of local context and time-dependent patterns in the phonetic sequence and pitch pattern. The DNN predicts the acoustic features along their first and second derivatives. This is followed by the maximum likelihood procedure and formant enhancement filters, which help to generate better-sounding speech.

3. Neural Vocoder

The neural vocoder is responsible for generating the actual speech samples from the acoustic features. It is trained on the speaker’s natural speech samples together with their corresponding features. Specifically, we were the first to use a novel, lightweight, high-quality neural vocoder called LPCNet [2] in a fully commercialized TTS system.

The novelty of this vocoder is that it doesn’t try to predict the complex speech signal directly by a DNN. Instead, the DNN only predicts the less-complex glottal tract residual signal and then uses LPC filters to convert it to the final speech signal.

Figure 4: LPCNet Neural Vocoder

Figure 4: LPCNet Neural Vocoder

Voice Adaptation

Voice adaptation to a target speaker can be easily achieved by retraining the three networks, based on some small amount of data from the target speaker. In our paper, we present results of adaptation experiments in terms of speech quality and similarity to the target speaker. There are also samples of adaptation to eight different VCTK [3] speakers (four male, four female) in this sample page.

Listening Tests Results

The figure below shows the crowd-listening tests results. For quality evaluations, the MOS (Mean Opinion Score) values are based on averaging quality scores (1-5) given by listeners for many synthesized and natural samples from the VCTK speakers. For similarity evaluations, the listeners were presented with pairs of samples and asked to rate the similarity between them (on a scale of 1-4).

We evaluated the quality and similarity to the target speaker of synthesized speech using female/male-adapted voices using five, 10 and 20 minutes of target speech, as well as natural speech of the target speakers.

The test results show that we can maintain both high quality and high similarity to the original speaker even for voices that were train on as little as five minutes of speech.

Figure 5: Quality and Similarity Listening tests results

Figure 5: Quality and Similarity Listening tests results

This work was productized by IBM Watson and was the basis for a new IBM Watson TTS service release with upgraded quality voices (select “V3” voices in the IBM Watson TTS demo).

All authors of “High quality, lightweight and adaptable TTS using LPCNet” contributed to this article: Zvi Kons, Slava Shechtman, Alex Sorin, Carmel Rabinovitz, Ron Hoory.

References

[1] Z. Kons, S. Shechtman, A. Sorin, R. Hoory, C. Rabinovitz and E. Da Silva Morais, “Neural TTS Voice Conversion,” 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018, pp. 290-296

[2] J. Valin and J. Skoglund, “LPCNET: Improving Neural Speech Synthesis through Linear Prediction,” ICASSP 2019, Brighton, United Kingdom, 2019, pp. 5891-5895

[3] Veaux, Christophe; Yamagishi, Junichi; MacDonald, Kirsten. (2017). “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit”, [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/1994

Research Staff Member, IBM Research - Haifa

Slava Shechtman

Research Staff Member, IBM Research-Haifa

Alex Sorin

Research Staff Member, IBM Research-Haifa

More AI stories

IBM Research AI at INTERSPEECH 2019

IBM Research's papers at INTERSPEECH 2019 showcase our focus on improving the underlying speech technologies that enable companies provide their customers with a uniformly good experience across different channels and extract actionable insights from these interactions.

Continue reading

IBM Project Debater Demonstrates the Future of Democracy in Switzerland

Can Artificial Intelligence (AI) capture the narrative of a community on a controversial topic to provide an unbiased outcome? Recently, the citizens of Lugano, a city of more than 60,000 citizens on the Swiss-Italian border, provided the answer. What is Project Debater? In February 2019, IBM unveiled Project Debater to the world. It’s the first ever AI technology […]

Continue reading

MIT-IBM Watson AI Lab Welcomes Inaugural Members

Two years in, and the MIT-IBM Watson AI Lab is now engaging with leading companies to advance AI research. Today, the Lab announced its new Membership Program with Boston Scientific, Nexplore, Refinitiv and Samsung as the first companies to join.

Continue reading