The Text to Speech service is a concatenative system that relies on an inventory of acoustic units from a large synthesis corpus to produce output speech for arbitrary input text. It is based on the following pipeline of processes that facilitate an efficient, real-time search over this inventory of units followed by a post-processing of the units:
Acoustic model: This model consists of a Decision Tree (DT) that is responsible for generating candidate units for the search. For each of the phones in a sequence of phones to be synthesized, the model takes the phone (plus a context of the preceding and following two phones) and produces a set of acoustic units that the search evaluates for fitness. This step effectively reduces the complexity of the search by restricting it to only those units that meet some contextual criteria and discarding all others.
Prosody target models: These models consist of Deep Recurrent Neural Networks (RNNs). The models are responsible for generating target values for prosodic aspects of the speech (such as duration and intonation) given a sequence of linguistic features extracted from the input text. This list includes attributes such as part of speech, lexical stress, word-level prominence, and positional features (for example, the position of the syllable or word in the sentence). The prosody target models help guide the search toward those units that meet the prosodic criteria predicted by this model.
Search: Given the list of candidates returned by the acoustic model and the target prosody, this module carries out a Viterbi search to extract a sequence of acoustic units that minimizes a cost function that takes into account both concatenation and target costs. As a result, audible artifacts from joining two units are minimized while trying to approximate the target prosody suggested by the prosody target models. This search also favors contiguous chunks in the synthesis corpus to further reduce such artifacts.
Waveform generation: Once the search has returned the optimal sequence of units, the system uses time-domain Pitch Synchronous Overlap and Add (PSOLA) to generate the output waveform. PSOLA is a digital signal-processing technique that is used for speech processing and, more specifically, for speech synthesis. It can modify the pitch and duration of a speech signal and blend the units that are returned by the search in a seamless way.
For all of the linguistic features needed in the the previous back-end processes, the service uses a text-processing front-end to parse the text before synthesizing it into audio form. This front-end sanitizes the text of any formatting artifacts such as HTML tags. It then uses a proprietary language that is driven by language-dependent linguistic rules to prepare the text and generate pronunciations. This module normalizes language-dependent features of the text such as dates, times, numbers, currency, and so on. For example, it performs abbreviation expansion from a dictionary and numeric expansion from rules for ordinals and cardinals.
Some words have multiple permissible pronunciations, so the text-processing front-end first produces a single, canonical pronunciation at runtime. Because this approach may not reflect the pronunciation the speaker used when the audio corpus was recorded, the service augments a candidate set of pronunciations with alternative forms inventoried in an alternate-baseform dictionary and lets the search choose forms that result in lower cost in terms of pitch, duration, and contiguity concerns and constraints. This algorithm facilitates selection of longer contiguous chunks from the data set, resulting in an optimal flow of speech in the synthesized result.
The topic of synthesizing text to speech is inherently complex, and any explanation of the service requires more explanatory depth than this brief summary can accommodate. See the documents listed in the following section for more information about the scientific research behind the service.
For more detailed information about the research and technical background behind the Text to Speech service, see the following documents:
Eide, Ellen M., and Raul Fernandez. Database Mining for Flexible Concatenative Text-to-Speech. Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4 (2007): pp. 697-700.
Eide, Ellen, Raul Fernandez, Ron Hoory, Wael Hamza, Zvi Kons, Michael Picheny, Ariel Sagi, Slava Shechtman, and Zhi Wei Shuang. The IBM Submission to the 2006 Blizzard Text-to-Speech Challenge. In Blizzard Challenge Workshop 2006.
Fernandez, Raul, Asaf Rendel, Bhuvana Ramabhadran, and Ron Hoory. Using Deep Bidirectional Recurrent Neural Networks for Prosodic-Target Prediction in a Unit-Selection Text-to-Speech System. Proceedings Interspeech (2015), pp. 1606-1610.
Fernandez, Raul, Asaf Rendel, Bhuvana Ramabhadran, and Ron Hoory. Prosody Contour Prediction with Long Short-Term Memory, Bi-directional, Deep Recurrent Neural Networks. Proceedings Interspeech (2014), pp. 2268-2272.
Fernandez, Raul, Zvi Kons, Slava Shechtman, Zhi Wei Shuang, Ron Hoory, Bhuvana Ramabhadran, and Yong Qin. The IBM Submission to the 2008 Text-to-Speech Blizzard Challenge. In Blizzard Challenge Workshop 2008.
Fernandez, Raul, and Bhuvana Ramabhadran. Automatic Exploration of Corpus-Specific Properties for Expressive Text-to-Speech: A Case Study in Emphasis. Proceedings of the Sixth ISCA Workshop on Speech Synthesis (August 2007): pp. 34-39.
Fernandez, Raul, Raimo Bakis, Ellen Eide, Wael Hamza, John Pitrelli, and Michael A. Picheny. The 2006 TC-STAR Evaluation of the IBM Expressive Text-to-Speech Synthesis System. Speech-to-Speech Translation Workshop, Barcelona, Spain (2006), pp. 175-180.
Pitrelli, John F., Raimo Bakis, Ellen M. Eide, Raul Fernandez, Wael Hamza, and Michael A. Picheny. The IBM Expressive Text-to-Speech Synthesis System for American English. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14(4) (July 2006): pp. 1099-1108.
Rendel, Asaf, Raul Fernandez, Ron Hoory, and Bhuvana Ramabhadran. Using Continuous Lexical Embeddings to Improve Symbolic-Prosody Prediction in a Text-to-Speech Front End. Proceedings ICASSP (2016), pp. 5655-5659.
Shechtman, Slava. Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System. Proceedings of the Sixth ISCA Workshop on Speech Synthesis (August 2007): pp. 234-239.
Shuang, Zhi-Wei, Raimo Bakis, Slava Shechtman, Dan Chazan, and Yong Qin. Frequency warping based on mapping formant parameters. Proceedings of the Ninth International Conference on Spoken Language Processing (ICSLP), Interspeech (2006): pp. 2290-2293.