About Text to Speech
The Text to Speech service was updated on July 14, 2017. The service now supports the MP3 (MPEG) audio format. For information about all recent changes to the service, see the Release notes.
The IBM® Text to Speech service provides an Application Programming Interface (API) that uses IBM's speech-synthesis capabilities to convert written text to natural-sounding speech. The service streams the results back to the client with minimal delay. The service offers the following features:
- HTTP and WebSocket interfaces: Supports speech synthesis via both HTTP REST and WebSocket interfaces. Both interfaces enable the use of SSML for all supported languages. The WebSocket interface also supports the SSML
<mark> element as well as optional word timing information for all words of the input text to synchronize the audio and input, for example, for use with robots. See Using the HTTP interface and Using the WebSocket interface.
- Audio formats: Produces Ogg format with the Opus or Vorbis codec, Waveform Audio File Format (WAV), Free Lossless Audio Codec (FLAC), MP3 (Motion Picture Experts Group, or MPEG) format, Web Media (WebM) format with the Opus or Vorbis codec, Linear 16-bit Pulse-Code Modulation (PCM), mu-law (u-law), or basic audio. See Specifying an audio format.
- Voices: Synthesizes text to audio in a variety of languages, including English, French, German, Italian, Japanese, Spanish, and Brazilian Portuguese. The service offers at least one male or female voice, sometimes both, for each language and different dialects, such as US and UK English and Castilian, Latin American, and North American Spanish. The audio uses appropriate cadence and intonation. See Specifying a voice.
- SSML: Accepts plain text or text that is tagged with the Speech Synthesis Markup Language (SSML), an XML-based markup language that provides annotations of text for speech synthesis applications. See Specifying SSML input.
- Expressiveness: Augments SSML with an expressive element that lets you indicate a speaking style of GoodNews, Apology, or Uncertainty. Currently available only for the US English Allison voice. See Using expressive SSML.
- Voice transformation: Extends SSML by adding a voice transformation element that lets you expand the range of possible voices by controlling aspects such as pitch, rate, and timbre. The service also offers two built-in virtual voices, Young and Soft. Currently available only for US English voices. See Using voice transformation SSML.
- Customization: Provides a customization interface that lets you specify how it pronounces unusual words that occur in your input. You can define pronunciations with the International Phonetic Alphabet (IPA) or IBM Symbolic Phonetic Representation (SPR). See Understanding customization.
For information about the pricing plans available for the service, see the Text to Speech service in Bluemix®.
The Text to Speech service can be used in voice-driven and screenless interfaces, as well as in interfaces for the disabled. It can be used in situations where audio is the preferred method of output, including home automation solutions, assistance tools for the vision-impaired, reading text and email messages aloud to drivers, video script narration and voice over, and reading-based educational tools.
You can see a quick demo of the Text to Speech service in action. The demo lets you enter text from which you can generate speech with different voices, including expressiveness and transformation where supported. Applications in Watson Developer Cloud Starter Kits also demonstrate the Text to Speech service.
Questions and feedback
We are always looking to improve and learn from your experience with our services:
- Ask programming-related questions in the Watson forums on Stack Overflow.
- Submit comments or ask product-related questions about this service in the Watson forum on dW Answers.
- Read general posts about Watson services that are written by IBM researchers, developers, and other experts on the Watson blog.