My IBM Log in Subscribe

What is text to speech?

2 December 2024

Authors

Charlotte Hu

IBM Content Contributor

Amanda Downie

Inbound Content Lead, AI Productivity & IBM Consulting

Text to speech (TTS) is a type of technology that converts text on a digital interface into natural-sounding audio. It can also be referred to as “read aloud” technology, computer-generated speech or speech synthesis. Most companies offer text to speech technology as an application programming interface (API).

Originally, TTS systems were developed as an assistive technology that might make certain services more accessible to users with visual impairments and learning disabilities like dyslexia. Now, artificial intelligence-powered voice generators are enabling text to speech software to mimic human speech better. Opening up a wave of new use cases like customer service call answering, AI-generated podcasts, voiceovers and audiobook narration.

Evolution of text to speech

The first electric speech synthesizers popped up around the 1930s1. The early machines were limited and were complicated to operate.

As computers came along, programmers starting in the late 1950s worked on algorithms that might access a large database of audio files as its source sounds. These algorithms might find sound matches for units of texts and piece together speech elements. Early on, the generated voice sounded robotic. As modeling work characterized language better, the algorithms for turning text to speech improved.

When deep learning techniques and neural networks emerged in the 2000s, programmers started modeling waveforms directly with recordings of speech, which lead to high-quality voices that sounded more realistic. In parallel, computer scientists were refining speech recognition software and natural language processing. The development of conversational AI hinged on combining speech to text with text to speech technology.

Although AI and machine learning made it easier to generate natural-sounding speech, they opened new areas of controversy, such as deepfakes. Technology companies are working on developing real-time voice analysis systems in order to detect audio deepfakes.

Black woman working on laptop

Stay ahead of the latest tech news

Get weekly insights, research and expert views on AI, security, cloud and more in the Think Newsletter.

How does text to speech work?

Deep learning techniques allow speech synthesis models to parse through more data and better understand the relationship between words and their acoustic feature. All of this makes the AI voice sound more natural. Converting text to speech is a multi-step process that involves both linguistic analysis and speech synthesis.

The main components of text to speech are:

  • Linguistic analysis

  • Speech synthesis

Linguistic analysis

Deep neural networks in the model are given audio datasets and corresponding transcriptions in English and sometimes other languages. This helps the system understand how words match up with speech as well as accents, pitch, volume, tone, rhythm and more. After it receives a text input, the text to speech model analyzes the words, punctuation and sentence structure. It can expand abbreviations and expressions, calculate the duration of words, find the matching pronunciations and plot out the prosody of phrases and sentences.

Speech synthesis

After the text gets analyzed, the model then uses a two-step process to turn it into a voice output.

  • Step 1: The model transforms the text into time-aligned features such as a spectrogram, which is used to map the variation of frequencies over time. This captures the detailed characteristic in speech and factors in context-dependent pronunciations, stresses and timings of words.

  • Step 2: A voice encoding (vocoder) network can turn the time-aligned features into audio waveforms, which computers can convert into natural sounding speech. Certain text to speech models allows users to alter volume, pitch, speed, and choose between different languages, accents and speaking styles.

Many devices like smartphones have text to speech systems built in. Text to speech is also available as a software program, a browser extension, a web-based tool or downloadable apps.

Mixture of Experts | 25 April, episode 52

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Uses of text to speech

Text to speech technology was originally developed as a way to increase accessibility for a wide range of users and enable people with visual impairments or reading disabilities to interact with texts through computers and other devices. Stephen Hawkings, for example, uses a version of text to speech technology.

Text to speech has since evolved to a wider range of use cases, chiefly ones where reading isn’t practical or a human operator’s time might be saved. Here are some of the main applications for the technology.

  • Audio content

  • Education

  • Chatbots and virtual assistants

  • Navigation

  • Multilingual communication and language learning

  • Media and entertainment

  • Healthcare

Audio content

Text to speech software can read aloud digital texts, books, lessons, guides, instructions and more to assist with e-learning and online training. News organizations can also use this technology to convert their articles into an audio format.

Education

Text to speech features can help students pay attention and read along to written text, allowing them to associate words with pronunciations. It can also improve reading comprehension and engagement as students get exposed to new grammar structures or vocabulary. It can also assist those with visual difficulties or learning disabilities such as dyslexia. Text to speech can also read aloud written works produced by students to help them with proofreading essay assignments.

Chatbots and virtual assistants

Virtual assistants like Apple’s Siri or Microsoft’s Cortana pair text to speech with speech to text in order to understand user requests and interact with them in a natural conversational way. They can also broadcast notifications, and read out texts when users are driving, for example.

In enterprise settings, TTS systems can enhance the quality of user experiences by making customer service feel more interactive and natural. TTS systems can answer calls, present options and respond to users. They are a key part of automated phone systems.

Navigation

Text to speech capabilities is what allows GPS and other mapping apps to relay directions to drivers in real-time. Before text to speech, navigation devices relied on pre-recorded voices and set prompts such as turn left or turn right. With text to speech, the driving instructions became more personalized. For example, GPS can say the exact street onto which you should turn left.

Multilingual communication and language learning

Text to speech can help users communicate in different languages, for example, through an app like Google Translate. This type of app feature can translate audio from one language to another, which might be used to dub video content. It can help expose language learners to natural speech, which can help them understand how different words are pronounced.

Media and entertainment

As TTS technology advances, it can be used to save costs in media production. For example, the technology might generate commentary and narration in video games as well as voiceovers for the characters. Some studios work with human voice actors to help improve the performance of their AI voices.

Healthcare

Healthcare organizations use text to speech technology to communicate with patients in an accessible way. This includes adding audio versions of content and literature posted on their web pages or social media. Some institutions will also add audio-guided instructions on how to use certain medical devices. Generative AI-powered voice interfaces can also help remind patients of upcoming appointments through calls, or alert them to news or updates to their charts. This can be especially important for patients with visual impairments, speech problems, mobility limitations and learning disabilities.

Footnotes

Text-to-Speech Technology (Speech Synthesis), American National Standards Institute, 7 December 2015

Related solutions

Related solutions

IBM® watsonx Orchestrate™

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Discover watsonx Orchestrate
Natural language processing tools and APIs

Accelerate the business value of artificial intelligence with a powerful and flexible portfolio of libraries, services and applications.

Explore NLP solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Discover watsonx Orchestrate Explore NLP solutions