Speech to text is the process of converting spoken words into a text transcript. Sometimes referred to as voice to text, it is available mostly as a software-based service (SaaS).
It typically combines artificial intelligence-powered speech recognition technology, also known as automatic speech recognition, with transcription. A computer program picks up audio in the form of sound wave vibrations and uses linguistic algorithms to convert the audio input into digital characters, words and phrases.
Machine learning, deep learning and large language models such as OpenAI’s Generative Pre-Trained Transformer (GPT) have made speech to text software more advanced and efficient because they can glean patterns in spoken language from a large volume of audio and text samples.
Generative AI can be integrated with speech to text software to create assistants that can help customers over a phone call, or interact with voice-enabled apps. Generative AI can also convert text back to speech, otherwise known as text to speech, in a realistic, natural-sounding voice.
Speech to text software contains several components. These include:
Speech input: where a microphone captures spoken words
Feature extraction: where the computer identifies distinctive pitches and patterns in the speech)
Decoder: where the algorithm matches the speech features to characters and words through a language model
Word output: where the final text is formatted with the correct punctuation and capitalizations so that it’s human-readable
Generally, the speech to text process is composed of the following steps:
Audio preprocessing: After audio recordings are captured, they are preprocessed to improve the quality and accuracy of recognition. This includes removing background noises and irrelevant frequencies, stabilizing the volume level, segmenting the clip for easier processing and converting the audio file into a standard format.
Sound analysis and feature extraction: Voice signals are often depicted as spectrograms (link resides outside of ibm.com), which are visual representations of frequencies across time.1 The relevant portions of the audio recordings are broken down into a sequence of phonemes, which are the smallest unit of speech that distinguishes 1 word from another. The major classes of phonemes are vowels and consonants (link resides outside of ibm.com).2 Language models and decoders can match phonemes to words and then sentences. Deep learning-based acoustic models can predict what characters and words are likely to occur next based on context.
There are three main methods for performing speech recognition: synchronous, asynchronous and streaming.
Synchronous recognition is when there is an immediate conversion of speech to text. It can only process audio files shorter than one minute. This is used in live-captioning for broadcast television.
Streaming recognition is when streamed audio is processed in real-time, so fragmented texts might appear as the user is still speaking.
Asynchronous recognition is when large prerecorded audio files are submitted for transcription. It might be queued for processing and delivered later.
Companies such as Google3 (link resides outside ibm.com), Microsoft4 (like resides outside ibm.com), Amazon5 (link resides outside ibm.com) and IBM® offer speech to text software as APIs through the cloud, which allows it to be used in concert with other applications, tools and devices.
Apple iPhones have a dictation feature (link resides outside ibm.com), which integrates speech to text technology baked into its iOS.6 Androids users can download apps like Gboard (link resides outside ibm.com) for speech to text functions. Some pixel devices allow users to type with voice through the Assistant.7 There are various options for both open source and proprietary speech to text software.
Early on in its evolution, speech recognition software relied on a limited vocabulary bank. Its recent adoption by industries from automotive to healthcare has been aided by advancements in data science, deep learning and artificial intelligence.
In the 1950s, Bell Laboratories came up with the first speech recognition setup (link resides outside ibm.com) called AUDREY that can recognize spoken numbers.8 Then, IBM came up with Shoebox in 1962, which might recognize numbers and 16 different words.
During these decades (link resides outside ibm.com), computer scientists came up with phoneme-recognizing models and statistical models such as the Hidden Markov Models, which remain popular algorithms for speech recognition.9 Around the 1970s, a Carnegie Mellon program called HARPY from Carnegie Mellon enabled computers to recognize 1,000 words.
In the 1980s, IBM’s transcription system Tangora used statistical methods to recognize up to 20,000 words. It was used in the first voice-activated dictation for office workers and set the foundation for modern speech to text software. This type of software continued to be developed and improved until it was commercialized in the 2000s.
When machine learning and deep learning algorithms came around, they replaced statistical models and improved recognition accuracy and allowed the applications to be scaled up. Deep learning might capture nuances and informal expressions better. Large language models (LLMs) can be used to add context, which can help when word choices are more ambiguous, or if there are accent variations on pronunciation. As virtual assistants and smart speakers came around, they were able to integrate speech to text with large language models, natural language processing (NLP) and other cloud-based services.
End-to-end deep learning models such as the transformers are fundamental to large language models. They are trained on large unlabeled datasets of audio-text pairs to learn how to correspond audio signals with transcriptions.
During this training, the model implicitly learns how words sound and what words are likely to show up in a sequence together. The model can also infer grammar and language structure rules to apply on its own. Deep learning consolidates some of the more tedious steps of traditional speech to text techniques.
There are various use cases for speech to text software:
Speech to text software can automatically transcribe customer interactions, route calls as needed, derive insights from customer conversations and perform sentiment analysis.
Example: For customer service call centers, AI voice assistants can use speech to text to handle the easier, more repetitive questions from customers and direct more complex requests to human agents.
It can transcribe minutes from online meetings or webinars and create subtitles, captions or dubs on videos. It can also be used with a translation software to offer transcription docs into multiple languages. Special-purpose applications can allow transcribing for healthcare, legal and education applications.
Example: Amazon (link resides outside ibm.com) offers a medical transcription service that uses speech to text to transcribe doctor and patient conversations for clinical notes, and subtitle telehealth consultations.10
Through natural language processing, voice recognition can derive meaning from the transcribed text and pull out actionable commands and carry them out. This can help users issue voice commands like making phone calls, searching the web or controlling the lights, thermostats and other connected devices in a smart home through chatbots or digital assistants like Alexa, Cortana, Google Assistant and Siri.
Example: Amazon’s Alexa (link resides outside ibm.com) is now using speech to text and text to speech to turn on lights, adjust the temperature in a certain room or suggest recipes based on your recent grocery purchases.11
People with disabilities can use these apps to interact with computers and smartphones without having to physically type. They can instead dictate text messages, notes, emails and more.
Example: Students who have dyslexia or recently injured their arms can still type notes by using their voice on a Microsoft computer (link resides outside ibm.com).12 This capability is powered by Azure Speech services.
AI can comb through transcripts of videos and audio clips to scan for inappropriate content and act as a moderator to flag questionable materials for human review.
Example: Vatis Tech (link resides outside ibm.com) offers a tool that uses speech to text for social media monitoring in marketing so it can help brands identify when they’re trending, and the intent behind customer interactions.13
1. From Sound to Images, Part 1: A deep dive on spectrogram creation (link resides outside ibm.com), Cornell Lab Macaulay Library, 19 July 2021
2. Lecture 12: An Overview of Speech Recognition (link resides outside ibm.com), University of Rochester Computer Science
3. Turn speech into text using Google AI (link resides outside ibm.com), Google Cloud
4. Speech to text REST API (link resides outside ibm.com), Microsoft
5. Amazon Transcribe API reference (link resides outside ibm.com), AWS
6. iPhone User Guide (link resides outside ibm.com), Apple
7. Type with your voice (link resides outside ibm.com), Google Support
8. Audrey, Alexa, Hal, and more (link resides outside ibm.com), Computer History Museum, 9 June 2021
9. Speech Recognition: Past, Present, Future (link resides outside ibm.com), Carnegie Mellon University Computer Science
10. Amazon Transcribe Medical (link resides outside ibm.com), AWS
11. Alexa unveils new speech recognition, text-to-speech technologies (link resides outside ibm.com), Amazon, 20 September 2023
12. Use voice typing to talk instead of type on your PC (link resides outside ibm.com), Microsoft
13. Media Monitoring Intelligence - Turn any Audio to Insights (link resides outside ibm.com), Vatis Tech
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Discover how natural language processing can help you to converse more naturally with computers.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
Explore IBM Developer’s website to access blogs, articles, newsletters and learn more about IBM embeddable AI.
Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.
Accelerate the business value of artificial intelligence with a powerful and flexible portfolio of libraries, services and applications.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
IBM web domains
ibm.com, ibm.org, ibm-zcouncil.com, insights-on-business.com, jazz.net, mobilebusinessinsights.com, promontory.com, proveit.com, ptech.org, s81c.com, securityintelligence.com, skillsbuild.org, softlayer.com, storagecommunity.org, think-exchange.com, thoughtsoncloud.com, alphaevents.webcasts.com, ibm-cloud.github.io, ibmbigdatahub.com, bluemix.net, mybluemix.net, ibm.net, ibmcloud.com, galasa.dev, blueworkslive.com, swiss-quantum.ch, blueworkslive.com, cloudant.com, ibm.ie, ibm.fr, ibm.com.br, ibm.co, ibm.ca, community.watsonanalytics.com, datapower.com, skills.yourlearning.ibm.com, bluewolf.com, carbondesignsystem.com, openliberty.io