What is speech recognition?

Speech recognition is a technique or capability that enables a program or system to process human speech. It is also referred to as voice recognition or speech-to-text.

According to Techopedia, speech recognition is “…the use of computer hardware and software-based techniques to identify and process the human voice. It is primarily used to convert spoken words into computer text. Additionally, automatic speech recognition is used for authenticating users via their voice and performing an action based on instructions defined by the human.” (1)

Common applications today include hands-free devices, dictation software and virtual assistants such as Siri and Alexa. Many businesses offer voice-activated call center services for more efficient call handling. Speech recognition also helps make driving safer by enabling voice-activated navigation systems and search capabilities for car radios.

Communicating with devices and applications by voice is gaining in popularity. The speech recognition market is expected to be worth $18 billion by 2023. (2)

Evolution of speech recognition

Speech recognition first came on the scene in the 1950s with a voice-driven machine named Audrey. Created by Bell Labs, it could understand the spoken numbers 0 to 9, and had a 90 percent accuracy rate. In 1962, IBM released Shoebox, which was the most advanced speech recognition machine of its time and could understand 16 spoken words. It was followed in 1971 by a system called Harpy. Developed at Carnegie Mellon University, the technology was able to recognize over 1,000 words.

Development accelerated through the 1980s and 90s. As computing power grew, so too did the number of terms the systems could understand. IBM released its VoiceType Simply Speaking software in 1996. The application had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.

Today, aided by cognitive and computational innovations, speech recognition programs can recognize a virtually limitless number of spoken words.

Complexities of human speech

The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics.

“Unlike text, which has a much greater level of standardization, the spoken word varies greatly based on regional dialects, speed, emphasis, even social class and gender,” says Clark Boyd at Medium.com. “Therefore, scaling any speech recognition system has always been a significant obstacle…In essence, we have spent hundreds of years teaching machines to complete a journey that takes the average person just a few years.” (3)

Science writer Chris Woodford outlines four approaches used by computers to understand human speech: (4)

  • Simple pattern matching: Each spoken word is recognized in its entirety without the need for analysis.
  • Pattern and feature analysis: A word is broken into bits and recognized by features such as its vowels.
  • Language modeling and statistical analysis: Grammar and the probability of certain words following one another are used to improve recognition and accuracy.
  • Artificial neural networks: Brain-like computer models are used to recognize patterns, after extensive training.

In the area of neural networks, natural language processing and cognitive computing are being used to ever more closely replicate how humans think and talk. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal. And speech recognition is getting close.

Why speech recognition is important

Speech recognition provides nearly endless opportunities – from improving ease of access to driving a vehicle with fewer distractions. Voice-based authentication too adds a viable level of security for many systems.

Consider other examples:

  • A call center needs to transcribe thousands of recorded conversations between customers and agents to identify common call patterns and issues
  • A medical service wants to create a dictation application for doctors to capture and log patient diagnoses and treatment notes
  • A retail company wants to extend its sales engagement with customers through an online conversational application

A good speech recognition system can support these kinds of tasks, bringing benefits such as speed, efficiency and cost savings. It also helps to free up humans for more complex work.

For consumers, voice recognition offers convenience, accessibility, even safety. The market for virtual assistants alone is growing at an exponential rate. According to Annet Aris, INSEAD Senior Affiliate Professor of Strategy: “The voice ecosystem is developing so fast that it’s estimated a staggering 75 percent of households in the United States will own a voice-activated smart speaker within the next two years.” (5)

In his IBM blog, Audioburst CTO and Co-Founder Gal Klein explains: “Because audio can be consumed easily while working out, traveling, driving, eating and in any number of other scenarios where it’s difficult or dangerous to be looking at a screen, eyes-free content is one of the fastest growing content-types on the internet.”

The company is using IBM Watson and in-house natural language processing and segmentation algorithms to create an audio library that can be searched by voice.

“We transcribe millions of minutes a month of audio content. Such content may feature speech, speaker change, music, laughter, clapping, silence and everything else you can expect from an audio program or podcast,” says Klein.

“Audioburst can detect all those audio cues to help our segmentation algorithms understand exactly when a topic segment begins and when it ends. The audio data is then organized according to topic and stored in our repository for search – which means that content can often be found within seconds of it airing.”

Voice-driven audio search is one of the many ways researchers and developers are expanding the reach and potential of speech recognition.

Audioburst pioneers personalized on-demand audio with the power of Watson NLU

Key features of effective speech recognition

Many speech recognition applications and devices are available. The more advanced solutions use AI and machine learning. They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.

The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:

  • Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
  • Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
  • Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
  • Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.

Meanwhile, speech recognition continues to advance. Companies like IBM are making inroads in several areas, the better to improve human and machine interaction.

Speaker diarization

IBM uses Watson and diarization (algorithms that identify and segment speech by speaker identity) to help programs better distinguish individuals in a conversation.

“In a live conversation, the exchanges quickly shift back-and-forth between speakers, which makes it hard for a system to develop a speech model on a particular speaker before another person begins talking,” says Michael Picheny in his Watson developers blog.

“We developed a system that would recognize the variations in spectral frequency content of each speaker during a conversation. Think of the nuances you might hear if you blow across the top of a soda bottle — our voice frequencies work in much the same way…What Watson is able to do is instantly build each of these profiles and assign the output text to specific speakers.”

Human parity

Researchers have worked for decades to reach human speech accuracy, which is estimated to be around a 4 percent word error rate. Recently, IBM achieved a new industry record of 5.5 percent.

“It was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like buying a car,” explains George Saon in his IBM Watson blog.

“IBM researchers focused on extending our application of deep learning technologies. We combined Long Short Term Memory and WaveNet language models with three strong acoustic models…The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.”

Blogs

Watson Speech to Text is paying attention to what people are saying

Read about the value of adding speech transcription capabilities to your applications.

Freedom and flexibility with speech to text

Read how customizable speech solutions help people leverage their industry expertise and deliver value.

How human should a chatbot be?

Learn about the challenges and opportunities of implementing chatbots.

Audio analytics: The sounds of systems

Read about some of the developments in automatic speaker recognition.

Featured solutions

Watson Speech to Text

Easily convert audio and voice into written text for quick understanding of content.

Watson Assistant

Build conversational interfaces into any application, device, or channel.

Watson Tone Analyzer

Use linguistic analysis to detect emotional and language tones in written text.