Speech recognition first came on the scene in the 1950s with a voice-driven machine named Audrey. Created by Bell Labs, it could understand the spoken numbers 0 to 9, and had a 90 percent accuracy rate. In 1962, IBM released Shoebox, which was the most advanced speech recognition machine of its time and could understand 16 spoken words. It was followed in 1971 by a system called Harpy. Developed at Carnegie Mellon University, the technology was able to recognize over 1,000 words.
Development accelerated through the 1980s and 90s. As computing power grew, so too did the number of terms the systems could understand. IBM released its VoiceType Simply Speaking software in 1996. The application had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.
Today, aided by cognitive and computational innovations, speech recognition programs can recognize a virtually limitless number of spoken words.
Complexities of human speech
The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics.
“Unlike text, which has a much greater level of standardization, the spoken word varies greatly based on regional dialects, speed, emphasis, even social class and gender,” says Clark Boyd at Medium.com. “Therefore, scaling any speech recognition system has always been a significant obstacle…In essence, we have spent hundreds of years teaching machines to complete a journey that takes the average person just a few years.” (3)
Science writer Chris Woodford outlines four approaches used by computers to understand human speech: (4)
- Simple pattern matching: Each spoken word is recognized in its entirety without the need for analysis.
- Pattern and feature analysis: A word is broken into bits and recognized by features such as its vowels.
- Language modeling and statistical analysis: Grammar and the probability of certain words following one another are used to improve recognition and accuracy.
- Artificial neural networks: Brain-like computer models are used to recognize patterns, after extensive training.
In the area of neural networks, natural language processing and cognitive computing are being used to ever more closely replicate how humans think and talk. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal. And speech recognition is getting close.