Skip to main content
Icons of Progress

Pioneering Speech Recognition

When it was revealed at the Seattle World’s Fair in 1962, the IBM Shoebox was the most advanced speech recognition machine, understanding 16 words spoken in English. Throughout the 1970s and ’80s, the development of speech recognition accelerated. As computing power grew, so too did the number of words recognized by these systems. Today, speech recognition software is available for a broad range of languages and can recognize a virtually limitless number of words.

From the beginning, IBM took a statistical approach to speech recognition technology, grouping sound into thousands of different units based on their characteristic combinations of frequencies.

Hidden Markov Models, statistical language modeling, the use of Viterbi and Stack Decoders, all now completely ubiquitous, were all pioneered by IBM Research in the 1970s by Fred Jelinek and his team.

The 1980s saw the development of real-time speech recognition systems embodying statistical methodologies. The first real-time large vocabulary dictation system was demonstrated in 1984 by the IBM speech team. At the time, an IBM mini-computer and three array processors filling a whole room was needed. Within only a couple of years, the technology had all been ported by the team to special purpose hardware that ran on an IBM-PC AT.

In 1992, IBM released its first dictation system, the IBM Speech Server Series. The next year brought the IBM Personal Dictation System, the first dictation system for the personal computer. It was later renamed IBM VoiceType Dictation, and was capable of recognizing 32,000 words at a rate of approximately 70 to 100 words per minute, with 97 percent accuracy. Both systems were used mostly in the medical and legal fields, and in business and government.

In 1996, VoiceType Simply Speaking was released. This voice recognition software worked with Microsoft ® Windows ® applications, making it useful in offices, schools and even homes. The dictation function included a 22,000 to 42,000 word vocabulary (depending on the language), and supported US English or Spanish dictation, and a spelling dictionary of 100,000 words. IBM MedSpeak/Radiology was also released that year. It was the world’s first continuous-speech recognition dictation and work-flow product. With this system, a radiologist would dictate the examination of a patient’s X-ray, and MedSpeak/Radiology would convert the comments into a written report.

In 1997, IBM introduced IBM ViaVoice, the first ever continuous dictation product that was offered in multiple languages. It was no surprise that the technology could work for languages such as German, Spanish, French, and Italian, but the team continued to demonstrate the power of the statistical methodologies by also creating high successful dictation systems for Mandarin and Japanese in conjunction with colleagues from the China Research Lab and Tokyo Research Lab. The Mandarin system was so impressive it was demonstrated to the President of China, Jiang Zemin, when it was initially launched.

Today, speech recognition technology appears in a very broad variety of applications that go beyond the desktop. Speech analytics systems, automated speech self-service, mobile devices, automobile navigation systems, car infotainment with climate control systems and media players, hands-free phones, personal navigation devices and other smart devices are all examples of the way speech recognition has penetrated our lives, all originating from IBM’s early vision in this area.