Instead, he developed techniques that involved recognizing words based on patterns that the human ear does not discern, and classifying sounds into machine code. The device recognized “machine vowels,” or voiced sounds that originate in the throat (like “a,” “o” and “m”), and “machine consonants,” frictional sounds created by air passing through the tongue and teeth (like “f,” “v” and “th”). Each word was categorized according to its specific pattern of machine consonants and vowels.
For words with similar patterns (like “one” and “nine”), Shoebox made further refinements. The frictional, or unvoiced, sounds were classified as weak or strong, to aid in identifying the correct number. Because Shoebox identified patterns that only occur in the human voice — and ignored unfamiliar patterns — it was relatively unaffected by ambient noise.
Shoebox was hardly the end of IBM’s interest in speech-recognition. After several breakthroughs, in the early 1970s the company established a Continuous Speech Recognition Group at the Thomas J. Watson Research Center. From the outset, the group took a statistical approach to speech recognition, grouping sounds into thousands of units based on their characteristic combinations of frequencies.
In 1987, its leader, Frederick Jelinek, described an approach that echoed Dersch’s perspective from two decades earlier. “We thought it was wrong to ask a machine to emulate people,” Jelinek told THINK Magazine. “If a machine has to move, it does it with wheels — not by walking. If a machine has to fly, it does so as an airplane does — not by flapping its wings. Rather than exhaustively studying how people listen to and understand speech, we wanted to find the natural way for the machine to do it.”