Speech recognition

In 1961, William C. Dersch, an engineer with IBM’s Advanced Systems Development Division, held a press conference to introduce Shoebox, an experimental machine created to recognize the human voice. As his rapt audience looked on, Dersch spoke a series of numbers and mathematical commands aloud: “Seven plus three plus six plus nine plus five. Subtotal.”

Shoebox typed out each digit and command correctly and then, as directed, computed the answer: 30. “The machine has understood my voice,” Dersch declared, “and caused the adding machine to correctly perform these arithmetical operations.”

The unveiling event marked the beginning of a nationwide roadshow to demonstrate the world’s first speech-recognition system, capable of understanding the numbers zero through nine and six command words, including “plus,” “minus,” “total” and “subtotal.” During subsequent appearances on the Today show and at the 1962 Seattle World’s Fair, observers marveled at Dersch’s unassuming device, which was named for its size.

Observers marveled at Dersch’s unassuming device, which was named for its size

A potential revolution in human-machine interaction

At the time, there was a widespread sense that Shoebox marked a breakthrough in the quest for seamless human-to-machine communication — a quest that continues to this day. News coverage anticipated a future in which pilots might one day issue split-second voice commands and cashiers would use the technology to serve customers more efficiently. In an IBM promotional video, Dersch said, “We are hopeful that someday soon, speech-recognition devices will give man increased efficiency and responsiveness in his control of machines.”

Still, few could have imagined the ways that smart speakers, voice-controlled homes and navigation systems — all building on the groundbreaking work behind Shoebox — would reshape our lives in the decades to come.

IBM began investigating speech recognition in the 1950s. Shoebox came on the heels of an earlier, larger prototype called Suitcase, which Dersch also developed at IBM’s Advanced Systems Development Division Laboratory in San Jose, California.

Established in 1959, the division had wide latitude to explore new applications for IBM’s latest technological advances and was charged with creating prototypes that could be tested in commercial environments. Shoebox emerged from years of related research. IBM had no plans to sell the device as a standalone product; rather, the company considered its core functionality a potential revolution in human-machine interaction.

Advanced circuitry, compact design

Shoebox’s small size was made possible by its advanced circuitry. It required only 31 transistors — fewer than two per recognized word. Earlier speech-recognition machines, by contrast, required as many as 200 transistors per word. Shoebox was also able to identify complete words, not just sounds or parts of words.

To operate Shoebox, the operator would speak into a microphone, which converted sounds into electrical impulses. A measuring circuit classified those impulses according to various sound types, stimulating the attached adding machine through a relay system within the device.

A technical report published by IBM noted that when developing Shoebox, Dersch purposely avoided preconceived ideas about how to best measure speech waves. For example, while humans respond to particular speech patterns, like pitch, he considered that approach inappropriate for machines. According to a 1961 Time magazine article, “[Dersch] thinks that this is like designing an airplane by copying a bird’s feathers.”

A new approach to word recognition

Instead, he developed techniques that involved recognizing words based on patterns that the human ear does not discern, and classifying sounds into machine code. The device recognized “machine vowels,” or voiced sounds that originate in the throat (like “a,” “o” and “m”), and “machine consonants,” frictional sounds created by air passing through the tongue and teeth (like “f,” “v” and “th”). Each word was categorized according to its specific pattern of machine consonants and vowels.

For words with similar patterns (like “one” and “nine”), Shoebox made further refinements. The frictional, or unvoiced, sounds were classified as weak or strong, to aid in identifying the correct number. Because Shoebox identified patterns that only occur in the human voice — and ignored unfamiliar patterns — it was relatively unaffected by ambient noise.

Shoebox was hardly the end of IBM’s interest in speech-recognition. After several breakthroughs, in the early 1970s the company established a Continuous Speech Recognition Group at the Thomas J. Watson Research Center. From the outset, the group took a statistical approach to speech recognition, grouping sounds into thousands of units based on their characteristic combinations of frequencies.

In 1987, its leader, Frederick Jelinek, described an approach that echoed Dersch’s perspective from two decades earlier. “We thought it was wrong to ask a machine to emulate people,” Jelinek told THINK Magazine. “If a machine has to move, it does it with wheels — not by walking. If a machine has to fly, it does so as an airplane does — not by flapping its wings. Rather than exhaustively studying how people listen to and understand speech, we wanted to find the natural way for the machine to do it.”

Speech recognition in our modern world

Speech-recognition technology has widespread applications today, ranging from automated customer service to smartphones that can listen and respond. For those with vision or mobility impairments, it has provided life-changing access to computers and the internet. In 1980, IBM developed a talking typewriter for the visually impaired. Seventeen years later it debuted the Home Page Reader, a voice browser that enables blind people to navigate websites.

More broadly, early speech-recognition devices like Shoebox paved the way for the burgeoning field of natural language processing, a subset of artificial intelligence that is quickly redefining how the world does business. IBM’s Watson, which competed on Jeopardy! against human champions in 2011, drew on natural language processing capabilities with roots going back to the humble Shoebox, and that IBM is continuing to expand upon and refine.

Although even Dersch himself may not have imagined the full impact of speech recognition, Shoebox created a palpable excitement that the world was on the cusp of something big. As the December 1961 edition of IBM’s internal Business Machines publication noted, “It has been the voice medium that has been most difficult to master in terms of creating machines that can handle it, as other machines already handle digital information, the printed word, the character and the image. The future, however, may prove to be different.”

It has been the voice medium that has been most difficult to master ... The future, however, may prove to be different.

December 1961 edition of IBM’s internal Business Machines publication