Watson APIs

Reaching new records in speech recognition

Share this post:

Depending on whom you ask, humans miss one to two words out of every 20 they hear. In a five-minute conversation, that could be as many 80 words. But, for most of us it isn’t a problem. Imagine, though, how difficult it is for a computer?

Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, we have continued to push the boundaries of speech recognition, and today we’ve reached a new industry record of 5.5 percent.

This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like “buying a car.” This recorded corpus, known as the “SWITCHBOARD” corpus, has been used for over two decades to benchmark speech recognition systems.

To reach this 5.5 percent breakthrough, IBM researchers focused on extending our application of deep learning technologies. We combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.

Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside us, and some have recently claimed reaching 5.9 percent as equivalent to human parity…but we’re not popping the champagne yet. As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.

To determine this number, we worked to reproduce human-level results with the help of our partner Appen, which provides speech and search technology services. And while our breakthrough of 5.5% is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.

As part of our research efforts, we connected with different industry experts to get their input on this matter too. Yoshua Bengio, leader of the University of Montreal’s MILA (Montreal Institute for Learning Algorithms) Lab agrees we still have more work to do to reach human parity:

“In spite of impressive advances in recent years, reaching human-level performance in AI tasks such as speech recognition or object recognition remains a scientific challenge. Indeed, standard benchmarks do not always reveal the variations and complexities of real data. For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition,” says Bengio. “IBM continues to make significant strides in advancing speech recognition by applying neural networks and deep learning into acoustic and language models.”

We also realized finding a standard measurement for human parity across the industry is more complex than it seems. Beyond SWITCHBOARD, another industry corpus, known as “CallHome,” offers a different set of linguistic data that can be tested, which is created from more colloquial conversations between family members on topics that are not pre-fixed. Conversations from CallHome data are more challenging for machines to transcribe than those from SWITCHBOARD, making breakthroughs harder to achieve. (On this corpus we achieved a 10.3 percent word error rate – another industry record – but again, with Appen’s help, measured human performance in the same situation to be 6.8 percent).

In addition, with SWITCHBOARD, some of the same human voices in test speakers’ data are also included in the training data used to train the acoustic and language models. Since CallHome has no such overlap, the speech recognition models have not been exposed to test speakers’ data. Because of this, there is no repetition and this has led to its larger gap between human and machine performance. As we continue to pursue human parity, advancements in our Deep Learning technologies that can pick up on such repetitions are ever more important to finally overcoming these challenges.

Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, also commented on the ongoing complex challenge of speech recognition:

“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated,” she shared. “IBM’s recent achievements on the SWITCHBOARD and on the CallHome data are thus quite impressive. But I’m also impressed with the way IBM has been working to better understand human ability to understand these two, much-cited corpora. This scientific achievement is in its way as impressive as the performance of their current ASR technology, and shows that we still have a way to go for machines to match human speech understanding.”

Today’s achievement adds to recent advancements we’ve made in speech technology – for example, in December we added diarization to our Watson Speech to Text service, marking a step forward in distinguishing individual speakers in a conversation. These speech developments build on decades of research, and achieving speech recognition comparable to that of humans is a complex task. We will continue to work towards creating the technology that will one day match the complexity of how the human ear, voice and brain interact. While we are energized by our progress, our work is dependent on further research – and most importantly, staying accountable to the highest standards of accuracy possible.

To check out the white paper on this automatic speech recognition milestone, please see this link https://arxiv.org/abs/1703.02136

Read the white paper on this automatic speech recognition milestone.

Large vocabulary continuous speech recognition IBM Watson

More Watson APIs stories
March 8, 2019

Women of Watson Part 1: Meet Jennifer Sukis

Today on International Women’s Day, we’re thrilled to roll out a four-part Women of Watson series. To commemorate Women’s History Month, each week in March we will highlight a woman working at IBM Watson who has helped to shape Watson as the AI for the business professional.

Continue reading

March 5, 2019

Use intelligent search to find real answers with less effort with Watson Discovery

Introducing, smart document understanding (SDU), a powerful feature now available in Watson Discovery, that can give you accurate answers faster, from 10 days to 2 minutes for one leading bank. Building on advances from IBM Research and drawing on the corpus conversion service released in late 2018, smart document understanding allows you visually train AI to understand your documents.

Continue reading

February 15, 2019

IBM’s Think 2019 highlighted best of AI customer service applications

IBM's Think 2019 was an industry-leading conference that took place February 12-15 in San Francisco and included 2,000+ technical and business sessions with execs from ExxonMobil, Sprint, Honda, KPM, and others. In addition, more than 800 leaders, 400 developers and 200 distinguished engineers from IBM appeared on stage and in private sessions. Speakers ranged from football great Joe Montana, skateboard pioneer Tony Hawk and astronaut Taylor Richardson.

Continue reading