IBM Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning

Share this post:

Two years ago IBM set new performance records on conversational telephone speech (CTS) transcription, by benchmarking its deep neural network based speech recognition systems on the Switchboard and Callhome corpora, two popular publicly available data sets for automatic speech recognition [1]. Here we show that this impressive performance holds on other audio genres. Similar to the CTS benchmarks, the industry has for many years evaluated system performances on multimedia audio signals with broadcast news (BN) captioning. We have now achieved a new industry record of 6.5% and 5.9% on two BN benchmarks: RT04 and DEV04F [2]. Both these test sets have been released in the past by the Linguistic Data Consortium (LDC) [3]. The first test set (DEV04F) has about 2 hours of data from 6 shows with close to 100 overlapping speakers across the shows. The second test set (RT04) has 4 hours of broadcast news data from 12 shows with about 230 overlapping speakers across the shows.

Progress in word error rate reduction on CTS and BN test sets.

In the CTS domain, our speech recognition systems deal with spontaneous speech recorded over a telephone channel with various channel distortions, in addition to numerous speaking styles. Conversational speech is also interspersed with portions of overlapping speech, interruptions, restarts, and back-channel confirmations between participants.  In contrast, the new speech recognition systems built for BN captioning need to deal with wide-band signals collected over a wide variety of speakers with different speaking styles, in multiple background noise conditions, and speaking on a wide variety of news topics. Most of the speech is well-articulated and is formed similarly to written English, but there are also materials such as on-site interviews, clips from TV shows, etc. in the mix.

The performance on the BN test sets are achieved by building on our deep neural network-based speech recognition techniques proposed earlier for CTS. We use a combination of deep long short term memory (LSTM) and residual network (ResNet)-based acoustic models along with n-gram and neural network language models. The ResNet-based acoustic models are deep convolutional networks with up to 25 convolutional layers trained on speech spectrograms and complement the six-layer deep LSTM models trained on a rich set of a various acoustic features. LSTM models are also used for language modeling to improve on top of traditional n-gram and feed-forward neural network language models. All these systems are trained on various amounts of broadcast news data also released by LDC [4].

To quantify our results and measure how close we are to the ultimate goal of achieving human parity, we have worked with our partner Appen, which provides speech and search technology services, to measure human recognition error rates on these tasks. While our new results of 6.5% and 5.9% are the lowest we are aware of for this task, human performance is estimated to be significantly better, at 3.6% and 2.8%, respectively, indicating that there is still room for new techniques and improvements in this space.

This work will be presented at IEEE ICASSP, the world’s largest and most comprehensive technical conference focused on signal processing and its applications, on May 16. The paper is titled “English Broadcast News Speech Recognition by Humans and Machines“, written by Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, and Bern Samko.


IBM Research

More AI stories

IBM Research at EMNLP 2020

At the annual Conference on Empirical Methods in Natural Language Processing (EMNLP), IBM Research AI is presenting 30 papers in the main conference and 12 findings that together aim to advance the field of natural language processing (NLP).

Continue reading

DualTKB: A Dual Learning Bridge between Text and Knowledge Base

Capturing and structuring common knowledge from the real world to make it available to computer systems is one of the foundational principles of IBM Research. The real-world information is often naturally organized as graphs (e.g., world wide web, social networks) where knowledge is represented not only by the data content of each node, but also […]

Continue reading

The Rensselaer-IBM Artificial Intelligence Research Collaboration advances breakthroughs in more robust and secure AI

Launched in 2018, the Rensselaer-IBM Artificial Intelligence Research Collaboration (AIRC) is a multi-year, multi-million dollar joint venture boasting dozens of ongoing projects in 2020-2021 involving more than 80 IBM and RPI researchers working to advance AI.

Continue reading