IBM Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning

Share this post:

Two years ago IBM set new performance records on conversational telephone speech (CTS) transcription, by benchmarking its deep neural network based speech recognition systems on the Switchboard and Callhome corpora, two popular publicly available data sets for automatic speech recognition [1]. Here we show that this impressive performance holds on other audio genres. Similar to the CTS benchmarks, the industry has for many years evaluated system performances on multimedia audio signals with broadcast news (BN) captioning. We have now achieved a new industry record of 6.5% and 5.9% on two BN benchmarks: RT04 and DEV04F [2]. Both these test sets have been released in the past by the Linguistic Data Consortium (LDC) [3]. The first test set (DEV04F) has about 2 hours of data from 6 shows with close to 100 overlapping speakers across the shows. The second test set (RT04) has 4 hours of broadcast news data from 12 shows with about 230 overlapping speakers across the shows.

Progress in word error rate reduction on CTS and BN test sets.

In the CTS domain, our speech recognition systems deal with spontaneous speech recorded over a telephone channel with various channel distortions, in addition to numerous speaking styles. Conversational speech is also interspersed with portions of overlapping speech, interruptions, restarts, and back-channel confirmations between participants.  In contrast, the new speech recognition systems built for BN captioning need to deal with wide-band signals collected over a wide variety of speakers with different speaking styles, in multiple background noise conditions, and speaking on a wide variety of news topics. Most of the speech is well-articulated and is formed similarly to written English, but there are also materials such as on-site interviews, clips from TV shows, etc. in the mix.

The performance on the BN test sets are achieved by building on our deep neural network-based speech recognition techniques proposed earlier for CTS. We use a combination of deep long short term memory (LSTM) and residual network (ResNet)-based acoustic models along with n-gram and neural network language models. The ResNet-based acoustic models are deep convolutional networks with up to 25 convolutional layers trained on speech spectrograms and complement the six-layer deep LSTM models trained on a rich set of a various acoustic features. LSTM models are also used for language modeling to improve on top of traditional n-gram and feed-forward neural network language models. All these systems are trained on various amounts of broadcast news data also released by LDC [4].

To quantify our results and measure how close we are to the ultimate goal of achieving human parity, we have worked with our partner Appen, which provides speech and search technology services, to measure human recognition error rates on these tasks. While our new results of 6.5% and 5.9% are the lowest we are aware of for this task, human performance is estimated to be significantly better, at 3.6% and 2.8%, respectively, indicating that there is still room for new techniques and improvements in this space.

This work will be presented at IEEE ICASSP, the world’s largest and most comprehensive technical conference focused on signal processing and its applications, on May 16. The paper is titled “English Broadcast News Speech Recognition by Humans and Machines“, written by Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, and Bern Samko.


IBM Research

More AI stories

Daily chats with AI could help spot early signs of Alzheimer’s

In a new paper published in Frontiers in Digital Health journal, we present the first empirical evidence of tablet-based automatic assessments of patients using speech analysis — successfully detecting mild cognitive impairment (MCI), the transitional stage between normal aging and dementia.

Continue reading

IBM researchers investigate ways to help reduce bias in healthcare AI

Our study "Comparison of methods to reduce bias from clinical prediction models of postpartum depression” examines healthcare data and machine learning models routinely used in both research and application to address bias in healthcare AI.

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading