AI

IBM Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning

Share this post:

Two years ago IBM set new performance records on conversational telephone speech (CTS) transcription, by benchmarking its deep neural network based speech recognition systems on the Switchboard and Callhome corpora, two popular publicly available data sets for automatic speech recognition [1]. Here we show that this impressive performance holds on other audio genres. Similar to the CTS benchmarks, the industry has for many years evaluated system performances on multimedia audio signals with broadcast news (BN) captioning. We have now achieved a new industry record of 6.5% and 5.9% on two BN benchmarks: RT04 and DEV04F [2]. Both these test sets have been released in the past by the Linguistic Data Consortium (LDC) [3]. The first test set (DEV04F) has about 2 hours of data from 6 shows with close to 100 overlapping speakers across the shows. The second test set (RT04) has 4 hours of broadcast news data from 12 shows with about 230 overlapping speakers across the shows.

Progress in word error rate reduction on CTS and BN test sets.

In the CTS domain, our speech recognition systems deal with spontaneous speech recorded over a telephone channel with various channel distortions, in addition to numerous speaking styles. Conversational speech is also interspersed with portions of overlapping speech, interruptions, restarts, and back-channel confirmations between participants.  In contrast, the new speech recognition systems built for BN captioning need to deal with wide-band signals collected over a wide variety of speakers with different speaking styles, in multiple background noise conditions, and speaking on a wide variety of news topics. Most of the speech is well-articulated and is formed similarly to written English, but there are also materials such as on-site interviews, clips from TV shows, etc. in the mix.

The performance on the BN test sets are achieved by building on our deep neural network-based speech recognition techniques proposed earlier for CTS. We use a combination of deep long short term memory (LSTM) and residual network (ResNet)-based acoustic models along with n-gram and neural network language models. The ResNet-based acoustic models are deep convolutional networks with up to 25 convolutional layers trained on speech spectrograms and complement the six-layer deep LSTM models trained on a rich set of a various acoustic features. LSTM models are also used for language modeling to improve on top of traditional n-gram and feed-forward neural network language models. All these systems are trained on various amounts of broadcast news data also released by LDC [4].

To quantify our results and measure how close we are to the ultimate goal of achieving human parity, we have worked with our partner Appen, which provides speech and search technology services, to measure human recognition error rates on these tasks. While our new results of 6.5% and 5.9% are the lowest we are aware of for this task, human performance is estimated to be significantly better, at 3.6% and 2.8%, respectively, indicating that there is still room for new techniques and improvements in this space.

This work will be presented at IEEE ICASSP, the world’s largest and most comprehensive technical conference focused on signal processing and its applications, on May 16. The paper is titled “English Broadcast News Speech Recognition by Humans and Machines“, written by Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, and Bern Samko.


[1] https://www.ibm.com/blogs/watson/2017/03/reaching-new-records-in-speech-recognition/
[2] https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation
[3] https://www.ldc.upenn.edu/
[4] https://www.ldc.upenn.edu/collaborations/past-projects/gale/data/gale-pubs

IBM Research

More AI stories

Adversarial Robustness 360 Toolbox v1.0: A Milestone in AI Security

IBM researchers published the first major release of the Adversarial Robustness 360 Toolbox (ART). Initially released in April 2018, ART is an open-source library for adversarial machine learning that provides researchers and developers with state-of-the-art tools to defend and verify AI models against adversarial attacks. ART addresses growing concerns about people’s trust in AI, specifically the security of AI in mission-critical applications.

Continue reading

Making Sense of Neural Architecture Search

It is no surprise that following the massive success of deep learning technology in solving complicated tasks, there is a growing demand for automated deep learning. Even though deep learning is a highly effective technology, there is a tremendous amount of human effort that goes into designing a deep learning algorithm.

Continue reading

Pushing the boundaries of convex optimization

Convex optimization problems, which involve the minimization of a convex function over a convex set, can be approximated in theory to any fixed precision in polynomial time. However, practical algorithms are known only for special cases. An important question is whether it is possible to develop algorithms for a broader subset of convex optimization problems that are efficient in both theory and practice.

Continue reading