May 14, 2019 | Written by: Samuel Thomas
Categorized: AI | Publications
Share this post:
Two years ago IBM set new performance records on conversational telephone speech (CTS) transcription, by benchmarking its deep neural network based speech recognition systems on the Switchboard and Callhome corpora, two popular publicly available data sets for automatic speech recognition . Here we show that this impressive performance holds on other audio genres. Similar to the CTS benchmarks, the industry has for many years evaluated system performances on multimedia audio signals with broadcast news (BN) captioning. We have now achieved a new industry record of 6.5% and 5.9% on two BN benchmarks: RT04 and DEV04F . Both these test sets have been released in the past by the Linguistic Data Consortium (LDC) . The first test set (DEV04F) has about 2 hours of data from 6 shows with close to 100 overlapping speakers across the shows. The second test set (RT04) has 4 hours of broadcast news data from 12 shows with about 230 overlapping speakers across the shows.
Progress in word error rate reduction on CTS and BN test sets.
In the CTS domain, our speech recognition systems deal with spontaneous speech recorded over a telephone channel with various channel distortions, in addition to numerous speaking styles. Conversational speech is also interspersed with portions of overlapping speech, interruptions, restarts, and back-channel confirmations between participants. In contrast, the new speech recognition systems built for BN captioning need to deal with wide-band signals collected over a wide variety of speakers with different speaking styles, in multiple background noise conditions, and speaking on a wide variety of news topics. Most of the speech is well-articulated and is formed similarly to written English, but there are also materials such as on-site interviews, clips from TV shows, etc. in the mix.
The performance on the BN test sets are achieved by building on our deep neural network-based speech recognition techniques proposed earlier for CTS. We use a combination of deep long short term memory (LSTM) and residual network (ResNet)-based acoustic models along with n-gram and neural network language models. The ResNet-based acoustic models are deep convolutional networks with up to 25 convolutional layers trained on speech spectrograms and complement the six-layer deep LSTM models trained on a rich set of a various acoustic features. LSTM models are also used for language modeling to improve on top of traditional n-gram and feed-forward neural network language models. All these systems are trained on various amounts of broadcast news data also released by LDC .
To quantify our results and measure how close we are to the ultimate goal of achieving human parity, we have worked with our partner Appen, which provides speech and search technology services, to measure human recognition error rates on these tasks. While our new results of 6.5% and 5.9% are the lowest we are aware of for this task, human performance is estimated to be significantly better, at 3.6% and 2.8%, respectively, indicating that there is still room for new techniques and improvements in this space.
This work will be presented at IEEE ICASSP, the world’s largest and most comprehensive technical conference focused on signal processing and its applications, on May 16. The paper is titled “English Broadcast News Speech Recognition by Humans and Machines“, written by Samuel Thomas, Masayuki Suzuki, Yinghui Huang, Gakuto Kurata, Zoltan Tuske, George Saon, Brian Kingsbury, Michael Picheny, Tom Dibert, Alice Kaiser-Schatzlein, and Bern Samko.