October 16, 2020 | Written by: Samuel Thomas, Brian Kingsbury, and Ron Hoory
Share this post:
The 21st INTERSPEECH Conference will take place as a fully virtual conference from October 25 to October 29. INTERSPEECH is the world’s largest conference devoted to speech processing and applications, and is the premiere conference of the International Speech Communication Association.
The current focus of speech technology research at IBM Research AI is around Spoken Customer Care, where our goals are to improve customer experience in human-machine interaction (voice bots) and enhance analytics capabilities in human-human call center interaction. IBM Research AI has contributed 10 papers to the INTERSPEECH technical program, covering a wide range of topics including text-to-speech synthesis, automatic speech recognition, speaker diariazation, and spoken language understanding, that answer many interesting research questions in our line of research.
- How do we make speech-to-text systems more accurate?
- How can automatic systems reliably listen to a multi-party conversation and know who spoke when?
- How can we synthesize speech that has natural and controllable expression?
- How can we best extract meaning from spoken utterances?
These exciting pieces of work are the result of research done at five of IBM’s global research laboratories in Bangalore, Haifa, São Paulo, Tokyo, and Yorktown Heights, and also in collaboration with our academic partners.
Setting the stage for IBM’s leadership in this field are papers titled, “Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard”, and, “New Advances in Speaker Diarization”. The first paper establishes new record performances in automatic speech recognition on the 300-hour and 2000-hour Switchboard conversational telephone speech benchmarks by developing a single headed attention, LSTM based encoder-decoder model. The second paper pushes the envelope on state-of-the-art performance for speaker diariazation (the task of determining “who spoke when?”) with novel improvements to speaker clustering with using multiple speaker embeddings and refined neural network-based modeling techniques.
Given the growing importance of spoken conversational systems for IBM’s customer care business, significant work has been accomplished in the other papers – two papers on spoken language understanding, one on speech synthesis, and five other papers on speech recognition and speaker recognition.
Full list of accepted papers:
Speaker recognition and diarization
- Hagai Aronowitz, Weizhong Zhu, Masayuki Suzuki, Gakuto Kurata, Ron Hoory, “New Advances in Speaker Diarization“
- Shai Rozenberg, Hagai Aronowitz, Ron Hoory, “Siamese X-Vector Reconstruction for Domain Adapted Speaker Recognition“
- Alexander Sorin, Slava Shechtman, Ron Hoory, “Principal Style Components: Expressive Style Control and Cross-Speaker Transfer in Neural TTS“
Spoken language understanding
- Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi , Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, and Luis Lastras, “End-to-End Spoken Language Understanding without Full Transcripts“
- Ashish Mittal, Samarth Bharadwaj, Shreya Khare, Saneem Chemmengath, Karthik Sankaranarayanan, Brian Kingsbury, “Representation based meta-learning for few-shot spoken intent recognition“
- Gakuto Kurata, George Saon, “Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-end Speech“
- Takashi Fukuda, Samuel Thomas, “Implicit Transfer of Privileged Acoustic Information in Generalized Knowledge Distillation Framework“
- Zoltán Tüske, George Saon, Kartik Audhkhasi, Brian Kingsbury, “Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard“
- Alexandros Koumparoulis, Gerasimos Potamianos, Samuel Thomas, Edmilson Morais, “Resource-adaptive Deep Learning for Visual Speech Recognition“
- Samuel Thomas, Kartik Audhkhasi, Brian Kingsbury, “Transliteration Based Data Augmentation for Training Multilingual ASR Acoustic Models in Low Resource Settings“