Have you called a family member or friend on the phone and within just a couple of words they know exactly who you are and what you are planning to talk about? It is one of those simple yet rewarding pleasures. And yet, when we call and speak with a computer at an automated customer service center we don’t have the same experience. Similarly, voice-activated virtual assistants on smart phones can be useful yet the complete experience still seems to be missing. When we speak to a computer the computer determines the words we spoke. Based on these words the computer decides how to respond. The problem is that words are not enough. A key argument supporting this is the fact that we use emoticons in our text messages. The words themselves do not convey our full intent and we have emoticons to attempt to bridge that gap.
So, when we talk with a computer, we want the computer to better understand us outside of just the words. Which words did we emphasize to indicate what is most important to us? Do we sound happy or frustrated, time crunched, or have time to spare? Is a child, teenager, adult or senior citizen talking with the computer? Did I speak with the system a moment earlier and did it remember where our conversation left off? Was there a lot of noise in the background and, in this case, should the virtual assistant speak louder for me? This is where audio analytics can help bridge this gap by learning the characteristics of our speech and the environment we are communicating in.
Speech technology has progressed at great speed. For example at IBM, on telephony data, the speech recognition word error rate is quoted 1 to be down to 6.6 percent and is still edging lower. To put this in context, the human word error rate is estimated at around 4 percent. With the goal of providing computers an understanding outside of just the words, our team is pleased to announce the world’s best published results for two technologies: automatic speaker recognition 2, 3 and age estimation 4 from speech. Here we share these exciting results and how they take steps toward our broader goal.
So, what is automatic speaker recognition? It is the capability of automatically recognizing a speaker based on analyzing their voice pattern. It allows a computer to recognize a person who was talking to it earlier so that it may continue the conversation where it left off. Humans do this all the time. One specific task is speaker verification where the system is tasked with answering the question: “Did a particular speaker say the speech in the recording? Yes or No.”
Since 1996, the National Institute of Standards and Technology (NIST) 5 has run public Speaker Recognition Evaluations (SREs) for conversational telephony speech. The good news is that systems since this time have made significant progress. In 2000, the world’s best published speaker verification performance was around the 10 percent Equal Error Rate (EER) mark. Is this a good result? Well, at this operating point given an equal probability of an impostor or target speaker (a statement to keep the scientists happy), it could be said that the system would make a mistake 10 percent of the time. Nowadays, after progress in algorithms and more data, the EER is less than 1 percent. To our knowledge, our team has achieved the best published error rates across the core conditions of the NIST 2010 SRE data (one of the most contested data sets), including 0.59 percent EER for the main telephony speech task 2, 3. This represents exciting times for the technology and there is mounting evidence to suggest that under certain constraints the performance of recent, state-of-the-art systems may have surpassed that of regular human listeners 6,7,8.
In addition to speaker recognition, the second accomplishment involves estimating the age of a person from their speech. Systems should recognize the approximate age of the person they are talking with so that the conversation can be tailored to that age group. Again, humans have mastered this ability. Accordingly, we have built a system 4 which achieves the world’s best published age estimation results on the same NIST 2010 telephony data used earlier. On this data, we achieve age estimation with a correlation of 0.91. A study 9 on another data set determined the human listener correlation to be 0.77. In addition to achieving these top results, we also learned something interesting when we invited our colleagues to play with the system. There was a human consideration that we had not taken into account. We found that when the system guessed the person’s age at only a couple of years older than they were, the users saw that as worse than estimating a person’s age as several years younger. This observation makes me chuckle and in hindsight is somehow not surprising.
Our Audio Analytics team has investigated automatic ways of analyzing sound for more than a decade. In addition to speaker recognition and age estimation, our work has focused on speaker diarization 10, language (language, nativeness and dialect) recognition 11, gender identification, and speech/music discrimination. Another team at IBM is exploring the use of emotion.
Collectively, what do these technologies mean for the end user? It makes for a far better personalized, effective and natural interaction with computers. Naturally, we consider the human’s mastery of these tasks our inspiration.
 G. Saon, T. Sercu, S. Rennie, and H. Kuo, “The IBM 2016 English Conversational Telephone Speech Recognition System,” in Interspeech, 2016.
 S. Sadjadi, S. Ganapathy, and J. Pelecanos, “The IBM 2016 Speaker Recognition System,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2016), 2016.
 S. Sadjadi, J. Pelecanos, and S. Ganapathy, “The IBM Speaker Recognition System: Recent Advances and Error Analysis,” in Proc. Interspeech, 2016.
 S. Sadjadi, S. Ganapathy, and J. Pelecanos, “Speaker Age Estimation on Conversational Telephone Speech using Senone Posterior based I-vectors,” in Proc. IEEE ICASSP, pp. 5040-5044, 2016.
 NIST, “NIST Speaker Recognition Evaluation,” http://www.nist.gov/itl/iad/mig/sre.cfm, Accessed October 2016.
 C. Greenberg, A. Martin, L. Brandschain, J. Campbell, C. Cieri, G. Doddington, and J. Godfrey, “Human Assisted Speaker Recognition (HASR) in NIST SRE10,” http:// http://www.nist.gov/itl/iad/mig/upload/hasr_od10_webpage.pdf, Presentation accessed May 2016.
 A. Schmidt-Nielsen and T. Crystal, “Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance using the NIST 1998 Speaker Evaluation Data,” in Digital Signal Processing, Vol 10, pp. 249–266, 2000.
 V. Hautamäki, T. Kinnunen, M. Nosratighods, K. Lee, B. Ma, and H. Li, “Approaching Human Listener Accuracy with Modern Speaker Veriﬁcation,” in Proc. Interspeech, 2010.
 L. Cerrato, M. Falcone, and A. Paoloni, “Subjective Age Estimation of Telephonic Voices,” in Speech Communication, Vol 31, Issues 2-3, pp 107-112, 2000.
 W. Zhu and J. Pelecanos, “Online Speaker Diarization using Adapted I-Vector Transforms,” in Proc. IEEE ICASSP, pp 5045-5049, 2016.
 M. Omar and J. Pelecanos, “A Novel Approach to Detecting Non-Native Speakers and their Native Language,” in Proc. IEEE ICASSP, pp 4398-4401, 2010.