Look who’s talking: IBM debuts Watson Speech To Text “Speaker Diarization” beta

Share this post:

“Hey, it’s me.”

How often has this been your first sentence after someone answers your phone call?

Our human ability to distinguish a person’s voice in just a few words is remarkable — and it’s an incredibly difficult task for an artificial agent to replicate. While we can instantly identify a voice or dialogue between two people, a machine will see this as one stream of input from a single source.

Today, IBM Research and Watson commercial teams working together have made a significant step forward to advance this ability to distinguish between speakers in a conversation. Watson’s Speech To Text API has been enhanced with beta functionality that supports real time speaker ‘diarization.’ Diarization derives from ‘diary’ or the recording of past events. Here, it refers to the algorithms used to identify and segment speech by speaker identity.

Speaker diarization has previously only worked effectively for pre-recorded conversations. What is exciting for us is that Watson can process the results as a conversation happens between two people. We recently outlined a number of advancements we have been making in audio analytics, and speaker diarization has been an important focus for us. Today’s beta release marks an additional step in our cognitive speech capabilities.

Real time speaker diarization is a need we’ve heard about from many businesses across the world that rely on transcribing volumes of voice conversations collected every day. Imagine you operate a call center and regularly take action as customer and agent conversations happen — issues can come up like providing product-related help, alerting a supervisor about negative feedback, or flagging calls based on customer promotional activities. Prior to today, calls were typically transcribed and analyzed after they ended. Now, Watson’s speaker diarization capability enables access to that data immediately.

Let’s take a look at two examples. The first is the normal output file for a conversation transcribed by Watson Speech To Text:


Now let’s take a look at that same conversation with Watson’s Speech to Text supported by speaker diarization, and you’ll see the difference:


In developing this capability, our team had to overcome technical challenges that have existed in this space for decades. For example, a lot of research has been done to transcribe longer form speech, such as speech from news broadcasts. However, in these scenarios, broadcasters typically speak uninterrupted for longer lengths of time — making it easier for a system to recognize who is speaking because there is more content to analyze from each speaker’s voice. In a live conversation on the other hand, the exchanges quickly shift back-and-forth between speakers, which makes it hard for a system to develop a speech model on a particular speaker before another person begins talking.

To overcome this, we developed a system that would recognize the variations in spectral frequency content of each speaker during a conversation. Think of the nuances you might hear if you blow across the top of a soda bottle — our voice frequencies work in much the same way. While you might be able to tell the difference between a woman and a man, you may not as easily differentiate between two women unless you break down the frequencies. What Watson is able to do is instantly build each of these profiles and assign the output text to specific speakers.

To enable us to achieve this capability as the conversation takes place, we also developed an advanced “speaker clustering” algorithm that can be updated in real-time. (Most algorithms are not able to dynamically build models while simultaneously absorbing data). With this technology, Watson is optimized to accommodate real-time speaker diarization for telephony conversations between two participants, and we are working actively to create a solution that can robustly handle four, five or even six speakers at the same time.

Speaker Diarization is in beta now and can be applied across three languages: US English, Japanese and Spanish. The technology can be trialed on our website. Simply select the US Narrowband model and play the sample file for a quick overview of how the feature works. This represents our first version of the software. As we learn more about how our developers use the system we will continually improve the technology.

To enable Speaker Diarization, take the following steps:

  1. Provision Speech to Text service in Bluemix if this is your first time using the service. Follow steps here.
  2. Once your service is provisioned, make sure to use one of these language models: (en-US_NarrowbandModel, es-ES_NarrowbandModel,   ja-JP_NarrowbandModel)
  3. Set the optional parameter speaker_labels = true (Note. This parameter is false by default)
  4. That’s it! Your output now has speaker labels identified. This works in real-time streaming as well as if you are passing in an audio-file. For more details follow the documentation here.

We look forward to “hearing” what you think.

More Developers stories

Behind the code: Meet Saloni Potdar

September 13, 2019 | Trends and Profiles

We talk to IBM Watson's Saloni Potdar about her passion for technology, how she uncovers ideas that shape the next generation of technology – and her advice for anyone interested in pursuing a career in AI. more

Digital innovation, SaaS, and sales & marketing trends. An interview with Nicolas Vandenberghe & Sean Johnson

August 30, 2019 | Think Leaders

How can a SaaS business meet the user where they are? In this episode of thinkPod, we are joined by Nicolas Vandenberghe (co-founder & CEO of Chili Piper) and Sean Johnson (partner at Founder Equity). more

The future of the fan experience at the US Open

August 27, 2019 | News and Updates, Watson APIs

With the help of IBM, the US Open is transforming its technology operations to create the future of championship sporting events. more