Look who’s talking: IBM debuts Watson Speech To Text “Speaker Diarization” beta

Share this post:

“Hey, it’s me.”

How often has this been your first sentence after someone answers your phone call?

Our human ability to distinguish a person’s voice in just a few words is remarkable — and it’s an incredibly difficult task for an artificial agent to replicate. While we can instantly identify a voice or dialogue between two people, a machine will see this as one stream of input from a single source.

Today, IBM Research and Watson commercial teams working together have made a significant step forward to advance this ability to distinguish between speakers in a conversation. Watson’s Speech To Text API has been enhanced with beta functionality that supports real time speaker ‘diarization.’ Diarization derives from ‘diary’ or the recording of past events. Here, it refers to the algorithms used to identify and segment speech by speaker identity.

Speaker diarization has previously only worked effectively for pre-recorded conversations. What is exciting for us is that Watson can process the results as a conversation happens between two people. We recently outlined a number of advancements we have been making in audio analytics, and speaker diarization has been an important focus for us. Today’s beta release marks an additional step in our cognitive speech capabilities.

Real time speaker diarization is a need we’ve heard about from many businesses across the world that rely on transcribing volumes of voice conversations collected every day. Imagine you operate a call center and regularly take action as customer and agent conversations happen — issues can come up like providing product-related help, alerting a supervisor about negative feedback, or flagging calls based on customer promotional activities. Prior to today, calls were typically transcribed and analyzed after they ended. Now, Watson’s speaker diarization capability enables access to that data immediately.

Let’s take a look at two examples. The first is the normal output file for a conversation transcribed by Watson Speech To Text:


Now let’s take a look at that same conversation with Watson’s Speech to Text supported by speaker diarization, and you’ll see the difference:


In developing this capability, our team had to overcome technical challenges that have existed in this space for decades. For example, a lot of research has been done to transcribe longer form speech, such as speech from news broadcasts. However, in these scenarios, broadcasters typically speak uninterrupted for longer lengths of time — making it easier for a system to recognize who is speaking because there is more content to analyze from each speaker’s voice. In a live conversation on the other hand, the exchanges quickly shift back-and-forth between speakers, which makes it hard for a system to develop a speech model on a particular speaker before another person begins talking.

To overcome this, we developed a system that would recognize the variations in spectral frequency content of each speaker during a conversation. Think of the nuances you might hear if you blow across the top of a soda bottle — our voice frequencies work in much the same way. While you might be able to tell the difference between a woman and a man, you may not as easily differentiate between two women unless you break down the frequencies. What Watson is able to do is instantly build each of these profiles and assign the output text to specific speakers.

To enable us to achieve this capability as the conversation takes place, we also developed an advanced “speaker clustering” algorithm that can be updated in real-time. (Most algorithms are not able to dynamically build models while simultaneously absorbing data). With this technology, Watson is optimized to accommodate real-time speaker diarization for telephony conversations between two participants, and we are working actively to create a solution that can robustly handle four, five or even six speakers at the same time.

Speaker Diarization is in beta now and can be applied across three languages: US English, Japanese and Spanish. The technology can be trialed on our website. Simply select the US Narrowband model and play the sample file for a quick overview of how the feature works. This represents our first version of the software. As we learn more about how our developers use the system we will continually improve the technology.

To enable Speaker Diarization, take the following steps:

  1. Provision Speech to Text service in Bluemix if this is your first time using the service. Follow steps here.
  2. Once your service is provisioned, make sure to use one of these language models: (en-US_NarrowbandModel, es-ES_NarrowbandModel,   ja-JP_NarrowbandModel)
  3. Set the optional parameter speaker_labels = true (Note. This parameter is false by default)
  4. That’s it! Your output now has speaker labels identified. This works in real-time streaming as well as if you are passing in an audio-file. For more details follow the documentation here.

We look forward to “hearing” what you think.

Add Comment

Leave a Reply

Your email address will not be published.Required fields are marked *

Velda May

I see this as being very helpful for those of us hard of hearing people who use a captioning phone, especially for conference calls.


JL Eaton

I wonder if this advanced speech-recognition research couldn’t also be shared with Watson Health. Getting to the most elementary level of how humans recognize each other’s voice may help with future Watson Health solutions with regard to later-stage Dementia.

All the same, the advancements by IBM in diarization of real time speech is nothing short of amazing!


    Lisa Kay Davis

    Thanks so much, JL!

More Developers Stories
October 20, 2017

The future of call centers and customer service is being shaped by AI

Customers today expect seamless interactions with brands whenever and wherever they choose. Given a choice, 70% of customers today prefer messaging over voice for customer support. Learn how LivePerson and Watson teamed together to create LiveEngage, the first out-of-the-box integration of Watson-powered chatbots with human agents.

Continue reading

October 17, 2017

How Watson Advertising improves decision-making and reduces costs across the marketing lifecycle

The Weather Company’s ad sales business has become IBM Watson Advertising, offering agencies and marketers a portfolio of media, data, and AI solutions to help improve decision-making and reduce costs – from media planning through measurement.

Continue reading

October 17, 2017

How chatbots can help reduce customer service costs by 30%

Businesses spend $1.3 trillion on 265 billion customer service calls each year. Chatbots can help businesses save by speeding up response times, and answering up to 80% of routine questions. Learn how you can increase productivity and performance at call centers by seamlessly integrating chatbots, AI and live agents.

Continue reading