Look who’s talking: IBM debuts Watson Speech To Text “Speaker Diarization” beta

Share this post:

“Hey, it’s me.”

How often has this been your first sentence after someone answers your phone call?

Our human ability to distinguish a person’s voice in just a few words is remarkable — and it’s an incredibly difficult task for an artificial agent to replicate. While we can instantly identify a voice or dialogue between two people, a machine will see this as one stream of input from a single source.

Today, IBM Research and Watson commercial teams working together have made a significant step forward to advance this ability to distinguish between speakers in a conversation. Watson’s Speech To Text API has been enhanced with beta functionality that supports real time speaker ‘diarization.’ Diarization derives from ‘diary’ or the recording of past events. Here, it refers to the algorithms used to identify and segment speech by speaker identity.

Speaker diarization has previously only worked effectively for pre-recorded conversations. What is exciting for us is that Watson can process the results as a conversation happens between two people. We recently outlined a number of advancements we have been making in audio analytics, and speaker diarization has been an important focus for us. Today’s beta release marks an additional step in our cognitive speech capabilities.

Real time speaker diarization is a need we’ve heard about from many businesses across the world that rely on transcribing volumes of voice conversations collected every day. Imagine you operate a call center and regularly take action as customer and agent conversations happen — issues can come up like providing product-related help, alerting a supervisor about negative feedback, or flagging calls based on customer promotional activities. Prior to today, calls were typically transcribed and analyzed after they ended. Now, Watson’s speaker diarization capability enables access to that data immediately.

Let’s take a look at two examples. The first is the normal output file for a conversation transcribed by Watson Speech To Text:


Now let’s take a look at that same conversation with Watson’s Speech to Text supported by speaker diarization, and you’ll see the difference:


In developing this capability, our team had to overcome technical challenges that have existed in this space for decades. For example, a lot of research has been done to transcribe longer form speech, such as speech from news broadcasts. However, in these scenarios, broadcasters typically speak uninterrupted for longer lengths of time — making it easier for a system to recognize who is speaking because there is more content to analyze from each speaker’s voice. In a live conversation on the other hand, the exchanges quickly shift back-and-forth between speakers, which makes it hard for a system to develop a speech model on a particular speaker before another person begins talking.

To overcome this, we developed a system that would recognize the variations in spectral frequency content of each speaker during a conversation. Think of the nuances you might hear if you blow across the top of a soda bottle — our voice frequencies work in much the same way. While you might be able to tell the difference between a woman and a man, you may not as easily differentiate between two women unless you break down the frequencies. What Watson is able to do is instantly build each of these profiles and assign the output text to specific speakers.

To enable us to achieve this capability as the conversation takes place, we also developed an advanced “speaker clustering” algorithm that can be updated in real-time. (Most algorithms are not able to dynamically build models while simultaneously absorbing data). With this technology, Watson is optimized to accommodate real-time speaker diarization for telephony conversations between two participants, and we are working actively to create a solution that can robustly handle four, five or even six speakers at the same time.

Speaker Diarization is in beta now and can be applied across three languages: US English, Japanese and Spanish. The technology can be trialed on our website. Simply select the US Narrowband model and play the sample file for a quick overview of how the feature works. This represents our first version of the software. As we learn more about how our developers use the system we will continually improve the technology.

To enable Speaker Diarization, take the following steps:

  1. Provision Speech to Text service in Bluemix if this is your first time using the service. Follow steps here.
  2. Once your service is provisioned, make sure to use one of these language models: (en-US_NarrowbandModel, es-ES_NarrowbandModel,   ja-JP_NarrowbandModel)
  3. Set the optional parameter speaker_labels = true (Note. This parameter is false by default)
  4. That’s it! Your output now has speaker labels identified. This works in real-time streaming as well as if you are passing in an audio-file. For more details follow the documentation here.

We look forward to “hearing” what you think.

More Developers stories

GM Financial uses IBM Watson Assistant to develop a secure and powerful AI assistant

August 12, 2020 | AI for the Enterprise, Conversational Services

At GM Financial, it’s my job to drive remarkable customer experiences. Answering customers’ questions quickly and accurately is a big part of that. These days, much of our customer care comes in the form of live messages on our customer service app. People text us about things like the status of their remaining loan balances, more

Uncover Hidden Risks with AI-Enabled GRC Platform

August 11, 2020 | AI for the Enterprise

Even before the added pressures caused by COVID-19, we had observed a considerable rise in regulatory fines and regulatory pressures across geographies and industries, and we expect this trend to continue. In financial services in 2019 alone, according to Fenergo research, $8.4 billion in fines were issued globally for non-compliance with AML, KYC and sanctions more

Solving common challenges in sentiment analysis with help from Project Debater

August 10, 2020 | AI for the Enterprise

Today, we are going to talk about a feature that isn’t always talked a lot in the AI space: sentiment analysis. What is sentiment analysis? Sentiment analysis is a subset of natural language processing (NLP) capabilities that provides high level filters for users when exploring and evaluating data. Popularly, sentiment analysis is used to construct more