New advances in speaker diarization

Share this post:

Automatic speaker diarization is the process of recognizing “who spoke when.” It enriches understanding from automatic speech recognition, which is valuable for downstream applications such as analytics for call-center transcription and meeting transcription, and is an important component in the Watson Speech-to-Text service.

In a recent publication, “New Advances in Speaker Diarization,” presented virtually at Interspeech 2020, we describe our new state-of-the-art speaker diarization system that introduces several novel techniques. We use multiple embedding methods to represent the acoustic information in short audio segments, and leverage both the acoustic embeddings and embedding uncertainty information (due to short duration) using neural networks, similarly as done recently for speaker change detection. We also propose a novel technique for estimating the number of clusters in spectral clustering which is a popular clustering framework and the one we use for speaker diarization. This results in improved diarization accuracy for our in-house, real test cases as well as beyond state-of-the-art results for public benchmark data.

Diarization is usually done by chopping the audio input into short single-speaker segments and embedding the segments of speech, into a space that represents the speaker’s characteristics. The segment embeddings are then clustered. This flow is illustrated in Fig. 1.

Fig. 1: Schematic diagram of speaker diarization


Speaker embeddings

One key problem is how to embed speech segments.  Ideally, different speakers should be embedded to different positions in the embedding space, regardless of what they are talking about.  In recent years, time-delay neural networks (TDNN)-based x-vectors and long short-term memory (LSTM)-based d-vectors have been successfully used for embedding. We use both embedding methods to obtain improved results.

Speaker similarity

The ability to score speaker similarity between speech segments is fundamental for clustering schemes such as spectral clustering. For other clustering schemes such as for agglomerative hierarchical clustering, scoring speaker similarity between clusters is required.

Note that for a given speaker, different segments (or clusters) will have different embeddings due to within-speaker variability. This variability is partly due to phonetic differences between segments, differences which are more significant when the segments (or clusters) are short.

In order to take account of this duration-dependent within-speaker variability, we train a neural network to compute speaker similarity between pairs of segments or pairs of clusters. We feed a pair of acoustic embeddings jointly with the corresponding durations into the neural network (Fig. 2).

Fig. 2: Scoring speaker similarity by jointly comparing embeddings and accounting for duration

Estimating the number of speakers: Temporal response analysis

Spectral clustering is currently the most widely used clustering method for speaker diarization. A major challenge in spectral clustering is estimating the number of clusters.

We want to distinguish between speaker-indicative eigenvectors and noisy eigenvectors. This is usually done by analyzing the sorted eigenvalues and looking for some sort of drop in the eigenvalues (eigen-gap), as the large eigenvalues of the similarity matrix typically correspond to speakers, and the small eigenvalues typically correspond to within-speaker variability (noise).

However, it is often hard to find the right cutoff point, as there may be one or two borderline eigenvalues for which it is difficult to distinguish between an eigenvalue that corresponds to an actual speaker and one that corresponds to noise. In our work we go beyond eigenvalue analysis.

Ideally, it is expected that each top eigenvector of the similarity matrix corresponds to one or two speakers. Multiplying the similarity matrix with this eigenvector results in a vector we name the temporal response. Observing the absolute values of the components of the temporal response, we expect to get large values in coordinates corresponding to segments that belong to the speaker associated with the eigenvector. In case of two speakers associated to the eigenvector, one of the speakers will induce large positive values and the other will induce large negative values.

In case of an eigenvector that is not associated to a speaker, we expect the temporal response to be noisy.

For every segment, we find the eigenvector which has the largest absolute response (“win”) and increase the “win”-counter for that signed-eigenvector (positive or negative). We then compare these counters to a threshold and remove signed eigenvectors which do not have enough “wins”. The method is demonstrated in Fig. 3.

Fig. 3: Temporal response analysis: A speaker indicative response in the left, noisy response in the right. Although the eigenvalue (energy) in the right is larger, the temporal response in the right does not indicate a speaker as it does not “win” in any segment. Response is plotted in red, and the maximum over all responses is in blue. 

Experiments and results

We evaluated our proposed method on the publicly available CALLHOME-500 corpus under the commonly used setup with evenly space overlapping short segments, oracle voice activity detection and 5-fold cross validation. Results are reported in terms of Diarization Error Rate (DER), which is the fraction of time that is not attributed correctly.

Table 1 shows the DERs for selected experiments under the spectral clustering framework. Starting with a baseline with a DER of 8%, using multiple embeddings, neural network-based speaker similarity and temporal response analysis, we obtained a state-of-the-art DER of 5.1%, which compares well to other published works (Table 2).

Table 1: DER results for selected spectral clustering-based experiments on NIST-2000 CALLHOME

Table 2: DER results for recent works on NIST-2000 CALLHOME

Other contributors to this work include Ron Hoory, Masayuki Suzuki, Gakutu Kurata.

Team Leader, Biometrics Research, IBM Research

Weizhong Zhu

Speech Scientist, IBM Research

More AI stories

IBM researchers investigate ways to help reduce bias in healthcare AI

Our study "Comparison of methods to reduce bias from clinical prediction models of postpartum depression” examines healthcare data and machine learning models routinely used in both research and application to address bias in healthcare AI.

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading