October 28, 2020 | Written by: Hagai Aronowitz and Weizhong Zhu
Categorized: AI | IBM Research-Haifa
Share this post:
Automatic speaker diarization is the process of recognizing “who spoke when.” It enriches understanding from automatic speech recognition, which is valuable for downstream applications such as analytics for call-center transcription and meeting transcription, and is an important component in the Watson Speech-to-Text service.
In a recent publication, “New Advances in Speaker Diarization,” presented virtually at Interspeech 2020, we describe our new state-of-the-art speaker diarization system that introduces several novel techniques. We use multiple embedding methods to represent the acoustic information in short audio segments, and leverage both the acoustic embeddings and embedding uncertainty information (due to short duration) using neural networks, similarly as done recently for speaker change detection. We also propose a novel technique for estimating the number of clusters in spectral clustering which is a popular clustering framework and the one we use for speaker diarization. This results in improved diarization accuracy for our in-house, real test cases as well as beyond state-of-the-art results for public benchmark data.
Diarization is usually done by chopping the audio input into short single-speaker segments and embedding the segments of speech, into a space that represents the speaker’s characteristics. The segment embeddings are then clustered. This flow is illustrated in Fig. 1.
Fig. 1: Schematic diagram of speaker diarization
One key problem is how to embed speech segments. Ideally, different speakers should be embedded to different positions in the embedding space, regardless of what they are talking about. In recent years, time-delay neural networks (TDNN)-based x-vectors and long short-term memory (LSTM)-based d-vectors have been successfully used for embedding. We use both embedding methods to obtain improved results.
The ability to score speaker similarity between speech segments is fundamental for clustering schemes such as spectral clustering. For other clustering schemes such as for agglomerative hierarchical clustering, scoring speaker similarity between clusters is required.
Note that for a given speaker, different segments (or clusters) will have different embeddings due to within-speaker variability. This variability is partly due to phonetic differences between segments, differences which are more significant when the segments (or clusters) are short.
In order to take account of this duration-dependent within-speaker variability, we train a neural network to compute speaker similarity between pairs of segments or pairs of clusters. We feed a pair of acoustic embeddings jointly with the corresponding durations into the neural network (Fig. 2).
Fig. 2: Scoring speaker similarity by jointly comparing embeddings and accounting for duration
Estimating the number of speakers: Temporal response analysis
Spectral clustering is currently the most widely used clustering method for speaker diarization. A major challenge in spectral clustering is estimating the number of clusters.
We want to distinguish between speaker-indicative eigenvectors and noisy eigenvectors. This is usually done by analyzing the sorted eigenvalues and looking for some sort of drop in the eigenvalues (eigen-gap), as the large eigenvalues of the similarity matrix typically correspond to speakers, and the small eigenvalues typically correspond to within-speaker variability (noise).
However, it is often hard to find the right cutoff point, as there may be one or two borderline eigenvalues for which it is difficult to distinguish between an eigenvalue that corresponds to an actual speaker and one that corresponds to noise. In our work we go beyond eigenvalue analysis.
Ideally, it is expected that each top eigenvector of the similarity matrix corresponds to one or two speakers. Multiplying the similarity matrix with this eigenvector results in a vector we name the temporal response. Observing the absolute values of the components of the temporal response, we expect to get large values in coordinates corresponding to segments that belong to the speaker associated with the eigenvector. In case of two speakers associated to the eigenvector, one of the speakers will induce large positive values and the other will induce large negative values.
In case of an eigenvector that is not associated to a speaker, we expect the temporal response to be noisy.
For every segment, we find the eigenvector which has the largest absolute response (“win”) and increase the “win”-counter for that signed-eigenvector (positive or negative). We then compare these counters to a threshold and remove signed eigenvectors which do not have enough “wins”. The method is demonstrated in Fig. 3.
Fig. 3: Temporal response analysis: A speaker indicative response in the left, noisy response in the right. Although the eigenvalue (energy) in the right is larger, the temporal response in the right does not indicate a speaker as it does not “win” in any segment. Response is plotted in red, and the maximum over all responses is in blue.
Experiments and results
We evaluated our proposed method on the publicly available CALLHOME-500 corpus under the commonly used setup with evenly space overlapping short segments, oracle voice activity detection and 5-fold cross validation. Results are reported in terms of Diarization Error Rate (DER), which is the fraction of time that is not attributed correctly.
Table 1 shows the DERs for selected experiments under the spectral clustering framework. Starting with a baseline with a DER of 8%, using multiple embeddings, neural network-based speaker similarity and temporal response analysis, we obtained a state-of-the-art DER of 5.1%, which compares well to other published works (Table 2).
Table 1: DER results for selected spectral clustering-based experiments on NIST-2000 CALLHOME
Table 2: DER results for recent works on NIST-2000 CALLHOME
Other contributors to this work include Ron Hoory, Masayuki Suzuki, Gakutu Kurata.