July 12, 2020 | Written by: Masayuki Suzuki and Gakuto Kurata
Categorized: AI | IBM Research-Tokyo
Share this post:
Automatic speaker diarization is the process of recognizing “who spoke when.” It enriches understanding from automatic speech recognition, which is valuable for downstream applications such as analytics for call-center transcription and meeting transcription.
In a recent publication, “Speaker Embeddings Incorporating Acoustic Conditions for Diarization,” presented virtually at ICASSP 2020, we describe a novel speaker diarization algorithm that can consider not only speaker information, but also identifying clues about individual recording environments that help differentiate between the speakers, resulting in improved diarization accuracy for our in-house, real test cases as well as public benchmark data.
This enables Watson Speech-to-Text to provide more accurate speaker diarization. Our proposed method is language-independent and is now integrated into the speaker labels (speaker diarization) APIs in the IBM Watson Speech-to-Text service.
Fig. 1: Schematic diagram of speaker diarization
Diarization is realized by mapping a segment of speech, such as a word, into a space that represents the speaker’s characteristics and then clustering the segment representations. Fig. 1 shows a flow chart for diarization. A key problem is how to do the mapping from speech segments to the representation space. Ideally, different speakers should be mapped to different positions in the space, regardless of what they are talking about. In recent years, neural networks havebeen used for this mapping, leading to substantial improvements in diarization accuracy.
In analyzing real speaker diarization use cases we found that additional information coming from the recording environment can be used for more accurate speaker diarization, proposed a new neural network training method for speaker diarization that exploits this information, and showed that it improves performance greatly.
Fig. 2: Neural network based speaker embedding extraction
The process for computing neural network-based speaker embeddings is illustrated in Fig. 2. The input speech is transformed into a fixed-dimension vector through a neural network, and this vector is then used as input for a speaker classifier. The speaker classifier and the neural network are jointly trained using data comprising pairs of speech segments and speaker labels. When doing diarization, the vector — typically called a “speaker embedding” — is used for clustering.
The speaker embeddings have information to discriminate between speakers. Considering the applications of speaker diarization, however, we can further improve embeddings by introducing information about the broad acoustic conditions in which the speech was recorded. Specifically, the acoustic conditions (e.g., the microphone) for each speaker are generally fixed in the same recording, and differences in these conditions can thus provide critical information for speaker diarization.
Fig. 3: Proposed speaker embedding extraction incorporating acoustic conditions
A simple, but very effective way we found to embed acoustic condition information in the vector is to use a “speaker AND acoustic condition” classifier. Fig. 3. shows a concrete example. Unlike the usual speaker classification, we assume that the same speaker belongs to different classes if the microphone is different. The vector obtained in this way includes information of not only the speaker, but also the microphone.
Fig. 4: Speaker Error Rate (SER) improvement by the proposed method
We evaluated our proposed method using four test cases, including the well-known public CallHome corpus (LDC catalog number LDC97S42). Fig. 4 shows the average speaker error rates from diarization. By using a neural network with the standard speaker classification task, speaker error rate (SER) improved from 6.02 percent to 5.59 percent. Using the proposed speaker AND microphone classification, SER improved from 5.59 percent to 4.36 percent. By combining with another method detailed in the paper, we achieved an SER of 4.21 percent on this test set.