Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Tue-Ses2-O4:
Speaker Diarisation

Time:Tuesday 13:30 Place:East Wing 3 Type:Oral
Chair:Douglas Reynolds

13:30A STUDY OF NEW APPROACHES TO SPEAKER DIARIZATION

Douglas Reynolds (MIT Lincoln Laboratory)
Patrick Kenny (CRIM)
Fabio Castaldo (Politecnico di Torino)

This paper reports on work carried out at the 2008 JHU Summer Workshop examining new approaches to speaker diarization. Four different systems were developed and experiments were conducted using summed-channel telephone data from the 2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker factors and classic segmentation and clustering, and a new hybrid system combining the baseline system with a new cosine-distance speaker factor clustering. Results are presented using the Diarization Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The best configurations of the diarization system produced DERs of 3.5-4.6\% and we demonstrate a weak correlation of EER and DER,

13:50REDEFINING THE BAYESIAN INFORMATION CRITERION FOR SPEAKER DIARISATION

Themos Stafylakis (Institute for Language and Speech Processing, National Technical University of Athens)
Vassilis Katsouros (Institute for Language and Speech Processing)
George Carayannis (Institute for Language and Speech Processing, National Technical University of Athens)

A novel approach to Bayesian Information Criterion (BIC) is introduced. The new criterion redefines the penalty terms of the BIC, such that each parameter is penalized with the effective sample size is trained with. Contrary to Local-BIC, the proposed criterion scores overall clustering hypotheses and therefore is not restricted to hierarchical clustering algorithms. Contrary to Global-BIC, it provides a local dissimilarity measure that depends only the statistics of the examined clusters and not on the overall sample size. We tested our criterion with two benchmark tests and found significant improvement in performance in the speaker diarisation task

14:10Speaker Diarization Using Divide-and-Conquer

Shih-Sian Cheng (Institute of Information Science, Academia Sinica, Taipei, Taiwan)
Chun-Han Tseng (Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan)
Chia-Ping Chen (Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan)
Hsin-Min Wang (Institute of Information Science, Academia Sinica, Taipei, Taiwan)

Speaker diarization systems consist of two core components: speaker segmentation and speaker clustering. The current state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after segmentation. However, HAC's quadratic computational complexity with respect to the number of data samples inevitably limits its application in large-scale data sets. In this paper, we propose a divide-and-conquer (DAC) framework for speaker diarization. It recursively partitions the input speech stream into two sub-streams, performs diarization on them separately, and then combines the diarization results obtained from them using HAC. The experiment results show that the proposed framework is faster than the conventional segmentation and clustering-based approach while achieving comparable diarization accuracy. Moreover, the proposed framework obtains a higher speedup over the conventional approach on a larger test data set.

14:30KL Realignment for Speaker Diarization with Multiple Feature Streams

Deepu Vijayasenan (Idiap Research Institute, 1920 Martigny, CH)
Fabio Valente (Idiap Research Institute, 1920 Martigny, CH)
Herve Bourlard (Idiap Research Institute, 1920 Martigny, CH)

This paper aims at investigating the use of Kullback-Leibler (KL) divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly on the speaker posterior distribution estimates and is compared with traditional realignment performed using HMM/GMM system. We hypothesize that using posterior estimates to re-align speaker boundaries is more robust than gaussian mixture models in case of multiple feature streams with different statistical properties. Experiments are run on the NIST RT06 data. They reveal that in case of conventional MFCC features the two approaches have the same performance while the KL based system outperforms the HMM/GMM re-alignment in case of combination of multiple feature streams (MFCC and TDOA). Furthermore we discuss the possible extension to other feature sets.

14:50Speech Overlap Detection in a Two-Pass Speaker Diarization System

Marijn Huijbregts (University of Twente)
David Leeuwen, van (TNO Human Factors)
Franciska Jong, de (University of Twente)

In this paper we present the two-pass speaker diarization system that we developed for the NIST RT09s evaluation. In the first pass of our system a model for speech overlap detection is generated automatically. This model is used in two ways to reduce the diarization errors due to overlapping speech. First, it is used in a second diarization pass to remove overlapping speech from the data while training the speaker models. Second, it is used to find speech overlap for the final segmentation so that overlapping speech segments can be generated. The experiments show that our overlap detection method improves the performance of all three of our system configurations.

15:10Improved Speaker Diarization of Meeting Speech with Recurrent Selection of Representative Speech Segments and Participant Interaction Pattern Modeling

Kyu Han (University of Southern California)
Shrikanth Narayanan (University of Southern California)

In this work we describe two distinct novel improvements to our speaker diarization system, previously proposed for analysis of meeting speech. The first approach focuses on recurrent selection of representative speech segments for speaker clustering while the other is based on participant interaction pattern modeling. The former selects speech segments with high relevance to speaker clustering, especially from a robust cluster modeling perspective, and keeps updating them throughout clustering procedures. The latter statistically models conversation patterns between meeting participants and applies it as a priori information when refining diarization results. Experimental results reveal that the two proposed approaches provide performance enhancement by 29.82% (relative) in terms of diarization error rate in tests on 13 meeting excerpts from various meeting speech corpora.