Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Tue-Ses1-P4:
Speaker Recognition and Diarisation

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Sadaoki Furui

#1Importance of Nasality Measures for Speaker Recognition Data Selection and Performance Prediction

Howard Lei (International Computer Science Institute)
Eduardo Lopez-Gonzalo (Dep. of Signals, Systems and Radiocomm., Universidad Politecnica Madrid, Spain)

We improve upon measures relating feature vector distributions to speaker recognition (SR) performances for SR performance prediction and arbitrary data selection. In particular, we examine the means and variances of 11 features pertaining to nasality (resulting in 22 measures), computing them on feature vectors of phones to determine which measures give good SR performance prediction of phones. We've found that the combination of nasality measures give a 0.917 correlation with the Equal Error Rates (EERs) of phones on SRE08, exceeding the correlation of our previous best measure (mutual information) by 12.7%. When implemented in our data-selection scheme (which does not require a SR system to be run), the nasality measures allow us to select data with combined EER better than data selected via running a SR system in certain cases, at a fortieth of the computational costs. The nasality measures require a tenth of the computational costs compared to our previous best measure.

#2Exploration of Vocal Excitation Modulation Features for Speaker Recognition

Ning Wang (Department of Electronic Engineering, The Chinese University of Hong Kong)
P. C. Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)

To derive spectro-temporal vocal source features complementary to the conventional spectral-based vocal tract features in improving the performance and reliability of a speaker recognition system, the excitation related modulation properties are studied. Through multi-band demodulation method, source-related amplitude and phase quantities are parameterized into feature vectors. Evaluation of the proposed features is carried out first through a set of designed experiments on artificially generated inputs, and then by simulations on speech corpus. It is observed via the designed experiments that the proposed features are capable of capturing the vocal differences in terms of F0 variation, pitch epoch shape, and relevant excitation details between epochs. In the simulations, by combination with the standard spectral features, both the amplitude and the phase-related features are shown to evidently reduce the identification error rate and equal error rate in the speaker recognition system.

#3Speaker Identification for Whispered Speech Using Modified Temporal Patterns and MFCCs

Xing Fan (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)
John H.L. Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)

Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. This study considers a combination of modified temporal patterns (m-TRAPs) and MFCCs to improve the performance of a neutral trained system for whispered speech. The m-TRAPs are introduced based on an explanation for the whisper/neutral mismatch degradation of MFCCs based system. A phoneme-by-phoneme score weighting method is used to fuse the score from each subband. Text independent closed set speaker ID was conducted and experiment shows that m-TRAPs is especially efficient for whisper with low SNR. When combining the scores from both MFCCs and TRAPs GMMs, an absolute 26.3% improvement in accuracy is obtained compared with a traditional MFCCs baseline system. This result confirms a viable approach to improving speaker ID performance between neutral/whisper mismatch conditions.

#4Speaker Diarization for Meeting Room Audio

Hanwu Sun (Institute for Infocomm Research)
Tin Lay Nwe (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)

This paper describes a speaker diarization system in 2007 NIST Rich Transcription (RT07) Meeting Recognition Evaluation for the task of Multiple Distant Microphone (MDM) in meeting room scenarios. The system includes three major modules: data preparation, initial speaker clustering and cluster purification/merging. The data preparation consists of the raw data Wiener filtering and beamforming, Time Difference of Arrival estimate and speech activity detection. Based on the initial processed data, two-stage histogram quantization has been used to perform the initial speaker clustering. A modified purification strategy via high-order GMM clustering method is proposed. BIC criterion is applied for cluster merging. The system achieves a competitive overall DER of 8.31% for RT07 MDM speaker diarization task.

#5Improving Speaker Segmentation via Speaker Identification and Text Segmentation

Runxin Li (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)
Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA; Fakultat fur Informatik, Universitat Karlsruhe (TH), Germany)
Qin Jin (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)

Speaker segmentation is an essential part of a speaker diarization system. Common segmentation systems usually miss speaker change points when speakers switch fast. These errors seriously confuse the following speaker clustering step and result in high overall speaker diarization error rates. In this paper two methods are proposed to deal with this problem: The first approach uses speaker identification techniques to boost speaker segmentation. And the second approach applies text segmentation methods to improve the performance of speaker segmentation. Experiments on Quaero speaker diarization evaluation data shows that our methods achieve up to 45% relative reduction in the speaker diarization error and 64% relative increase in the speaker change detection recall rate over the baseline system. Moreover, both these two approaches can be considered as post-processing steps over the baseline segmentation, therefore, they can be applied in any speaker diarization systems.

#6Overall performance metrics for multi-condition Speaker Recognition Evaluations

David van Leeuwen (TNO Human Factors)

In this paper we propose a framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE-2008, which combines several acoustic conditions in one evaluation. We do this by weighting trials of different conditions according to their relative proportion, and we derive expressions for the basic speaker recognition performance measures Cdet, Cllr, as well as the DET curve, from which EER and minCdet can be computed. Examples of pooling of conditions are shown on SRE-2008 data, including speaker sex and microphone type and speaking style.

#7Speaker Identification usingWarped MVDR Cepstral Features

Matthias Wölfel (ZKM|Center for Art and Media, Germany)
Qian Yang (Universität Karlsruhe (TH), Germany)
Jin Qin (Carnegie Mellon University, USA)
Tanja Schultz (Universität Karlsruhe (TH), Germany)

It is common practice to use similar or even the same feature extraction methods for automatic speech recognition and speaker recognition. While the front-end for the former requires to preserve phoneme discrimination and to compensate for speaker differences to some extend the front-end for the latter has to preserve the unique characteristics of individual speakers. It seems, therefore, contradictory to use the same feature extraction methods for both tasks. Starting out from the common practice we propose to use warped minimum variance distortionless response (MVDR) cepstral coefficients, which have already been demonstrated to preform superior for automatic speech recognition in particular under adverse conditions. Replacing the widely used mel-frequency cepstral coefficients by warped MVDR cepstral coefficients improves the speaker identification accuracy by up to 24% relative. We found that the optimal choice of the model order within the warped MVDR framework differs between speech recognition and speaker recognition, confirming our intuition that the two different tasks indeed require different feature extraction strategies.

#8Entropy Based Overlapped Speech Detection as a Pre-Processing Stage for Speaker Diarization

Oshry Ben-Harush (Ben-Gurion University of the Negev)
Itshak Lapidot (Sami Shamoon College of Engineering)
Hugo Guterman (Ben-Gurion University of the Negev)

One inherent deficiency of most diarization systems is their inability to handle co-channel or overlapped speech. Most of the suggested algorithms perform under singular conditions, require high computational complexity in both time and frequency domains. In this study, frame based entropy analysis of the audio data in the time domain serves as a single feature for an overlapped speech detection algorithm. Identification of overlapped speech segments is performed using Gaussian Mixture Modeling (GMM) along with well known classification algorithms applied on two speaker conversations. By employing this methodology, the proposed method eliminates the need for setting a hard threshold for each conversation or database. LDC CALLHOME American English corpus is used for evaluation of the suggested algorithm. The proposed method successfully detects 63.2% of the frames labeled as overlapped speech by the manual segmentation, while keeping a 5.4% false-alarm rate.

#9Speech Style and Speaker Recognition: a Case Study

Marco Grimaldi (School of Computer Science and Informatics, UCD, Dubin; FBK, via Sommarive 18, I-38100 Povo (Trento))
Fred Cummins (School of Computer Science and Informatics, UCD, Dubin)

This work presents an experimental evaluation of the effect of different speech styles on the task of speaker recognition. We make use of willfully altered voice extracted from the CHAINS corpus and methodically assess the effect of its use in a reference speaker identification and verification system. We contrast normal readings of text with two varieties of imitative styles and with the familiar, non-imitative, variant of fast speech. Furthermore, we test the applicability of a novel speech parameterization that has been suggested as a promising technique in the task of speaker identification: the pyknogram frequency estimate coefficients - pykfec.

#10The Majority Wins: a Method for Combining Speaker Diarization Systems

Marijn Huijbregts (University of Twente)
David Leeuwen, van (TNO Human Factors)
Franciska Jong, de (University of Twente)

In this paper we present a method for combining multiple diarization systems into one single system by applying a majority voting schema. The voting schema selects the best segmentation purely on basis of the output of each system. On our development set of NIST Rich Transcription evaluation meetings the voting method improves our system on all evaluation conditions. For the single distant microphone condition, the DER performance is improved by 7.8% (relative) compared to the best input system. For the multiple distant microphone condition the improvement is 3.6%.

#11Two-Wire Nuisance Attribute Projection

Yosef Solewicz (Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel)
Hagai Aronowitz (IBM Haifa Research Labs, Haifa 31905, Israel)

This paper addresses the task of nuisance reduction in two-wire speaker recognition applications. Besides channel mismatch, two-wire conversations are contaminated by extraneous speakers which represent an additional source of noise in the supervector domain. It is shown that two-wire nuisance manifests itself as undesirable directions in the interspeaker subspace. For this purpose, we derive two alternative Nuisance Attribute Projection (NAP) formulations tailored for two-wire sessions. The first formulation generalizes the NAP framework based on a model of two-wire conversations. The second formulation explicitly models the four- vs. two-wire supervector variability. Preliminary experiments show that two-wire NAP significantly outperforms regular NAP in varied two-wire tasks