|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses1-P4: Speaker Recognition and Diarisation
| Time: | Tuesday 10:00 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Sadaoki Furui |
| #1 | Importance of Nasality Measures for Speaker Recognition Data Selection and Performance Prediction
Howard Lei (International Computer Science Institute) Eduardo Lopez-Gonzalo (Dep. of Signals, Systems and Radiocomm., Universidad Politecnica Madrid, Spain)
We improve upon measures relating feature vector distributions to speaker recognition (SR) performances for SR performance prediction and arbitrary data selection. In particular, we examine the means and variances of 11 features pertaining to nasality (resulting in 22 measures), computing them on feature vectors of phones to determine which measures give good SR performance prediction of phones. We've found that the combination of nasality measures give a 0.917 correlation with the Equal Error Rates (EERs) of phones on SRE08, exceeding the correlation of our previous best measure (mutual information) by 12.7%. When implemented in our data-selection scheme (which does not require a SR system to be run), the nasality measures allow us to select data with combined EER better than data selected via running a SR system in certain cases, at a fortieth of the computational costs. The nasality measures require a tenth of the computational costs compared to our previous best measure.
|
| #2 | Exploration of Vocal Excitation Modulation Features for Speaker Recognition
Ning Wang (Department of Electronic Engineering, The Chinese University of Hong Kong) P. C. Ching (Department of Electronic Engineering, The Chinese University of Hong Kong) Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)
To derive spectro-temporal vocal source features complementary to the conventional spectral-based vocal tract features in improving the performance and reliability of a speaker recognition system, the excitation related modulation properties are studied. Through multi-band demodulation method, source-related amplitude and phase quantities are parameterized into feature vectors. Evaluation of the proposed features is carried out first through a set of designed experiments on artificially generated inputs, and then by simulations on speech corpus. It is observed via the designed experiments that the proposed features are capable of capturing the vocal differences in terms of F0 variation, pitch epoch shape, and relevant excitation details between epochs. In the simulations, by combination with the standard spectral features, both the amplitude and the phase-related features are shown to evidently reduce the identification error rate and equal error rate in the speaker recognition system.
|
| #3 | Speaker Identification for Whispered Speech Using Modified Temporal Patterns and MFCCs
Xing Fan (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA) John H.L. Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)
Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. This study considers a combination of modified temporal patterns (m-TRAPs) and MFCCs to improve the performance of a neutral trained system for whispered speech. The m-TRAPs are introduced based on an explanation for the whisper/neutral mismatch degradation of MFCCs based system. A phoneme-by-phoneme score weighting method is used to fuse the score from each subband. Text independent closed set speaker ID was conducted and experiment shows that m-TRAPs is especially efficient for whisper with low SNR. When combining the scores from both MFCCs and TRAPs GMMs, an absolute 26.3% improvement in accuracy is obtained compared with a traditional MFCCs baseline system. This result confirms a viable approach to improving speaker ID performance between neutral/whisper mismatch conditions.
|
| #4 | Speaker Diarization for Meeting Room Audio
Hanwu Sun (Institute for Infocomm Research) Tin Lay Nwe (Institute for Infocomm Research) Bin Ma (Institute for Infocomm Research) Haizhou Li (Institute for Infocomm Research)
This paper describes a speaker diarization system in 2007 NIST Rich Transcription (RT07) Meeting Recognition Evaluation for the task of Multiple Distant Microphone (MDM) in meeting room scenarios. The system includes three major modules: data preparation, initial speaker clustering and cluster purification/merging. The data preparation consists of the raw data Wiener filtering and beamforming, Time Difference of Arrival estimate and speech activity detection. Based on the initial processed data, two-stage histogram quantization has been used to perform the initial speaker clustering. A modified purification strategy via high-order GMM clustering method is proposed. BIC criterion is applied for cluster merging. The system achieves a competitive overall DER of 8.31% for RT07 MDM speaker diarization task.
|
| #5 | Improving Speaker Segmentation via Speaker Identification and Text Segmentation
Runxin Li (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA) Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA; Fakultat fur Informatik, Universitat Karlsruhe (TH), Germany) Qin Jin (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)
Speaker segmentation is an essential part of a speaker diarization system. Common segmentation systems usually miss speaker change points when speakers switch fast. These errors seriously confuse the following speaker clustering step and result in high overall speaker diarization error rates. In this paper two methods are proposed to deal with this problem: The first approach uses speaker identification techniques to boost speaker segmentation. And the second approach applies text segmentation methods to improve the performance of speaker segmentation. Experiments on Quaero speaker diarization evaluation data shows that our methods achieve up to 45% relative reduction in the speaker diarization error and 64% relative increase in the speaker change detection recall rate over the baseline system. Moreover, both these two approaches can be considered as post-processing steps over the baseline segmentation, therefore, they can be applied in any speaker diarization systems.
|
| #6 | Overall performance metrics for multi-condition Speaker Recognition Evaluations
David van Leeuwen (TNO Human Factors)
In this paper we propose a framework for measuring the overall
performance of an automatic speaker recognition system using a set
of trials of a heterogeneous evaluation such as NIST SRE-2008, which
combines several acoustic conditions in one evaluation. We do this
by weighting trials of different conditions according to their
relative proportion, and we derive expressions for the basic speaker
recognition performance measures Cdet, Cllr, as well as the DET
curve, from which EER and minCdet can be computed. Examples of
pooling of conditions are shown on SRE-2008 data, including speaker
sex and microphone type and speaking style.
|
| #7 | Speaker Identification usingWarped MVDR Cepstral Features
Matthias Wölfel (ZKM|Center for Art and Media, Germany) Qian Yang (Universität Karlsruhe (TH), Germany) Jin Qin (Carnegie Mellon University, USA) Tanja Schultz (Universität Karlsruhe (TH), Germany)
It is common practice to use similar or even the same feature extraction methods for automatic speech recognition and speaker recognition.
While the front-end for the former requires to preserve phoneme discrimination and to compensate for speaker differences to some extend the front-end for the latter has to preserve the unique characteristics of individual speakers. It seems, therefore, contradictory to use the same feature extraction methods for both tasks. Starting out from the common practice we propose to use warped minimum variance distortionless response (MVDR) cepstral coefficients, which have already been demonstrated to preform superior for automatic speech recognition in particular under adverse conditions. Replacing the widely used mel-frequency cepstral coefficients by warped MVDR cepstral coefficients improves the speaker identification accuracy by up to 24% relative. We found that the optimal choice of the model order within the warped MVDR framework differs between speech recognition and speaker recognition, confirming our intuition that the two different tasks indeed require different feature extraction strategies.
|
| #8 | Entropy Based Overlapped Speech Detection as a Pre-Processing Stage for Speaker Diarization
Oshry Ben-Harush (Ben-Gurion University of the Negev) Itshak Lapidot (Sami Shamoon College of Engineering) Hugo Guterman (Ben-Gurion University of the Negev)
One inherent deficiency of most diarization systems is their inability to handle co-channel or overlapped speech. Most of
the suggested algorithms perform under singular conditions,
require high computational complexity in both time and frequency
domains.
In this study, frame based entropy analysis of the audio data
in the time domain serves as a single feature for an overlapped
speech detection algorithm. Identification of overlapped
speech segments is performed using Gaussian Mixture Modeling
(GMM) along with well known classification algorithms applied
on two speaker conversations. By employing this methodology,
the proposed method eliminates the need for setting a
hard threshold for each conversation or database.
LDC CALLHOME American English corpus is used for evaluation
of the suggested algorithm. The proposed method successfully
detects 63.2% of the frames labeled as overlapped
speech by the manual segmentation, while keeping a 5.4%
false-alarm rate.
|
| #9 | Speech Style and Speaker Recognition: a Case Study
Marco Grimaldi (School of Computer Science and Informatics, UCD, Dubin; FBK, via Sommarive 18, I-38100 Povo (Trento)) Fred Cummins (School of Computer Science and Informatics, UCD, Dubin)
This work presents an experimental evaluation of the effect of different speech styles on the task of speaker recognition.
We make use of willfully altered voice extracted from the CHAINS corpus and methodically assess the effect of its use in a reference speaker identification and verification system. We contrast normal readings of text with two varieties of imitative styles and with the familiar, non-imitative, variant of fast speech. Furthermore, we test the applicability of a novel speech parameterization that has been suggested as a promising technique in the task of speaker identification: the pyknogram frequency estimate coefficients - pykfec.
|
| #10 | The Majority Wins: a Method for Combining Speaker Diarization Systems
Marijn Huijbregts (University of Twente) David Leeuwen, van (TNO Human Factors) Franciska Jong, de (University of Twente)
In this paper we present a method for combining multiple diarization systems into one single system by applying a majority voting schema. The voting schema selects the best segmentation purely on basis of the output of each system. On our development set of NIST Rich Transcription evaluation meetings the voting method improves our system on all evaluation conditions. For the single distant microphone condition, the DER performance is improved by 7.8% (relative) compared to the best input system. For the multiple distant microphone condition the improvement is 3.6%.
|
| #11 | Two-Wire Nuisance Attribute Projection
Yosef Solewicz (Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel) Hagai Aronowitz (IBM Haifa Research Labs, Haifa 31905, Israel)
This paper addresses the task of nuisance reduction in two-wire speaker recognition applications. Besides channel mismatch, two-wire conversations are contaminated by extraneous speakers which represent an additional source of noise in the supervector domain. It is shown that two-wire nuisance manifests itself as undesirable directions in the interspeaker subspace. For this purpose, we derive two alternative Nuisance Attribute Projection (NAP) formulations tailored for two-wire sessions. The first formulation generalizes the NAP framework based on a model of two-wire conversations. The second formulation explicitly models the four- vs. two-wire supervector variability. Preliminary experiments show that two-wire NAP significantly outperforms regular NAP in varied two-wire tasks
|
|
|