Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Wed-Ses1-O1:
Speaker verification & identification II

Time:Wednesday 10:00 Place:Main Hall Type:Oral
Chair:Jean-Francois Bonastre

10:00Does Session Variability Compensation in Speaker Recognition Model Intrinsic Variation Under Mismatched Conditions?

Elizabeth Shriberg (SRI International)
Sachin Kajarekar (SRI International)
Nicolas Scheffer (SRI International)

Intersession variability (ISV) compensation in speaker recognition is well studied with respect to extrinsic variation, but little is known about its ability to model intrinsic variation. We find that ISV compensation is remarkably successful on a corpus of intrinsic variation that is highly controlled for channel (a dominant component of ISV). The results are particularly surprising because the ISV training data come from a different corpus than do speaker train and test data. We further find that relative improvements are (1) inversely related to uncompensated performance, (2) reduced more by vocal effort train/test mismatch than by speaking style mismatch, and (3) reduced additionally for mismatches in both style and level. Results demonstrate that intersession variability compensation does model intrinsic variation, and suggest that mismatched data may be more useful than previously expected for modeling certain types of within-speaker variability in speech.

10:20Variability Compensated Support Vector Machines Applied to Speaker Verification

Zahi Karam (DSPG, Research Laboratory of Electronics at MIT & MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)

Speaker verification using SVMs has proven successful, specifically using the GSV Kernel with NAP. Also, the recent popularity and success of JFA has led to promising attempts to use speaker factors directly as SVM features. NAP projection and the use of speaker factors are methods of handling variability: NAP by removing nuisance variability, and using speaker factors by forcing the discrimination to be performed based on inter-speaker variability. These successes have led us to propose a new method we call VCSVM to handle both inter and intra-speaker variability directly in the SVM optimization. VCSVM adds a regularized penalty to the optimization that biases the normal to the hyperplane to be orthogonal to the nuisance subspace or alternatively the complement of the inter-speaker variability subspace. The bias attempts to emphasize inter-speaker variability while ignoring intra-speaker variability. This paper presents the VCSVM theory and promising results on nuisance compensation.

10:40Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification

Najim Dehak (CRIM-ETS)
Réda Dehak (LRDE-EPITA)
Patrick Kenny (CRIM)
Niko Brummer (Agnitio)
Pierre Ouellet (CRIM)
Pierre Dumouchel (CRIM-ETS)

This paper presents a new speaker verification system architecture based on Joint Factor Analysis (JFA) as feature extractor. In this modeling, the JFA is used to define a new low-dimensional space named the total variability factor space, instead of both channel and speaker variability spaces for the classical JFA. The main contribution in this approach, is the use of the cosine kernel in the new total factor space to design two different systems: the first system is Support Vector Machines based, and the second one uses directly this kernel as a decision score. This last scoring method makes the process faster and less computation complex compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that the combination of Linear Discriminate Analysis and Within Class Covariance Normalization achieved the best performance.

11:00Within-Session Variability Modelling for Factor Analysis Speaker Verification

Robbie Vogt (Speech Research Lab, QUT)
Jason Pelecanos (IBM T.J. Watson Research Center)
Nicolas Scheffer (SRI International)
Sachin Kajarekar (SRI International)
Sridha Sridharan (Speech Research Lab, QUT)

This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability. The goals of the proposed extended JFA model are to improve verification performance with short utterances by compensating for the effects of limited or imbalanced phonetic coverage, and to produce a flexible JFA model that is effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results on the 2006 NIST SRE corpus demonstrate the flexibility of the proposed model by providing competitive results over a wide range of utterance lengths without retraining and also yielding modest improvements in a number of conditions over current state-of-the-art.

11:20Speaker Recognition by Gaussian Information Bottleneck

Ron M Hecht (Department of Computer Science, Tel-Aviv University, Tel-Aviv, Israel)
Elad Noor (The Weizmann Institute of Science, Rehovot, Israel)
Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)

This paper explores a novel approach for the extraction of relevant information in speaker recognition tasks. This approach uses a principled information theoretic framework - the Information Bottleneck method (IB). In our application, the method compresses the acoustic data while preserving mostly the relevant information for speaker identification. This paper focuses on a continuous version of the IB method known as the Gaussian Information Bottleneck (GIB). This version assumes that both the source and target variables are high dimensional multivariate Gaussian variables. The GIB was applied in our work to the Super Vector (SV) dimension reduction conundrum. Experiments were conducted on the male part of the NIST SRE 2005 corpora. The GIB representation was compared to other dimension reduction techniques and to a baseline system. In our experiments, the GIB outperformed the baseline system; achieving a 6.1% Equal Error Rate (EER) compared to the 15.1% EER of a baseline system.

11:40Variational Dynamic Kernels for Speaker Verification

Chris Longworth (Cambridge University Engineering Department)
Rogier van Dalen (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)

An important aspect of SVM-based speaker verification is the choice of dynamic kernel. Recently there has been interest in the use of kernels based on the Kullback-Leibler divergence between GMMs. Since this has no closed-form solution, typically a matched-pair upper bound is used instead. This places significant restrictions on the forms of model structure that may be used. All GMMs must contain the same number of components and must be adapted from a single background model. For many tasks this will not be optimal. In this paper, dynamic kernels are proposed based on alternative, variational approximations to the KL divergence. Unlike the matched-pair bound, these do not restrict the forms of GMM that may be used. Additionally, using a more accurate approximation of the divergence may lead to performance gains. Preliminary results using these kernels are presented on the NIST 2002 SRE dataset.