Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Thu-Ses1-S1:
Special Session: New Approaches to Modeling Variability for Automatic Speech Recognition

Time:Thursday 10:00 Place:East Wing 4 Type:Special
Chair:Carol Espy-Wilson & Jennifer Cole

10:00A Noise-type and level-dependent MPO-based speech enhancement architecture

Vikramjit Mitra (University of Maryland, College Park)
Bengt Borgstrom (University of California, Los Angeles)
Carol Espy-Wilson (University of Maryland, College Park)
Abeer Alwan (University of California, Los Angeles)

In previous work, a speech enhancement algorithm based on phase opponency and a periodicity measure (MPO-APP) was developed for speech recognition. Axiomatic thresholds were used in the MPO-APP regardless of the signal-to-noise ratio (SNR) of the corrupted speech or any characterization of the noise. The current work developed an algorithm for adjusting the threshold in the MPO-APP based on the SNR and whether the speech signal is clean, corrupted by aperiodic noise or corrupted with noise with periodic components. In addition, variable frame rate (VFR) analysis has been incorporated so that dynamic regions in the speech signal are more heavily sampled than steady-state regions. The result is a 2-stage algorithm that gives superior performance to the previous MPO-APP, and to several other state-of-the-art speech enhancement algorithms.

10:20Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities

Bernd T. Meyer (University of Oldenburg)
Birger Kollmeier (University of Oldenburg)

In this study, the effect of speech-intrinsic variabilities such as speaking rate, effort and speaking style on automatic speech recognition (ASR) is investigated. We analyze the influence of such variabilities as well as extrinsic factors (i.e., additive noise) on the most common features in ASR (mel-frequency cepstral coefficients and perceptual linear prediction features) and spectro-temporal Gabor features. MFCCs performed best for clean speech, whereas Gabors were found to be the most robust feature in extrinsic variabilities. Intrinsic variations were found to have a strong impact on error rates. While performance with MFCCs and PLPs was degraded in much the same way, Gabor features exhibit a different sensivity towards these variabilities and are, e.g., well-suited to recognize speech with varying pitch. The results suggest that spectro-temporal and classic features carry complementary information, which could be exploited in feature-stream experiments.

10:40Noise robustness of Tract Variables and their application to Speech Recognition

Vikramjit Mitra (1Department of Electrical and Computer Engineering, University of Maryland, USA)
Hosung Nam (Haskins Laboratories, New Haven, USA)
Carol Espy-Wilson (1Department of Electrical and Computer Engineering, University of Maryland, USA)
Elliot Saltzman (Haskins Laboratories, New Haven, USA)
Louis Goldstein (Haskins Laboratories, New Haven, USA)

This paper analyzes the noise robustness of vocal tract constriction variable estimation and also investigates their role for noise robust speech recognition. We implement a simple direct inverse model using a feed-forward artificial neural network (ANN) to estimate vocal tract time functions (VTTF) from acoustic speech signal parameterized as Melfrequency cepstral coefficients (MFCC). The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding VTTFs. Eight different vocal tract (VT) constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study.

11:00Articulatory Phonological Code for Word Classification

Xiaodan Zhuang (University of Illinois at Urbana-Champaign)
Hosung Nam (Haskins Laboratories, New Haven, U.S.A.)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
Louis Goldstein (Haskins Laboratories, New Haven, U.S.A.)
Elliot Saltzman (Haskins Laboratories, New Haven, U.S.A.)

We propose a framework that leverages articulatory phonology for speech recognition. "Gestural pattern vectors" (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the "canonical" gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the canonical gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.

11:20Robust Keyword Spotting with Rapidly Adapting Point Process Models

Aren Jansen (Dept of Computer Science, University of Chicago)
Partha Niyogi (Depts. of Computer Science and Statistics, University of Chicago)

In this paper, we investigate the noise robustness properties of frame-based and sparse point process-based models for spotting keywords in continuous speech. We introduce a new strategy to improve point process model (PPM) robustness by adapting low-level feature detector thresholds to preserve background firing rates in the presence of noise. We find that this unsupervised approach can significantly outperform fully supervised maximum likelihood linear regression (MLLR) adaptation of an equivalent keyword-filler HMM system in the presence of additive white and pink noise. Moreover, we find that the sparsity of PPMs introduces an inherent resilience to non-stationary babble noise not exhibited by the frame-based HMM system. Finally, we demonstrate that our approach requires less adaptation data than MLLR, permitting rapid online adaptation.

11:40Automatically Rating Pronunciation Through Articulatory Phonology

Joseph Tepperman (University of Southern California)
Louis Goldstein (University of Southern California)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Articulatory Phonology's link between cognitive speech planning and the physical realizations of vocal tract constrictions has implications for speech acoustic and duration modeling that should be useful in assigning subjective ratings of pronunciation quality to nonnative speech. In this work, we compare traditional phoneme models used in automatic speech recognition to similar models for articulatory gestural pattern vectors, each with associated duration models. What we find is that, on the CDT corpus, gestural models outperform the phoneme-level baseline in terms of correlation with listener ratings, and in combination phoneme and gestural models outperform either one alone. This also validates previous findings with a similar (but not gesture-based) pseudo-articulatory representation.