Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses2-O1:
ASR: Features for Noise Robustness

Time:Monday 13:30 Place:Main Hall Type:Oral
Chair:Hynek Hermansky

13:30Feature Extraction for Robust Speech Recognition Using a Power-Law Nonlinearity and Power-Bias Subtraction

Chanwoo Kim (Carnegie Mellon University)
Richard Stern (Carnegie Mellon University)

This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing.

13:50Towards Fusion of Feature Extraction and Acoustic Model Training: A Top Down Process for Robust Speech Recognition

Yu-Hsiang Bosco Chiu (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)
Richard M. Stern (Carnegie Mellon University)

This paper presents a strategy to learn physiologically motivated components in a feature computation module discriminatively, directly from data, in a manner that is inspired by the presence of efferent processes in the human auditory system. In our model a set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of the feature extraction process. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in the training data. The estimated feature computation is observed to be robust against environmental noise. Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated rate-nonlinearity results in better performance in the presence of background noise than traditional procedures which separate the feature extraction and model training into two distinct parts.

14:10Temporal Modulation Processing of Speech Signals for Noise Robust ASR

Hong You (UCLA Electrical Engineering Dept.)
Abeer Alwan (UCLA Electrical Engineering Dept.)

We analyze the temporal modulation characteristics of speech and noise from a speech/non-speech discrimination point of view, and propose a frequency adaptive modulation processing algorithm and apply it to a noise robust ASR task. Although previous psychoacoustic studies have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on modulation components from the point of view of speech/noise discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more accurately classified by low-passed modulation frequencies than band-passed ones. We then propose a frequency adaptive modulation processing algorithm for a noise robust ASR task. Speech recognition experiments are performed to compare the proposed algorithm with other noise robust frontends, including RASTA and ETSI AFE. Results show that the frequency adaptive modulation processing is promising.

14:30PROGRESSIVE MEMORY-BASED PARAMETRIC NON-LINEAR FEATURE EQUALIZATION

Luz GarcĂ­a (Department of TSTC, University of Granada, Spain)
Roberto Gemello (LOQUENDO, Torino, ITALY)
Franco Mana (LOQUENDO, Torino, ITALY)
Jose Carlos Segura (Department of TSTC, University of Granada, Spain)

This paper analyzes the benefits and drawbacks of PEQ (Parametric Non-linear Equalization), a features normalization technique based on the parametric equalization of the MFCC parameters to match a reference probability distribution. Two limitations have been outlined: the distortion intrinsic to the normalization process and the lack of accuracy in estimating normalization statistics on short sentences. Two evolutions of PEQ are presented as solutions to the limitations encountered. The effects of the proposed evolutions are evaluated on three speech corpora, namely WSJ0, AURORA-3 and HIWIRE cockpit databases, with different mismatch conditions given by convolutional and/or additive noise and non-native speakers. The obtained results show that the encountered limitations can be overcome by the newly introduced techniques.

14:50Dynamic Features in the Linear Domain for Robust Automatic Speech Recognition in a Reverberant Environment

Osamu Ichikawa (Tokyo Research Laboratory, IBM Research)
Takashi Fukuda (Tokyo Research Laboratory, IBM Research)
Ryuki Tachibana (Tokyo Research Laboratory, IBM Research)
Masafumi Nishimura (Tokyo Research Laboratory, IBM Research)

Since the MFCC are calculated from logarithmic spectra, the delta and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals have trailing reverberations, whose power is plotted as a long-term exponential decay. This means the logarithmic delta value tends to remain large for a long time. This paper proposes a delta feature calculated in the linear domain, due to the rapid decay in reverberant environments. In an experiment using an evaluation framework (CENSREC-4), significant improvements were found in reverberant situations by simply replacing the MFCC dynamic features with the proposed dynamic features.

15:10Local Projections and Support Vector Based Feature Selection in Speech Recognition

Antonio Miguel (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Luis Buera (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

In this paper we study a method to provide noise robustness in mismatch conditions for speech recognition using local frequency projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust features and a simpler feature set to apply reliability weighting techniques. The proposed method combines two techniques to select the feature set, first a realibility metric based on information theory and, second, a support vector set to reduce the errors. The support vector set provides the most representative examples which have influence in the error rate in mismatch conditions, so that only the features which incorporate implicit robustness to mismatch are selected. Some experimental results are obtained with this method compared to baseline systems using the Aurora 2 database.