|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Mon-Ses2-O1: ASR: Features for Noise Robustness
| Time: | Monday 13:30 |
Place: | Main Hall |
Type: | Oral |
| Chair: | Hynek Hermansky |
| 13:30 | Feature Extraction for Robust Speech Recognition Using a Power-Law Nonlinearity and Power-Bias Subtraction
Chanwoo Kim (Carnegie Mellon University) Richard Stern (Carnegie Mellon University)
This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing.
|
| 13:50 | Towards Fusion of Feature Extraction and Acoustic Model Training: A Top Down Process for Robust Speech Recognition
Yu-Hsiang Bosco Chiu (Carnegie Mellon University) Bhiksha Raj (Carnegie Mellon University) Richard M. Stern (Carnegie Mellon University)
This paper presents a strategy to learn physiologically motivated
components in a feature computation module discriminatively,
directly from data, in a manner that is inspired
by the presence of efferent processes in the human auditory system.
In our model a set of logistic functions which represent the
rate-level nonlinearities found in most mammal hearing system
are put in as part of the feature extraction process. The parameters
of these rate-level functions are estimated to maximize
the a posteriori probability of the correct class in the training
data. The estimated feature computation is observed to be robust
against environmental noise. Experiments conducted with
the CMU Sphinx-III on the DARPA Resource Management task
show that the discriminatively estimated rate-nonlinearity results
in better performance in the presence of background noise
than traditional procedures which separate the feature extraction
and model training into two distinct parts.
|
| 14:10 | Temporal Modulation Processing of Speech Signals for Noise Robust ASR
Hong You (UCLA Electrical Engineering Dept.) Abeer Alwan (UCLA Electrical Engineering Dept.)
We analyze the temporal modulation characteristics of speech and noise from a speech/non-speech discrimination point of view, and propose a frequency adaptive modulation processing algorithm and apply it to a noise robust ASR task. Although previous psychoacoustic studies have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on modulation components from the point of view of speech/noise discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more accurately classified by low-passed modulation frequencies than band-passed ones. We then propose a frequency adaptive modulation processing algorithm for a noise robust ASR task. Speech recognition experiments are performed to compare the proposed algorithm with other noise robust frontends, including RASTA and ETSI AFE. Results show that the frequency adaptive modulation processing is promising.
|
| 14:30 | PROGRESSIVE MEMORY-BASED PARAMETRIC NON-LINEAR FEATURE EQUALIZATION
Luz GarcĂa (Department of TSTC, University of Granada, Spain) Roberto Gemello (LOQUENDO, Torino, ITALY) Franco Mana (LOQUENDO, Torino, ITALY) Jose Carlos Segura (Department of TSTC, University of Granada, Spain)
This paper analyzes the benefits and drawbacks of PEQ (Parametric
Non-linear Equalization), a features normalization
technique based on the parametric equalization of the MFCC
parameters to match a reference probability distribution. Two
limitations have been outlined: the distortion intrinsic to the
normalization process and the lack of accuracy in estimating
normalization statistics on short sentences. Two evolutions
of PEQ are presented as solutions to the limitations encountered.
The effects of the proposed evolutions are evaluated
on three speech corpora, namely WSJ0, AURORA-3 and HIWIRE
cockpit databases, with different mismatch conditions
given by convolutional and/or additive noise and non-native
speakers. The obtained results show that the encountered
limitations can be overcome by the newly introduced techniques.
|
| 14:50 | Dynamic Features in the Linear Domain for Robust Automatic Speech Recognition in a Reverberant Environment
Osamu Ichikawa (Tokyo Research Laboratory, IBM Research) Takashi Fukuda (Tokyo Research Laboratory, IBM Research) Ryuki Tachibana (Tokyo Research Laboratory, IBM Research) Masafumi Nishimura (Tokyo Research Laboratory, IBM Research)
Since the MFCC are calculated from logarithmic spectra, the delta and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals have trailing reverberations, whose power is plotted as a long-term exponential decay. This means the logarithmic delta value tends to remain large for a long time. This paper proposes a delta feature calculated in the linear domain, due to the rapid decay in reverberant environments. In an experiment using an evaluation framework (CENSREC-4), significant improvements were found in reverberant situations by simply replacing the MFCC dynamic features with the proposed dynamic features.
|
| 15:10 | Local Projections and Support Vector Based Feature Selection in Speech Recognition
Antonio Miguel (University of Zaragoza) Alfonso Ortega (University of Zaragoza) Luis Buera (University of Zaragoza) Eduardo Lleida (University of Zaragoza)
In this paper we study a method to provide noise robustness in mismatch conditions for speech recognition using local frequency projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust features and a simpler feature set to apply reliability weighting techniques. The proposed method combines two techniques to select the feature set, first a realibility metric based on information theory and, second, a support vector set to reduce the errors. The support vector set provides the most representative examples which have influence in the error rate in mismatch conditions, so that only the features which incorporate implicit robustness to mismatch are selected. Some experimental results are obtained with this method compared to baseline systems using the Aurora 2 database.
|
|
|