|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses2-P4: Robust Automatic Speech Recognition I
| Time: | Tuesday 13:30 |
Place: | Hewison Hall |
Type: | Poster |
| #1 | Optimization of Dereverberation Parameters based on Likelihood of Speech Recognizer
Randy Gomez (Kyoto University) Tatsuya Kawahara (Kyoto University)
Speech recognition under reverberant condition is a difficult task. Most dereverberation techniques used to address this problem enhance the reverberant waveform independent from that of the speech recognizer. In this paper, we improve the conventional Spectral Subtraction-based (SS) dereverberation technique. In our proposed approach, the dereverberation parameters are optimized to improve the likelihood of the acoustic model. The system is capable of adaptively fine-tuning these parameters jointly with acoustic model training. Additional optimization is also implemented during decoding of the test utterances. We have evaluated using real reverberant data and experimental results show that the proposed method significantly improves the recognition performance over the conventional approach.
|
| #2 | Application of noise robust MDT speech recognition on the SPEECON and SpeechDat-Car databases
Jort Florent Gemmeke (Dept. of Linguistics, Radboud University, Nijmegen, The Netherlands) Yujun Wang (ESAT Department, Katholieke Universiteit Leuven, Belgium) Maarten Van Segbroeck (ESAT Department, Katholieke Universiteit Leuven, Belgium) Bert Cranen (Dept. of Linguistics, Radboud University, Nijmegen, The Netherlands) Hugo Van hamme (ESAT Department, Katholieke Universiteit Leuven, Belgium)
We show that the recognition accuracy of an MDT recognizer which performs well on artificially noisified data, deteriorates rapidly under realistic noisy conditions (using multiple microphone recordings from the SPEECON/SpeechDat-Car databases) and is outperformed by a commercially available recognizer which was trained using a multi-condition paradigm. Analysis of the recognition results indicates that the recording channels with the lowest SNRs where the MDR recognizer fails most, are also the channels which suffer most from room reverberation. Despite the channel compensation measures we took, it appears difficult to maintain the restorative power of MDT in such non-additive noise conditions.
|
| #3 | Model based feature enhancement for automatic speech recognition in reverberant environments
Alexander Krueger (University of Paderborn) Reinhold Haeb-Umbach (University of Paderborn)
In this paper we present a new feature space dereverberation technique for automatic speech recognition.
We derive an expression for the dependence of the reverberant speech features in the log-mel spectral domain on the non-reverberant
speech features and the room impulse response.
The obtained observation model is used for a model based speech enhancement based on Kalman filtering.
The performance of the proposed enhancement technique is studied on the AURORA5 database. In our currently best configuration, which includes uncertainty decoding, the number of recognition errors is approximately halved compared to the recognition of unprocessed speech.
|
| #4 | A study of mutual front-end processing method based on statistical model for noise robust speech recognition
Masakiyo Fujimoto (NTT Communication Science Laboratories, NTT Corporation) Kentaro Ishizuka (NTT Communication Science Laboratories, NTT Corporation) Tomohiro Nakatani (NTT Communication Science Laboratories, NTT Corporation)
This paper addresses robust front-end processing for automatic speech recognition (ASR) in noise. Accurate recognition of corrupted speech requires noise robust front-end processing, e.g., voice activity detection (VAD) and noise suppression (NS). Typically, VAD and NS are combined as one-way processing, and are developed independently. However, VAD and NS should not be assumed to be independent techniques, because sharing each others' information is important for the improvement of front-end processing. Thus, we investigate the mutual front-end processing by integrating VAD and NS, which can beneficially share each others' information. In an evaluation of a concatenated speech corpus, CENSREC-1-C database, the proposed method improves the performance of both VAD and ASR compared with the conventional method.
|
| #5 | Integrating Codebook and Utterance Information in Cepstral Statistics Normalization Techniques for Robust Speech Recognition
Guan-min He (National Chi Nan University) Jeih-weih Hung (National Chi Nan University)
Cepstral statistics normalization techniques have been shown to be very successful at improving the noise robustness of speech features. This paper proposes a hybrid-based scheme to achieve a more accurate estimate of the statistical information of features in these techniques. By properly integrating codebook and utterance knowledge, the resulting hybrid-based approach significantly outperforms conventional utterance-based,segment-based and codebook-based approaches in noise environments. Furthermore, the high-performance CS-HEQ can be implemented with a short delay and can thus be applied in real-time online systems.
|
| #6 | Reduced Complexity Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments
Hynek Boril (Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, U.S.A) John H.L. Hansen (Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, U.S.A)
Speech signal corruption by background noise, microphone channel variations, and speech production adjustments introduced by speakers in an effort to communicate efficiently over noise (Lombard effect) impact severely the automatic speech recognition (ASR) performance. Recently, a set of unsupervised techniques reducing ASR sensitivity to these sources of distortion have been presented. In this study, a scheme utilizing a set of speech-in-noise Gaussian mixture models and a neutral/LE classifier is shown to substantially decrease the computational load of the compensations (from 14 to 2–4 ASR decoding passes) while preserving the performance. In addition, an extended codebook capturing multiple environmental noises is introduced and shown to improve ASR in changing environments. The evaluation is conducted on the samples from the Czech Lombard Speech Database (CLSD‘05) presented in different levels of background car noise and Aurora 2 noises.
|
| #7 | UNSUPERVISED TRAINING SCHEME WITH NON-STEREO DATA FOR EMPIRICAL FEATURE VECTOR COMPENSATION
Luis Buera (I3A, University of Zaragoza) Antonio Miguel (I3A, University of Zaragoza) Alfonso Ortega (I3A, University of Zaragoza) Eduardo Lleida (I3A, University of Zaragoza) Richard Stern (Carnegie Mellon University)
In this paper, a novel training scheme based on unsupervised and non-stereo data is presented for Multi-Environment Model-based LInear Normalization (MEMLIN) and MEMLIN with cross-probability model based on GMMs (MEMLIN-CPM). Both are data-driven feature vector normalization techniques which have been proved very effective in dynamic noisy acoustic environments. However, this kind of techniques usually requires stereo data in a previous training phase, which could be an important limitation in real situations. To compensate this drawback, we present an approach based on ML criterion and Vector Taylor Series (VTS). Experiments have been carried out with Spanish SpeechDat Car, reaching consistent improvements:48.7\% and 61.9\% when the novel training process is applied over MEMLIN and MEMLIN-CPM, respectively.
|
| #8 | Incremental Adaptation with VTS and Joint Adaptively Trained Systems
Federico Flego (Cambridge University) Mark Gales (Cambridge University)
Recently adaptive training schemes using model based compensation approaches such as VTS and JUD have been proposed. Adaptive training allows the use of multi-environment training data whilst training a neutral, ``clean'', acoustic model to be trained. This paper describes and assesses the advantages of using incremental, rather than batch, mode adaptation with these adaptively trained systems. Incremental adaptation reduces the latency during recognition, and has the possibility of reducing the error rate for slowly varying noise. The work is evaluated on a large scale multi-environment training configuration targeted at in-car speech recognition. Results on in-car collected test data indicate that incremental adaptation is an attractive option when using these adaptively trained systems.
|
| #9 | Target Speech GMM-based Spectral Compensation for Noise Robust Speech Recognition
Takahiro Shinozaki (Tokyo Institute of Technology) Sadaoki Furui (Tokyo Institute of Technology)
To improve speech recognition performance in adverse conditions,
a noise compensation method is proposed that applies a transformation
in the spectral domain whose parameters are optimized based on
likelihood of speech GMM modeled on the feature domain.
The idea is that additive and convolutional noises have
mathematically simple expression in the spectral domain while speech characteristics
are better modeled in the feature domain such as MFCC.
The proposed method works as a feature extraction front-end that is independent
from decoding engine, and has ability to compensate for
non-stationary additive and convolutional noises with a short time delay.
It includes spectral subtraction as a special case when no parameter optimization is performed.
Experiments were performed using the AURORA-2J database.
It has been shown that significantly higher recognition performance is obtained
by the proposed method than spectral subtraction.
|
| #10 | Noise-Robust Feature Extraction Based on Forward Masking
Sheng-Chiuan Chiou (Department of Computer Science and Engineering, National Sun Yat-sen University) Chia-Ping Chen (Department of Computer Science and Engineering, National Sun Yat-sen University)
Forward masking is a phenomenon of human auditory perception, that a
weaker sound is masked by a preceding stronger masker. In this paper, we
postulate the mechanism of forward masking to be synaptic adaptation
and temporal integration, and incorporate them in the feature
extraction process of an automatic speech recognition system to
improve noise-robustness. The synaptic adaptation is implemented by a
highpass filter, and the temporal integration is implemented by a
bandpass filter. We apply both filters in the domain of log
mel-spectrum. On the Aurora 3 tasks, we evaluate three modified
mel-frequency cepstral coefficients: synaptic adaptation only,
temporal integration only, and both synaptic adaptation and temporal
integration. Experiments show that the overall improvement is 16.1\%,
21.8\%, and 26.2\% respectively in the three cases over the
baseline.
|
|
|