Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses3-P3:
Automatic Speech Recognition: Adaptation I

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair:Stephen Cox

#0On the Development of Matched and Mismatched Italian Children’s Speech Recognition Systems

Piero Cosi (ISTC-CNR (Istituto di Scienze e Tecnologie della Cognizione - Consiglio Nazionale delle Ricerche))

While at least read speech corpora are available for Italian children’s speech research, there exist many languages in which this is not the case. Learning statistical mappings between the adult and child acoustic space using existing adult/children corpora may provide a future direction for generating children’s models for such data deficient languages. In this work the recent advances in the development of the SONIC Italian children’s speech recognition system will be described. Specifically, the complete training and test set of the FBK (ex ITC-irst) Italian Children’s Speech Corpus (ChildIt) was considered. Using the University of Colorado SONIC LVSR system, we demonstrate a phonetic recognition error rate of 12,0% for a system which incorporates Vocal Tract Length Normalization (VTLN), Speaker-Adaptive Trained phonetic models, as well as unsupervised Structural MAP Linear Regression (SMAPLR).

#0Speaker Adaptation Based on Two-Step Active Learning

Koichi Shinoda (Tokyo Institute of Technology)
Hiroko Murakami (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

We propose a two-step active learning method for supervised speaker adaptation. In the first step, the initial adaptation data is collected to obtain a phone error distribution. In the second step, those sentences whose phone distributions are close to the error distribution are selected, and their utterances are collected as the additional adaptation data. We evaluated the method using a Japanese speech database and maximum likelihood linear regression (MLLR) as the speaker adaptation algorithm. We confirmed that our method had a significant improvement over a method using randomly chosen sentences for adaptation.

#0Using VTLN matrices for Rapid and Computationally-Efficient Speaker Adaptation with Robustness to First-Pass Transcription Errors

Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Inidian Institute of Technology Kanpur)
Achintya Kumar Sarkar (Inidian Institute of Technology Kanpur)

In this paper we combine rapid adaptation capability of conventional VTLN with computational efficiency of transform-based adaptation such as CMLLR. Conventional VTLN requires very little adaptation data unlike transform-based adaptation methods. However, conventional VTLN is computationally expensive since it requires generation of warped features. We have recently shown that VTLN can be efficiently implemented as a linear-transformation with computational complexity similar to CMLLR. In this frame-work VTLN provides significant improvement in performance when there is small adaptation data than transform-based adaptation. We also show that the use of MLLT along with VTLN gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in mismatched conditions, VTLN provides significant improvement over transform-based adaptation. We compare the performance of different methods on WSJ, RM and TIDIGITS tasks.

#0Acoustic Class Specific VTLN-Warping using Regression Class Trees

Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Indian Institute of Technology Kanpur)

In this paper we study the use of different frequency warp-factors for different acoustic classes. This is motivated by the fact that all acoustic classes do not exhibit similar spectral variation as a result of physiological differences in vocal tract and therefore the use of a single frequency-warp for the entire utterance may not be appropriate. We have recently proposed an VTLN method that implements VTLN-warping through a linear-transformation of the conventional MFCC features and efficiently estimates the warp-factor using the same sufficient statistics that are used in CMLLR adaptation. In this paper, we have shown that in this efficient framework of VTLN and using the idea of regression class tree it is possible to obtain separate frequency-warping for different acoustic classes. On the WSJ database we have shown the recognition performance of the proposed method for data driven based and phonetic knowledge regression class trees.

#0Bilinear Transformation Space-based Maximum Likelihood Linear Regression

Hwa Jeon Song (School of Electrical Engineering, Pusan National University)
Yongwon Jeong (School of Electrical Engineering, Pusan National University)
Hyung Soon Kim (School of Electrical Engineering, Pusan National University)

This paper proposes two types of bilinear transformation space-based speaker adaptation frameworks. In training session, transformation matrices for speakers are decomposed into the style factor for speakers’ characteristics and orthonormal basis of eigenvectors to control dimensionality of the canonical model by the singular value decomposition-based algorithm. In adaptation session, the style factor of a new speaker is estimated, depending on what kind of proposed framework is used. At the same time, the dimensionality of the canonical model can be reduced by the orthonormal basis from training. Moreover, both maximum likelihood linear regression (MLLR) and eigenspace-based MLLR are identified as special cases of our proposed methods. Experimental results show that the proposed methods are much more effective and versatile than other methods.

#0Speaking Style Adaptation for Spontaneous Speech Recognition Using Multiple-Regression HMM

Yusuke Ijima (Tokyo Institute of Technology)
Takeshi Matsubara (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)

This paper describes a rapid model adaptation technique for spontaneous speech recognition. The proposed technique utilizes a multiple-regression hidden Markov model (MRHMM) and is based on a style estimation technique of speech. In the MRHMM, the mean vector of probability density function (pdf) is given by a function of a low-dimensional vector, called style vector, which corresponds to the intensity of expressivity of speaking style variation. The value of the style vector is estimated for every utterance of the input speech and the model adaptation is conducted by calculating new mean vectors of the pdf using the estimated style vector. The performance evaluation results using “Corpus of spontaneous Japanese (CSJ)” are shown under a condition in which the amount of model training and adaptation data is very small.

#0Improving the robustness by multiple sets of HMMs

Hans-Guenter Hirsch (Niederrhein University of Applied Sciences)
Andreas Kitzig (Niederrhein University of Applied Sciences)

The highest recognition performance is still achieved when training a recognition system with speech data that have been recorded in the acoustic scenario where the system will be applied. We investigated the approach of using several sets of HMMs. These sets have been trained on data that were recorded in different typical noise situations. One HMM set is individually selected at each speech input by comparing the pause segment at the beginning of the utterance with the pause models of all sets. We observed a considerable reduction of the error rates when applying this approach in comparison to two well known techniques for improving the robustness. Furthermore, we developed a technique to additionally adapt certain parameters of the selected HMMs to the specific noise condition. This leads to a further improvement of the recognition rates.

#0On the Use of Pitch Normalization for Improving Children\'s Speech Recognition

Rohit Sinha (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)
Shweta Ghai (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)

In this work, we have studied the effect of pitch variations across the speech signals in context of automatic speech recognition. Our initial study done on vowel data indicates that on account of insufficient smoothing of pitch harmonics by the filterbank, particularly for high pitch signals, the variances of mel frequency cepstral coefficients (MFCC) feature significantly increase with increase in the pitch of the speech signals. Further to reduce the variance of MFCC feature due to varying pitch among speakers, a maximum likelihood based explicit pitch normalization method has been explored. On connected digit recognition task, with pitch normalization a relative improvement of 15% is obtained over baseline for children's speech (higher pitch) on adults' speech (lower pitch) trained models.

#0Speaker normalization for template based speech recognition

Sébastien Demange (Katholieke Universiteit Leuven ESAT/PSI)
Dirk Van Compernolle (Katholieke Universiteit Leuven ESAT/PSI)

Vocal Tract Length Normalization (VTLN) has been shown to be an efficient speaker normalization tool for HMM based systems. In this paper we show that it is equally efficient for a template based recognition system. Template based systems, while promising, have as potential drawback that templates maintain all non phonetic details apart from the essential phonemic properties; i.e. they retain information on speaker and acoustic recording circumstances. This may lead to a very inefficient usage of the database. We show that after VTLN significantly more speakers - also from opposite gender - contribute templates to the matching sequence compared to the non-normalized case. In experiments on the Wall Street Journal database this leads to a relative word error rate reduction of 10%.

#0Combination of Acoustic and Lexical Speaker Adaptation for Disordered Speech Recognition

Oscar Saz (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)
Antonio Miguel (University of Zaragoza)

This paper presents an approach to provide of lexical adaptation in Automatic Speech Recognition (ASR) of the disordered speech from a group of young impaired speakers. The outcome of an Acoustic Phonetic Decoder (APD) is used to learn new lexical variants of the 57-word vocabulary and add them to a lexicon personalized to each user. The possibilities of combination of this lexical adaptation with acoustic adaptation achieved through traditional Maximum A Posteriori (MAP) approaches are furtherer explored, and the results show the importance of matching the lexicon in the ASR decoding phase to the lexicon used for the acoustic adaptation.

#3Tree-based Estimation of Speaker Characteristics for Speech Recognition

Mats Blomberg (Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm, Sweden)
Daniel Elenius (Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm, Sweden)

A hierarchical tree is designed to reduce the computationally heavy demands of joint multi-dimensional estimation of speaker characteristic properties in speech recognition. The leaf model sets are created by transforming a conventionally trained set. Non-leaf sets are formed by merging the models of their child nodes. One- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) reduce the computational load to a fraction compared to that of an exhaustive search. In recognition experiments on children's connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star-Sw, respectively, using adult models.

#5A Study on the Influence of Covariance Adaptation on Jacobian Compensation in Vocal Tract Length Normalization

Rama Sanand Doddipatla (Indian Institute of Technology Kanpur)
Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Indian Institute of Technology Kanpur)

In this paper, we first show that accounting for Jacobian in VTLN degrades the performance in the mismatched train and test speaker conditions. VTLN is implemented using our recently proposed approach of linear transformation of conventional MFCC, ie, a feature-transformation. In this case, Jacobian is simply the determinant of the LT. Feature transformation is equivalent to the means and covariances of the model being transformed by the inverse transformation while leaving the data unchanged. Using a set of adaptation experiments, we analyze the reasons for the degradation during Jacobian compensation and conclude that applying the same VTLN transformation on both means and variances does not fully match the data when there is a mismatch in the speaker conditions. We propose to use covariance adaptation on top of VTLN to account for the covariance mismatch between the train and the test speakers and show that accounting for Jacobian after covariance adaptation improves the performance.