|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-O3: Automatic Speech Recognition: Adaptation II
| Time: | Wednesday 10:00 |
Place: | East Wing 2 |
Type: | Oral |
| Chair: | Satoshi Nakamura |
| 10:00 | On the Estimation and the Use of Confusion-Matrices for Improving ASR Accuracy
Santiago Omar Caballero Morales (University of East Anglia, School of Computing Sciences) Stephen Cox (University of East Anglia, School of Computing Sciences)
In previous work, we described how learning the pattern of recognition errors made by an individual using a certain ASR system leads to increased recognition accuracy compared with a standard MLLR adaptation approach. This was the case for low-intelligibility speakers with dysarthric speech, but no improvement was observed for normal speakers. In this paper, we describe an alternative method for obtaining the training data for confusion-matrix estimation for normal speakers which is more effective than our previous technique. We also address the issue of data sparsity in estimation of confusion-matrices by using non-negative matrix factorization (NMF) to discover structure within them. The confusion-matrix estimates made using these techniques are integrated into the ASR process using a technique termed as ``metamodels'', and the results presented here show statistically significant gains in word recognition accuracy when applied to normal speech.
|
| 10:20 | A Study on Soft Margin Estimation of Linear Regression Parameters for Speaker Adaptation
Shigeki Matsuda (Spoken Language Communication Group, National Institute of Information and Communication Technology) Yu Tsao (Spoken Language Communication Group, National Institute of Information and Communication Technology) Jinyu Li (Speech Component Group, Microsoft Corporation) Satoshi Nakamura (Spoken Language Communication Group, National Institute of Information and Communication Technology) Chin-Hui Lee (School of Electrical and Computer Engineering, Georgia Institute of Technology)
We formulate a framework for soft margin estimation-
based linear regression (SMELR) and apply it to supervised
speaker adaptation. Enhanced separation capability
and increased discriminative ability are two key
properties in margin-based discriminative training. For
the adaptation process to be able to flexibly utilize any
amount of data, we also propose a novel interpolation
scheme to linearly combine the speaker independent (SI)
and speaker adaptive SMELR (SMELR/SA) models. The
two proposed SMELR algorithms were evaluated on a
Japanese large vocabulary continuous speech recognition
task. Both the SMELR and interpolated SI+SMELR/SA
techniques showed improved speech adaptation performance
in comparison with the well-known maximum
likelihood linear regression (MLLR) method. We also
found that the interpolation framework works even more
effectively than SMELR when the amount of adaptation
data is relatively small.
|
| 10:40 | Exploring the Role of Spectral Smoothing in context of Children\'s Speech Recognition
Shweta Ghai (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.) Rohit Sinha (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)
This work is motivated by our earlier study which shows that on explicit pitch normalization the children's speech recognition performance on the adults' speech trained models improves as a result of reduction in the pitch-dependent distortions in the spectral envelope. In this paper, we study the role of spectral smoothing in context of children's speech recognition. The spectral smoothing has been effected in the feature domain by two approaches viz., modification of bandwidth of the filters in the filterbank and cepstral truncation. In conjunction, both approaches give significant improvement in the children's speech recognition performance with 57% relative improvement over the baseline. Also, when combined with the widely used vocal tract length normalization (VTLN), these spectral smoothing approaches result in an additional 25% relative improvement over the VTLN performance for children's speech recognition on the adults' speech trained models.
|
| 11:00 | Unsupervised Lattice-based Acoustic Model Adaptation for Speaker-Dependent Conversational Telephone Speech Transcription
Kit Thambiratnam (Microsoft Research) Frank Seide (Microsoft Research)
This paper examines the application of lattice adaptation techniques to
speaker-dependent models for the purpose of conversational telephone speech transcription.
Given sufficient training data per speaker, it is feasible to build adapted speaker-dependent
models using lattice MLLR and lattice MAP. Experiments on iterative and cascaded adaptation are presented.
Additionally various strategies for thresholding frame posteriors are investigated, and it is
shown that accumulating statistics from the local best-confidence path is sufficient to achieve
optimal adaptation. Overall, an iterative cascaded lattice system was able to reduce WER by
7.0% abs., which was a 0.8% abs. gain over transcript-based adaptation. Lattice adaptation reduced the
unsupervised/supervised adaptation gap from 2.5\% to 1.7\%.
|
| 11:20 | Rapid Unsupervised Adaptation Using Frame Independent Output Probabilities of Gender and Context Independent Phoneme Models
Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories) Atsunori OGAWA (NTT Communication Science Laboratories) Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories) Satoshi TAKAHASHI (NTT Cyber Space Laboratories)
Business is demanding higher recognition accuracy with no increase in computation time compared to previously adopted baseline speech recognition systems. Accuracy can be improved by adding a gender dependent acoustic model and unsupervised adaptation based on CMLLR. CMLLR-based batch-type unsupervised adaptation estimates a single global transformation matrix by utilizing prior unsupervised labeling, which unfortunately increases the computation time. Our proposed technique reduces prior gender selection and labeling time by using frame independent output probabilities of only gender dependent speech GMM and monophone HMM in a dual-gender acoustic model. The proposed technique further raises accuracy by employing a power term after adaptation. Simulations using spontaneous speech show that the proposed technique reduces computation time by 17.9 % and the relative error in correct rate by 13.7 % compared to the baseline without prior gender selection and unsupervised adaptation.
|
| 11:40 | Bark-shift based nonlinear speaker normalization using the second subglottal resonance
Shizhen Wang (University of California, Los Angeles) Yi-Hui Lee (University of California, Los Angeles) Abeer Alwan (University of California, Los Angeles)
In this paper, we propose a Bark-scale shift based piecewise nonlinear warping function for speaker normalization, and a joint
frequency discontinuity and energy attenuation detection algorithm
to estimate the second subglottal resonance (Sg2). We then apply Sg2
for rapid speaker normalization. Experimental results on children's
speech recognition show that the proposed nonlinear warping function
is more effective for speaker normalization than linear frequency
warping. Compared to maximum likelihood based grid search methods,
Sg2 normalization is more efficient and achieves comparable or
better performance, especially for limited normalization data.
|
|
|