|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses2-P3: ASR: Decoding and Confidence Measures
| Time: | Tuesday 13:30 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Kai Yu |
| #1 | Incremental composition of static decoding graphs
Miroslav Novak (IBM T.J. Watson Research Center)
A fast, scalable and memory-efficient method for static decoding graph
construction is presented. As an alternative to the traditional transducer-based
approach, it is based on incremental composition. Memory efficiency
is achieved by combining composition, determinization and minimization
into a single step, thus eliminating large intermediate graphs. We
have previously reported the use of incremental composition
limited to grammars and left cross-word context. Here,
this approach is extended to n-gram models with explicit epsilon arcs and right cross-word context.
|
| #2 | Evaluation of Phone Lattice Based Speech Decoding
Jacques Duchateau (Katholieke Universiteit Leuven) Kris Demuynck (Katholieke Universiteit Leuven) Hugo Van hamme (Katholieke Universiteit Leuven)
Previously, we proposed a flexible two-layered speech recogniser architecture, called FLaVoR. In the first layer an unconstrained, task independent phone recogniser generates a phone lattice. Only in the second layer the task specific lexicon and language model are applied to decode the phone lattice and produce a word level recognition result. In this paper, we present a further evaluation of the FLaVoR architecture. The performance of a classical single-layered architecture and the FLaVoR architecture are compared on two recognition tasks, using the same acoustic, lexical and language models. On the large vocabulary Wall Street Journal 5k and 20k benchmark tasks, the two-layered architecture resulted in slightly but not significantly better word error rates. On a reading error detection task for a reading tutor for children, the FLaVoR architecture clearly outperformed the single-layered architecture.
|
| #3 | A Fully Data Parallel WFST-based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit
Jike Chong (University of California, Berkeley) Ekaterina Gonina (University of California, Berkeley) Youngmin Yi (University of California, Berkeley) Kurt Keutzer (University of California, Berkeley)
Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data-parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA's GTX280 GPU. Our implementation has a compute-intensive phase for observation probability computation that allows dynamic elimination of redundant computation while maintaining close-to-peak execution efficiency. We demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs.
|
| #4 | Combined low level and high level features for Out-Of-Vocabulary Word detection
Benjamin LECOUTEUX (Laboratoire Informatique d\'Avignon (LIA) University of Avignon, France) Georges LINARES (Laboratoire Informatique d\'Avignon (LIA) University of Avignon, France) Benoit FAVRE (ICSI, 1947 Center St, Suite 600, Berkeley, CA 94704, USA)
This paper addresses the issue of Out-Of-Vocabulary (OOV) words detection in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We propose a method inspired by confidence measures, that consists in analyzing the recognition system outputs in order to automatically detect errors due to OOV words. This method combines various features based on acoustic, linguistic, decoding graph and semantics. We evaluate separately each feature and we estimate their complementarity. Experiments are conducted on a large French broadcast news corpus from the ESTER evaluation campaign. Results show good performance in real conditions: the method obtains a OOV word detection rate of 43%-90% with 2.5%-17.5% of false detection.
|
| #5 | Bayes Risk Approximations Using Time Overlap with an Application to System Combination
Björn Hoffmeister (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University) Ralf Schlüter (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University) Hermann Ney (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
The computation of the Minimum Bayes Risk (MBR) decoding rule for word lattices needs approximations. We investigate a class of approximations where the Levenshtein alignment is approximated under the condition that competing lattice arcs overlap in time. The approximations have their origins in MBR decoding and in discriminative training. We develop modified versions and propose a new, conceptually extremely simple confusion network algorithm. The MBR decoding rule is extended to scope with several lattices, which enables us to apply all the investigated approximations to system combination. All approximations are tested on a Mandarin and on an English LVCSR task for a single system and for system combination. The new methods are competitive in error rate and show some advantages over the standard approaches to MBR decoding.
|
| #6 | Unsupervised Estimation of the Language Model Scaling Factor
Christopher M. White (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University) Ariya Rastrow (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University) Sanjeev Khudanpur (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University) Frederick Jelinek (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
This paper addresses the adjustment of the language model (LM) scaling factor of an automatic speech recognition (ASR) system for a new domain using only un-transcribed speech. The main idea is to replace the (unavailable) reference transcript with an automatic transcript generated by an independent ASR system, and adjust parameters using this sloppy reference. It is shown that despite its fairly high error rate (ca. 35%), choosing the scaling factor to minimize disagreement with the erroneous transcripts is still an effective recipe for model selection. This effectiveness is demonstrated by adjusting an ASR system trained on Broadcast News to transcribe the MIT Lectures corpus. An ASR system for telephone speech produces the sloppy reference, and optimizing towards it yields a nearly optimal LM scaling factor for the MIT Lectures corpus.
|
| #7 | Simultaneous Estimation of Confidence and Error Cause in Speech Recognition Using Discriminative Model
Atsunori Ogawa (NTT Corporation) Atsushi Nakamura (NTT Corporation)
Since recognition errors are unavoidable in speech recognition, confidence scoring, which accurately estimates the reliability of recognition results, is a critical function for speech recognition engines. In addition to achieving accurate confidence estimation, if we are to develop speech recognition systems that will be widely used by the public, speech recognition engines must be able to report the causes of errors properly, namely they must offer a reason for any failure to recognize input utterances. This paper proposes a method that simultaneously estimates both confidences and causes of errors in speech recognition results by using discriminative models. We evaluated the proposed method in an initial speech recognition experiment, and confirmed its promising performance with respect to confidence and error cause estimation.
|
| #8 | A Generalized Composition Algorithm for Weighted Finite-State Transducers
Cyril Allauzen (Google) Michael Riley (Google) Johan Schalkwyk (Google)
This paper describes a weighted finite-state transducer composition
algorithm that generalizes the notion of the composition filter
and present filters that remove useless epsilon paths and push forward
labels and weights along epsilon paths. This filtering allows us to
compose together large speech recognition context-dependent lexicons
and language models much more efficiently in time and space than
previously possible. We present experiments on Broadcast News and Google
Search by Voice that demonstrate a 5% to 10% overhead for
dynamic, runtime composition compared to a static, offline composition
of the recognition transducer. To our knowledge, this is the first
such system with such small overhead.
|
| #9 | Word Confidence using Duration Models
Stefano Scanzio (Politecnico di Torino) Pietro Laface (Politecnico di Torino) Daniele Colibro (Loquendo S.p.A.) Roberto Gemello (Loquendo S.p.A.)
In this paper, we propose a word confidence measure based on phone durations depending on large contexts. The measure is based on the expected duration of each recognized phone in a word. In the approach here proposed the duration of each phone is in principle context-dependent, and the measure is a function of the distance between the observed and expected phone duration distributions within a word. Our experiments show that, since the “duration confidence” does not make use of any acoustic information, its Equal Error Rate (EER) in terms of False Accept and False Rejection rates is not as good as the one obtained by using the more informed acoustic confidence measure. However, combining the two measures by a simple linear interpolation, the system EER improves by 6% to 10% relative on an isolated word recognition task in several languages.
|
| #10 | A Comparison of Audio-free Speech Recognition Error Prediction Methods
Preethi Jyothi (Ohio State University) Eric Fosler-Lussier (Ohio State University)
Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized.
|
| #11 | Automatic Out-of-Language Detection based on Confidence Measures derived from LVCSR Word and Phone Lattices
Petr Motlicek (Idiap Research Institute, Martigny, Switzerland)
Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection.
|
| #12 | Automatic Estimation of Decoding Parameters Using Large-Margin Iterative Linear Programming
Brian Mak (The Hong Kong University of Science and Technology) Tom Ko (The Hong Kong University of Science and Technology)
The decoding parameters in automatic speech recognition --- grammar factor and word insertion penalty --- are usually determined by performing a grid search on a development set. Recently, we cast their estimation as a convex optimization problem, and proposed a solution using an iterative linear programming algorithm. However, the solution depends on how well the development data set matches with the test set. In this paper, we further investigates an improvement on the generalization property of the solution by using large margin training within the iterative linear programming framework. Empirical evaluation on the WSJ0 5K speech recognition tasks shows that the recognition performance of the decoding parameters found by the improved algorithm using only a subset of the acoustic model training data is even better than that of the decoding parameters found by grid search on the development data, and is close to the performance of those found by grid search on the test set.
|
|
|