|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses2-P4: LVCSR Systems and Spoken Term Detection
| Time: | Wednesday 13:30 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Simon King |
| #1 | Real-Time Live Broadcast News Subtitling System for Spanish
Alfonso Ortega (University of Zaragoza) Jose Enrique Garcia (University of Zaragoza) Antonio Miguel (University of Zaragoza) Eduardo Lleida (University of Zaragoza)
Subtitling of live broadcast news is a very important
application to meet the needs of deaf and hard of hearing people.
However, live subtitling is a high cost operation in
terms of qualification human resources and therefore, money if high
precision is desired.
Automatic Speech Recognition researchers can help to perform this task saving
both time and money developing systems that deliver
subtitles fully synchronized with speech without human
assistance. In this paper we
present a real-time system for automatic subtitling of live broadcast
news in Spanish based on the News Redaction Computer texts and an Automatic
Speech Recognition engine to provide precise temporal alignment of speech to text scripts with negligible
latency. The presented system is working satisfactory on the Aragonese
Public Television from June 2008 without human assistance.
|
| #2 | Development of the 2008 SRI Mandarin Speech-to-text System for Broadcast News and Conversations
Xin Lei (SRI International) Wei Wu (Univ. of Washington) Wen Wang (SRI International) Arindam Mandal (SRI International) Andreas Stolcke (SRI International)
We describe the recent progress in SRI’s Mandarin speech-to-
text system developed for 2008 evaluation in the DARPA GALE
program. A data-driven lexicon expansion technique and lan-
guage model adaptation methods contribute to the improvement
in recognition performance. Our system yields 8.3% character
error rate on the GALE dev08 test set, and 7.5% after combining
with RWTH systems. Compared to our 2007 evaluation system,
a significant improvement of 13% relative has been achieved.
|
| #3 | Multifactor Adaptation for Mandarin Broadcast News and Conversation Speech Recognition
Wen Wang (SRI International) Arindam Mandal (SRI International) Xin Lei (SRI International) Andreas Stolcke (SRI International) Jing Zheng (SRI International)
We explore the integration of multiple factors such as genre and
speaker gender for acoustic model adaptation tasks to improve Mandarin
ASR system performance on broadcast news and broadcast conversation
audio. We investigate the use of multi-factor clustering of acoustic
model training data and the application of MPE-MAP and fMPE-MAP
acoustic model adaptations. We found that by effectively combining
these adaptation approaches, we can achieve 5% relative improvement on
the final recognition error rate from SRI's state-of-the-art Mandarin
ASR system.
|
| #4 | Development of the GALE 2008 Mandarin LVCSR System
Christian Plahl (RWTH Aachen University) Björn Hoffmeister (RWTH Aachen University) Georg Heigold (RWTH Aachen University) Jonas Lööf (RWTH Aachen University) Ralf Schlüter (RWTH Aachen University) Hermann Ney (RWTH Aachen University)
This paper describes the current improvements of the RWTH Mandarin LVCSR system. We introduce vocal tract length normalization for the Gammatone features and present comparable results for Gammatone based feature extraction and classical feature extraction. In order to benefit from the huge amount of data of 1600h available in the GALE project we have trained the acoustic models up to 8M Gaussians. We present detailed character error rates for the different number of Gaussians. Different kinds of systems are developed and a two stage decoding framework is applied, which uses cross-adaptation and a subsequent lattice-based system combination. In addition to various acoustic front-ends, these systems use different kinds of neural network toneme posterior features. We present detailed recognition results of the development cycle and the different acoustic front-ends of the systems. Finally, we compare the ultimate evaluation system to our last years system and can report a 10% relative improvement.
|
| #5 | The RWTH Aachen University Open Source Speech Recognition System
David Rybach (RWTH Aachen University, Germany) Christian Gollan (RWTH Aachen University, Germany) Georg Heigold (RWTH Aachen University, Germany) Björn Hoffmeister (RWTH Aachen University, Germany) Jonas Lööf (RWTH Aachen University, Germany) Ralf Schlüter (RWTH Aachen University, Germany) Hermann Ney (RWTH Aachen University, Germany)
We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Comprehensive documentation, example setups for training and recognition, and a tutorial are provided to support newcomers.
|
| #6 | Online Detecting End Times of Spoken Utterances for Synchronization of Live Speech and its Transcripts
Jie Gao (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences) Qingwei Zhao (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences) Yonghong Yan (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
In this paper, we present our initial efforts in the task of Automatically Synchronizing live spoken Utterances with their Transcripts (textual contents) (ASUT). We address the problem of online detecting of the end time of a spoken utterance given its textual content, which is one of the key problems of the ASUT task. A frame-synchronous likelihood ratio test (FS-LRT) procedure is proposed and explored under the hidden Markov model (HMM) framework. The property of FS-LRT is studies empirically. Experiments indicate that our proposed approach shows satisfying performance. In addition, the proposed procedure has been successfully applied in a subtitling system for live broadcast news.
|
| #7 | Real-Time ASR from Meetings
Philip N. Garner (Idiap Research Institute, Martigny, Switzerland) John Dines (Idiap Research Institute, Martigny, Switzerland) Thomas Hain (Speech and Hearing Group, The University of Sheffield, UK) Asmaa El Hannani (Speech and Hearing Group, The University of Sheffield, UK) Martin Karafiat (Speech Processing Group, Brno University of Technology, Czech Republic) Danil Korchagin (Idiap Research Institute, Martigny, Switzerland) Mike Lincoln (Centre for Speech Technology Research, The University of Edinburgh, UK) Vincent Wan (Speech and Hearing Group, The University of Sheffield, UK) Le Zhang (Centre for Speech Technology Research, The University of Edinburgh, UK)
The AMI(DA) system is a meeting room speech recognition system that
has been developed and evaluated in the context of the NIST Rich
Text (RT) evaluations. Recently, the "Distant Access"
requirements of the AMIDA project have necessitated that the system
operate in real-time. Another more difficult requirement is that
the system fit into a live meeting transcription scenario. We
describe an infrastructure that has allowed the AMI(DA) system to
evolve into one that fulfils these extra requirements. We emphasise
the components that address the live and real-time aspects.
|
| #8 | Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate?
Paul Deleglise (LIUM - University of Le Mans) Yannick Esteve (LIUM - University of Le Mans) Sylvain Meignier (LIUM - University of Le Mans) Teva Merlin (LIUM - University of Le Mans)
This paper describes the new ASR system developed by the LIUM and analyzes the various origins of the significant drop of the word error rate observed in comparison to the previous LIUM ASR system.
This study was made on the test data of the latest evaluation campaign of ASR systems on French broadcast news, called ESTER2 and organized in December 2008.
For the same computation time, the new system yields a word error rate about 38% lower than what the previous system (which reached the second position during the ESTER1 evaluation campaign) did.
This paper evaluates the gain provided by various changes to the system: implementation of new search and training algorithms, new training data, vocabulary size, etc.
The LIUM ASR system was the best open-source ASR system of the ESTER2 campaign.
|
| #9 | MERGING SEARCH SPACES FOR SUBWORD SPOKEN TERM DETECTION
Timo Mertens (Norwegian University of Science and Technology) Daniel Schneider (Fraunhofer IAIS) Joachim Köhler (Fraunhofer IAIS)
We describe how complementary search spaces, addressed by two different methods used in Spoken Term Detection (STD), can be merged for German subword STD. We propose fuzzy-search techniques on lattices to narrow the gap between subword and word retrieval. The first technique is based on an edit-distance, where no a priori knowledge about confusions is employed. Additionally, we propose a weighting method which explicitly models pronunciation variation on a subword level and thus improves robustness against false positives. Recall is improved by 6% absolute when retrieving on the merged search space rather than using an exact lattice search. By modeling subword pronunciation variation, we increase recall in a high-precision setting by 2% absolute compared to the edit-distance method.
|
| #10 | A Posterior Probability-Based System Hybridisation and Combination for Spoken Term Detection
Javier Tejedor (HCTLab-UAM) Dong Wang (The Centre For Speech Technology Research) Simon King (The Centre For Speech Technology Research) Joe Frankel (The Centre For Speech Technology Research) Jose Colas (HCTLab-UAM)
Spoken term detection (STD) is a fundamental task for multimedia
information retrieval. To improve the detection performance,
we have presented a direct posterior-based confidence measure generated from a neural network. In this paper, we propose a detection-independent confidence estimation based on the direct posterior confidence measure, in which the decision making is totally separated from the term detection. Based on this idea, we first present a hybrid system which conducts the term detection and confidence estimation based on different sub-word units and then propose a combination method which merges detections from heterogeneous term detectors based on the direct posterior-based confidence. Experimental results demonstrated that the proposed methods improved system performance considerably for both English and Spanish.
|
| #11 | Stochastic Pronunciation Modelling for Spoken Term Detection
Dong Wang (The Centre for Speech Technology Research, University of Edinburgh, UK) Simon King (The Centre for Speech Technology Research, University of Edinburgh, UK) Joe Frankel (The Centre for Speech Technology Research, University of Edinburgh, UK)
A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. Current approaches to STD do not acknowledge the particular properties of OOV terms, such as pronunciation uncertainty. In this paper, we use a stochastic pronunciation model to deal with the uncertain pronunciations of OOV terms. By considering all possible term pronunciations, predicted by a joint-multigram model, we observe a significant performance improvement.
|
| #12 | Term-Dependent Confidence for Out-of-Vocabulary Term Detection
Dong Wang (The Centre for Speech Technology Research, University of Edinburgh, UK) Simon King (The Centre for Speech Technology Research, University of Edinburgh, UK) Joe Frankel (The Centre for Speech Technology Research, University of Edinburgh, UK) Peter Bell (The Centre for Speech Technology Research, University of Edinburgh, UK)
Within a spoken term detection (STD) system, the decision maker plays an important role in retrieving reliable detections. Most of the state-of-the-art STD systems make decisions based on a confidence measure that is term-independent, which poses a serious problem for out-of-vocabulary (OOV) term detection. In this paper, we study a term-dependent confidence measure based on confidence normalisation and discriminative modelling, particularly focusing on its remarkable effectiveness for detecting OOV terms. Experimental results indicate that the term-dependent confidence provides much more significant improvement for OOV terms than terms in-vocabulary.
|
| #13 | A Comparison of Query-by-Example Methods for Spoken Term Detection
Wade Shen (MIT/Lincoln Laboratory) Christopher White (MIT/Lincoln Laboratory) Timothy Hazen (MIT/Lincoln Laboratory)
In this paper we examine an alternative interface for phonetic search,
namely query-by-example, that avoids OOV issues associated with both
standard word-based and phonetic search methods. We develop three
methods that compare query lattices derived from example audio against
a standard ngram-based phonetic index and we analyze factors affecting
the performance of these systems. We show that the best systems under
this paradigm are able to achieve 77% precision when retrieving
utterances from conversational telephone speech and returning 10
results from a single query (performance that is better than a similar
dictionary-based approach) suggesting significant utility for
search applications. We also show that these systems
can be further improved using relevance feedback: By incorporating
four additional queries the precision of the best system can be
improved by 13.7% relative.
|
| #14 | Fast Keyword Detection Using Suffix Array
Kouichi Katsurada (Toyohashi University of Technology) Shigeki Teshima (Toyohashi University of Technology) Tsuneo Nitta (Toyohashi University of Technology)
In this paper, we propose a technique for detecting keywords quickly from a very large speech database without using a large memory space. To accelerate searches and save memory, we used a suffix array as the data structure and applied phoneme-based DP-matching. To avoid an exponential increase in the process time with the length of the keyword, a long keyword is divided into short sub-keywords. Moreover, an iterative lengthening search algorithm is used to rapidly output accurate search results. The experimental results show that it takes less than 100ms to detect the first set of search results from a 10,000-h virtual speech database.
|
|
|