|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Mon-Ses2-O3: Systems for LVCSR and Rich Transcription
| Time: | Monday 13:30 |
Place: | East Wing 2 |
Type: | Oral |
| Chair: | Thomas Schaaf |
| 13:30 | Minimum Hypothesis Phone Error as a Decoding Method for Speech Recognition
Haihua Xu (Shanghai Jiaotong University, China) Daniel Povey (Microsoft Research, Redmond, WA, USA) Jie Zhu (Shanghai Jiaotong University, China) Guanyong Wu (Shanghai Jiaotong University, China)
In this paper we show how methods for approximating phone error
as normally used for Minimum Phone Error (MPE) discriminative training,
can be used instead as a decoding criterion for lattice rescoring. This is
an alternative to Confusion Networks (CN) which are commonly used in speech recognition.
The standard (Maximum A Posteriori) decoding approach
is a Minimum Bayes Risk estimate with respect to the Sentence
Error Rate (SER); however, we are typically more interested in the Word Error Rate (WER).
Methods such as CN and our proposed Minimum Hypothesis Phone Error
(MHPE) aim to get closer to minimizing the expected WER.
Based on preliminary experiments we find that our approach gives
more improvement than CN, and is conceptually simpler.
|
| 13:50 | Posterior-based Out-of-Vocabulary Word Detection in Telephone Speech
Stefan Kombrink (Brno University of Technology, Czech Republic) Lukas Burget (Brno University of Technology, Czech Republic) Pavel Matejka (Brno University of Technology, Czech Republic) Martin Karafiat (Brno University of Technology, Czech Republic) Hynek Hermansky (Johns Hopkins University, Baltimore (USA))
In this paper we present an out-of-vocabulary word detector suitable for English conversational and read speech.
We use an approach based on phone posteriors created by a Large Vocabulary Continuous Speech Recognition system and an additional phone recognizer, that allows detection of OOV and misrecognized words. In addition, the recognized word output can be transcribed more detailed using several classes.
Reported results are on CallHome English and Wall Street Journal data.
|
| 14:10 | Automatic Transcription System for Meetings of the Japanese National Congress
Yuya Akita (Kyoto University) Masato Mimura (Kyoto University) Tatsuya Kawahara (Kyoto University)
This paper presents an automatic speech recognition (ASR) system for assisting meeting record creation of the National Congress of Japan. The system is designed to cope with spontaneous characteristics of meeting speech, as well as a variety of topics and speakers. For acoustic model, minimum phone error (MPE) training is applied with several normalization techniques. For language model, we have proposed statistical style transformation to generate spoken-style N-grams and their statistics. We also introduce statistical modeling of pronunciation variation in spontaneous speech. The ASR system was evaluated on real congressional meetings, and achieved word accuracy of 84%. It is also suggested that the ASR-based transcripts with this accuracy level is usable for editing meeting records.
|
| 14:30 | Cross-language Bootstrapping for Unsupervised Acoustic Model Training: Rapid Development of a Polish Speech Recognition System
Jonas Lööf (RWTH Aachen University) Christian Gollan (RWTH Aachen University) Hermann Ney (RWTH Aachen University)
This paper describes the rapid development of a Polish language speech recognition system. The system development was performed without access to any transcribed acoustic training data. This was achieved through the combined use of cross-language bootstrapping and confidence based unsupervised acoustic model training. A Spanish acoustic model was ported to Polish, through the use of a manually constructed phoneme mapping. This initial model was refined through iterative recognition and retraining of the untranscribed audio data.
The system was trained and evaluated on recordings from the European Parliament, and included several state-of-the-art speech recognition techniques. Confidence based speaker adaptive training using features space transform adaptation, as well as vocal tract length normalization and maximum likelihood linear regression, was used to refine the acoustic model. Through the combination of the different techniques, good recognition performance was achieved.
|
| 14:50 | Porting an European Portuguese Broadcast News Recognition System to Brazilian Portuguese
Alberto Abad (INESC-ID Lisboa) Isabel Trancoso (IST / INESC-ID Lisboa, Portugal) Nelson Neto (Federal University of Pará, Belém, Brazil) M. Céu Viana (Center of Linguistics of the University of Lisbon, Portugal)
This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News recognition system originally developed for European Portuguese to other varieties. Concretely, in this paper we have focused on porting to Brazilian Portuguese. The impact of some of the main sources of variability has been assessed, besides proposing solutions at the lexical, acoustic and syntactic levels. The ported Brazilian Portuguese Broadcast News system allowed a drastic performance improvement from 56.6% WER (obtained with the European Portuguese system) to 25.5%.
|
| 15:10 | Modeling Northern and Southern Varieties of Dutch for STT
Julien Despres (Vecsys Research) Petr Fousek (CNRS-LIMSI) Jean-Luc Gauvain (CNRS-LIMSI) Sandrine Gay (Vecsys Research) Yvan Josse (Vecsys Research) Lori Lamel (CNRS-LIMSI) Abdel Messaoudi (CNRS-LIMSI and Vecsys Research)
This paper describes how the Northern (NL) and Southern (VL) varieties of Dutch are modeled in the joint Limsi-Vecsys~Research speech-to-text transcription systems for broadcast news (BN) and conversational telephone speech (CTS). Using the Spoken Dutch Corpus resources (CGN), systems were developed and evaluated in the 2008 N-Best benchmark. Modeling techniques that are used in our systems for other languages were found to be effective for the Dutch language, however it was also found to be important to have acoustic and language models, and statistical pronunciation generation rules adapted to each variety. This was in particular true for the MLP features which were only effective when trained separately for Dutch and Flemish. The joint submissions obtained the lowest WERs in the benchmark by a significant margin.
|
|
|