|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses1-O3: ASR: Lexical and Prosodic Models
| Time: | Tuesday 10:00 |
Place: | East Wing 2 |
Type: | Oral |
| Chair: | Eric Fosler-Lussier |
| 10:00 | Grapheme to phoneme conversion using an SMT system
Antoine Laurent (Laboratoire Informatique Université du Maine (LIUM)) Paul Deléglise (Laboratoire Informatique Université du Maine (LIUM)) Sylvain Meignier (Laboratoire Informatique Université du Maine (LIUM))
This paper presents an automatic grapheme to phoneme conversion system that uses statistical machine translation techniques provided by the Moses Toolkit. The generated word pronunciations are employed in the dictionary of an automatic speech recognition system and evaluated using the ESTER 2 French broadcast news corpus. Grapheme to phoneme conversion based on Moses is compared to two other methods: G2P, and a dictionary look-up method supplemented by a rule-based tool for phonetic transcriptions of words unavailable in the dictionary. Moses gives better results than G2P, and have performance comparable to the dictionary look-up strategy.
|
| 10:20 | Lexical and Phonetic Modeling for Arabic Automatic Speech Recognition
Long Nguyen (BBN Technologies) Tim Ng (BBN Technologies) Kham Nguyen (Northeastern University) Rabih Zbib (Massachusetts Institute of Technology) John Makhoul (BBN Technologies)
In this paper, we describe the use of either words or morphemes as lexical modeling units and the use of either graphemes or phonemes as phonetic modeling units for Arabic automatic speech recognition (ASR). We designed four Arabic ASR systems: two word-based systems and two morpheme-based systems. Experimental results using these four systems show that they have comparable state-of-the-art performance individually, but the more sophisticated morpheme-based system tends to be the best. However, they seem to complement each other quite well within the ROVER system combination framework to produce substantially-improved combined results.
|
| 10:40 | Assessing Context and Learning for isiZulu Tone Recognition
Gina-Anne Levow (University of Chicago)
Prosody plays an integral role in spoken language understanding.
In isiZulu, a Nguni family language with lexical tone,
prosodic information determines word meaning. We assess the
impact of models of tone and coarticulation for tone recognition.
We demonstrate the importance of modeling prosodic context
to improve tone recognition. We employ this less commonly
studied language to assess models of tone developed
for English and Mandarin, finding common threads in coarticulatory
modeling. We also demonstrate the effectiveness of
semi-supervised and unsupervised tone recognition techniques
for this less-resourced language, with weakly supervised approaches
rivaling supervised techniques.
|
| 11:00 | A Sequential Minimization Algorithm for Finite-State Pronunciation Lexicon Models
Dobrisek Simon (Faculty of Electrical Engineering, Ljubljana University, Slovenia) Vesnicer Bostjan (Faculty of Electrical Engineering, Ljubljana University, Slovenia) Mihelic France (Faculty of Electrical Engineering, Ljubljana University, Slovenia)
The paper first presents a large-vocabulary automatic speech-recognition system that is being developed for the Slovenian language. The concept of a single-pass token-passing algorithm for fast speech decoding that can be used with the designed multi-level system structure is discussed. From the algorithmic point of view, the main component of the system is a finite-state pronunciation lexicon model. This component has crucial impact on the overall performance of the system and we developed a sequential minimization algorithm that very efficiently reduces the size and algorithmic complexity of the lexicon model. The presented experiments show that the sequential minimization algorithm considerably outperforms (up to 60 %) the conventional algorithms that were developed for the static global optimization of the finite-state transducers.
|
| 11:20 | A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling
Kornel Laskowski (Carnegie Mellon University) Mattias Heldner (KTH) Jens Edlund (KTH)
Prosody plays a central role in conversation, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demonstrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by a factor of five. The resulting representation is sufficiently mature for general deployment in a broad range of automatic speech processing applications.
|
| 11:40 | Vocabulary Expansion through Automatic Abbreviation Generation for Chinese Voice Search
Dong Yang (Department of Computer Science, Tokyo Institute of Technology) Yi-cheng Pan (Department of Computer Science, Tokyo Institute of Technology) Sadaoki Furui (Department of Computer Science, Tokyo Institute of Technology)
Long named entities are often abbreviated in oral Chinese language,
and this usually leads to out-of-vocabulary(OOV) problems in speech
recognition applications. In this paper, we propose a new method for
automatically generating abbreviations for Chinese named entities
and we perform vocabulary expansion using output of the abbreviation
model for voice search. In our abbreviation modeling, we convert the
abbreviation generation problem into a tagging problem and use the
conditional random field (CRF) as the tagging tool. In the
vocabulary expansion, considering the multiple abbreviation problem
and limited coverage of top-1 abbreviation candidate, we add top-10
candidates into the vocabulary. In our experiments, for the
abbreviation modeling, we achieved the top-10 coverage of 88.3% by
the proposed method; for the voice search, we improved the voice
search accuracy from 16.9% to 79.2% by incorporating the top-10
abbreviation candidates to vocabulary.
|
|
|