|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses3-O4: Text Processing for Spoken Language Generation
| Time: | Tuesday 16:00 |
Place: | East Wing 3 |
Type: | Oral |
| Chair: | Bernd Möbius |
| 16:00 | Automatic Syllabification for Danish Text-to-Speech Systems
Jeppe Beck (Microsoft Language Development Center) Daniela Braga (Microsoft Language Development Center) João Nogueira (Faculty of Sciences of University of Lisbon) Miguel Dias (Microsoft Language Development Center) Luis Coelho (Instituto Politécnico do Porto)
In this paper, a rule-based automatic syllabifier for Danish is described using the Maximal Onset Principle. Prior success rates of rule-based methods applied to Portuguese and Catalan syllabification modules were on the basis of this work. The system was implemented and tested using a very small set of rules. The results gave rise to 96.9% and 98.7% of word accuracy rate, contrary to our initial expectations, being Danish a language with a complex syllabic structure and thus difficult to be rule-driven. Comparison with data-driven syllabification system using artificial neural networks showed a higher accuracy rate of the former system.
|
| 16:20 | Hybrid Approach to Grapheme to Phoneme Conversion for Korean
Jinsik Lee (Pohang University of Science and Technology) Byeongchang Kim (Catholic University of Daegu) Gary Geunbae Lee (Pohang University of Science and Technology)
In the grapheme to phoneme conversion problem for Korean, two main approaches have been discussed: knowledge-based and data-driven methods. However, both camps have limitations: the knowledge-based hand-written rules cannot handle some of the pronunciation changes due to the lack of capability of linguistic analyzers and many exceptions; data-driven methods always suffer from data sparseness. To overcome the shortages of both camps, this paper presents a novel combining method which effectively integrates two components: (1) a rule-based converting system based on linguistically motivated hand-written rules and (2) a statistical converting system using a Maximum Entropy model. The experimental results clearly show the effectiveness of our proposed method.
|
| 16:40 | Robust LTS rules with the Combilex speech technology lexicon
Korin Richmond (CSTR, Informatics, Edinburgh University) Robert Clark (CSTR, Informatics, Edinburgh University) Sue Fitt (CSTR, Informatics, Edinburgh University)
Combilex is a high quality pronunciation lexicon aimed at speech
technology applications that has recently been released by CSTR.
Combilex benefits from several advanced features. This paper
evaluates one of these: the explicit alignment of phones to graphemes
in a word. This alignment can help to rapidly develop robust and
accurate letter-to-sound (LTS) rules, without needing to rely on
automatic alignment methods. To evaluate this, we used Festival's LTS
module, comparing its standard automatic alignment with Combilex's
explicit alignment. Our results show using Combilex's alignment
improves LTS accuracy: 86.50% words correct as opposed to
84.49%, with our most general form of lexicon. In addition,
building LTS models is greatly accelerated, as the need to list
allowed alignments is removed. Finally, loose comparison with other
studies indicates Combilex is a superior quality lexicon in terms of
consistency and size.
|
| 17:00 | Letter-to-phoneme conversion by inference of rewriting rules
Vincent Claveau (IRISA - CNRS)
Phonetization is a crucial step for oral document processing.
In this paper, a new letter-to-phoneme conversion approach is proposed; it is automatic,
simple, portable and efficient.
It relies on a machine learning technique initially developed for transliteration and
translation; the system infers rewriting rules
from examples of words with their phonetic representations.
This approach is evaluated in the framework of the Pronalsyl Pascal challenge,
which includes several datasets on different languages.
The obtained results equal or outperform those of the best known systems.
Moreover, thanks to the simplicity of our technique, the inference time of our
approach is much lower than those of the best performing state-of-the-art systems.
|
| 17:20 | Online Discriminative Training for Grapheme-to-Phoneme Conversion
Sittichai Jiampojamarn (Department of Computing Science, University of Alberta) Grzegorz Kondrak (Department of Computing Science, University of Alberta)
We present an online discriminative training approach to grapheme-to-phoneme (g2p) conversion. We employ a many-to-many alignment between graphemes and phonemes, which overcomes the limitations of widely used one-to-one alignments. The discriminative structure-prediction model incorporates input segmentation, phoneme prediction, and sequence modeling in a unified dynamic programming framework. The learning model is able to capture both local context features in inputs, as well as non-local dependency features in sequence outputs. Experimental results show that our system surpasses the state-of-the-art on several data sets.
|
| 17:40 | Using Same-Language Machine Translation to Create Alternative Target Sequences for Text-To-Speech Synthesis
Peter Cahill (University College Dublin) Jinhua Du (Dublin City University) Andy Way (Dublin City University) Julie Carson-Berndsen (University College Dublin)
Modern speech synthesis systems attempt to produce speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same meaning. This method allows the synthesiser to overcome the problem of insufficient units in runtime.
|
|
|