|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Mon-Ses3-P2: Prosody, Text Analysis, and Multilingual Models
| Time: | Monday 16:00 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Andrew Breen |
| #1 | Polyglot Speech Prosody Control
Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)
Within a polyglot text-to-speech synthesis system, the generation of an
adequate prosody for mixed-lingual texts, sentences, or even words,
requires a polyglot prosody model that is able to seamlessly switch
between languages and that applies the same voice for all
languages. This paper presents the first polyglot prosody model that
fulfills these requirements and that is constructed from independent
monolingual prosody models. A perceptual evaluation showed that the
synthetic polyglot prosody of about 82% of German and French
mixed-lingual test sentences cannot be distinguished from natural
polyglot prosody.
|
| #2 | Weighted Neural Network Ensemble Models for Speech Prosody Control
Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)
In text-to-speech synthesis systems, the quality of the predicted
prosody contours influences quality and naturalness of synthetic
speech. This paper presents a new statistical model for prosody
control that combines an ensemble learning technique using neural
networks as base learners with feature relevance determination. This
weighted neural network ensemble model was applied for both, phone
duration modeling and fundamental frequency modeling. A comparison
with state-of-the-art prosody models based on classification and
regression trees (CART), multivariate adaptive regression splines
(MARS), or artificial neural networks (ANN), shows a 12% improvement
compared to the best duration model and a 24% improvement
compared to the best F0 model. The neural network ensemble model
also outperforms another, recently presented ensemble model based
on gradient tree boosting.
|
| #3 | Cross-language F0 Modeling for Under-resourced Tonal Languages: A Case Study on Thai-Mandarin
Vataya Boonpiam (National Electronics and Computer Technology Center) Anocha Rugchatjaroen (National Electronics and Computer Technology Center) Chai Wutiwiwatchai (National Electronics and Computer Technology Center)
This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-resourced language with corpora from another rich-resourced language. A case study on Thai HMM-based F0 modeling with a Mandarin corpus is explored. Comparing to baseline systems without cross-language resources, over 7% relative reduction of RMSE and significant improvement of MOS are obtained.
|
| #4 | Prosodic issues in synthesising Thadou, a Tibeto-Burman tone language
Dafydd Gibbon (Universität Bielefeld, Bielefeld, Germany) Pramod K. S. Pandey (Jawaharlal Nehru University, New Delhi, India) D. Mary Kim Haokip (Assam University, Silchar, India) Jolanta Bachan (Adam Mickiewicz University, Poznań, Poland)
The objective of the present analysis is to present linguistic constraints on the phonetic realisation of lexical tone which are relevant for the choice of speech synthesis development strategy for a specific type of tone language, in this case Thadou (Tibeto-Burman), which has lexical and morphosyntactic tone as well as phonetic tone displacement. The last two constraint types differ from those in more well-known tone languages such as Mandarin, and present problems for mainstream corpus-based speech synthesis techniques. Linguistic and phonetic models and a ‘microvoice’ for rule-based tone generation are developed.
|
| #5 | Advanced Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech and Its Application to Prosody Generation for TTS
Chen-Yu Chiang (Dept. Communication Engineering, National Chiao Tung University, Taiwan) Sin-Horng Chen (Dept. Communication Engineering, National Chiao Tung University, Taiwan) Yih-Ru Wang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Motivated by the success of the unsupervised joint prosody labeling and modeling (UJPLM) method for Mandarin speech on modeling of syllable pitch contour in our previous study, in this paper, the advanced UJPLM (A-UJPLM) method is proposed based on UJPLM to jointly label prosodic tags and model syllable pitch contour, duration and energy level. Experimental results on the Sinica Treebank corpus showed that most prosodic tags labeled were linguistically meaningful and the model parameters estimated were interpretable and generally agreed with other previous study. In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted prosodic features matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method.
|
| #6 | Optimization of T-Tilt F0 Modeling
Ausdang Thangthai (National Electronics and Computer Technology Center (NECTEC)) Anocha Rugchatjaroen (National Electronics and Computer Technology Center (NECTEC)) Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC)) Ananlada Chotimongkol (National Electronics and Computer Technology Center (NECTEC)) Chai Wutiwiwatchai (National Electronics and Computer Technology Center (NECTEC))
This paper investigates on the improvement of T-Tilt modeling, a modified Tilt model specifically designed for F0 modeling in tonal languages. The model has proved to work well for F0 analysis but suffers from text-to-F0 prediction. To optimize, the T-Tilt event is restricted to span over the whole syllable unit which helps reduce the number of parameters significantly. F0 interpolation and smoothing processes often performed in preprocessing are avoided to prevent modeling errors. F0 shape pre-classification and parameter clustering are introduced for better modeling. Evaluation results using the optimized model show the significant improvement for both F0 analysis and prediction.
|
| #7 | A Multi-Level Context-Dependent Prosodic Model Applied to Duration Modeling
Nicolas OBIN (IRCAM) Xavier RODET (IRCAM) Anne LACHERET-DUJOUR (Modyco labs)
We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllable-based durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error.
|
| #8 | Sentiment classification in English from sentence-level annotations of emotions regarding models of affect
Alexandre Trilla (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL) Francesc Alías (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)
This paper presents a text classifier for automatically tagging the sentiment of input text according to the emotion that is being conveyed. This system has a pipelined framework composed of Natural Language Processing modules for feature extraction and a hard binary classifier for decision making between positive and negative categories. To do so, the Semeval 2007 dataset composed of sentences emotionally annotated is used for training purposes after being mapped into a model of affect. The resulting scheme stands a first step towards a complete emotion classifier for a future automatic expressive text-to-speech synthesizer.
|
| #9 | Identification of Contrast and Its Emphatic Realization in HMM Based Speech Synthesis
Leonardo Badino (University of Edinburgh, Edinburgh, U.K.) Sebastian Andersson (University of Edinburgh, Edinburgh, U.K.) Junichi Yamagishi (University of Edinburgh, Edinburgh, U.K.) Robert Clark (University of Edinburgh, Edinburgh, U.K.)
The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hiddden-Markov-Model (HMM) based speech synthesis system.We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable.
|
| #10 | How to Improve TTS Systems for Emotional Expressivity
Antonio Rui Ferreira Rebordao (The University of Tokyo) Mostafa Al Masum Shaikh (The University of Tokyo) Keikichi Hirose (The University of Tokyo) Nobuaki Minematsu (The University of Tokyo)
Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis.
|
| #11 | State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis
Yi-Jian Wu (Microsoft) Yoshihiko Nankaku (Nagoya Institute of Technology) Keiichi Tokuda (Nagoya Institute of Technology)
A phone mapping-based method had been introduced for cross-lingual speaker adaptation in HMM-based speech synthesis. In this paper, we continue to propose a state mapping based method for cross-lingual speaker adaptation. In this method, we firstly establish the state mapping between two voice models in source and target languages using Kullback-Leibler divergence (KLD). Based on the established mapping information, we introduce two approaches to conduct cross-lingual speaker adaptation, including data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed the phone mapping based method. In addition, the data mapping approach achieved better speaker similarity, and the transform mapping approach achieved better speech quality after adaptation.
|
| #12 | Real Voice and TTS Accent Effects on Intelligibility and Comprehension for Indian Speakers of English as a Second Language
Frederick V. Weber (Earth Institute, Columbia University) Kalika Bali (Microsoft Research, India)
We investigate the effect of accent on comprehension of English for speakers of English as a second language in southern India. Subjects were exposed to real and TTS voices with US and several Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for familiar accents, and are broken down by various demographic factors.
|
| #13 | Improving Consistence of Phonetic Transcription for Text-to-Speech
Pablo Daniel Agüero (FI-UNMDP) Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain) Juan Carlos Tulli (FI-UNMDP)
Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based error-driven learning to match the phonetic transcription with the speaker's dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects.
|
|
|