Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses3-P2:
Prosody, Text Analysis, and Multilingual Models

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair:Andrew Breen

#1Polyglot Speech Prosody Control

Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)

Within a polyglot text-to-speech synthesis system, the generation of an adequate prosody for mixed-lingual texts, sentences, or even words, requires a polyglot prosody model that is able to seamlessly switch between languages and that applies the same voice for all languages. This paper presents the first polyglot prosody model that fulfills these requirements and that is constructed from independent monolingual prosody models. A perceptual evaluation showed that the synthetic polyglot prosody of about 82% of German and French mixed-lingual test sentences cannot be distinguished from natural polyglot prosody.

#2Weighted Neural Network Ensemble Models for Speech Prosody Control

Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)

In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting.

#3Cross-language F0 Modeling for Under-resourced Tonal Languages: A Case Study on Thai-Mandarin

Vataya Boonpiam (National Electronics and Computer Technology Center)
Anocha Rugchatjaroen (National Electronics and Computer Technology Center)
Chai Wutiwiwatchai (National Electronics and Computer Technology Center)

This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-resourced language with corpora from another rich-resourced language. A case study on Thai HMM-based F0 modeling with a Mandarin corpus is explored. Comparing to baseline systems without cross-language resources, over 7% relative reduction of RMSE and significant improvement of MOS are obtained.

#4Prosodic issues in synthesising Thadou, a Tibeto-Burman tone language

Dafydd Gibbon (Universität Bielefeld, Bielefeld, Germany)
Pramod K. S. Pandey (Jawaharlal Nehru University, New Delhi, India)
D. Mary Kim Haokip (Assam University, Silchar, India)
Jolanta Bachan (Adam Mickiewicz University, Poznań, Poland)

The objective of the present analysis is to present linguistic constraints on the phonetic realisation of lexical tone which are relevant for the choice of speech synthesis development strategy for a specific type of tone language, in this case Thadou (Tibeto-Burman), which has lexical and morphosyntactic tone as well as phonetic tone displacement. The last two constraint types differ from those in more well-known tone languages such as Mandarin, and present problems for mainstream corpus-based speech synthesis techniques. Linguistic and phonetic models and a ‘microvoice’ for rule-based tone generation are developed.

#5Advanced Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech and Its Application to Prosody Generation for TTS

Chen-Yu Chiang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Sin-Horng Chen (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Yih-Ru Wang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)

Motivated by the success of the unsupervised joint prosody labeling and modeling (UJPLM) method for Mandarin speech on modeling of syllable pitch contour in our previous study, in this paper, the advanced UJPLM (A-UJPLM) method is proposed based on UJPLM to jointly label prosodic tags and model syllable pitch contour, duration and energy level. Experimental results on the Sinica Treebank corpus showed that most prosodic tags labeled were linguistically meaningful and the model parameters estimated were interpretable and generally agreed with other previous study. In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted prosodic features matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method.

#6Optimization of T-Tilt F0 Modeling

Ausdang Thangthai (National Electronics and Computer Technology Center (NECTEC))
Anocha Rugchatjaroen (National Electronics and Computer Technology Center (NECTEC))
Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC))
Ananlada Chotimongkol (National Electronics and Computer Technology Center (NECTEC))
Chai Wutiwiwatchai (National Electronics and Computer Technology Center (NECTEC))

This paper investigates on the improvement of T-Tilt modeling, a modified Tilt model specifically designed for F0 modeling in tonal languages. The model has proved to work well for F0 analysis but suffers from text-to-F0 prediction. To optimize, the T-Tilt event is restricted to span over the whole syllable unit which helps reduce the number of parameters significantly. F0 interpolation and smoothing processes often performed in preprocessing are avoided to prevent modeling errors. F0 shape pre-classification and parameter clustering are introduced for better modeling. Evaluation results using the optimized model show the significant improvement for both F0 analysis and prediction.

#7A Multi-Level Context-Dependent Prosodic Model Applied to Duration Modeling

Nicolas OBIN (IRCAM)
Xavier RODET (IRCAM)
Anne LACHERET-DUJOUR (Modyco labs)

We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllable-based durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error.

#8Sentiment classification in English from sentence-level annotations of emotions regarding models of affect

Alexandre Trilla (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)
Francesc Alías (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)

This paper presents a text classifier for automatically tagging the sentiment of input text according to the emotion that is being conveyed. This system has a pipelined framework composed of Natural Language Processing modules for feature extraction and a hard binary classifier for decision making between positive and negative categories. To do so, the Semeval 2007 dataset composed of sentences emotionally annotated is used for training purposes after being mapped into a model of affect. The resulting scheme stands a first step towards a complete emotion classifier for a future automatic expressive text-to-speech synthesizer.

#9Identification of Contrast and Its Emphatic Realization in HMM Based Speech Synthesis

Leonardo Badino (University of Edinburgh, Edinburgh, U.K.)
Sebastian Andersson (University of Edinburgh, Edinburgh, U.K.)
Junichi Yamagishi (University of Edinburgh, Edinburgh, U.K.)
Robert Clark (University of Edinburgh, Edinburgh, U.K.)

The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hiddden-Markov-Model (HMM) based speech synthesis system.We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable.

#10How to Improve TTS Systems for Emotional Expressivity

Antonio Rui Ferreira Rebordao (The University of Tokyo)
Mostafa Al Masum Shaikh (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)

Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis.

#11State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis

Yi-Jian Wu (Microsoft)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

A phone mapping-based method had been introduced for cross-lingual speaker adaptation in HMM-based speech synthesis. In this paper, we continue to propose a state mapping based method for cross-lingual speaker adaptation. In this method, we firstly establish the state mapping between two voice models in source and target languages using Kullback-Leibler divergence (KLD). Based on the established mapping information, we introduce two approaches to conduct cross-lingual speaker adaptation, including data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed the phone mapping based method. In addition, the data mapping approach achieved better speaker similarity, and the transform mapping approach achieved better speech quality after adaptation.

#12Real Voice and TTS Accent Effects on Intelligibility and Comprehension for Indian Speakers of English as a Second Language

Frederick V. Weber (Earth Institute, Columbia University)
Kalika Bali (Microsoft Research, India)

We investigate the effect of accent on comprehension of English for speakers of English as a second language in southern India. Subjects were exposed to real and TTS voices with US and several Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for familiar accents, and are broken down by various demographic factors.

#13Improving Consistence of Phonetic Transcription for Text-to-Speech

Pablo Daniel Agüero (FI-UNMDP)
Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain)
Juan Carlos Tulli (FI-UNMDP)

Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based error-driven learning to match the phonetic transcription with the speaker's dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects.