Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Tue-Ses1-O4:
Unit-Selection Synthesis

Time:Tuesday 10:00 Place:East Wing 3 Type:Oral
Chair:Alan Black

10:00Perceptual Cost Function for Cross-fading Based Concatenation

Qi Miao (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)
Alexander Kain (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)
Jan P. H. van Santen (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)

In earlier research, we applied a linear weighted cross-fading function to ensure smooth concatenation. However, this can cause unnaturally shaped spectral trajectories. We propose context-sensitive cross-fading. To train this system, a perceptually validated cost function is needed, which is the focus of this paper. A corpus was designed to generate a variety of formant trajectory shapes. A perceptual experiment was performed and a multiple linear regression model was applied to predict perceptual quality ratings from various distances between cross-faded and natural trajectories. Results show that perceptual quality could be predicted well from the proposed distance measures.

10:20Exploring Automatic Similarity Measures for Unit Selection Tuning

Daniel Tihelka (University of West Bohemia)
Jan Romportl (SpeechTech s.r.o)

The paper focuses on the current handling of target features in the unit selection approach basically requiring huge corpora. In the paper there are outlined possible solutions based on measuring (dis)similarity among prosodic patterns. As the start of research, several intuitively chosen measures of acoustic signal (dis)similarity are presented and correlated to perceived similarity obtained from a large-scale listening test.

10:40Towards Intonation Control in Unit Selection Speech Synthesis

Cedric Boidin (Orange Labs)
Olivier Boeffard (IRISA / University of Rennes 1)
Thierry Moudenc (Orange Labs)
Geraldine Damnati (Orange Labs)

We propose to control intonation in unit selection speech synthesis with a mixed CART-HMM intonation model. The Finite State Machine (FSM) formulation is suited to incorporate the intonation model in the unit selection framework because it allows for combination of models with different unit types and handling competing intonative variants. Subjective experiments have been carried out to compare segmental and joint-prosodic-and-segmental unit selection.

11:00A Novel Approach to Cost Weighting in Unit Selection TTS

Jerome Bellegarda (Apple Inc.)

Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. For a particular set of criteria, the relative weighting of the resulting costs crucially affects final candidate ranking. Their influence is typically determined in an empirical manner (e.g., based on a limited amount of synthesized data), yielding global weights that are thus applied to all concatenations indiscriminately. This paper proposes an alternative approach, based on a data-driven framework separately optimized for each concatenation. The cost distribution in every information stream is dynamically leveraged to locally shift weight towards those characteristics that prove most discriminative at this point. An illustrative case study underscores the potential benefits of this solution.

11:20Maximum Likelihood Unit Selection for Corpus-based Speech Synthesis

Abubeker Gamboa Rosales (University of Guanajuato)
Hamurabi Gamboa Rosales (Dresden University of Technology)
Ruediger Hoffmann (Dresden University of Technology)

Unit selection attempts to find the best combination of speech unit sequences in an inventory so that the perceptual differences between expected (natural) and synthesized signals are as low as possible. However, mismatches and distortions are still possible in concatenative speech synthesis and they are normally perceptible in the synthesized waveform. Therefore, unit selection strategies and parameter tuning are still important issues in the improvement of speech synthesis. We present a novel concept to increase the efficiency of the exhaustive speech unit search within the inventory via a unit selection model. This model bases its operation on a mapping analysis of the concatenation sub-costs, a Bayes optimal classification (BOC), and a Maximum likelihood selection ( MLS). The principle advantage of the proposed unit selection method is that it does not require an exhaustive training to set up weighted coefficients for target and concatenation subcosts.

11:40A Close Look into the Probablistic Concatenation Model for Corpus-based Speech Synthesis

Shinsuke Sakai (NICT)
Ranniery Maia (NICT)
Hisashi Kawai (NICT)
Satoshi Nakamura (NICT)

We have proposed a novel probabilistic approach to concatenation modeling for corpus-based speech synthesis, where the goodness of concatenation for a unit is modeled using a conditional Gaussian probability densities whose mean is defined as a linear transform of the feature vector from the previous unit, and have shown its effectiveness through a subjective listening test. In this paper, we further investigate the characteristics of the proposed method by a objective evaluation and by observing the sequence of concatenation scores across an utterance. We also present the mathematical relationships of the proposed method with other approaches and show that it has a flexible modeling power, having other approaches to concatenation scoring methods as special cases.