Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses3-O3:
Statistical Parametric Synthesis I

Time:Monday 16:00 Place:East Wing 2 Type:Oral
Chair:Keiichi Tokuda

16:00Autoregressive HMMs for speech synthesis

Matt Shannon (Cambridge University Engineering Department, U.K.)
William Byrne (Cambridge University Engineering Department, U.K.)

We propose the autoregressive HMM for speech synthesis. We show that the autoregressive HMM supports efficient EM parameter estimation and that we can use established effective synthesis techniques such as synthesis considering global variance with minimal modification. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard HMM synthesis framework, and supports easy and efficient parameter estimation, in contrast to the trajectory HMM. We find that the autoregressive HMM gives performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation.

16:20ASYNCHRONOUS F0 AND SPECTRUM MODELING FOR HMM-BASED SPEECH SYNTHESIS

Cheng-Cheng Wang (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)
Zhen-Hua Ling (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)
Li-Rong Dai (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)

This paper proposes an asynchronous model structure for fundamental frequency(F0) and spectrum modeling in HMM-based parametric speech synthesis to improve the performance of F0 prediction. F0 and spectrum features are considered to be synchronous in the conventional system. Considering that the production of these two features is decided by the movement of different speech organs, an explicitly asynchronous model structure is introduced. At training stage, F0 models are training asynchronously with spectrum models. At synthesis stage, the two features are generated respectively. The objective and subjective evaluation results show the proposed method can effectively improve the accuracy of F0 prediction.

16:40A Minimum V/U Error Approach to F0 Generation in HMM-based TTS

yao Qian (Microsoft Research Asia, Beijing, China)
Frank Soong (Microsoft Research Asia, Beijing, China)
miaomiao Wang (Microsoft Research Asia, Beijing, China)
zhizheng Wu (Microsoft Research Asia, Beijing, China)

The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a minimum v/u error approach to F0 generation. A prior knowledge of v/u is imposed in each Mandarin phone and accumulated v/u posterior probabilities are used to search for the optimal v/u switching point in each VU or UV segment in generation. Objectively the new approach is shown to improve v/u prediction performance, specifically on voiced to unvoiced swapping errors. They are reduced from 3.7% (baseline) down to 2.0% (new approach). The improvement is also subjectively confirmed by an AB preference test score, 72% (new approach) versus 22% (baseline).

17:00Voiced/Unvoiced Decision Algorithm for HMM-based Speech Synthesis

Shiyin Kang (Department of Computer Science and Technology, Tsinghua University, Beijing, China)
Zhiwei Shuang (IBM China Research Lab, Beijing, China)
Quansheng Duan (Department of Computer Science and Technology, Tsinghua University, Beijing, China)
Yong Qin (IBM China Research Lab, Beijing, China)
Lianhong Cai (Department of Computer Science and Technology, Tsinghua University, Beijing, China)

This paper introduces a novel method to improve the U/V decision method in HMM-based speech synthesis. In the conventional method, the U/V decision of each state is independently made, and a state in the middle of a vowel may be decided as unvoiced. In this paper, we propose to utilize the constraints of natural speech to improve the U/V decision inside a unit, such as syllable or phone. We use a GMM-based U/V change time model to select the best U/V change time in one unit, and refine the U/V decision of all states in that unit based on the selected change time. The result of a perceptual evaluation demonstrates that the proposed method can significantly improve the naturalness of the synthetic speech.

17:20Local minimum generation error criterion for hybrid HMM speech synthesis

Xavi Gonzalvo (Phonetic Arts Ltd.)
Alexander Gutkin (Yahoo! Europe)
Joan Claudi Socoro (Universitat Ramon Llull)
Ignasi Iriondo (Universitat Ramon Llull)
Paul Taylor (Phonetic Arts Ltd.)

This paper presents an HMM-driven hybrid speech synthesis approach in which unit selection concatenative synthesis is used to improve the quality of the statistical system using a Local Minimum Generation Error (LMGE) during the synthesis stage. The idea behind this approach is to combine the robustness due to HMMs with the naturalness of concatenated units. Unlike the conventional hybrid approaches to speech synthesis that use concatenative synthesis as a backbone, the proposed system employs stable regions of natural units to improve the statistically generated parameters. We show that this approach improves the generation of vocal tract parameters, smoothes the bad joints and increases the overall quality.

17:40Thousands of Voices for HMM-based Speech Synthesis

Junichi Yamagishi (University of Edinburgh)
Bela Usabaev (Universit¨at T¨ubingen)
Simon King (University of Edinburgh)
Oliver Watts (University of Edinburgh)
John Dines (Idiap Research Institute)
Jilei Tian (Nokia)
Rile Hu (Nokia)
Keiichiro Oura (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)
Reima Karhila (Helsinki University of Technology)
Mikko Kurimo (Helsinki University of Technology)

Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ’non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.