|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-P2: Prosody perception and language acquisition
| Time: | Wednesday 10:00 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | David House |
| #1 | Perception of English Compound vs. Phrasal Stress: Natural vs. Synthetic Speech
Irene Vogel (University of Delaware) Arild Hestvik (University of Delaware) H. Timothy Bunnell (Nemours Biomedical Research) Laura Spinu (University of Delaware)
The ability of listeners to distinguish between compound and
phrasal stress in English was examined on the basis of a picture
selection task. The responses to naturally and synthetically
produced stimuli were compared. While greater overall accuracy
was observed with the natural stimuli, the same pattern of
greater accuracy with compound stress than with phrasal stress
was observed with both types of stimuli.
|
| #2 | New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis
Martti Vainio (Department of Speech Sciences, University of Helsinki) Antti Suni (Department of Speech Sciences, University of Helsinki) Tuomo Raitio (Department of Signal Processing and Acoustics, Helsinki University of Technology) Jani Nurminen (Nokia Devices R&D) Juhani Järvikivi (Max Planck Institute for Psycholinguistics) Paavo Alku (Department of Signal Processing and Acoustics, Helsinki University of Technology)
This paper describes a new flexible delexicalization method based on
glottal excited parametric speech synthesis scheme. The system
utilizes inverse filtered glottal flow and all-pole modelling of the
vocal tract. The method provides a possibility to retain and
manipulate all relevant prosodic features of any kind of speech.
Most importantly, the features include voice quality, which has not
been properly modeled in earlier delexicalization methods. The
functionality of the new method was tested in a prosodic tagging
experiment aimed at providing word prominence data for a
text-to-speech synthesis system. The experiment confirmed the
usefulness of the method and further corroborated earlier evidence that
linguistic factors influence the perception of prosodic prominence.
|
| #3 | Speech rate and pauses in non-native Finnish
Minnaleena Toivola (Department of General Linguistics, University of Helsinki, Finland) Mietta Lennes (Department of General Linguistics and Department of Speech Sciences, University of Helsinki, Finland) Eija Aho (Department of General Linguistics, University of Helsinki, Finland)
In this study, the temporal aspects of speech are compared in read-aloud Finnish produced by six native and 16 non-native speakers. It is shown that the speech and articulation rates as well as pause durations are different for native and non-native speakers. Moreover, differences exist between the groups of speakers representing four different non-native languages. Surprisingly, the native Finnish speakers tend to make longer pauses than the non-natives. The results are relevant when developing methods for assessing fluency or the strength of foreign accent.
|
| #4 | Modelling similarity perception of intonation
Uwe Reichel (University of Munich) Felicitas Kleber (University of Munich) Raphael Winkelmann (University of Munich)
In this study a perception experiment was carried out to examine the perceived similarity of intonation contours. Amongst other results we found, that the subjects are capable to produce consistent similarity judgements.
On the basis of this data we studied the influence of several physical distance measures on the human similarity judgements by grouping these measures to principal components and by comparing the weights of these components in a linear regression model predicting human perception. Non-correlation based distance measures for f0 contours received the highest relative weight.
Finally, we developed applicable linear regression and neural feed
forward network models predicting similarity perception of intonation on the basis of physical contour distances. The performance of the neural networks, measured in terms of mean absolute error, did not differ significantly from the human performance derived from judgement consistency.
|
| #5 | Studying L2 Suprasegmental Features in Asian Englishes: A Position Paper
Helen Meng (The Chinese University of Hong Kong) Chiu-yu Tseng (Academia Sinica) Mariko Kondo (Waseda University) Alissa Harrison (The Chinese University of Hong Kong) Tanya Viscelgia (Academia Sinica)
This position paper highlights the importance of suprasegmental training in secondary language (L2) acquisition. Suprasegmental features are manifested in terms of acoustic cues and convey important information about linguistic and information structures. Hence, L2 learners must harness appropriate suprasegmental productions for effective communication. However, this learning process is influenced by language transfer. We propose to design and collect a corpus to support systematic analysis of L2 suprasegmental features. We lay out a set of carefully selected textual environments that illustrate how suprasegmental features convey information including part-of-speech, syntax, focus, speech acts and semantics. We intend to use these textual environments for collecting speech data in a variety of Asian Englishes. Analyses of such corpora should lead to research findings that have important implications for language education and CALL applications.
|
| #6 | Classification of disfluent phenomena as fluent communicative devices in specific prosodic contexts
Helena Moniz (FLUL/CLUL INESC-ID) Isabel Trancoso (IST/INESC-ID) Ana Mata (FLUL/CLUL)
This work explores prosodic cues of disfluent phenomena. In our previous work, we conducted a perceptual experiment regarding (dis)fluency ratings. Results suggested that some disfluencies may be considered felicitous by listeners, namely filled pauses and prolongations.
In an attempt to discriminate which linguistic features are more salient in the classification of disfluencies as either fluent or disfluent phenomena, we used CART techniques on a corpus of 3.5 hours of spontaneous and prepared non-scripted speech.
CART results pointed out 2 splits: break indices and contour shape. The first split indicates that events uttered at breaks 3 and 4 are considered felicitous. The second shows that these events must have flat or ascending contours to be considered as such; otherwise they are strongly penalized.
Our preliminary results suggest that there are regular trends in the production of these events, namely, prosodic phrasing and contour shape.
|
| #7 | Cross-Cultural Perception of Discourse Phenomena
Rolf Carlson (CTT, KTH) Julia Hirschberg (Columbia University)
We discuss perception studies of two low level indicators of discourse phenomena by Swedish, Japanese, and Chinese native speakers. Subjects were asked to identify upcoming prosodic boundaries and disfluencies in Swedish spontaneous speech. We hypothesize that speakers of prosodically unrelated languages should be less able to predict upcoming phrase boundaries but potentially better able to identify disfluencies, since indicators of disfluency are more likely to depend upon lexical, as well as acoustic information. However, surprisingly, we found that both phenomena were fairly well recognized by native and non-native speakers, with, however, some possible interference from word tones for the Chinese subjects.
|
| #8 | Modelling Vocabulary Growth from Birth to Young Adulthood
Roger Moore (University of Sheffield) Louis ten Bosch (Radboud University Nijmegen)
There has been considerable debate over the existence of the ‘vocabulary spurt’ phenomenon - an apparent acceleration in word learning that is commonly said to occur in children around the age of 18 months. This paper presents an investigation into modelling the phenomenon using data from almost 1800 children. The results indicate that the acquisition of a receptive/productive lexicon can be quite adequately modelled as a single growth function with an ecologically well founded and cognitively plausible interpretation. Hence it is concluded that there is little evidence for the vocabulary spurt phenomenon as a separable aspect of language acquisition.
|
| #9 | Adaptive Non-negative Matrix Factorization in a Computational Model of Language Acquisition
Joris Driesen (Dept. ESAT, KULeuven, Leuven) Louis ten Bosch (CLST, Radboud University, Nijmegen) Hugo Van hamme (Dept. ESAT, KULeuven, Leuven)
During the early stages of language acquisition, young infants
face the task of learning a basic vocabulary without the aid of
prior linguistic knowledge. It is believed the long term
episodic memory plays an important role in this process. Ex-
periments have shown that infants retain large amounts of very
detailed episodic information about the speech they perceive
(e.g. [1]). This weakly justifies the fact that some algorithms at-
tempting to model the process of vocabulary acquisition compu-
tationally process large amounts of speech data in batch. Non-
negative Matrix Factorization (NMF), a technique that is par-
ticularly successful in data mining but can also be applied to
vocabulary acquisition (e.g. [2]), is such an algorithm. In this
paper, we will integrate an adaptive variant of NMF into a com-
putational framework for vocabulary acquisition, foregoing the
need for long term storage of speech inputs, and show its accuracy
matches that of the batch algorithm
|
| #10 | Classifying clear and conversational speech based on acoustic features
Akiko Amano-Kusumoto (Oregon Health & Science University) John-Paul Hosom (Oregon Health & Science University) Izhak Shafran (Oregon Health & Science University)
This paper reports an investigation of features relevant for classifying two speaking styles, namely, conversational speaking style and clear (e.g. hyper-articulated) speaking style. Spectral and prosodic features were automatically extracted from speech and classified using decision tree classifiers and multilayer perceptrons to achieve accuracies of about 71% and 77% respectively. More interestingly, we found that out of the 56 features only about 9 features are needed to capture the most predictive power. While perceptual studies have shown that spectral cues are more useful than prosodic features for intelligibility [Kain2008], here we find prosodic features are more important for classification.
|
| #11 | The Acoustic Characteristics of Russian Vowels in Children of 6 and 7 Years of Age
Elena Lyakso (Saint-Petersburg State University) Olga Frolova (Saint-Petersburg State University) Alex Grigoriev (Saint-Petersburg State University)
The purpose of this investigation is to examine the process of acoustic features of vowels from child speech approaching corresponding values in the normal Russian adult speech. The vowels formants structure, pitch and vowels duration were examined. Word stress and palatal context influence on the formants structure of the vowels were taken into account. It was shown that the word stress is formed by 6 -7 years of age on the basis of the feature typical for Russian language. Formant structure of Russian vowels /u/ and /i/ is not formed by the age of 7 years. Native speakers recognize the meaning of 57-93% words in speech of 6 and 7-years-old children.
|
| #12 | Japanese children’s acquisition of prosodic politeness expressions
Takaaki Shochi (Division of Cognitive Psychology, Kumamoto University, Japan) Donna Erickson (Showa Music University, Kawasaki City, Japan) Kaoru Sekiyama (Division of Cognitive Psychology, Kumamoto University, Japan) Albert Rilliard (LIMSI-CNRS) Véronique Aubergé (GIPSA Lab, Grenoble, France)
This paper presents a perception experiment to measure the ability of Japanese children in fourth and fifth grade elementary school to recognize culturally encoded expressions of politeness and impoliteness in their native language. Audiovisual stimuli were presented to listeners, who rate the politeness degree and a possible situation where such an expression could be used. Analysis of results focuses on the differences and the similarities between adult listeners and children, for each attitude and modality. Facial information seems to be retrieved earlier than audio ones, and expressions of different degrees of Japanese politeness, including expressions of kyoshuku, are still not understood around 10 years of age.
|
| #13 | Perceptual training of singleton and geminate stops in Japanese language by Korean learners
Mee Sonu (University of Waseda) Keiichi Tajima (University of Hosei) Hiroaki Kato (NICT/ATR) Yoshinori Sagisaka (University of Waseda)
We aim to build up an effective perceptual training paradigm toward a computer-assisted language learning (CALL) system for second language. This study investigated the effectiveness of the perceptual training on Korean-speaking learners of Japanese in the distinction between geminate and singleton stops of Japanese. The training consisted of identification of geminate and singleton stops with feedback. We investigated whether training improves the learners’ identification of the geminate and singleton stops in Japanese. Moreover, we examined how perceptual training is affected by factors that influence speaking rate. Results were as follows. Participants who underwent perceptual training improved overall performance to a greater extent than untrained control participants. However, there was no significant difference between the group that was trained with three speaking rates and the group that was trained with normal rate only.
|
|
|