|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-O2: Emotion and Expression I
| Time: | Wednesday 10:00 |
Place: | East Wing 1 |
Type: | Oral |
| Chair: | Ailbhe Ni Chasaide |
| 10:00 | Emotion dimensions and formant position
Martijn Bastiaan Goudbeek (University of Tilburg, the Netherlands / Swiss Center for Affective Sciences, Geneva, Switzerland) Jean Philippe Goldman (Language Technology Laboratory, University of Geneva, Switzerland) Klaus Scherer (Swiss Center for Affective Sciences, Switzerland)
The influence of emotion on articulatory precision was investigated in a newly established corpus of acted emotional speech. The frequencies of the first and second formant of the vowels /i/, /u/, and /a/ was measured and shown to be significantly affected by emotion dimension. High arousal resulted in a higher mean F1 in all vowels, whereas positive valence resulted in higher mean values for F2. The dimension potency/control showed a pattern of effects that was consistent with a larger vocalic triangle for emotions high in potency/control. The results are interpreted in the context of Scherer's component process model.
|
| 10:20 | Identifying Uncertain Words within an Utterance via Prosodic Features
Heather Pon-Barry (Harvard University) Stuart Shieber (Harvard University)
We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the word-level. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate `target word' segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time.
|
| 10:40 | Evaluating Evaluators: A Case Study in Understanding the Benefits and Pitfalls of Multi-Evaluator Modeling
Emily Mower (University of Southern California) Maja J Mataric (University of Southern California) Shrikanth Narayanan (University of Southern California)
Emotion perception is a complex process, often measured using stimuli presentation experiments that query evaluators for their perceptual ratings of emotional cues. These evaluations contain variability both related and unrelated to the evaluated utterances. One approach to handling this variability is to model emotion perception at the individual level. However, the reported perception of users may not adequately capture the emotional acoustic properties of an utterance. This problem can be mitigated by creating averaged evaluator models. We demonstrate that this averaging improves classification performance compared to models created using individual-specific evaluations. We also demonstrate that the performance increases are related to the consistency with which evaluators label data. These results suggest that the acoustic properties of emotional speech are better captured using models formed from averaged evaluations rather than from individual-specific evaluations.
|
| 11:00 | Responding to User Emotional State by Adding Emotional Coloring to Utterances
Jaime Acosta (University of Texas at El Paso) Nigel Ward (University of Texas at El Paso)
When people speak to each other, they share a rich set of nonverbal
behaviors such as varying prosody in voice. These behaviors,
sometimes interpreted as demonstrations of emotions,
call for appropriate responses, but today’s spoken dialog systems
lack the ability to do so. We collected a corpus of persuasive
dialogs, specifically conversations about graduate school
between a staff member and students, and had judges label all
utterances with triples indicating the perceived emotions, using
the three dimensions: activation, evaluation, and power. We
found immediate response patterns, in which the staff member
colored her utterances in response to the emotion shown by the
student in the immediately previous utterance, and built a predictive
model suitable for use in a dialog system to persuasively
discuss graduate school with students.
|
| 11:20 | Analysis of Laugh Signals for Detecting in Continuous Speech
Sudheer Kumar K (International Institute of Information Technology, Hyderabad, India) Sri Harish Reddy M (International Institute of Information Technology, Hyderabad, India) Sri Rama Murty K (Indian Institute of Technology Madras, Chennai, India) Yegnanarayana B (International Institute of Information Technology, Hyderabad, India)
Laughter is a nonverbal vocalization that occurs often in speech communication. Since laughter is produced by the speech production mechanism, spectral analysis methods are used mostly for the study of laughter acoustics. In this paper the significance of excitation features for discriminating laughter and speech is discussed. New features describing the excitation characteristics are used to analyze the laugh signals. The features are based on instantaneous pitch and strength of excitation at epochs. An algorithm is developed based on these features to detect laughter regions in continuous speech. The results are illustrated by detecting laughter regions in a TV broadcast program.
|
| 11:40 | Data-driven Clustering in Emotional Space for Affect Recognition Using Discriminatively Trained LSTM Networks
Martin Woellmer (Technische Universitaet Muenchen) Florian Eyben (Technische Universitaet Muenchen) Bjoern Schuller (Technische Universitaet Muenchen) Ellen Douglas-Cowie (Queen\'s University Belfast) Roddy Cowie (Queen\'s University Belfast)
In today's affective databases speech turns are often labelled on a continuous scale for emotional dimensions such as valence or arousal to better express the diversity of human affect. However, applications like virtual agents usually map the detected emotional user state to rough classes in order to reduce the multiplicity of emotion dependent system responses. Since these classes often do not optimally reflect emotions that typically occur in a given application, this paper investigates data-driven clustering of emotional space to find class divisions that better match the training data and the area of application. Thereby we consider the Belfast Sensitive Artificial Listener database and TV talkshow data from the VAM corpus. We show that a discriminatively trained Long Short-Term Memory (LSTM) recurrent neural net that explicitly learns clusters in emotional space and additionally models context information outperforms both, Support Vector Machines and a Regression-LSTM net.
|
|
|