Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses2-S1:
Special Session: INTERSPEECH 2009 Emotion Challenge

Time:Monday 13:30 Place:East Wing 4 Type:Special
Chair:Bjoern Schuller & Anton Batliner

#0Emotion Classification in Children’s Speech Using Fusion of Acoustic and Linguistic Features

Tim Polzehl (TU-Berlin, Deutsche Telekom Laboratories)
Shiva Sundaram (TU-Berlin, Deutsche Telekom Laboratories)
Hamed Ketabdar (TU-Berlin, Deutsche Telekom Laboratories)
Michael Wagner (National Centre for Biometric Studies)
Florian Metze (interACT)

This paper describes a system to detect angry vs. non-angry utterances of children who are engaged in dialog with an Aibo robot dog. The system was submitted to the Interspeech2009 Emotion Challenge evaluation. The speech data consist of short utterances of the children’s speech, and the proposed system is designed to detect anger in each given chunk. Frame-based cepstral features, prosodic and acoustic features as well as glottal excitation features are extracted automatically, reduced in dimensionality and classified by means of an artificial neural network and a support vector machine. An automatic speech recognizer transcribes the words in an utterance and yields a separate classification based on the degree of emotional salience of the words. Late fusion is applied to make a final decision on anger vs. non-anger of the utterance. Preliminary results show 75.9% unweighted average recall on the training data and 67.6% on the test set.

#0Acoustic Emotion Recognition using Dynamic Bayesian Networks and Multi-Space Distributions

Roberto Barra-Chicote (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Fernando Fernandez (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Syaheerah Lutfi (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Juan Manuel Lucas-Cuesta (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Javier Macias-Guarasa (Department of Electronics. University of Alcala. Spain)
Juan Manuel Montero (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Ruben San-Segundo (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Jose Manuel Pardo (Speech Technology Group. Universidad Politecnica de Madrid. Spain)

In this paper we describe the acoustic emotion recognition system built at the Speech Technology Group of the Universidad Politecnica de Madrid (Spain) to participate in the INTERSPEECH 2009 Emotion Challenge. Our proposal is based on the use of a Dynamic Bayesian Network (DBN) to deal with the temporal modelling of the emotional speech information. The selected features (MFCC, F0, Energy and their variants) are modelled as different streams, and the F0 related ones are integrated under a Multi Space Distribution (MSD) framework, to properly model its dual nature (voiced/unvoiced). Experimental evaluation on the challenge test set, show a 67.06% and 38.24% of unweighted recall for the 2 and 5-classes tasks respectively. In the 2-class case, we achieve similar results compared with the baseline, with 8.5 times less features. In the 5-class case, we achieve a statistically significant 6.5% relative improvement.

#0Brno University of Technology System for Interspeech 2009 Emotion Challenge

Marcel Kockmann (Brno University of Technology, Czech Republic)
Lukas Burget (Brno University of Technology, Czech Republic)
Jan Cernocky (Brno University of Technology, Czech Republic)

This paper describes Brno University of Technology (BUT) system for the Interspeech 2009 Emotion Challenge. Our submitted system for the Open Performance Sub-Challenge uses acoustic frame based features as a front-end and Gaussian Mixture Models as a back-end. Different feature types and modeling approaches successfully applied in speaker- and language recognition are investigated and we can achieve an 16% and 9% relative improvement over the best dynamic and static baseline system on the 5-class task, respectively.

#0Cepstral and Long-Term Features for Emotion Recognition

Pierre Dumouchel (Ecole de technologie superieure)
Najim Dehak (Ecole de technologie superieure)
Yazid Attabi (Ecole de technologie superieure)
Reda Dehak (Laboratoire de recherche et de developpement de l\'EPITA)
Narjes Boufaden (Centre de recherche informatique de Montreal)

In this paper, we describe systems that were developed for the Open Performance Sub-Challenge of the INTERSPEECH 2009 Emotion Challenge. We participate to both two-class and five-class emotion detection. For the two-class problem, the best performance is obtained by logistic regression fusion of three systems. Theses systems use short- and long-term speech features. This fusion achieved an absolute improvement of 2,6% on the unweighted recall value compared with [6]. For the five-class problem, we submitted two individual systems: cepstral GMM vs. long-term GMM-UBM. The best result comes from a cepstral GMM and produced an absolute improvement of 3,5% compared to [6].

#0Exploring the benefits of discretization of acoustic features for speech emotion recognition

Thurid Vogt (Multimedia Concepts and Applications, University of Augsburg, Germany)
Elisabeth André (Multimedia Concepts and Applications, University of Augsburg, Germany)

We present a contribution to the Open Performance subchallenge of the INTERSPEECH 2009 Emotion Challenge. We evaluate the feature extraction and classifier of EmoVoice, our framework for real-time emotion recognition from voice on the challenge database and achieve competitive results. Furthermore, we explore the benefits of discretizing numeric acoustic features and find it beneficial in a multi-class task.

#0Combining spectral and prosodic information for emotion recognition in the Interspeech 2009 Emotion Challenge

Iker Luengo (Department of Electronics and Telecommunication, University of the Basque Country, Spain)
Eva Navas (Department of Electronics and Telecommunication, University of the Basque Country, Spain)
Inmaculada Hernáez (Department of Electronics and Telecommunication, University of the Basque Country, Spain)

This paper describes the system presented at the Interspeech 2009 Emotion Challenge. It relies on both spectral and prosodic features in order to automatically detect the emotional state of the speaker. As both kinds of features have very different characteristics, they are treated separately, creating two sub-classifiers, one using the prosodic features and the other one using the prosodic ones. The results of these two classifiers are then combined with a fusion system based on Support Vector Machines.

#0GTM-URL Contribution to the INTERSPEECH 2009 Emotion Challenge

Santiago Planet (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Ignasi Iriondo (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Joan-Claudi Socoró (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Carlos Monzo (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Jordi Adell (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)

This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge [1]. Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the Classifier Sub-Challenge, for which we tested several classification strategies. On the whole, the results were slightly worse than or similar to the baseline, but we found some configurations that could be considered in future implementations.

#0Improving Automatic Emotion Recognition from Speech Signals

Elif Bozkurt (Koc University, Istanbul, Turkey)
Engin Erzin (Koc University, Istanbul, Turkey)
Cigdem Eroglu Erdem (Bahcesehir University, Istanbul, Turkey)
Tanju Erdem (Ozyegin University, Istanbul, Turkey)

We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. We investigate prosody related, spectral and HMM-based features for the evaluation of emotion recognition with Gaussian mixture model (GMM) based classifiers. Spectral features consist of mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF) features and their derivatives, whereas prosody-related features consist of mean normalized values of pitch, first derivative of pitch and intensity. Unsupervised training of HMM structures are employed to define prosody related temporal features for the emotion recognition problem. We also investigate data fusion of different features and decision fusion of different classifiers, which are not well studied for emotion recognition framework.

#0Emotion Recognition Using a Hierarchical Binary Decision Tree Approach

Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Emily Mower (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Carlos Busso (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Sungbok Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)

Emotion state tracking is an important aspect of human-computer and human-robot interaction. It is important to design task specific emotion recognition systems for real-world applications. In this work, we propose a hierarchical structure loosely motivated by Appraisal Theory for emotion recognition. The levels in the hierarchical structure are carefully designed to place the easier classification task at the top level and delay the decision between highly ambiguous classes to the end. The proposed structure maps an input utterance into one of the five-emotion classes through subsequent layers of binary classifications. We obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall percentage on the evaluation data set improves by 3.3% absolute (8.8% relative) over the baseline model.

13:30The INTERSPEECH 2009 Emotion Challenge

Bjoern Schuller (Technische Universitaet Muenchen)
Stefan Steidl (Friedrich-Alexander University Erlangen-Nuremberg)
Anton Batliner (Friedrich-Alexander University Erlangen-Nuremberg)

The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed – such as cross-validation or percentage splits without proper instance definition – prevents exact reproducibility. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus serves as basis with clearly defined test and training partitions incorporating speaker independence as needed in most reallife settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.