|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses2-P2: Expression, Emotion and Personality Recognition
| Time: | Wednesday 13:30 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | John H.L. Hansen |
| #1 | Classifying Turn-Level Uncertainty Using Word-Level Prosody
Diane Litman (University of Pittsburgh) Mihai Rotaru (Textkernel B.V.) Greg Nicholas (Brown University)
Spoken dialogue researchers often use supervised machine learning to classify turn-level user affect from a set of turn-level features. The utility of sub-turn features has been less explored, due to the complications introduced by associating a variable number of sub-turn units with a single turn-level classification. We present and evaluate several voting methods for using word-level pitch and energy features to classify turn-level user uncertainty in spoken dialogue data. Our results show that when linguistic knowledge regarding prosody and word position is introduced into a word-level voting model, classification accuracy is significantly improved compared to the use of both turn-level and uninformed word-level models.
|
| #2 | Detecting Subjectivity in Multiparty Speech
Gabriel Murray (Department of Computer Science, University of British Columbia) Giuseppe Carenini (Department of Computer Science, University of British Columbia)
In this research we aim to detect subjective sentences in spontaneous speech and label them for polarity. We introduce a novel technique wherein subjective patterns are learned from both labeled and unlabeled data, using n-grams with varying levels of lexical instantiation. Applying this technique to meeting speech, we gain significant improvement over
state-of-the-art approaches and demonstrate the method's robustness to ASR errors. We also show that coupling thepattern-based approach with structural and lexical features of meetings yields additional improvement.
|
| #3 | Pitch Contour Parameterisation based on Linear Stylisation for Emotion Recognition
Vidhyasaharan Sethu (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia) Eliathamby Ambikairajah (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia) Julien Epps (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia)
The pitch contour contains information that characterises the emotion being expressed by speech, and consequently features extracted from pitch form an integral part of many automatic emotion recognition systems. While pitch contours may have many small variations and hence are difficult to represent compactly, it may be possible to parameterise them by approximating the contour for each voiced segment by a straight line. This paper looks at such a parameterisation method in the context of emotion recognition. Listening tests were performed to subjectively determine if the linearly stylised contours were able to sufficiently capture information pertaining to emotions expressed in speech. Furthermore these parameters were used as features for an automatic 5-class emotion classification system. The use of the proposed parameters rather than pitch statistics resulted in a relative increase in accuracy of about 20%.
|
| #4 | FEATURE-BASED AND CHANNEL-BASED ANALYSES OF INTRINSIC VARIABILITY IN SPEAKER VERIFICATION
Martin Graciarena (SRI International) Tobias Bocklet (University of Erlangen) Elizabeth Shriberg (SRI International) Andreas Stolcke (SRI International) Sachin Kajarekar (SRI International)
We explore how intrinsic variations (those associated with the speaker rather than the recording environment) affect text-independent speaker verification performance. In a previous paper we introduced the SRI-FRTIV corpus and provided speaker verification results using a Gaussian mixture model (GMM) system on telephone-channel speech. In this paper we explore the use of other speaker verification systems on the telephone channel data and compare against the GMM baseline. We found the GMM system to be one of the more robust across all conditions. Systems relying on recognition hypotheses had a significant degradation in low vocal effort conditions. We also explore the use of the GMM system on several other channels. We found improved performance on table-top microphones compared to the telephone channel in furtive conditions and gradual degradations as a function of the distance from the microphone to the speaker.
|
| #5 | Robust Angry Speech Detection Employing TEO-Based Discriminative Classifier Combination
Wooil Kim (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA) John Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA)
This paper proposes an effective angry speech detection employing the TEO-based feature extraction. Decorrelation process is applied to the TEO-based feature and minimum classification error training is employed. Combination with the conventional MFCC is also employed to utilize its effectiveness to characterize the spectral envelope of speech signals. The experimental results over the SUSAS corpus demonstrate the proposed angry speech detection scheme is effective at increasing detection accuracy on the open-speaker and open-vocabulary task. Up to 7.78% of classification accuracy is obtained by combination of the proposed methods including decorrelation of TEO-based feature, discriminative training, and classifier combination.
|
| #6 | Improving Emotion Recognition using Class-Level Spectral Features
Dmitri Bitouk (University of Pennsylvania) Ani Nenkova (University of Pennsylvania) Ragini Verma (University of Pennsylvania)
Traditional approaches to automatic emotion recognition from speech typically make use of utterance level prosodic features. Still, a great deal of useful information about expressivity and emotion can be gained from spectral features or from measurements from specific regions of the utterance, such as the stressed vowels. Here we introduce a novel set of spectral features for emotion recognition: statistics of Mel-Frequency Spectral Coefficients computed over three phoneme classes. We investigate performance of our features in the task of speaker-independent emotion recognition using two datasets. Our results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement.
|
| #7 | Arousal and Valence prediction in spontaneous emotional speech: felt versus perceived emotion
Khiet Truong (University of Twente) David van Leeuwen (TNO Defence, Security, and Safety) Mark Neerincx (TNO Defence, Security, and Safety) Franciska de Jong (University of Twente)
In this paper, we describe emotion recognition experiments carried out for
spontaneous affective speech with the aim to compare the added value of
annotation of felt emotion versus annotation of perceived emotion. Using speech
material available in the TNO-GAMING corpus (a corpus containing
audiovisual recordings of people playing videogames), speech-based affect
recognizers were developed that can predict Arousal and Valence scalar values.
Two types of recognizers were developed in parallel: one trained with felt
emotion annotations (generated by the gamers themselves) and one trained with
perceived/observed emotion annotations (generated by a group of observers). The
experiments showed that, in speech, with the methods and features currently
used, observed emotions are easier to predict than felt emotions. The results
suggest that recognition performance strongly depends on how and by whom the
emotion annotations are carried out.
|
| #8 | Dimension Reduction Approaches for SVM based Speaker Age Estimation
Gil Dobry (The Open University of Israel) Ron Hecht (PuddingMedia) Mireille Avigal (The Open University of Israel) Yaniv Zigel (Ben-Gurion University)
This paper presents two novel dimension reduction approaches applied on the gaussian mixture model (GMM) supervectors, to improve age estimation speed and accuracy. The GMM supervector embodies many speech characteristics irrelevant to age estimation and like noise, they are harmful to the system’s generalization ability. In addition, the support vectors machine (SVM) evaluation computation grows with the vector’s dimension, especially when using complex kernels. The first approach presented is the weighted-pairwise principal components analysis (WPPCA) that reduces the vector dimension by minimizing the redundant variability. The second approach is based on anchor-models, using a novel anchors selection method. Experiments showed that dimension reduction makes the evaluation process 5 times faster and using the WPPCA approach, it is also 5% more accurate.
|
| #9 | ANN based Decision Fusion for Speech Emotion Recognition
Lu Xu (State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory of Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China) Mingxing Xu (State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory of Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China) Dali Yang (Department of Computer Science and Technology, Beijing Information Science and Technology University, Beijing 100101, China)
As a hot research field, speech emotion recognition has attracted increasing attentions from both academic and business. In this paper, we proposed a method to recognize speech emotions adopting ANNs and to fuse two kinds of recognitions using different features at the decision level. Each emotional utterance is recognized by some individual recognizers firstly. Then the outputs of these recognizers were fused adopting the voting strategy. Furthermore, the dimensionality of supervectors constructed from spectral features is reduced through PCA. Experimental results demonstrated that the proposed decision fusion is effective and the dimensionality reduction is feasible.
|
| #10 | Processing affected speech within human machine interaction
Bogdan Vlasenko (Cognitive Systems, IESK, Otto-von-Guericke Universitaet) Andreas Wendemuth (Cognitive Systems, IESK, Otto-von-Guericke Universitaet)
Spoken dialog systems (SDS) integrated into human-machine
interaction interfaces is becoming a standard technology. Current
state-of-the-art SDS, usually, is not able to provide for the
user a natural way of communication. Existing automated dialog
systems do not dedicate enough attention to problems in the
interaction related to affected user behavior. As a result, Automatic
Speech Recognition (ASR) engines are not able to recognize
affected speech and dialog strategy does not make use of
the user’s emotional state. This paper addresses some aspects of
processing affected speech within natural human-machine interaction.
First of all, we propose an affected speech adapted ASR
engine. Second, we describe our methods of emotion recognition
within speech and present our results of emotion classification
within Interspeech 2009 Emotion Challenge. Third,
we test affected speech adapted speech recognition models and
introduce an approach to achieve emotion adaptive dialog management
in human-machine interaction.
|
| #11 | Emotion Recognition from Speech using Extended Feature Selection and a Simple Classifier
Ali Hassan (University of Southampton) Robert Damper (University of Southampton)
We describe extensive experiments on the recognition of emotion from speech using acoustic features only. Two databases of acted emotional speech (Berlin and DES) have been used in this work. The principal focus is on methods for selection of good features from a relatively large set of hand-crafted features, perhaps formed by fusing different feature sets used by different researchers. We show that the monotonic assumption underlying popular sequential selection algorithms does not hold, and use this finding to improve recognition accuracy. We show further that a very simple classifier (-nearest neighbour) produces better results than any so far reported by other researchers on these databases, suggesting that previous work has failed to match the complexity of the classifier used to the complexity of the data. Finally, several potentially fruitful avenues for future work are outlined.
|
|
|