Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses2-P4:
Spoken dialogue systems

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair:Dilek Hakkani-Tur

#1Enabling A User To Specify An Item At Any Time During System Enumeration

Kyoko Matsuyama (Kyoto University)
Kazunori Komatani (Kyoto University)
Tetsuys Ogata (Kyoto University)
Hiroshi G. Okuno (Kyoto University)

In conversational dialogue systems, users prefer to speak at any time and to use natural expressions. We have developed an Independent Component Analysis (ICA) based semi-blind source separation method, which allows users to barge-in over system utterances at any time. We created a novel method from timing information derived from barge-in utterances to identify one item that a user indicates during system enumeration. First, we determine the timing distribution of user utterances containing referential expressions and then approximate it using gamma distribution. Second, we represent both the utterance timing and automatic speech recognition (ASR) results as probabilities of the desired selection from the system's enumeration. We then integrate these two probabilities to identify the item having the maximum likelihood of selection. Experimental results using 400 utterances indicated that our method outperformed two methods used as a baseline (one of ASR results only and one of utterance timing only) in identification accuracy.

#2System Request Detection in Human Conversation Based on Multi-Resolution Gabor Wavelet Features

Tomoyuki Yamagata (Kobe University)
Tetsuya Takiguchi (Kobe University)
Yasuo Ariki (Kobe University)

For a hands-free speech interface, it is important to detect commands in spontaneous utterances. Usual voice activity detection systems can only distinguish speech frames from non-speech frames, but they cannot discriminate whether the detected speech section is a command for a system or not. In this paper, in order to analyze the difference between system requests and spontaneous utterances, we focus on fluctuations in a long period, such as prosodic articulation, and fluctuations in a short period, such as phoneme articulation. The use of multi-resolution analysis using Gabor wavelet on a Log-scale Mel-frequency Filter-bank clarifies the different characteristics of system commands and spontaneous utterances. Experiments using our robot dialog corpus show that the accuracy of the proposed method is 92.6% in F-measure, while the conventional power and prosody-based method is just 66.7%.

#3Using Graphical Models for Mixed-Initiative Dialog Management Systems with Realtime Policies

Stefan Schwärzler (Technische Universität München, Germany)
Stefan Maier (Technische Universität München, Germany)
Joachim Schenk (Technische Universität München, Germany)
Frank Wallhoff (Technische Universität München, Germany)
Gerhard Rigoll (Technische Universität München, Germany)

In this paper, we present a novel approach for dialog modeling, which extends the idea underlying the partially observable Markov Decision Processes (POMDPs), i. e. it allows for calculating the dialog policy in real-time and thereby increases the system flexibility. The use of statistical dialog models is particularly advantageous to react adequately to common errors of speech recognition systems. Comparing our results to the refernce system (POMDP), we achieve a relative reduction of 31.6 % of the average dialog length. Furthermore, the proposed system shows a relative enhancement of 64.4 % of the sensitivity rate in the error recognition capabilities using the same specifity rate in both systems. The achieved results are based on the Air Travelling Information System with 21650 user utterances in 1585 natural spoken dialogs.

#4Conversation Robot Participating in and Activating a Group Communication

Shinya Fujie (Waseda University)
Yoichi Matsuyama (Waseda University)
Hikaru Taniyama (Waseda University)
Tetsunori Kobayashi (Waseda University)

As a new type of application of the conversation system, a robot activating other parties' communications has been developed. The robot participates in a quiz game with other participants and tries to activate the game. The functions installed in the robot are as follows: (1) The robot can participate in a group communication using its basic group conversation function. (2) The robot can perform the game according to the rules of the game. (3) The robot can activate communication using its proper actions depending on the game situations and the participants' situations. We conducted a real field experiment: the prototype system performed a quiz game with elderly people in an adult day-care center. The robot successfully entertained the people with its one hour demonstration.

#5Recent Advances in WFST-based Dialog System

Chiori Hori (National Institute of Information and Communications Technology (NICT))
Kiyonori Ohtake (National Institute of Information and Communications Technology (NICT))
Teruhisa Misu (National Institute of Information and Communications Technology (NICT))
Hideki Kashioka (National Institute of Information and Communications Technology (NICT))
Satoshi Nakamura (National Institute of Information and Communications Technology (NICT))

We proposed a dialog system using a weighted finite-state transducer (WFST) in which users concept and system action tags are input and output of the transducer, respectively. To test the potential of the WFST-based dialog management (DM) platform using statistical DM models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next actions. In this paper, we focus on how WFST optimization operations contribute to the performance of the system. In addition, we have constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. We show an example of a hotel reservation dialog with the fully composed system and discuss future work.

#6A Statistical Dialog Manager for the LUNA Project

David Griol (Universidad Carlos III de Madrid)
Giuseppe Riccardi (University of Trento)
Emilio Sanchis (Universitat Politecnica de Valencia)

In this paper, we present an approach for the development of a statistical dialog manager, in which the system response is selected by means of a classification process which considers all the previous history of the dialog to select the next system response. In particular, we use decision trees for its implementation. The statistical model is automatically learned from training data which are labeled in terms of different SLU features. This methodology has been applied to develop a dialog manager within the framework of the European LUNA project, whose main goal is the creation of a robust natural spoken language understanding system. We present an evaluation of this approach for both human machine and human-human conversations acquired in this project. We demonstrate that a statistical dialog manager developed with the proposed technique and learned from a corpus of human-machine dialogs can successfully infer the task-related topics present in spontaneous human-human dialogs.

#7A Policy-Switching Learning Approach for Adaptive Spoken Dialogue Agents

Heriberto Cuayáhuitl (Autonomous University of Tlaxcala)
Juventino Montiel-Hernández (Autonomous University of Tlaxcala)

The reinforcement learning paradigm has been adopted for inferring optimized and adaptive spoken dialogue agents. Such agents are typically learnt and tested without combining competing agents that may yield better performance at some points in the conversation. This paper presents an approach that learns dialogue behaviour from competing agents---switching from one policy to another competing one---on a previously proposed hierarchical learning framework. This policy-switching approach was investigated using a simulated flight booking dialogue system based on different types of information request. Experimental results reported that the induced agent using the proposed policy-switching approach yielded 8.2% fewer system actions than three baselines with a fixed type of information request. This result suggests that the proposed approach is useful for learning adaptive and scalable spoken dialogue agents.

#8Strategies for Accelerating the Design of Dialogue Applications using Heuristic Information from the Backend Database

Luis Fernando D\'Haro (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Ricardo Cordoba (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Ruben San-Segundo (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Javier Macias-Guarasa (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Jose Manuel Pardo (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)

Nowadays, current commercial and academic platforms for developing spoken dialogue applications lack of acceleration strategies based on using heuristic information from the contents or structure of the backend database in order to speed up the definition of the dialogue flow. In this paper we describe our attempts to take advantage of these information sources using the following strategies: the quick creation of classes and attributes to define the data model structure, the semi-automatic generation and debugging of database access functions, the automatic proposal of the slots that should be preferably requested using mixed-initiative forms or the slots that are better to request one by one using directed forms, and the generation of automatic state proposals to specify the transition network that defines the dialogue flow. Subjective and objective evaluations confirm the advantages of using the proposed strategies to simplify the design, and the high acceptance of the platform and its acceleration strategies.

#9Feature-based Summary Space for Stochastic Dialogue Modeling with Hierarchical Semantic Frames

Florian Pinault (LIA - UAPV)
Fabrice Lefèvre (LIA - UAPV)
Renato De Mori (LIA - UAPV)

In a spoken dialogue system, the dialogue manager needs to make decisions in a highly noisy environment. This work addresses this issue by proposing a framework to interface efficient probabilistic modeling both for the spoken language understanding module and for the dialogue management module. Hierarchical semantic frames are inferred and composed to build a thorough representation of the user's utterance semantic. Then this representation is mapped into a feature-based summary space in which is defined the set of dialogue states used by the dialogue manager, based on the POMDP paradigm. This allows a planning of the dialogue course taking into account the uncertainty on the dialogue state and tractability is ensured by use of an intermediate summary space. A preliminary implementation of such a system is presented on the MEDIA domain. The task is touristic information and hotel reservation, and the availability of WoZ data allows to consider a model-based approach to the POMDP dialogue manager.

#10Language Modeling and Dialog Management for Address Recognition

Rajesh Balchandran (IBM - T J Watson Research Center)
Rachevsky Leonid (IBM - T J Watson Research Center)
Larry Sansone (IBM - T J Watson Research Center)

This paper describes a language modeling and dialog management system for efficient and robust recognition of several arbitrarily ordered and inter-related components from very large datasets - such as with a complete addresses specified in a single sentence with address components in their natural sequence. A new two-pass speech recognition technique based on using multiple language models with embedded grammars is presented. Tests with this technique on complete address recognition task yielded good results and memory and CPU requirements are sufficiently low to make this technique viable for embedded environments. Additionally, a goal oriented algorithm for dialog based error recovery and disambiguation, that does not require manual identification of all possible dialog situations, is also presented. The combined system yields very high task completion accuracy, for only a few additional turns of interaction.

#11A framework for rapid development of conversational natural language call routing systems for call centers

Ea-Ee Jan (IBM)
Hong-Kwang Kuo (IBM)
Osamuyimen Stewart (IBM)
David Lubensky (IBM)

A framework for rapid development of conversational natural language call routing systems is proposed. The framework cuts costs by using only scantily prepared business requirements to automatically create an initial prototype. Aside from clear targets (terminal routing classes). vague targets which are variations of users’ incomplete (semantically overlapping) sentences are enumerated. The vague targets can be derived from the confusion set of the semantic tokens of the clear targets. Also automatically generated for a vague target is a disambiguation dialogue module, which consists of a prompt and grammar to guide the user from a vague target to one of its associated clear targets. In the final analysis, our framework is able to reduce the human labor associated with developing an initial natural language call routing system from a few weeks to just a few hours. The experimental results from a deployed pilot system support the feasibility of our proposed approach.

#12The MonAMI Reminder: a spoken dialogue system for face-to-face interaction

Jonas Beskow (KTH Speech Music & Hearing)
Jens Edlund (KTH Speech Music & Hearing)
Björn Granström (KTH Speech Music & Hearing)
Joakim Gustafson (KTH Speech Music & Hearing)
Gabriel Skantze (KTH Speech Music & Hearing)
Helena Tobiasson (KTH Human-Computer Interaction Group)

We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.

#13Influence of Training on Direct and Indirect Measures for the Evaluation of Multimodal Systems

Julia Seebode (Research training group prometei, Berlin Institute of Technology, Germany)
Stefan Schaffer (Research training group prometei, Berlin Institute of Technology, Germany)
Ina Wechsung (Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany)
Florian Metze (School of Computer Science, Carnegie Mellon University, Pittsburgh, USA)

Finding suitable evaluation methods is an indispensable task during the development of new user interfaces, as no standardized approach has so far been established, especially for multimodal interfaces. In the current study, we used several data sources (direct and indirect measurements) to evaluate a multimodal version of an information system, tested on trained and untrained users. We investigated the extent to which the different types of data showed concordance concerning the perceived quality of the system, in order to derive clues as to the suitability of the respective evaluation methods. The aim was to examine, if widely used methods not originally developed for multimodal interfaces are appropriate under these conditions, and to derive new evaluation paradigms.

#14Talking Heads for Interacting with Spoken Dialog Smart-Home Systems

Christine Kühnel (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Benjamin Weiss (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)

In this paper the relation between the quality of a talking head as an output component of a spoken dialog system and the quality of the system itself are investigated. Results show that the quality of the talking head has indeed an important impact on system quality. The quality of the talking head itself is found to be influenced by visual and speech quality and the synchronization of voice and lip movement.

#15Speech Generation from Hand Gestures Based on Space Mapping

Aki Kunikoshi (The University of Tokyo)
Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

Individuals with speaking disabilities often use a TTS synthesizer for speech communication. Since users always have to type sound symbols and the synthesizer reads them out in a monotonous style, the use of the current synthesizers usually renders real-time operation and lively communication difficult. In this paper, we develop a special glove, by wearing which, speech sounds are generated from hand gesture transitions. For development, GMM-based voice conversion techniques are applied to estimate a mapping function between a space of hand gestures and another space of speech sounds. In this paper, as an initial trial, a mapping between hand gestures and Japanese vowel sounds is estimated so that topological features of the selected gestures in a feature space and those of the five Japanese vowels in a cepstrum space are equalized. Experiments show that the special glove can generate good Japanese vowel transitions with voluntary control of duration and articulation.