|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Mon-Ses3-P4: Applications in learning and other areas
| Time: | Monday 16:00 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Nestor Becerra Yoma |
| #1 | Designing spoken tutorial dialogue with children to elicit predictable but educationally valuable responses
Gregory Aist (Carnegie Mellon University) Jack Mostow (Carnegie Mellon University)
How to construct spoken dialogue interactions with children that are educationally effective and technically feasible? To address this challenge, we propose a design principle that constructs short dialogues in which (a) the user’s utterance are the external evidence of task performance or learning in the domain, and (b) the target utterances can be expressed as a well-defined set, in some cases even as a finite language (up to a small set of variables which may change from exercise to exercise.) The key approach is to teach the human learner a parameterized process that maps input to response. We describe how the discovery of this design principle came out of analyzing the processes of automated tutoring for reading and pronunciation and designing dialogues to address vocabulary and comprehension, show how it also accurately describes the design of several other language tutoring interactions, and discuss how it could extend to non-language tutoring tasks.
|
| #2 | Optimizing non-native speech recognition for CALL applications
Joost van Doremalen (Centre for Language and Speech Technology, Radboud University Nijmegen) Helmer Strik (Centre for Language and Speech Technology, Radboud University Nijmegen) Catia Cucchiarini (Centre for Language and Speech Technology, Radboud University Nijmegen)
We are developing a Computer Assisted Language Learning (CALL) system that gives feedback to grammar and pronunciation that makes use of Automatic Speech Recognition (ASR). However, good quality unconstrained non-native ASR is not yet feasible. Therefore, we use an approach in which we try to elicit constrained responses. The task in the current experiments is to select utterances from a list of responses. The results of our experiments show that significant improvements can be obtained by optimizing the language model and acoustic models. In this way we could reduce the utterance error rate from 29-26% to 10-8%.
|
| #3 | Evaluation of English Intonation based on Combination of Multiple Evaluation Scores
Akinori Ito (Graduate School of Engineering, Tohoku University) Tomoaki Konno (Graduate School of Engineering, Tohoku University) Masashi Ito (Graduate School of Engineering, Tohoku University) Shozo Makino (Graduate School of Engineering, Tohoku University)
In this paper, we proposed a novel method for evaluating intonation of an English utterance spoken by a learner for intonation learning by a CALL system. The proposed method is based on an intonation evaluation method proposed by Suzuki et al., which uses “word importance factors,” which are calculated based on word clusters given by a decision tree. We extended Suzuki’s method so that multiple decision trees are used and the resulting intonation scores are combined using multiple regression. As a result of an experiment, we obtained correlation coefficient comparable to the correlation between human raters.
|
| #4 | A LANGUAGE-INDEPENDENT FEATURE SET FOR THE AUTOMATIC EVALUATION OF PROSODY
Andreas Maier (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung) Florian Hönig (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung) Viktor Zeissler (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung) Anton Batliner (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung) Erik Körner (Universität Erlangen-Nürnberg, Japanologie) Nobuyuki Yamanaka (Universität Erlangen-Nürnberg, Japanologie) Peter Ackermann (Universität Erlangen-Nürnberg, Japanologie) Elmar Nöth (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
In second language learning, the correct use of prosody plays a vital
role. Therefore, an automatic method to evaluate the naturalness of
the prosody of a speaker is desirable. We present a novel method
to model prosody independently of the text and thus independently
of the language as well. For this purpose, the voiced and unvoiced
speech segments are extracted and a 187-dimensional feature vector
is computed for each voiced segment. This approach is compared to
word based prosodic features on a German text passage. Both are
confronted with the perceptive evaluation of two native speakers of
German. The word-based feature set yielded correlations of up to
0.92 while the text-independent feature set yielded 0.88. This is in
the same range as the inter-rater correlation with 0.88.
|
| #5 | Adapting the Acoustic Model of a Speech Recognizer for Varied Proficiency Non-Native Spontaneous Speech Using Read Speech with Language-Specific Pronunciation Difficulty
Klaus Zechner (Educational Testing Service) Derrick Higgins (Educational Testing Service) Rene Lawless (Educational Testing Service) Yoko Futagi (Educational Testing Service) Sarah Ohls (Educational Testing Service) George Ivanov (Educational Testing Service)
This paper presents a novel approach to acoustic model adaptation of a recognizer for non-native spontaneous speech for candidates’ responses in a test of spoken English. Instead of transcribing spontaneous speech data, a read speech corpus is created where non-native speakers of English read English sentences of different degrees of pronunciation difficulty with respect to their native language. As a selection criterion we develop a novel score, the “phonetic challenge score”, consisting of a measure for native language-specific difficulties described in the second-language acquisition literature and also of a statistical measure based on the cross-entropy between phoneme sequences of the native language and English.
The results of using the read speech for AM adaptation of a recognizer for spontaneous non-native speech show a significant reduction of word error rate for two of four language groups of the spontaneous speech test set as well as for the entire test set.
|
| #6 | Analysis and Utilization of MLLR Speaker Adaptation Technique for Learners\' Pronunciation Evaluation
Dean Luo (The University of Tokyo) Yu Qiao (The University of Tokyo) Nobuaki Minematsu (The University of Tokyo) Yutaka Yamauchi (Tokyo International University) Hirose Keikichi (The University of Tokyo)
In this paper, we investigate the effects and problems of MLLR speaker adaptation when applied to pronunciation evaluation. Automatic scoring and error detection experiments are conducted on two publicly available databases of Japanese learners’ English pronunciation. As we expected, over adapta-tion causes misjudge of pronunciation accuracy. Based on the analyses, two novel methods, Forced-aligned GOP score and Regularized-MLLR adaptation, are proposed to solve the ad-verse effects of MLLR adaption. Experimental results show that the proposed methods can better utilize MLLR adaptation and avoid over adaptation.
|
| #7 | Control of human generating force by use of acoustic information – Study on Onomatopoeic utterances for controlling small lifting-force
Miki Iimura (School of Engineering, Tokyo Denki University) Taichi Sato (School of Engineering, Tokyo Denki University) Kihachiro Tanaka (Faculty of Engineering, Saitama University)
We have conducted basic experiments for applying acoustic information to engineering problems. We asked the subjects to execute lifting actions while listening to sounds and measured the resultant lifting-force.
We used human onomatopoeic utterances as the sounds that are presented to the subjects aiming to make their lifting-force small.
Especially, we focused on the “emotion” or “nuance” contained in humans’ utterances, which is a unique characteristic evoked by the utterance’ acoustical features. We found that the emotion or nuance can control the lifting-force effectively. We also clarified the acoustical features that are responsible for effective control of lifting-force exerted by human.
|
| #8 | Mi-DJ: a multi-source intelligent DJ service
Ching-Hsien Lee (researcher) Hsu-Chih Wu (researcher)
In this paper, A Multi-source intelligent DJ (Mi-DJ) service is introduced. It is an audio program platform that integrates different media types, including audio and text format content. It acts like a DJ who plays personalized audio program to user whenever and wherever users need. The audio program is automatically generated, comprising several audio clips; all of them are from either existing audio files or text information, such as e-mail, calendar, news or user-preferred article. Our unique program generation technology makes user feel like listening to a well-organized program, instead of several separated audio files. The program can be organized dynamically, which realizes context-aware service based on location, user's schedule, or other user preference. With appropriate data management, text processing and speech synthesis technologies, Mi-DJ can be applied to many application scenarios. For example, it can be applied in language learning and tour guide.
|
| #9 | Human Voice or Prompt Generation? Can they Co-exist in an Application?
Géza Németh (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics) Csaba Zainkó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics) Mátyás Bartalis (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics) Gábor Olaszy (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics) Géza Kiss (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
This paper describes an R&D project regarding procedures for the automatic maintenance of the interactive voice response (IVR) system of a mobile telecom operator. The original plan was to create a generic voice prompt generation system for the customer service department. The challenge was to create a solution that is hard to distinguish from the human speaker (i.e. passing a sort of Turing-test) so its output can be freely mixed with original human recordings. The domain of the solution at the first step had to be narrowed down to the price list of available mobile phones and services. This is updated weekly, so the final operational system generates about 3 hours of speech at each weekend. It operates under human supervision but without intervention in the speech generation process. It was tested both by academic procedures and company customers and was accepted as fulfilling the original requirements.
|
| #10 | Automatic vs. human question answering over multimedia meeting recordings
Quoc Anh Le (University of Namur) Andrei Popescu-Belis (Idiap Research Institute)
Information access in meeting recordings can be assisted by meeting browsers, or can be fully automated following a question-answering (QA) approach. An information access task is defined, aiming at discriminating true vs. false parallel statements about facts in meetings. An automatic QA algorithm is applied to this task, using passage retrieval over a meeting transcript. The algorithm scores 59% accuracy for passage retrieval, while random guessing is below 1%, but only scores 60% on combined retrieval and question discrimination, for which humans reach 70%-80% and the baseline is 50%. The algorithm clearly outperforms humans for speed, at less than 1 second per question, vs. 1.5-2 minutes per question for humans. The degradation on ASR compared to manual transcripts still yields lower but acceptable scores, especially for passage identification. Automatic QA thus appears to be a promising enhancement to meeting browsers used by humans, as an assistant for relevant passage identification.
|
|
|