Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses2-O1:
Automotive and Mobile applications
| Time: | Tuesday 13:30 |
Place: | Main Hall |
Type: | Oral |
| Chair: | Kate Knill |
| 13:30 | Fast Speech Recognition for Voice Destination Entry in a Car Navigation System
Hoon Chung (ETRI) Jeon Gue Park (ETRI) Hyeon Bae Jeon (ETRI) Yun Keun Lee (ETRI)
In this paper, we introduce a multi-stage decoding algorithm optimized to recognize very large number of entry names on a resource-limited embedded device. The multi-stage decoding algorithm is composed of a two-stage HMM-based coarse search and a detailed search. The two-stage HMM-based coarse search generates a small set of candidates that are assumed to contain a correct hypothesis with high probability, and the detailed search re-ranks the candidates by rescoring them with sophisticate acoustic models. In this paper, we take experiments with 1-millions of point-of-interest (POI) names on an in-car navigation device with a fixed-point processor running at 620MHz. The experimental result shows that the multi-stage decoding algorithm runs about 2.23 times real-time on the device without serious degradation of recognition performance.
|
| 13:50 | Improving Perceived Accuracy for In-Car Media Search
Yun-Cheng Ju (Microsoft Research) Michael Seltzer (Microsoft Research) Ivan Tashev (Microsoft Research)
Speech recognition technology is prone to mistakes, but this is not the only source of errors that cause speech recognition systems to fail; sometimes the user simply does not utter the command correctly. Usually, user mistakes are not considered when a system is designed and evaluated. This creates a gap between the claimed accuracy of the system and the actual accuracy perceived by the users. We address this issue quantitatively in our in-car infotainment media search task and propose expanding the capability of voice command to accommodate user mistakes while retaining a high percentage of the performance for queries with correct syntax. As a result, failures caused by user mistakes were reduced by an absolute 70% at the cost of a drop in accuracy of only 0.28%.
|
| 14:10 | Laying the Foundation for In-car Alcohol Detection by Speech
Florian Schiel (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität München) Christian Heinrich (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität München)
The fact that an increasing number of functions in the automobile are and will be controlled by speech of the driver rises the question whether this speech input may be used to detect a possible alcoholic intoxication of the driver. For that matter a large part of the new Alcohol Language Corpus (ALC) edited by the Bavarian Archive of Speech Signals (BAS) will be used for a broad statistical investigation of possible feature candidates for classification. In this contribution we present the motivation and the design of the ALC corpus as well as first results from fundamental frequency and rhythm analysis. Our analysis by comparing sober and alcoholized speech of the same individuals suggests that there are in fact promising features that can automatically be derived from the speech signal during the speech recognition process and will indicate intoxication for most speakers.
|
| 14:30 | A Voice Search Approach to Replying to SMS Messages in Automobiles
Yun-Cheng Ju (Microsoft Research) Tim Paek (Microsoft Research)
Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robust voice search approach to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches are accurately retrieved using a vector space model. In evaluating SMS replies within the acoustically challenging environment of automobiles, the voice search approach consistently outperformed using just the recognition results of a statistical language model or a probabilistic context-free grammar. For SMS replies covered by our templates, the approach achieved as high as 89.7% task completion when evaluating the top five reply candidates.
|
| 14:50 | Language Modeling for What-with-Where on GOOG-411
Charl van Heerden (Meraka Institute) Johan Schalkwyk (Google Inc.) Brian Strope (Google Inc.)
This paper describes the language modeling architectures and recognition experiments that enabled support of 'what-with-where' queries on GOOG-411.
First we compare accuracy trade-offs between a single national business LM for business queries and using many small models adapted for particular cities. Experimental evaluations show that both approaches lead to comparable overall accuracy. Differences in the distributions of errors also lead to improvements from a simple combination. We then optimize variants of the national business LM in the context of combined business and location queries from the web, and finally evaluate these models on a recognition test from the recently fielded 'what-with-where' system.
|
| 15:10 | Very Large Vocabulary Voice Dictation for Mobile Devices
Jan Nouza (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic) Petr Cerva (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic) Jindrich Zdansky (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)
This paper deals with optimization techniques that can make very large vocabulary voice dictation applications deployable on recent mobile devices. We focus namely on optimization of signal parameterization (frame rate, FFT calculation, fixed-point representation) and on efficient pruning techniques employed on the state and Gaussian mixture level. We demonstrate the applicability of the proposed techniques on the practical design of an embedded 255K-word discrete dictation program developed for Czech. Its real performance is comparable to a client-server version of the fluent dictation program implemented on the same mobile device.
|