Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses3-S1:
Special Session: Silent Speech Interfaces

Time:Monday 16:00 Place:East Wing 4 Type:Special
Chair:Bruce Denby & Tanja Schultz

#0Visuo-Phonetic Decoding using Multi-Stream and Context-Dependent Models for an Ultrasound-based Silent Speech Interface

Thomas Hueber (ESPCI/Telecom ParisTech)
Elie-Laurent Benaroya (ESPCI ParisTech)
Gérard Chollet (LTCI/CNRS Telecom ParisTech)
Bruce Denby (UPMC Paris VI - ESPCI ParisTech)
Gérard Dreyfus (Laboratoire d\'Electronique - ESPCI ParisTech)
Maureen Stone (University of Maryland Dental School)

Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a one-hour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5k-word decoding dictionary.

#0Disordered Speech Recognition Using Acoustic and sEMG Signals

Yunbin Deng (BAE Systems, Inc, Advanced Information Technologies)
Rupal Patel (Communication Analysis & Design Lab, Northeastern University)
James T. Heaton (Center for Laryngeal Surgery & Voice Rehabilitation, Mass. General Hospital)
Glen Colby (BAE Systems, Inc, Advanced Information Technologies)
L. Donald Gilmore (Delsys, Inc.)
Joao Cabrera (BAE Systems, Inc, Advanced Information Technologies)
Serge H. Roy (Delsys, Inc.)
Carlo J. De Luca (Delsys, Inc.)
Geoffrey S. Meltzner (BAE Systems, Inc, Advanced Information Technologies)

Parallel isolated word corpora were collected from healthy speakers and individuals with speech impairment due to stroke or cerebral palsy. Surface electromyographic (sEMG) signals were collected for both vocalized and mouthed speech production modes. Pioneering work on disordered speech recognition using the acoustic signal, the sEMG signals, and their fusion are reported. Results indicate that speaker-dependent isolated-word recognition from the sEMG signals of articulator muscle groups during vocalized disordered-speech production was highly effective. However, word recognition accuracy for mouthed speech was much lower, likely related to the fact that some disordered speakers had considerable difficulty producing consistent mouthed speech. Further development of the sEMG-based speech recognition systems is needed to increase usability and robustness.

#0Multimodal HMM-based NAM-to-speech conversion

Viet-Anh TRAN (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Gérard BAILLY (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Hélène LOEVENBRUCK (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Tomoki TODA (NAIST (NAra Institute of Science and Technology), Japan)

Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al. is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice.

#0Technologies for Processing Body-Conducted Speech Detected with a Non-Audible Murmur Microphone

Tomoki Toda (Nara Institute of Science and Technology)
Keigo Nakamura (Nara Institute of Science and Technology)
Takayuki Nagai (Nara Institute of Science and Technology)
Tomomi Kaino (Nara Institute of Science and Technology)
Yoshitaka Nakajima (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)

In this paper, we review our recent research on technologies for processing body-conducted speech detected with Non-Audible Murmur (NAM) microphone. NAM microphone enables us to detect various types of body-conducted speech such as extremely soft whisper, normal speech, and so on. Moreover, it is robust against external noise due to its noise-proof structure. To make speech communication more universal by effectively using these properties of NAM microphone, we have so far developed two main technologies: one is body-conducted speech conversion for human-to-human speech communication; and the other is body-conducted speech recognition for man-machine speech communication. This paper gives an overview of these technologies and presents our new attempts to investigate the effectiveness of body-conducted speech recognition.

#0Impact of Different Speaking Modes on EMG-based Speech Recognition

Michael Wand (Cognitive Systems Lab, University of Karlsruhe, Germany)
Szu-Chen Stan Jou (ATC, ICL, Industrial Technology Research Institute, Taiwan)
Arthur R. Toth (Cognitive Systems Lab, University of Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, University of Karlsruhe, Germany)

We present our recent results on speech recognition by surface electromyography (EMG), which captures the electric potentials that are generated by the human articulatory muscles. This technique can be used to enable Silent Speech Interfaces, since EMG signals are generated even when people only articulate speech without producing any sound. Preliminary experiments have shown that the EMG signals created by audible and silent speech are quite distinct. In this paper we first compare various methods of initializing a silent speech EMG recognizer, showing that the performance of the recognizer substantially varies across different speakers. Based on this, we analyze EMG signals from audible and silent speech, present first results on how discrepancies between these speaking modes affect EMG recognizers, and suggest areas for future work.

#0Artificial speech synthesizer control by brain-computer interface

Jonathan S. Brumberg (Boston University; Neural Signals, Inc.)
Philip R. Kennedy (Neural Signals, Inc.)
Frank H. Guenther (Boston University; Harvard University; MIT)

We developed and tested a brain-computer interface for control of an artificial speech synthesizer by an individual with near complete paralysis. This neural prosthesis for speech restoration is currently capable of predicting vowel formant frequencies based on neural activity recorded from an intracortical microelectrode implanted in the left hemisphere speech motor cortex. Using instantaneous auditory feedback (< 50 ms) of predicted formant frequencies, the study participant has been able to correctly perform a vowel production task at a maximum rate of 80-90% correct.

#0Synthesizing Speech from Electromyography using Voice Transformation Techniques

Arthur R. Toth (University of Karlsruhe)
Michael Wand (University of Karlsruhe)
Tanja Schultz (University of Karlsruhe)

Surface electromyography (EMG) can be used to record the activation potentials of articulatory muscles while a person speaks. It could enable silent speech interfaces, as EMG signals are generated even when people pantomime speech noiselessly. Having effective silent speech interfaces would enable a number of compelling applications, allowing people to communicate in areas where they would not want to be overheard or could not be heard. In order to use EMG signals in speech interfaces, however, there must be a relatively accurate method to map the signals to speech. Most previous attempts to use EMG signals for speech interfaces appear to focus on Automatic Speech Recognition (ASR) based on features derived from EMG signals. We explore the alternative idea of using Voice Transformation (VT) techniques to synthesize speech from EMG signals. We report the results of our preliminary studies, noting the difficulties we encountered and suggesting future work.

16:00Characterizing Silent and Pseudo-Silent Speech using Radar-like Sensors

John Holzrichter (Hertz Foundation)

Radar-like sensors enable the measuring of speech articulator conditions, especially their shape changes and contact events both during silent and normal speech. Such information can be used to associate articulator conditions with digital “codes” for use in communications, machine control, speech masking or canceling, and other applications.