|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Mon-Ses2-P2: Accent and Language Recognition
| Time: | Monday 13:30 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | William Campbell |
| #1 | Factor Analysis and SVM for Language Recognition
Florian Verdet (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France and Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland) Driss Matrouf (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France) Jean-François Bonastre (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France) Jean Hennebert (Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)
Statistic classifiers operate on features that generally include both, useful and useless information. These two types of information are difficult to separate in feature domain. Recently, a new paradigm based on Factor Analysis (FA) proposed a model decomposition into useful and useless components. This method has successfully been applied to speaker recognition tasks.
In this paper, we study the use of FA for language recognition. We propose a classification method based on SDC features and Gaussian Mixture Models (GMM). We present well performing systems using Factor Analysis and FA-based Support Vector Machine (SVM) classifiers.
Experiments are conducted using NIST LRE 2005’s primary condition. The relative equal error rate reduction obtained by the best factor analysis configuration with respect to baseline GMM-UBM system is over 60 %, corresponding to an EER of 6.59 %.
|
| #2 | Exploring Universal Attribute Characterization of Spoken Languages for Spoken Language Recognition
Sabato Marco Siniscalchi (NTNU) Jeremy Reed (Georgia Institute of Technology) Torbjørn Svendsen (NTNU) Chin-Hui Lee (Georgia Institute of Technology)
We propose a novel universal acoustic characterization approach to spoken language identification (LID), in which any spoken language is described with a common set of fundamental units defined "universally." Specifically, manner and place of articulation form this unit inventory and are used to build a set of universal attribute models with data-driven techniques. Using the vector space modeling approaches to LID a spoken utterance is first decoded into a sequence of attributes. Then, a feature vector consisting of co-occurrence statistics of attribute units is created, and the final LID decision is implemented with a set of vector space language classifiers. Although the present study is just in its preliminary stage, promising results comparable to acoustically rich phone-based LID systems have already been obtained on the NIST 2003 LID task. The results provide clear insight for further performance improvements and encourage a continuing exploration of the proposed framework.
|
| #3 | On the use of Phonological Features for Automatic Accent Analysis
Abhijeet Sangwan (Center for Robust Speech Systems) John Hansen (Center for Robust Speech Systems)
In this paper, we present an automatic accent analysis system
that is based on phonological features (PFs). The proposed system
exploits the knowledge of articulation embedded in phonology
by rapidly build Markov models (MMs) of PFs extracted
from accented speech. The Markov models capture information
in the PF space along two dimensions of articulation: PF
state-transitions and state-durations. Furthermore, by utilizing
MMs of native and non-native accents a new statistical measure
of “accentedness” is developed which rates the articulation
of a word on a scale of native-like (−1) to non-native like
(+1. The proposed methodology is then used to perform an
automatic cross-sectional study of accented English spoken by
native speakers of Mandarin Chinese (N-MC). The work developed in this paper is easily assimilated into language learning systems, and has impact in the
areas of speaker recognition and ASR (automatic speech recognition).
|
| #4 | Language Recognition Using Language Factors
Fabio Castaldo (Politecnico di Torino) Sandro Cumani (Politecnico di Torino) Pietro Laface (Politecnico di Torino) Daniele Colibro (Loquendo)
Language recognition systems based on acoustic models reach state of the art performance using discriminative training techniques.
In speaker recognition, eigenvoice modeling of the speaker, and the use of speaker factors as input features to SVMs has recently been demonstrated to give good results compared to the standard GMM-SVM approach, which combines GMMs supervectors and SVMs. In this paper we propose, in analogy to the eigenvoice modeling approach, to estimate an eigen-language space, and to use the language factors as input features to SVM classifiers. Since language factors are low-dimension vectors, training and evaluating SVMs with different kernels and with large training examples becomes an easy task.
This approach is demonstrated on the 14 languages of the NIST 2007 language recognition task, and shows performance improvements with respect to the standard GMM-SVM technique.
|
| #5 | Automatic Accent Detection: Effect of Base Units and Boundary Information
Je Hun Jeon (The University of Texas at Dallas) Yang Liu (The University of Texas at Dallas)
Automatic prominence or pitch accent detection is important as it can perform automatic prosodic annotation of speech corpora, as well as provide additional features in other tasks such as keyword detection. In this paper, we evaluate how accent detection performance changes according to different base units and what kind of boundary information is available. We compare word, syllable, and vowel-based units when their boundaries are provided. We also automatically estimate syllable boundaries using energy contours when phone-level alignment is available. In addition, we utilize a sliding window with fixed length under the condition of unknown boundaries. Our experiments show that when boundary information is available, using longer base unit achieves better performance. In the case of no boundary information, using a moving window with a fixed size achieves similar performance to using syllable information on word-level evaluation, suggesting that accent detection can be performed without relying on a speech recognizer to generate boundaries.
|
| #6 | Age Verification Using a Hybrid Speech Processing Approach
Ron M Hecht (PuddingMedia) Omer Hezroni (PuddingMedia) Amit Manna (PuddingMedia) Ruth Aloni-Lavi (PuddingMedia) Gil Dobry (PuddingMedia) Amir Alfandary (Nice systems) Yaniv Zigel (Bio-medical Engineering Dept., Ben-Gurion University)
The human speech production system is a multi-level system. On the upper level, it starts with information that one wants to transmit. It ends on the lower level with the materialization of the information into a speech signal. Most of the recent work conducted in age estimation is focused on the lower-acoustic level. In this research the upper lexical level information is utilized for age-group verification and it is shown that one's vocabulary reflects one's age. Several age-group verification systems that are based on automatic transcripts are proposed. In addition, a hybrid approach is introduced, an approach that combines the word-based system and an acoustic-based system. Experiments were conducted on a four age-groups verification task using the Fisher corpora, where an average equal error rate (EER) of 28.7% was achieved using the lexical-based approach and 28.0% using an acoustic approach. By merging these two approaches the verification error was reduced to 24.1%.
|
| #7 | Information Bottleneck Based Age Verification
Ron M Hecht (PuddingMedia, Kfar-Saba, Israel) Omer Hezroni (PuddingMedia, Kfar-Saba, Israel) Amit Manna (PuddingMedia, Kfar-Saba, Israel) Gil Dobry (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel) Yaniv Zigel (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel) Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)
Word N-gram models can be used for word-based age-group
verification. In this paper the agglomerative information
bottleneck (AIB) approach is used to tackle one of the most
fundamental drawbacks of word N-gram models: its abundant
amount of irrelevant information. It is demonstrated that
irrelevant information can be omitted by joining words to
form word-clusters; this provides a mechanism to transform
any sequence of words to a sequence of word-cluster labels.
Consequently, word N-gram models are converted to wordcluster
N-gram models which are more compact. Age
verification experiments were conducted on the Fisher
corpora. Their goal was to verify the age-group of the speaker
of an unknown speech segment. In these experiments an Ngram
model was compressed to a fifth of its original size
without reducing the verification performance. In addition, a
verification accuracy improvement is demonstrated by
disposing irrelevant information.
|
| #8 | Discriminative N-gram Selection for Dialect Recognition
Fred Richardson (MIT Lincoln Laboratory) William Campbell (MIT Lincoln Laboratory) Pedro Torres-Carrasquillo (MIT Lincoln Laboratory)
Dialect recognition is a challenging and multifaceted problem. Distinguishing between dialects can rely upon many tiers of interpretation of speech data-e.g., prosodic, phonetic, spectral, and word. High-accuracy automatic methods for dialect recognition typically use either phonetic or spectral characteristics of the input. A challenge with spectral system, such as those based on shifted-delta cepstral coefficients, is that they achieve good performance but do not provide insight into distinctive dialect features. In this work, a novel method based upon discriminative training and phone N-grams is proposed. This approach achieves excellent classification performance, fuses well with other systems, and has interpretable dialect characteristics in the phonetic tier. The method is demonstrated on data from the LDC and prior NIST language recognition evaluations. The method is also combined with spectral methods to demonstrate state-of-the-art performance in dialect recognition.
|
| #9 | Data-driven Phonetic Comparison and Conversion between South African, British and American English Pronunciations
Linsen Loots (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa) Thomas Niesler (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)
We analyse pronunciations in American, British and South African English pronunciation dictionaries. Three analyses are perfomed. First the accuracy is determined with which decision tree based grapheme-to-phoneme (G2P) conversion can be applied to each accent. It is found that there is little difference between the accents in this regard. Secondly, pronunciations are compared by performing pairwise alignments between the accents. Here we find that South African English pronunciation most closely matches British English. Finally, we apply decision trees to the conversion of pronunciations from one accent to another. We find that pronunciations of unknown words can be more accurately determined from a known pronunciation in a different accent than by means of G2P methods. This has important implications for the development of pronunciation dictionaries in less-resourced varieties of English, and hence also for the development of ASR systems.
|
| #10 | Target-Aware Language Models for Spoken Language Recognition
Rong Tong (Institute for Infocomm Research, Singapore) Bin Ma (Institute for Infocomm Research, Singapore) Haizhou Li (Institute for Infocomm Research, Singapore) Eng Siong Chng (Nanyang Technological University, Singapore)
This paper studies a way of constructing multiple phone tokenizers for language recognition. In this approach, each phone tokenizer for a target language will share a common set of acoustic models, while each will have a unique phone-based language model (LM) trained for a specific target language. The target-aware language models (TALM) are constructed to capture the discriminative ability of individual phones for the desired target languages. The parallel phone tokenizers thus formed are shown to achieve better performance than the original phone recognizer. The proposed TALM is very different from the LM in the traditional PPRLM technique as the TALM applies the LM information in the front-end while PPRLM approach uses a LM in the system back-end; Furthermore, the TALM exploits the discriminative phones occurrence statistics, which are different from the traditional n-gram statistics in PPRLM approach. A novel way of training TALM is also studied in this paper.
|
| #11 | Language Identification for Speech-to-Speech Translation
Daniel Chung Yong Lim (Language Technologies Institute, Carnegie Mellon University) Ian Lane (Language Technologies Institute, Carnegie Mellon University)
This paper investigates the use of language identification (LID) in real-time speech-to-speech translation systems. We propose a framework that incorporates LID capability into a speech-to-speech translation system while minimizing the impact on the system’s real-time performance. We compared two phone-based LID approaches, namely PRLM and PPRLM, to a proposed extended approach based on Conditional Random Field classifiers. The performances of these three approaches were evaluated to identify the input language in the CMU English-Iraqi TransTAC system, and the proposed approach obtained significantly higher classification accuracies on two of the three test sets evaluated.
|
| #12 | Using Prosody and Phonotactics in Arabic Dialect Identification
Fadi Biadsy (Columbia University) Julia Hirschberg (Columbia University)
While Modern Standard Arabic is the formal spoken and
written language of the Arab world, dialects are the major communication
mode for everyday life; identifying a speaker’s dialect
is thus critical to speech processing tasks such as automatic
speech recognition, as well as speaker identification. We examine
the role of prosodic features (intonation and rhythm) across
four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for
the purpose of automatic dialect identification. We show that
prosodic features can significantly improve identification, over
a purely phonotactic-based approach, with an identification accuracy
of 86.33% for 2m utterances.
|
|
|