Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses2-P2:
Accent and Language Recognition

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair: William Campbell

#1Factor Analysis and SVM for Language Recognition

Florian Verdet (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France and Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)
Driss Matrouf (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France)
Jean-François Bonastre (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France)
Jean Hennebert (Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)

Statistic classifiers operate on features that generally include both, useful and useless information. These two types of information are difficult to separate in feature domain. Recently, a new paradigm based on Factor Analysis (FA) proposed a model decomposition into useful and useless components. This method has successfully been applied to speaker recognition tasks. In this paper, we study the use of FA for language recognition. We propose a classification method based on SDC features and Gaussian Mixture Models (GMM). We present well performing systems using Factor Analysis and FA-based Support Vector Machine (SVM) classifiers. Experiments are conducted using NIST LRE 2005’s primary condition. The relative equal error rate reduction obtained by the best factor analysis configuration with respect to baseline GMM-UBM system is over 60 %, corresponding to an EER of 6.59 %.

#2Exploring Universal Attribute Characterization of Spoken Languages for Spoken Language Recognition

Sabato Marco Siniscalchi (NTNU)
Jeremy Reed (Georgia Institute of Technology)
Torbjørn Svendsen (NTNU)
Chin-Hui Lee (Georgia Institute of Technology)

We propose a novel universal acoustic characterization approach to spoken language identification (LID), in which any spoken language is described with a common set of fundamental units defined "universally." Specifically, manner and place of articulation form this unit inventory and are used to build a set of universal attribute models with data-driven techniques. Using the vector space modeling approaches to LID a spoken utterance is first decoded into a sequence of attributes. Then, a feature vector consisting of co-occurrence statistics of attribute units is created, and the final LID decision is implemented with a set of vector space language classifiers. Although the present study is just in its preliminary stage, promising results comparable to acoustically rich phone-based LID systems have already been obtained on the NIST 2003 LID task. The results provide clear insight for further performance improvements and encourage a continuing exploration of the proposed framework.

#3On the use of Phonological Features for Automatic Accent Analysis

Abhijeet Sangwan (Center for Robust Speech Systems)
John Hansen (Center for Robust Speech Systems)

In this paper, we present an automatic accent analysis system that is based on phonological features (PFs). The proposed system exploits the knowledge of articulation embedded in phonology by rapidly build Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents a new statistical measure of “accentedness” is developed which rates the articulation of a word on a scale of native-like (−1) to non-native like (+1. The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The work developed in this paper is easily assimilated into language learning systems, and has impact in the areas of speaker recognition and ASR (automatic speech recognition).

#4Language Recognition Using Language Factors

Fabio Castaldo (Politecnico di Torino)
Sandro Cumani (Politecnico di Torino)
Pietro Laface (Politecnico di Torino)
Daniele Colibro (Loquendo)

Language recognition systems based on acoustic models reach state of the art performance using discriminative training techniques. In speaker recognition, eigenvoice modeling of the speaker, and the use of speaker factors as input features to SVMs has recently been demonstrated to give good results compared to the standard GMM-SVM approach, which combines GMMs supervectors and SVMs. In this paper we propose, in analogy to the eigenvoice modeling approach, to estimate an eigen-language space, and to use the language factors as input features to SVM classifiers. Since language factors are low-dimension vectors, training and evaluating SVMs with different kernels and with large training examples becomes an easy task. This approach is demonstrated on the 14 languages of the NIST 2007 language recognition task, and shows performance improvements with respect to the standard GMM-SVM technique.

#5Automatic Accent Detection: Effect of Base Units and Boundary Information

Je Hun Jeon (The University of Texas at Dallas)
Yang Liu (The University of Texas at Dallas)

Automatic prominence or pitch accent detection is important as it can perform automatic prosodic annotation of speech corpora, as well as provide additional features in other tasks such as keyword detection. In this paper, we evaluate how accent detection performance changes according to different base units and what kind of boundary information is available. We compare word, syllable, and vowel-based units when their boundaries are provided. We also automatically estimate syllable boundaries using energy contours when phone-level alignment is available. In addition, we utilize a sliding window with fixed length under the condition of unknown boundaries. Our experiments show that when boundary information is available, using longer base unit achieves better performance. In the case of no boundary information, using a moving window with a fixed size achieves similar performance to using syllable information on word-level evaluation, suggesting that accent detection can be performed without relying on a speech recognizer to generate boundaries.

#6Age Verification Using a Hybrid Speech Processing Approach

Ron M Hecht (PuddingMedia)
Omer Hezroni (PuddingMedia)
Amit Manna (PuddingMedia)
Ruth Aloni-Lavi (PuddingMedia)
Gil Dobry (PuddingMedia)
Amir Alfandary (Nice systems)
Yaniv Zigel (Bio-medical Engineering Dept., Ben-Gurion University)

The human speech production system is a multi-level system. On the upper level, it starts with information that one wants to transmit. It ends on the lower level with the materialization of the information into a speech signal. Most of the recent work conducted in age estimation is focused on the lower-acoustic level. In this research the upper lexical level information is utilized for age-group verification and it is shown that one's vocabulary reflects one's age. Several age-group verification systems that are based on automatic transcripts are proposed. In addition, a hybrid approach is introduced, an approach that combines the word-based system and an acoustic-based system. Experiments were conducted on a four age-groups verification task using the Fisher corpora, where an average equal error rate (EER) of 28.7% was achieved using the lexical-based approach and 28.0% using an acoustic approach. By merging these two approaches the verification error was reduced to 24.1%.

#7Information Bottleneck Based Age Verification

Ron M Hecht (PuddingMedia, Kfar-Saba, Israel)
Omer Hezroni (PuddingMedia, Kfar-Saba, Israel)
Amit Manna (PuddingMedia, Kfar-Saba, Israel)
Gil Dobry (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel)
Yaniv Zigel (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel)
Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)

Word N-gram models can be used for word-based age-group verification. In this paper the agglomerative information bottleneck (AIB) approach is used to tackle one of the most fundamental drawbacks of word N-gram models: its abundant amount of irrelevant information. It is demonstrated that irrelevant information can be omitted by joining words to form word-clusters; this provides a mechanism to transform any sequence of words to a sequence of word-cluster labels. Consequently, word N-gram models are converted to wordcluster N-gram models which are more compact. Age verification experiments were conducted on the Fisher corpora. Their goal was to verify the age-group of the speaker of an unknown speech segment. In these experiments an Ngram model was compressed to a fifth of its original size without reducing the verification performance. In addition, a verification accuracy improvement is demonstrated by disposing irrelevant information.

#8Discriminative N-gram Selection for Dialect Recognition

Fred Richardson (MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)
Pedro Torres-Carrasquillo (MIT Lincoln Laboratory)

Dialect recognition is a challenging and multifaceted problem. Distinguishing between dialects can rely upon many tiers of interpretation of speech data-e.g., prosodic, phonetic, spectral, and word. High-accuracy automatic methods for dialect recognition typically use either phonetic or spectral characteristics of the input. A challenge with spectral system, such as those based on shifted-delta cepstral coefficients, is that they achieve good performance but do not provide insight into distinctive dialect features. In this work, a novel method based upon discriminative training and phone N-grams is proposed. This approach achieves excellent classification performance, fuses well with other systems, and has interpretable dialect characteristics in the phonetic tier. The method is demonstrated on data from the LDC and prior NIST language recognition evaluations. The method is also combined with spectral methods to demonstrate state-of-the-art performance in dialect recognition.

#9Data-driven Phonetic Comparison and Conversion between South African, British and American English Pronunciations

Linsen Loots (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)
Thomas Niesler (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)

We analyse pronunciations in American, British and South African English pronunciation dictionaries. Three analyses are perfomed. First the accuracy is determined with which decision tree based grapheme-to-phoneme (G2P) conversion can be applied to each accent. It is found that there is little difference between the accents in this regard. Secondly, pronunciations are compared by performing pairwise alignments between the accents. Here we find that South African English pronunciation most closely matches British English. Finally, we apply decision trees to the conversion of pronunciations from one accent to another. We find that pronunciations of unknown words can be more accurately determined from a known pronunciation in a different accent than by means of G2P methods. This has important implications for the development of pronunciation dictionaries in less-resourced varieties of English, and hence also for the development of ASR systems.

#10Target-Aware Language Models for Spoken Language Recognition

Rong Tong (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)
Eng Siong Chng (Nanyang Technological University, Singapore)

This paper studies a way of constructing multiple phone tokenizers for language recognition. In this approach, each phone tokenizer for a target language will share a common set of acoustic models, while each will have a unique phone-based language model (LM) trained for a specific target language. The target-aware language models (TALM) are constructed to capture the discriminative ability of individual phones for the desired target languages. The parallel phone tokenizers thus formed are shown to achieve better performance than the original phone recognizer. The proposed TALM is very different from the LM in the traditional PPRLM technique as the TALM applies the LM information in the front-end while PPRLM approach uses a LM in the system back-end; Furthermore, the TALM exploits the discriminative phones occurrence statistics, which are different from the traditional n-gram statistics in PPRLM approach. A novel way of training TALM is also studied in this paper.

#11Language Identification for Speech-to-Speech Translation

Daniel Chung Yong Lim (Language Technologies Institute, Carnegie Mellon University)
Ian Lane (Language Technologies Institute, Carnegie Mellon University)

This paper investigates the use of language identification (LID) in real-time speech-to-speech translation systems. We propose a framework that incorporates LID capability into a speech-to-speech translation system while minimizing the impact on the system’s real-time performance. We compared two phone-based LID approaches, namely PRLM and PPRLM, to a proposed extended approach based on Conditional Random Field classifiers. The performances of these three approaches were evaluated to identify the input language in the CMU English-Iraqi TransTAC system, and the proposed approach obtained significantly higher classification accuracies on two of the three test sets evaluated.

#12Using Prosody and Phonotactics in Arabic Dialect Identification

Fadi Biadsy (Columbia University)
Julia Hirschberg (Columbia University)

While Modern Standard Arabic is the formal spoken and written language of the Arab world, dialects are the major communication mode for everyday life; identifying a speaker’s dialect is thus critical to speech processing tasks such as automatic speech recognition, as well as speaker identification. We examine the role of prosodic features (intonation and rhythm) across four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect identification. We show that prosodic features can significantly improve identification, over a purely phonotactic-based approach, with an identification accuracy of 86.33% for 2m utterances.