Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Tue-Ses3-P1:
Single- and Multichannel Speech Enhancement

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster

#1Watermark Recovery From Speech Using Inverse Filtering And Sign Correlation

Robert Morris (SPAWAR Systems Center Pacific)
Ralph Johnson (SPAWAR Systems Center Pacific)
Vladimir Goncharoff (University of Illinois at Chicago)
Joseph DiVita (SPAWAR Systems Center Pacific)

This paper presents an improved method for asynchronous embedding and recovery of sub-audible watermarks in speech signals. The watermark, a sequence of DTMF tones, was added to speech without knowledge of its time-varying characteristics. Watermark recovery began by implementing a synchronized zero-phase inverse filtering operation to decorrelate the speech during its voiced segments. The final step was to apply the sign correlation technique, which resulted in performance advantages over linear correlation detection. Our simulations include the effects of finite word length in the correlator.

#2Weighted Linear Prediction for Speech Analysis in Noisy Conditions

Jouni Pohjalainen (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)
Heikki Kallasjoki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Kalle Palomäki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Mikko Kurimo (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Paavo Alku (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)

Following earlier work, we modify linear predictive (LP) speech analysis by including temporal weighting of the squared prediction error in the model optimization. In order to focus this so called weighted LP model on the least noisy signal regions in the presence of stationary additive noise, we use short-time signal energy as the weighting function. We compare the noisy spectrum analysis performance of weighted LP and its recently proposed variant, the latter guaranteed to produce stable synthesis models. As a practical test case, we use automatic speech recognition to verify that the weighted LP methods improve upon the conventional FFT and LP methods by making spectrum estimates less prone to corruption by additive noise.

#3Log-Spectral Magnitude MMSE Estimators under Super-Gaussian Densities

Richard Christian Hendriks (Delft University of Technology)
Richard Heusdens (Delft University of Technology)
Jesper Jensen (Oticon A/S)

Despite the fact that histograms of speech DFT coefficients are super-Gaussian, not much attention has been paid to develop estimators under these super-Gaussian distributions in combination with perceptual meaningful distortion measures. In this paper we present log-spectral magnitude MMSE estimators under super-Gaussian densities, resulting in an estimator that is perceptually more meaningful and in line with measured histograms of speech DFT coefficients. Compared to state-of-the-art reference methods, the presented estimator leads to an improvement of the segmental SNR in the order of 0.5 dB up to 1 dB. Moreover, listening tests show that the proposed estimator leads to significant improvement for the presented estimator over state-of-the-art methods.

#4Speech enhancement in a 2-dimensional area based on power spectrum estimation of multiple areas with investigation of existence of active sources

Yusuke Hioka (NTT Cyber Space Laboratories, NTT Corporation)
Kenichi Furuya (NTT Cyber Space Laboratories, NTT Corporation)
Yoichi Haneda (NTT Cyber Space Laboratories, NTT Corporation)
Akitoshi Kataoka (Fuculty of Science and Technology, Ryukoku University)

A microphone array that emphasizes sound sources located in a particular 2-dimensional area is described. We previously developed a method that estimates the power spectra of target and noise sounds using multiple fixed beamformings. However, that method requires the areas where the noise sources are located to be restricted. We describe the principle of this limitation then propose a procedure that investigates the possibility of the existence of a sound source in a target area and other areas beforehand to reduce the number of unknown power spectra to be estimated.

#5Modulation Domain Spectral Subtraction for Speech Enhancement

Kuldip Paliwal (Signal Processing Laboratory, Griffith University, Queensland, Australia)
Belinda Schwerin (Signal Processing Laboratory, Griffith University, Queensland, Australia)
Kamil Wojcicki (Signal Processing Laboratory, Griffith University, Queensland, Australia)

In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using subjective listening tests and objective speech quality evaluation we show that the proposed method results in improved speech quality. Furthermore, applying spectral subtraction in the modulation domain does not introduce the musical noise artifacts that are typically present after acoustic domain spectral subtraction. The proposed methods also achieves better background noise reduction than the MMSE method.

#6Variational Loopy Belief Propagation for Multi-talker Speech Recognition

Steven Rennie (IBM)
John Hershey (IBM)
Peder Olsen (IBM)

We address single-channel speech separation and recognition by combining loopy belief propagation and variational inference methods. Inference is done in a graphical model consisting of an HMM for each speaker combined with the max interaction model of source combination. We present a new variational inference algorithm that exploits the structure of the max model to compute an arbitrarily tight bound on the probability of the mixed data. The variational parameters are chosen so that the algorithm scales linearly in the size of the language and acoustic models, and quadratically in the number of sources. The algorithm scores 30.7\% on the SSC task \cite{Cooke:09}, which is the best published result by a method that scales linearly with speaker model complexity to date. The algorithm achieves average recognition error rates of 27\%, 35\%, and 51\% on small datasets of SSC-derived speech mixtures containing two, three, and four sources, respectively, using a single audio channel.

#7Enhancement of Binaural Speech Using Codebook Constrained Iterative Binaural Wiener Filter

Nadir Cazi (Indian Institute of Science, Bangalore)
Thippur Sreenivas (Indian Institute of Science, Bangalore)

A clean speech VQ codebook has been shown to be effective in providing intraframe constraints and hence better convergence of the iterative wiener filtering scheme for single channel speech enhancement. Here we present an extension of the single channel CCIWF scheme to binaural speech input by incorporating a speech distortion weighted multi-channel wiener filter. The new algorithm shows considerable improvement over single channel CCIWF in each channel, in a diffuse noise field environment, in terms of aposteriori SNR and speech intelligibility measure. Next, considering a moving speech source, a good tracking performance is seen, upto a certain resolution.

#8A Semi-blind Source Separation Method with A Less Amount of Computation Suitable for Tiny DSP Modules

Kazunobu Kondo (Yamaha Corporation)
Makoto Yamada (Yamaha Corporation)
Hideki Kenmochi (Yamaha Corporation)

In this paper, we propose a method of implementing FDICA on tiny DSP modules. Firstly, we show a semi-blind separation matrix initialization step that consists of an estimation method using covariance fitting for a known source and an unknown source. It contributes to the faster convergence and less amount of computation. Secondly, a learning band selection step is shown that consists of the determinant of the covariance matrix as a criteria for selection; This achieves a significant reduction of an amount of computation with practical separation performance. Finally, the effectiveness of the proposed method is evaluated via the source separation simulations in anechoic and reverberant rooms, and also a procedure and a resource presumption for the integrated method which we call tinyICA are shown.

#9Model-based Speech Separation: Identifying Transcription using Orthogonality

Siu Wa Lee (The Chinese University of Hong Kong)
Frank K. Soong (Microsoft Research Asia)
Tan Lee (The Chinese University of Hong Kong)

Spectral envelopes and harmonics are the building elements of a speech signal. By estimating these elements, individual speech sources in a mixture observation can be reconstructed and hence separated. Transcription gives the spoken content. More important, it describes the expected sequence of spectral envelopes, if modeling of different speech sounds is acquired. Our recently proposed single-microphone speech separation algorithm exploits this to derive the spectral envelope trajectories of individual sources and remove interference accordingly. This paper investigates the relationship between the correctness of transcription hypotheses and the orthogonality of associated source estimates. An orthogonality measure is introduced to quantify the correlation between spectrograms. Experiments verify that underlying true transcriptions lead to a salient orthogonality distribution, which is distinguishable from the counterfeit transcription one.

#10Enhanced Minimum Statistics Technique Incorporating Soft Decision For Noise Suppression

Yun-Sik Park (Inha University)
Ji-Hyun Song (Inha University)
Jae-Hun Choi (Inha University)
Joon-Hyuk Chang (Inha University)

In this paper, we propose a novel approach to noise power estimation for robust noise suppression in noisy environments. From investigation of the state-of-the-art techniques for noise power estimation, it is discovered that the previously known methods are accurate mostly either during speech absence or speech presence but none of it works well in both situations. Our approach combines minimum statistics (MS) and soft decision (SD) techniques based on probability of speech absence. The performance of the proposed approach is evaluated by a quantitative comparison method and subjective test under various noise environments and found to yield better results compared with conventional MS and SD-based schemes.

#11Effect of Noise Reduction on Reaction Time to Speech in Noise

Mark Huckvale (UCL)
Jayne Leak (UCL)

In moderate levels of noise, listeners report that noise reduction (NR) processing can improve the perceived quality of a speech signal as measured on a typical MOS rating scale. Most quantitative experiments of intelligibility, however, show that NR reduces the intelligibility of noisy speech signals, and so should be expected to increase the cognitive effort required to process utterances. To study cognitive effort we look at how NR affects reaction times to speech in noise, using material that is still highly intelligible. We show that adding noise increases reaction times and that NR does not restore reaction times back to the quiet condition. The implication is that NR does not make speech "easier" to process, at least as far as this task is concerned.

#12Joint Noise Reduction and Dereverberation of Speech Using Hybrid TF-GSC and Adaptive MMSE Estimator

Behdad Dashtbozorg (Yazd University)
Hamid Reza Abutalebi (Yazd University)

This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments.

#13A Study on Multiple Sound Source Localization with a Distributed Microphone System

Kook Cho (Ritsumeikan University)
Takanobu Nishiura (Ritsumeikan University)
Yoichi Yamashita (Ritsumeikan University)

This paper describes a novel method for multiple sound source localization and its performance evaluation in actual room environments. The proposed method localizes a sound source by finding the position that maximizes the accumulated correlation coefficient between multiple channel pairs. After the estimation of the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness of the proposed method, experiments of multiple sound source localization were carried out in an actual office room. The result shows that multiple sound source localization accuracy is about 99.7%.

#14Robust Minimal Variance Distortionless Speech Power Spectra Enhancement

Tao Yu (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)
John H. L. Hansen (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)

In this study, we propose a novel minimal variance distortionless speech power spectral enhancement algorithm, which is robust to real-world implementation issues. Our proposed method is implemented in the power spectral domain where stochastic noise can be modeled as the exponential distribution, whose non-Gaussianity is explored by order statistics filter. Both theoretical and experimental results shows the effectiveness of our proposed method over traditional ones.

#15Speech Enhancement Minimizing Generalized Euclidean Distortion Using Supergaussian Priors

Amit Das (University of Colorado, Boulder and University of Texas, Dallas)
John H. L. Hansen (University of Texas, Dallas)

We introduce short time spectral estimators which minimize the weighted Euclidean distortion (WED) between the clean and estimated speech spectral components when clean speech is degraded by additive noise. The traditional minimum mean square error (MMSE) estimator does not take into account sufficient perceptual measure during enhancement of noisy speech. However, the new estimators discussed in this paper provide greater flexibility to improve speech quality. We explore the cases when clean speech spectral magnitude and discrete Fourier transform (DFT) coefficients are modeled by super-Gaussian priors like Chi and bilateral Gamma distributions respectively. We also present the joint maximum aposteriori (MAP) estimators of the Chi distributed spectral magnitude and uniform phase. Performance evaluations over two noise types and three SNR levels demonstrate improved results of the proposed estimators.

#16STFT-Based Speech Enhancement by Reconstructing the Harmonics

Iman Haji Abolhassani (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)
Sid-Ahmed Selouani (Université de Moncton, Campus de Shippagan, Canada)
Douglas O\'Shaughnessy (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)

A novel Short Time Fourier Transform (STFT) based speech enhancement method is introduced. This method enhances the magnitude spectrum of a noisy speech segment. The new idea that is used in this method is to basically reconstruct the harmonics at the multiples of the fundamental frequency (F0) rather than trying to improve them. The harmonics are produced, in the magnitude spectrum, using the knowledge of the window function we are using for the STFT. These harmonics are then scaled and laid on multiples of F0. Experimental results prove the effectiveness of this enhancement method in various noisy conditions and various SNR ratios.

#17Joint Speech Enhancement and Speaker Identification Using Monte Carlo Methods

Ciira wa Maina (Drexel University)
John MacLaren Walsh (Drexel University)

We present an approach to speaker identification using noisy speech observations where the speech enhancement and speaker identification tasks are performed jointly. This is motivated by the belief that human beings perform these tasks jointly and that optimality may be sacrificed if sequential processing is used. We employ a Bayesian approach where the speech features are modeled using a mixture of Gaussians prior. A Gibbs sampler is used to estimate the speech source and the identity of the speaker. Preliminary experimental results are presented comparing our approach to a maximum likelihood approach and demonstrating the ability of our method to both enhance speech and identify speakers.