|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses3-P1: Single- and Multichannel Speech Enhancement
| Time: | Tuesday 16:00 |
Place: | Hewison Hall |
Type: | Poster |
| #1 | Watermark Recovery From Speech Using Inverse Filtering And Sign Correlation
Robert Morris (SPAWAR Systems Center Pacific) Ralph Johnson (SPAWAR Systems Center Pacific) Vladimir Goncharoff (University of Illinois at Chicago) Joseph DiVita (SPAWAR Systems Center Pacific)
This paper presents an improved method for asynchronous embedding and recovery of sub-audible watermarks in speech signals. The watermark, a sequence of DTMF tones, was added to speech without knowledge of its time-varying characteristics. Watermark recovery began by implementing a synchronized zero-phase inverse filtering operation to decorrelate the speech during its voiced segments. The final step was to apply the sign correlation technique, which resulted in performance advantages over linear correlation detection. Our simulations include the effects of finite word length in the correlator.
|
| #2 | Weighted Linear Prediction for Speech Analysis in Noisy Conditions
Jouni Pohjalainen (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland) Heikki Kallasjoki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland) Kalle Palomäki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland) Mikko Kurimo (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland) Paavo Alku (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)
Following earlier work, we modify linear predictive (LP) speech analysis by including temporal weighting of the squared prediction error in the model optimization. In order to focus this so called weighted LP model on the least noisy signal regions in the presence of stationary additive noise, we use short-time signal energy as the weighting function. We compare the noisy spectrum analysis performance of weighted LP and its recently proposed variant, the latter guaranteed to produce stable synthesis models. As a practical test case, we use automatic speech recognition to verify that the weighted LP methods improve upon the conventional FFT and LP methods by making spectrum estimates less prone to corruption by additive noise.
|
| #3 | Log-Spectral Magnitude MMSE Estimators under Super-Gaussian Densities
Richard Christian Hendriks (Delft University of Technology) Richard Heusdens (Delft University of Technology) Jesper Jensen (Oticon A/S)
Despite the fact that histograms of speech DFT coefficients are super-Gaussian, not much attention has been paid to develop estimators under these super-Gaussian distributions in combination with perceptual meaningful distortion measures.
In this paper we present log-spectral magnitude MMSE estimators under super-Gaussian densities, resulting in an estimator that is perceptually more meaningful and in line with measured histograms of speech DFT coefficients.
Compared to state-of-the-art reference methods, the presented estimator leads to an improvement of the segmental SNR in the order of 0.5 dB up to 1 dB.
Moreover, listening tests show that the proposed estimator leads to significant improvement for the presented estimator over state-of-the-art methods.
|
| #4 | Speech enhancement in a 2-dimensional area based on power spectrum estimation of multiple areas with investigation of existence of active sources
Yusuke Hioka (NTT Cyber Space Laboratories, NTT Corporation) Kenichi Furuya (NTT Cyber Space Laboratories, NTT Corporation) Yoichi Haneda (NTT Cyber Space Laboratories, NTT Corporation) Akitoshi Kataoka (Fuculty of Science and Technology, Ryukoku University)
A microphone array that emphasizes sound sources located in a particular 2-dimensional area is described. We previously developed a method that estimates the power spectra of target and noise sounds using multiple fixed beamformings. However, that method requires the areas where the noise sources are located to be restricted. We describe the principle of this limitation then propose a procedure that investigates the possibility of the existence of a sound source in a target area and other areas beforehand to reduce the number of unknown power spectra to be estimated.
|
| #5 | Modulation Domain Spectral Subtraction for Speech Enhancement
Kuldip Paliwal (Signal Processing Laboratory, Griffith University, Queensland, Australia) Belinda Schwerin (Signal Processing Laboratory, Griffith University, Queensland, Australia) Kamil Wojcicki (Signal Processing Laboratory, Griffith University, Queensland, Australia)
In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using subjective listening tests and objective speech quality evaluation we show that the proposed method results in improved speech quality. Furthermore, applying spectral subtraction in the modulation domain does not introduce the musical noise artifacts that are typically present after acoustic domain spectral subtraction. The proposed methods also achieves better background noise reduction than the MMSE method.
|
| #6 | Variational Loopy Belief Propagation for Multi-talker Speech Recognition
Steven Rennie (IBM) John Hershey (IBM) Peder Olsen (IBM)
We address single-channel speech separation and recognition by combining loopy belief propagation and variational inference methods. Inference is done in a graphical model consisting of an HMM for each speaker combined with the max interaction model of source combination. We present a new variational inference algorithm that exploits the structure of the max model to compute an arbitrarily tight bound on the probability of the mixed data. The variational parameters are chosen so that the algorithm scales linearly in the size of the language and acoustic models, and quadratically in the number of sources. The algorithm scores 30.7\% on the SSC task \cite{Cooke:09}, which is the best published result by a method that scales linearly with speaker model complexity to date. The algorithm achieves average recognition error rates of 27\%, 35\%, and 51\% on small datasets of SSC-derived speech mixtures containing two, three, and four sources, respectively, using a single audio channel.
|
| #7 | Enhancement of Binaural Speech Using Codebook Constrained Iterative Binaural Wiener Filter
Nadir Cazi (Indian Institute of Science, Bangalore) Thippur Sreenivas (Indian Institute of Science, Bangalore)
A clean speech VQ codebook has been shown to be effective in providing intraframe constraints and hence better convergence of the iterative wiener filtering scheme for single channel speech enhancement. Here we present an extension of the single channel CCIWF scheme to binaural speech input by incorporating a speech distortion weighted multi-channel wiener filter. The new algorithm shows considerable improvement over single channel CCIWF in each channel, in a diffuse noise field environment, in terms of aposteriori SNR and speech intelligibility measure. Next, considering a moving speech source, a good tracking performance is seen, upto a certain resolution.
|
| #8 | A Semi-blind Source Separation Method with A Less Amount of Computation Suitable for Tiny DSP Modules
Kazunobu Kondo (Yamaha Corporation) Makoto Yamada (Yamaha Corporation) Hideki Kenmochi (Yamaha Corporation)
In this paper, we propose a method of implementing FDICA on tiny DSP modules. Firstly, we show a semi-blind separation matrix initialization step that consists of an estimation method using covariance fitting for a known source and an unknown source. It contributes to the faster convergence and less amount of computation. Secondly, a learning band selection step is shown that consists of the determinant of the covariance matrix as a criteria for selection; This achieves a significant reduction of an amount of computation with practical separation performance. Finally, the effectiveness of the proposed method is evaluated via the source separation simulations in anechoic and reverberant rooms, and also a procedure and a resource presumption for the integrated method which we call tinyICA are shown.
|
| #9 | Model-based Speech Separation: Identifying Transcription using Orthogonality
Siu Wa Lee (The Chinese University of Hong Kong) Frank K. Soong (Microsoft Research Asia) Tan Lee (The Chinese University of Hong Kong)
Spectral envelopes and harmonics are the building elements of a speech signal. By estimating these elements, individual speech sources in a mixture observation can be reconstructed and hence separated. Transcription gives the spoken content. More important, it describes the expected sequence of spectral envelopes, if modeling of different speech sounds is acquired. Our recently proposed single-microphone speech separation algorithm exploits this to derive the spectral envelope trajectories of individual sources and remove interference accordingly. This paper investigates the relationship between the correctness of transcription hypotheses and the orthogonality of associated source estimates. An orthogonality measure is introduced to quantify the correlation between spectrograms. Experiments verify that underlying true transcriptions lead to a salient orthogonality distribution, which is distinguishable from the counterfeit transcription one.
|
| #10 | Enhanced Minimum Statistics Technique Incorporating Soft Decision For Noise Suppression
Yun-Sik Park (Inha University) Ji-Hyun Song (Inha University) Jae-Hun Choi (Inha University) Joon-Hyuk Chang (Inha University)
In this paper, we propose a novel approach to noise power estimation
for robust noise suppression in noisy environments. From
investigation of the state-of-the-art techniques for noise power
estimation, it is discovered that the previously known methods are
accurate mostly either during speech absence or speech presence but
none of it works well in both situations. Our approach combines
minimum statistics (MS) and soft decision (SD) techniques based on
probability of speech absence. The performance of the proposed
approach is evaluated by a quantitative comparison method and
subjective test under various noise environments and found to yield
better results compared with conventional MS and SD-based schemes.
|
| #11 | Effect of Noise Reduction on Reaction Time to Speech in Noise
Mark Huckvale (UCL) Jayne Leak (UCL)
In moderate levels of noise, listeners report that noise reduction (NR) processing can improve the perceived quality of a speech signal as measured on a typical MOS rating scale. Most quantitative experiments of intelligibility, however, show that NR reduces the intelligibility of noisy speech signals, and so should be expected to increase the cognitive effort required to process utterances. To study cognitive effort we look at how NR affects reaction times to speech in noise, using material that is still highly intelligible. We show that adding noise increases reaction times and that NR does not restore reaction times back to the quiet condition. The implication is that NR does not make speech "easier" to process, at least as far as this task is concerned.
|
| #12 | Joint Noise Reduction and Dereverberation of Speech Using Hybrid TF-GSC and Adaptive MMSE Estimator
Behdad Dashtbozorg (Yazd University) Hamid Reza Abutalebi (Yazd University)
This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments.
|
| #13 | A Study on Multiple Sound Source Localization with a Distributed Microphone System
Kook Cho (Ritsumeikan University) Takanobu Nishiura (Ritsumeikan University) Yoichi Yamashita (Ritsumeikan University)
This paper describes a novel method for multiple sound source localization and its performance evaluation in actual room environments. The proposed method localizes a sound source by finding the position that maximizes the accumulated correlation coefficient between multiple channel pairs. After the estimation of the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness of the proposed method, experiments of multiple sound source localization were carried out in an actual office room. The result shows that multiple sound source localization accuracy is about 99.7%.
|
| #14 | Robust Minimal Variance Distortionless Speech Power Spectra Enhancement
Tao Yu (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA) John H. L. Hansen (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)
In this study, we propose a novel minimal variance distortionless
speech power spectral enhancement algorithm, which is robust
to real-world implementation issues. Our proposed method
is implemented in the power spectral domain where stochastic
noise can be modeled as the exponential distribution, whose
non-Gaussianity is explored by order statistics filter. Both theoretical and experimental results shows the effectiveness of our proposed method over traditional ones.
|
| #15 | Speech Enhancement Minimizing Generalized Euclidean Distortion Using Supergaussian Priors
Amit Das (University of Colorado, Boulder and University of Texas, Dallas) John H. L. Hansen (University of Texas, Dallas)
We introduce short time spectral estimators which minimize the weighted
Euclidean distortion (WED) between the clean and estimated speech spectral components
when clean speech is degraded by additive noise. The traditional minimum mean square error (MMSE) estimator does not take into account sufficient perceptual measure during enhancement of noisy speech. However, the new estimators discussed in this paper provide greater flexibility to improve speech quality. We explore the cases when clean speech spectral magnitude and discrete Fourier transform (DFT) coefficients are modeled by super-Gaussian priors like Chi and bilateral Gamma distributions respectively. We also present the joint maximum aposteriori (MAP) estimators of the Chi distributed spectral magnitude and uniform phase. Performance evaluations over two noise types and three SNR levels demonstrate improved results of the proposed estimators.
|
| #16 | STFT-Based Speech Enhancement by Reconstructing the Harmonics
Iman Haji Abolhassani (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada) Sid-Ahmed Selouani (Université de Moncton, Campus de Shippagan, Canada) Douglas O\'Shaughnessy (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)
A novel Short Time Fourier Transform (STFT) based speech enhancement method is introduced. This method enhances the magnitude spectrum of a noisy speech segment. The new idea that is used in this method is to basically reconstruct the harmonics at the multiples of the fundamental frequency (F0) rather than trying to improve them. The harmonics are produced, in the magnitude spectrum, using the knowledge of the window function we are using for the STFT. These harmonics are then scaled and laid on multiples of F0. Experimental results prove the effectiveness of this enhancement method in various noisy conditions and various SNR ratios.
|
| #17 | Joint Speech Enhancement and Speaker Identification Using Monte Carlo Methods
Ciira wa Maina (Drexel University) John MacLaren Walsh (Drexel University)
We present an approach to speaker identification using noisy speech observations where the speech enhancement and speaker identification tasks are performed jointly. This is motivated by the belief that human beings perform these tasks jointly and that optimality may be sacrificed if sequential processing is used. We employ a Bayesian approach where the speech features are modeled using a mixture of Gaussians prior. A Gibbs sampler is used to estimate the speech source and the identity of the speaker. Preliminary experimental results are presented comparing our approach to a maximum likelihood approach and demonstrating the ability of our method to both enhance speech and identify speakers.
|
|
|