<<Taikichiro Mori Memorial Research Fund>>

Graduate Student Researcher Development Grant Report

Noise reduction mechanisms for speech enhancement

Nguyen Anh Duc, e-mail: ducna80@sfc.keio.ac.jp

I. Introduction

Speech enhancement is concerned with improving the quality and intelligibility of speech contaminated by noise, and it is sometimes referred as noise reduction techniques. It has a variety of applications in telecommunications, automatic speech recognition or in digital aid.

Figure 1. An example of communication in a noisy environment.

There have been numerous proposed algorithms for speech enhancement and related applications. We can classify them based on the number of microphones (single or multiple-microphones), the domain of processing, e.g., time domain or frequency domain, and techniques to process the information, e.g., spectral subtraction, Wiener filters, subspace method or statistical-model-based mechanisms.

This research considers single-channel speech enhancement mechanisms using statistics and estimation techniques to process data in the frequency domain. By using one microphone (single-channel), the scope of this research is limited in monaural hearing only. The noise here is assumed statistically independent with the considered speech signal.

The main contributions of this research are two new effective speech spectral amplitude estimators for speech enhancement. The originality of these estimators is a novel perceptually-motivated cost function, which is developed based on characteristics of the human auditory system. The experimental results present advantages of the proposals over well-known methods in terms of both better noise reduction and less speech distortion.

II. Statistical-model-based mechanisms for single-channel speech enhancement

II.1 Statistical-model-based mechanisms for single-channel speech enhancement

In this approach, the speech enhancement problem is put in statistics and estimation frameworks. Given a set of observations, here are the Discrete Fourier Transform (DFT) coefficients of the noisy speech, i.e., the noisy signal spectrum; we wish to estimate the values of the unknown DFT coefficients of the clean speech, i.e., the clean signal spectrum. In order to find the estimation of the clean speech, some prior knowledge, i.e., statistical properties, of noise and clean speech themselves should be known in advance. These properties are often the shape of the distribution, e.g., Gaussian, non-Gaussian, and the independence(uncorrelated)-or-not issues among speech and nose components. Then, from this knowledge, related statistics, e.g., expected value and variance, are calculated. Finally, estimation techniques, including both conventional estimators where the parameters of interest are treated as unknown but deterministic variables, e.g., Maximum Likelihood Estimator (MLE), and Bayesian ones where the parameters of interest are treated as random variables with some prior distribution properties, e.g., Maximum A Posteriori (MAP), Minimum Mean Square Error (MMSE), come into play. All researches that have been done in this direction relate to the three above steps. The main contribution of this research lies on the last step, proposing novel and more efficient estimators.

The general diagram for a typical single-channel statistical-model-based speech enhancement algorithm is presented in Figure 2.

Figure 2. General diagram of a single-channel statistical-model-based

speech enhancement algorithm in the frequency domain .

II.2 Contributions of this research

The core contribution of this research is a new cost function for Bayesian estimation. That cost function is the weighted squared error between the real and the estimated values. While the squared error of speech log-spectral amplitude is motivated by the more perceptual relevance of loudness than intensity itself, the weighting factor comes from the observation of the auditory masking effects. Therefore, this cost function takes advantages of the both useful properties of the human hearing system, the masking effects and perceived loudness.

Based on this cost function, two speech (log) spectral amplitude estimators are constructed under the Rayleigh and Chi speech prior assumptions respectively. While the Rayleigh prior is theoretically derived, the Chi prior is more generalized and capable of reflecting the super-Gaussian distributed nature of speech spectral amplitude. Discussions on how to make these proposed algorithms practical for real applications are also presented. When evaluating these proposed estimators with speech signals contaminated by various noise sources at different input signal-to-noise ratios, the experimental results show that they achieve better performance than the well-known Minimum Mean Square Error log-spectral amplitude estimator in terms of both noise reduction and speech quality.

III. Feature work

Some problems still remain for the future work as follows:

1. Improve the way of implementing proposed estimators, more computationally efficient. Since it is rather complicated to compute these estimators, we have to use the lookup table technique (LUT) to mitigate the problem and make the proposals implementabe for real applications. The disadvantage of the LUT is that it consumes some significant amount of memory to store all the pre-computed data. The higher precision (better accuracy) we need, the more memory is required.

2. Conduct the implementation and evaluation of the second estimator, which is constructed based on the Chi speech prior, and find out the best values of the parameters.

3. Consider the frequency dependence of the human perception or a multi-band speech enhancement strategy.

4. Incorporate speech presence uncertainty with the proposed estimators. This method substantially reduces the residual noise, and therefore, improves the performance of the proposals.

References

[1]	A. D. Nguyen, K. Naoe, and Y. Takefuji, “A new log-spectral amplitude estimator using the weighted Euclidean distortion measure for speech enhancement,” Proc. 26th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI), pp. 000675-000679, 2010.
[2]	A. D. Nguyen, “Statistical model based mechanisms for single-channel speech enhancement,” Master Thesis, Keio SFC, 2011.