Blind Dereverberation Using Maximum Kurtosis Biology Essay


Reverberation in speech, caused by room reflections, is problematic specially for hands-free telephonic applications in a confined space. The problem is even severe for hearing impaired people. Therefore blind speech dereverberation is an important research area. The task is to remove reverberation from the output of a room, where the room impulse response, as well as the clean speech signal is unknown. The method discussed herein maximizes the fourth order cumulant, referred to as Kurtosis, of the Linear Prediction (LP) residual of the speech to remove reverberation from the degraded speech.


This report is written during the first semester of Master of Engineering with Thesis, specialization in Signal and Speech Processing at the Dept. of Electrical and Computer Engineering, McGill University, Canada.

The report will cover analysis, implementation, and test of the maximum kurtosis based algorithm used to deconvolve room impulse response. Primary metrics of the algorithm analysis, discussed herein, include the ability to deconvolve, convergence speed, and the number of required multiplications. The implementation of the algorithm is done on MATLAB R2009a in a 64-bit environment.

Lady using a tablet
Lady using a tablet


Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

In the report, the natural algorithm is denoted with log; otherwise the base is stated. Throughout the report, the whole sequence of a signal will be denoted by a vector written with bold letters, i.e. x, whereas x(n) will correspond to an element in that sequence. If both time and vector indices are used the variable hj(k) is the j'th coefficient at time index k, where the bold face represents a vector. Literature references are represented with numbers in IEEE format, e.g. [number], and a the full list of references is found in the References section.

The report also includes the MATLAB implementation of the algorithm. 1


Sound waves travel as waveforms, which are then reflected by various surfaces, and objects in the room; such reflected multipath signals form a delayed and attenuated feedback of the source signal to the sink/Mic. Such a feedback is referred to as reverberation, and causes degradation of the speech intelligibility.

1.1 Motivation and Problem Statement

Speech reverberation, mixed with environmental noise, is regarded as one of the primary issue in the speech capturing in a confined space, for example, business offices. Normally, degraded (additive or reverberant) speech is processed assuming that the degradation has long term stationary characteristics relative to speech; many methods of speech enhancement have been proposed based on this concept. Unfortunately, due to sharp changes in spectrum within a speech frame and across the frames, the resulting processed speech produces significant audible distortions. Such noise reduction is accomplished at the cost of quality. Thus it is required to look at the methods focusing on characteristics of speech for enhancement of degraded speech rather than the degradation itself.

Over the years, several methods of speech dereverberation based on a simplified discrete model of speech production have been proposed. The basic model consists of an excitation source, and a time-varying vocal tract filter. Such as model can easily be modeled by an Auto-Regressive (AR) linear prediction (LP) technique. The inverse LP filter gives the LP-residual, which is a close approximation of the excitation signal. The motivation for this project is the observation that in reverberant environments, the LP residual contains


the original impulses followed by several other peaks due to multi-path reflections. Thus, dereverberation can be achieved by modifying the spectral envelope and/or the excitation signal.

1.2 Approach

It has been empirically established that, for clean voiced speech, LP residuals have strong peaks corresponding to glottal pulses, whereas for reverberated speech such peaks are spreaded in time [1], in other words, LP residual of reverberated speech is a time-spread version of the relatively more peaky LP residual of clean speech. Thus, amplitude spread, in a degraded signal, can be seen as a reverberation metric. Recent researches have suggested to look at kurtosis, which is a degree of peakedness of a distribution, as a reasonable measure of reverberation [2, 3]. The goal of this project is to study an on-line gradient-ascent algorithm to maximize LP residual Kurtosis, as proposed in [4]. In such an attempt, there will be more emphasis on the speech - and enhancement for human intelligibility - than on the degradation during the enhancement. The project will focus on single microphone setup for it is more practical as well as challenging. Furthermore, the scenario considered here is that of Blind Derevereberation, where neither the clean signal, nor the room impulse response (RIR) is known. This is the case in most practical situations.

Lady using a tablet
Lady using a tablet


Writing Services

Lady Using Tablet

Always on Time

Marked to Standard

Order Now

Based on the targeted application, the problem of dereverberation can further be divided into two major classes, namely dereverberation targeted Automatic Speech Recognition (ASR) systems, or dereverberation targeted at making the signal more intelligible to humans. This work is mainly targeted at the latter class. When we want to remove the dereverberation for better human intelligibility we want to make the signal sound better while protecting the spectral distribution of formants; it is better to leave some reverberation than causing spectral distortion. However, when the target is ASR, the goal is to maximize the Signal-to-Noise ratio by removing as much reverberation as possible; a nice speech - in terms of human perception - is not aspired.

There have been a lot of work done in this area. Some of the interesting methods are presented in the next section.

1.3 Related Literature

The RIR, in general, is non-minimum phase. A non-minimum phase filter is a mixture of a minimum phase filter, where all singularities lie inside the unit circle, and a maximum phase filter, where all the singularities lie outside the unit circle. Therefore, the inversion of a non-minimum phase filter will result in a filter which has poles outside the unit circle, such a filter cannot be causal and stable at the same time. Thus the approach available is to estimate a near-inverse RIR filter to cancel the effect of room in the degraded speech. For the same reason, second order statistics cannot be used to reconstruct a direct signal, and therefore, higher order statistics, such as kurtosis, are required.

The idea of using kurtosis to remove reverberation from speech was at first proposed by Tanrikulu et. al. in [3], as Least-mean kurtosis. However, the authors did not exploit any speech specific properties of the input signal. Later, Gillespie et. al. in [4] proposed an LMS-like gradient maximizing algorithm that maximizes the kurtosis of the LP residuals of the speech signal to the clean speech. LP residual has been used as an efficient metric for the reverberation in speech, and many different algorithms has been proposed utilising LP residual for reverberant speech enhancement. Authors in [1], for example, have utilised an LP residual weighting scheme which enhances the regions with high signal-to-reverberation ratio in a speech signal. Other methods of dereverberation include spectral subtraction, as used in [5]. In [6], authors have proposed to use the CELP postfilterfor dereverberation.

Chapter 2

Project Definitions and Background

This chapter present a brief introduction to widely used concepts in this report.

2.1 Room

In this section the room and its properties are described. The room is described a the filter, g, between the source and sink i.e. speaker and microphone. The clean output of the speaker is the speech, s, and the signal at the sink is the reverberated speech x. The effect of ambient environment on speech can be modeled as a convolution in time of the speech signal and the RIR [7]. Furthermore, the system can also be modeled as containing HYPERLINK "#LinkTarget_1069"additive noise [1]. This is illustrated in Fig. 2.1 and Eq. 2.1.

s −→


−→ x

Fig. 2.1 Block diagram of speech convolution with room impulse response.

x(n)= g(n) âˆ- s(n)+ w(n) (2.1)

Where, n is the time index, and w(n) is the additive noise.

In this project, different room models were used. For many experiments, simplified room filters were defined using the famous image-source method. However, the implementation proposed by Lehamn and Johansson [8], which promises to address the problem of anoma lous tail decay in the original image-source method proposed by Allen and Berkley [9], was used. Furthermore, real life RIRs were downloaded from Aachen university's Aachen


Impulse Response database [10], and convolved with clean speech signal to produce rever berated speech.

2.2 Reverberation

When a person is speaking in a regular room, the listeners will not only perceive the direct speech signal, but also various multipath copies of it created by reflections on the room walls and other objects. The multipath signals are delayed and possibly attenuated as compared to the direct signal. This phenomenon is known as reverberation. A very simple scenario with only one reflective surface is illustrated in Fig. 2.2. To tackle this degradation, it is of interest to minimize the effect of the room.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

In time domain, such effects can be divided into early and late reflections. Early reflections can be defined as first 50-100 ms of the reflection. Early reflections are good for the speech intelligibility for humans, as they provide information related to the acoustic environment, such as, size of the room, and the position of the speaker in the room. Fig.

2.3 illustrates the direct, early and late signals.

2.3 Speech Production Model

The vocal tract can be modeled as an autoregressive (AR) process. The voiced sounds are generated when the input is given as quasi-periodic pulses, also referred to as the glottal pulses. Unvoiced sounds are made when the input is white noise. This simple source-filter model is illustrated in Fig. 2.4.

2.4 Performance Metrics

The selected metrics for algorithm evaluation are computation complexity, measured in the number of used multiplications, spectrogram, and subjective quality measures such as, feedback from many listeners. Spectrogram gives visual cues of formant spreads in time because of reverberation. Both computational complexity and subjective quality are important metrics for an hearing aid application, which has limited computational powers at its disposal. Therefore, it is important that the algorithms are as effective as possible, and giving as clear output as possible. Other important metrics are execution time (measured in seconds or clock cycles) and the amount of required memory for program and data storage. The execution time and memory requirements will however not be analysed in this project, but an estimate of how far the algorithms are from a real time application will be considered. Also, mean-square error between the original and reconstructed signal is not used here, because the system is not driven to minimize the mean-square error, and minimum mean-square error does not necessarily correspond to better sounding speech.

2.5 Higher Order Statistics

The main advantage of using higher order statistics follows the hypothesis that the received signal at the Mic, x(n), can be considered as composed of a Gaussian distributed, and a non-Gaussian distributed component. One important assumption here is that the input clean speech is non-Gaussian. A reverberated speech signal is a multipath signal, and it is represented as weighted, delayed, and summed copies of the same signal. Hence from central limit theorem (CLT) - which states that the distribution of the sum of independent and identical distributed (iid) signals is approximately Gaussian - the ambient noise and reverberation added to the signal because of convolution with RIR, can be considered as being Gaussian distributed.

All Gaussian distributed signals have higher order statistics equal to zero. The idea is therefore to establish a cost function that maximizes the higher order statistics of the reverberated signal, which entails that the processed signal should obtain a pdf that is non-Gaussian, that is, the room impulse response has been removed.

In higher order statistics, we would want to focus on the 3rd and 4th order cumulants. The nth order cumulants cn of a random variable X are defined by the cumulant generating function, which is logarithm of the moment-generating function.

tn g(t)= log(E[e tX ]) = ∞κn = µt + σ2 t2 + ... (2.2)



Cumulants are then given by derivatives at zero of g(t). A more general definition of the nth order cumulant for a non-Gaussian stationary random process x(k ) is given in [11, Eq.19]:

x xG

cn(τ1,τ2, .....τn−1) mn(τ1,τ2, .....τn−1) − mn (τ1,τ2, .....τn−1) (2.3)

Where, the parameter m


is the nth order central mean of x(k), and m


the nth order

moment of a Gaussian signal with mean and autocorrelation equal to those of x(k). The Ï„'s

are different delays, and c


is the nth order cumulant of x(k). It is important to note here

that the above equation is not valid for n = 2. The 2nd order cumulant is auto-covariance of x(k ). It is clear from Eq. 2.3 that if x(k) is Gaussian distributed, the cumulants will be zero. This prove that the higher order cumulants are not affected by the Gaussian noise. Finally, it is worth mentioning three commonly used parameters, which are defined for zero delay, and zero mean (or central mean), i.e., have moments equal to cumulants.

22()]= σnxx

(0, 0) = E[x 3(n)] = γ3


(0) = E[x










42−(000)= E[()]3(σxn,,x

4γ= x

is the kurtosis.


2342Where σisthevariance, γistheskewness,and γThe3(σxx


x2importanttonoteasthekurtosisofanormallydistributedsignalequalsto3(σwidelyusedflavourofkurtosisisthe normalizedkurtosis,definedas:


)2 factor is

)2 . A more




− 3





Thus the normalized kurtosis of a normally distributed signal equals to zero. Kurtosis is a measure of "peakedness", i.e., a signal with many large value in the middle and small values at the tails has a positive kurtosis.

This concludes the brief presentation of higher order statistics and this chapter. The developed concepts and definitions will be utilized in the forthcoming chapters.

)2 )2

Chapter 3


Having acquired sufficient background in previous chapters, we would look at the kurtosis based method of dereverberation in this chapter.

3.1 Linear Prediction of Speech

A Speech signal can be expressed as a linear combination of its past samples. Based on the source-filter model discussed in Section 2.3, the clean speech can be modeled as an output of an all-pole process.


s(n)= − aks(n − k)+ u(n) (3.1)


where ak's are the corresponding filter coefficients, and u(n) is the glottal pulse excitation signal. Let's say the predicted signal for the above speech be ˜s(n), which can also be modeled as an output of an all-pole process.


s˜(n)= − bks(n − k) (3.2)


where bk's are the linear prediction (LP) coefficients. Now, if the speech signal were truly generated by an all-pole filter, Eq. 3.2 would be an exact prediction of the speech signal at all times, except the glottal excitation instants, i.e.,

For ak = bk; error in prediction,e(n)= s(n) − s˜(n)= u(n) (3.3)


This error in prediction is refered to as LP residual. It is evident from Eq. 3.3 that the LP residual whitens the speech signal, and - in ideal conditions - represents the excitation signal.

In a similar fashion, LP of the reverberant speech can be written as,


x(n)= − hkx(n − k)+ ex(n) (3.4)


where ex(n) is the LP residual for the reverberant speech. As reverberation mainly affects the excitation signal, it can be removed by modifying the LP residual in a manner to achieve ex(n)= u(n), and then the clean speech signal can be synthesized from the cleansed residual.

3.2 Maximum Kurtosis based Dereverberation

In this section the maximum kurtosis based blind dereverberation is discussed. The basic idea is to maximize the kurtosis of LP residual of received reverberant signal to achieve dereverberation. The concept stems from the fact that the LP residual of a speech signal closely approximate the glottal excitation signal, and hence, it has quasi-periodic peaks. These peaks spread in time if reverberation is present/increased, and hence reverberation causes the LP residual of speech to become less peaky. Recall from Section 2.5 that kurtosis is a measure of the peakedness of a signal. Hence, the kurtosis of the LP residual of a speech signal increases as the reverberation in speech increases. An experimental proof of the same is presented later in Section 4.1.

In [4] Gillispie et al. present an adaptive algorithm to maximize the kurtosis of LP residuals. In the steepest-ascent algorithm, the cost function is given as the normalized kurtosis, as in Eq. 2.5. The block diagram for the algorithm is given in Fig. 3.1.

The adaptive filter h(n) is controlled by the feedback function f(n) given by the chosen cost function (described later). And the filtered LP residual ˜y(n) so achieved is used to synthesize the dereverberated signal y(n). An important assumption is made here, that the predictor coefficients obtained from the LP analysis are unaffected by the reverberation, and can be used to synthesize the clean speech from the filtered residual. This may not be true always. Hence, a secondary approach would be to duplicate the adaptive filter coefficients to directly filter the reverberant signal to get the dereverberated speech, as

illustrated in Fig. 3.2. To derive the adaptation equations, we want to maximize the kurtosis of Ëœy(n), given by


J(n)= − 3 (3.5)

E2[˜y2(n)] which constitutes our cost function. The gradient of J(n) with respect to current filter is

δJ E[˜y2]˜y2 − E[˜y4]

= 4˜y x˜= f(n)x˜(n) (3.6)

δh E3[˜y2]

E[˜y2]˜y2 − E[˜y4]

and hence,f(n) = 4Ëœy (3.7)

E3[˜y2] where f(n) is the desired feedback function used to control the filter update. The update equation can be written as:

h(n +1) = h(n)+ µf(n)x˜(n) (3.8)

where µ is the step-size. The expected values can be calculated recursively, as following:

E[˜y 2(n)] = βE[˜y 2(n − 1)] + (1 − β)˜y 2(n)


E[˜y 4(n)] = βE[˜y 4(n − 1)] + (1 − β)˜y 4(n)

The parameter β is the weighing factor in the recursive update, and controls the smoothness of the moment estimates.

3.3 Complexity of the Algorithm

In this section, the computation complexity of the algorithm is discussed. Before the analysis of the algorithm is made it is noted that this could be optimized, e.g., by using pre-calculations of often used variables, using look-up tables, and optimizing with regards to parallelism, because the algorithm utilize summations which can be calculated in parallel. Execution speed is critical because the application is a hearing aid, where real time execution is required. Using parallel computation will help increase the execution speed significantly. It will, however, require a processor which is capable of performing the calculations in parallel.

The kurtosis maximization algorithm can be divided into four separate calculations de fined in Eq. 3.7, 3.8, and 3.9. Furthermore it is also necessary to calculate the output signal y(n) of the filter once per algorithm update. In the following the number of multiplications required to compute each of the equations is determined.

The filtering resulting in y(n) is defined as

y(n)= hT x (3.10)

The number of taps in the filter h(n) is equal to L and therefore Eq.3.10 requires L multiplications to be computed. The number of required additions is not used in this simple cost analysis.

The filter update equation, given in Eq.3.8, requires one multiplication in scaling (mul tiplication by µ) the feedback f(n), and multiplying this result with the input vector x(n) requires L multiplications because length of x(n) is L, thus total L + 1 computations.

The feedback function is given in Eq.3.7. Squaring of ˜ y(n) requires one multiplication, and because the multiplication with the constant 4 can be included into the step size µ in Eq.3.8, the nominator can be calculated using three multiplications. The denominator requires two multiplications, and because it is assumed that a division requires the same number of cycles as a multiplication, even though it is a rough approximation, the total number of multiplications is six for this equation.

The first expectation operation, E[˜y2(n )] defined in Eq.3.9 requires two multiplications, because the squaring of ˜y has already been made in Eq.3.7, and the latter estimate E[˜y2(n)] requires three multiplications, because ˜y4(n)= y˜2(n).y˜2(n), where ˜y2(n) is known.

Hence, total number of calculation required to once update the filter = L +(L + 1) + 6+2+3=2L + 12. Using the O-notation the complexity is O(L).

This concludes the chapter, and next we will look at some experiments and results.

Chapter 4 Experiments and Results

In this chapter, experimental setup and various results are discussed.

4.1 Kurtosis and LP residual of reverberant speech

To verify that the kurtosis of the LP residual of a speech signal decreases with reverberation, a room environment was simulated, using the algorithm proposed in [8], with following details.

Dimensions: 4 Ã- 13 Ã- 4 in m3 .

Mic Position: At [2,2,2]

Source Position: Moving from [2,3,2] to [2,12,2]

Reverberation time, T60 = 0.4 seconds.

A clean speech of 8000 Hz sampling frequency was convolved with the RIR. The kurtosis of the LP residual of this output was calculated, and plotted against the distance between source and mic, as depicted in Fig.4.1. As the distance between he source and mic increases, the reverberation in the received speech increases, and the kurtosis of its residual decreases. This is evident from the figure. It should be noted that one should expect similar results if the source was fixed at [2,2,2], and the mic was moving. Furthermore, it can be seen from the figure that the kurtosis of LP residual of clean speech is very high as compared to that of the reverberated residual, and that the kurtosis of actual signal is not very high.


This small experiment establishes that the kurtosis of LP residual decreases with increase in reverberation.

4.2 Derverberation Experiments

In this section, three dereverberation experiments are discussed, where a separate reverberant room impulse response is simulated for each experiment. For all the experiments below β =0.9 was used. For LP residual a hamming window of size 256, and filter-tap length of 20 was used.

4.2.1 Experiment 1

In the first experiment, a room of 4 Ã- 4 Ã- 4 meter cube dimension was considered, with reverberant time, T60 = 0.7 seconds. A clean speech sampled at 8000 Hz was taken to be originating from a source located at [2,2,2], and the microphone was situated as [2,3,2]. To understand the performance of the algorithm, waveforms of the signals, as well as LP residuals were plotted. Spectrogram was analysed using the "wave-surfer" software. Furthermore, many people were asked to rate the improvement in reverberation for subjective measures. The results are discussed below.

The LP residuals of the clean as well as reverberant, and then the dereverberated output were calculated. The waveforms are depicted in Fig. 4.2.

To visualize the effect of reverberation in the speech signal, the waveforms of the clean, reverberant, and processed speech were plotted, as illustrated in Fig. 4.3.

In all the figures above, the kurtosis of the respective signal has also been marked above its waveform. It is apparent from the plots that reverberation spreads the signal energy in time making the LP residual less peaky; the kurtosis values of different LP residuals also align with the results. It is also evident that the LP residual of the filtered output is much closer to that of the clean speech, and the kurtosis value has also increased.

Spectrogram of all the three waveforms were plotted using "wave-surfer" software. The same are shown in Fig. 4.4. It can be seen in the figure how different formants are spread in time in the reverberant speech, and that the filtering process produces a much cleaner picture for the dereverberant speech.

Subjective Feedback: When listened to the reverberant and processed speech, the

output was perceptibly better than the input, in terms of significant mitigation in reverberation.

4.2.2 Experiment 2

In this experiment, the case of blind dereverberation is tried. A reverberant speech from ITU-T Wideband database was taken, for which no clean signal or room information was available; the signal was sampled at 16000 Hz. To understand the performance of the algorithm, waveforms of the signals, as well as LP residuals were plotted; and many people were asked to rate the improvement in reverberation for subjective measures. The results are discussed below.

The LP residuals of the reverberant, and then the dereverberated are depicted in Fig.

On the same lines as of experiment -1, the waveforms of the reverberant, and pro cessed speech were plotted, as illustrated in Fig. 4.6.

Again, the kurtosis of the respective signal has also been marked above its waveform. It is apparent from the plots that the LP residual of the filtered output is more peaky; the kurtosis values also increase with decrease in reverberation.

Subjective Feedback: When listened to the reverberant and processed speech, the output was perceptibly better than the input, in terms of significant mitigation in reverberation.


4.2.3 Experiment 3

In the aforementioned two experiments the algorithm was used as an offline algorithm, that is, first the final set of filter coefficients was calculated by letting the simulation run for whole length of the residual signal, and then the reverberant speech was filtered through the adaptive filter, using the derived coefficients. However, the primary use of dereverberation algorithms are in real-time scenarios, and hence, for this experiment the filter was applied on-the-fly to the reverberant speech, and a gradual improvement in the quality of speech was noticed. For the purpose of the experiment, same clean speech signal as in Experiment-1 was used here, but for reverberation a real life RIR downloaded from Aachen Impulse Response database [10] was used. To understand the performance of the algorithm, waveforms of the signals, as well as LP residuals were plotted. The results are discussed below.

The LP residuals of the reverberant, and then the dereverberated output are depicted in Fig. 4.7.

On the same lines as of experiment -1, the waveforms of the reverberant, and pro cessed speech were plotted, as illustrated in Fig. 4.8.

Again, the kurtosis of the respective signal has also been marked above its waveform. It is apparent from the plots that the LP residual of the filtered output is more peaky; the kurtosis values also increase with decrease in reverberation.

Subjective Feedback: When listened to the reverberant and processed speech, the output was perceptibly better than the input, in terms of significant mitigation in reverberation. One could hear gradual improvement in the signal quality, as the algorithm learns and adapts the filter coefficients with time; here a gradual improvement in the quality of speech was noticed.

4.3 Summary

It was proved that the kurtosis of LP residual is a good measure of reverberation in speech. Various experiments were used to validate that the algorithm works well for synthetic as well as real life RIR. Finally experiment 3 proved the algorithm to be suitable for real time scenarios.

Chapter 5

Discussion and Conclusion

In this project maximum kurtosis based approaches to the blind speech dereverberation problem was analysed. The applications targeted in this report are hearing aid, and hands-free telephony systems, with a single microphone setup. The method was tested on simulated and real life RIR's with real voiced speech.

The implemented algorithm is based on the work by Gillespie et al., [4], who presented a steepest-ascent solution for the problem. The source signal (the direct speech), s, is assumed to be non-Gaussian distributed. The room then causes the observed signal (the reverberated speech), x, to be corrupted by delayed and weighted versions of s, and introduces Gaussian distributed components. A Gaussian distributed signal has higher order cumulants equal to zero, and the approach is therefore to maximize the kurtosis of the observed signal, such that the room impact on the source signal can be removed.

In conclusion, an interesting method to blind speech dereverberation has been implemented and analysed in detail. Adjustments have been made on carefully performed tests and the final word is that the algorithm is able to significantly reduce the undesired reverberation in speech.

Future Work

Authors in [12] claim that an average over several spatially distributed microphones can provide potentially better results, hence, it may be worthwhile to investigate into multi-microphone setups.

Time domain implementation is prone to slow or no convergence because of all variance in the eigenvectors of autocorrelation matrices of the input signal. Alternate noise robust implementations may be investigated to avoid this issue, such as the subband adaptive method promoted in [4].