Noise And Speech Levels In Various Environments Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Before design a voice enhancement system, we have to understand the behaviour of various types of noise, difference between the noise source and the range of noise levels that may be encountered in real life. Generally noise refers to the anything that interferes with what we want. There are many technical challenges trying to extract a desired information signal from a background of unwanted noise such as in building a cell phone system, robust voice recognition, ultrasound machine and others.

Noise Sources

In real life, noise will surround us wherever we go. It will appear in different shapes and forms. For example, in the street the cars are passing by and people talking at the nearby table in the restaurant.

Generally, noise can be divided into two types which are stationary and non-stationary. Stationary noise has a spectral power density that does not change over time, for example noise make by a computer fan or air conditioning. While the non-stationary noise has statistical properties that change over time, for example noise generated caused by door slam, radio, voice and TV. We can clearly make a conclusion where the task of suppressing non-stationary noise is more difficult than that of suppressing stationary noise.

Another feature of the various type of noise is the shape of spectrum, the distribution of noise energy in the frequency domain. By compare the long term average spectra example of noise sources in restaurant, car and train where the noise sources were taken from the NOIZEUS corpus. We found that the car noise is relatively stationary but the restaurant and train noise are not. The noise sources are more distinct in frequency domain rather than the time domain.

Normally, there are two main sources of distortion may make the voice signal degraded which are additive noise and channel distortion. For the additive noise, it can be categorised into stationary and non-stationary as explain as above statement. For example, a fan running in the background, a door slams and a conversation among others. If we captured the signal with the speaker close to the microphone for sure the signal captured has a little noise and reverberation. However, if the microphone is far from the speaker's mouth it can pick up a lot of noise and reverberation.

For channel distortion, it can be caused by reverberation such as the frequency response of a microphone, the response of the local loop of a telephone line, the presence of an electrical filter in an ADC circuit and a speech codec. The reverberation is caused by the reflection of acoustic waves on the walls and other objects in the room that will alter the speech signal. For the direct path, the signal level at the microphone is inversely proportional to the distance from the speaker. For the reflected sound waves, the signal level is inversely proportional with the distance of sound travel. Besides that, we have to take into account the energy absorption which takes place each time the sound wave hits a surface where the different surface material will give a differently absorption.

2.1.2 Noise and Speech Levels in Various Environments

To the design of speech enhancement algorithms, we have to understanding the knowledge of the range of speech and noise intensity levels in real-world scenarios in order to estimate the range of signal-to-noise ratio (SNR) levels encountered in realistic environment. The speech enhancement algorithms have to work effectively in suppressing noise and hence to improving speech quality within that range of SNR levels.

In 1977, Pearson and colleagues (Pearson, Bennett & Fidell, 1977) had done a comprehensive analysis which is a measurement of speech and noise levels in real world environments. They considered a variety of environments encountered in daily life such as outside and inside of home, commuter train, nursing station and department store. This analysis had provided an important corpus of data on typical speech and noise, SNR ratio across a wide variety of everyday listening situations. The speech and noise levels were measured using sound levels meters. The measurements were reported in dB sound pressure level (SPL) where the dB SPL is the relative pressure of sound in reference to 0.0002 dynes/cm², corresponding to the barely audible sound pressure.

Figure 2.1 Average speech and noise levels in a variety of environments, from Pearsons et al. (1977).

Figure above is the summary of the average speech and noise levels measured in various environments from the result of Pearsons's analysis. From the figure, we found that the speech levels are increase in extremely high background noise levels. Generally, people will tend to raise their voices when noise level goes beyond 45 dB SPL, this phenomenon is known as the Lombard effect. The speech level will tends to increase 0.5dB for every 1dB increase in background noise. When the ambient noise level goes beyond 70 dB SPL, people will stop raising their voice. In practical environment, the speech enhancement algorithms to be employed have needs to operate at SNRs in the range of 0-15dB.

Speech Perception and modelling

2.2.1 Speech Perception

Processes by which humans are able to interpret and understand the sounds used in the language are called speech perception. Normally, the speech perception is closely linked to the phonology and phonetic field. Researches for seek to understand how the humans recognize the speech sounds and use this information to understand the spoken language were done by many speech perception researchers. These researches about the speech perception have been used in some applications in the construction of computer systems which are able to recognize the speech and transmit a meaning signal, as well as to improve the recognition for hearing of listeners. Figure below the shows speech production of human's organ, there are a lot of biological and psychological factors which can affect the speech which include disorders with the lungs, the larynx and the vocal cords.

Figure 2.2 Speech production of human's organ

2.2.2 Engineering Model of Speech Production

Nowadays, a lot of electronic devices which can use the voice that as similar as possible to a real human voice and speak to us. Figure shown below is the one of the production voice model.

Figure 2.3 Source-tract model of speech production

Reason for using this model is because this model has been used extensively for low-bit-rate speech coding application. By using this model, firstly we have to decide the noise that we want to produce is voiced or unvoiced. Voiced sounds are produced when the vocal folds are in the voicing state where unvoiced sounds are produced when the vocal folds are in the unvoicing state. If we want the voiced sounds we have to model a glottal pulse train resemble to the produced in our vocal cords. If we want the unvoiced sounds the signal produced which is sound like noise we can see in the fricative sounds. After that we have to go through the vocal tract with our generated signal. In this model, vocal tract resonance is represented by a quasi-linear system that is excited by either a periodic or aperiodic sources, depending on the state of the vocal folds. The vocal folds can assume one of two states is modelled by a switch. The vocal tract is modelled by a time-invariant linear filter. The vocal tract resonance will filter the signal with a filter that tries to mimic the effect of the shape formed with the pharyngeal cavity (throat), vocal and nasal cavity. Lastly, the output of vocal tract filter is fed to the radiation model. The radiation model will reproduce the effect of the radiation impedance that the air put up to the exit of the speech from the mouth.

2.3 Algorithms for Voice Enhancement

Recently, voice processing algorithms can roughly be divided into three domains, spectral subtraction, sub-space analysis and filtering algorithms. Spectral subtraction algorithms operate in the spectral domain by removing the amount of energy which corresponds to the noise contribution from each spectral band. Spectral subtraction is one of the popular algorithms being used in speech enhancement because it is work effectively in estimating the spectral magnitude of the voice signal. Sub-space analysis operates in the autocorrelation domain. The voice and noise components can be assumed to be orthogonal whereby their contributions can be readily separated but to find the orthogonal components is computationally expensive. Besides that, the orthogonal assumption is difficult to motivate. Therefore, this algorithm is not encouraged to use in our project. Another algorithm which is filtering algorithms is operating in time-domain which includes Wiener filtering and Kalman filtering. That Wiener filtering attempt to either remove the noise component and Kalman filtering approach to estimate the noise and voice components.

2.3.1 Spectral-Subtractive Algorithms

The spectral subtraction algorithm is historically one of the first algorithms proposed for noise reduction (Boll, 1979; Weiss et al., 1974). Based on the principle, we assuming it are additive noise and the noise spectrum can be estimated and updated during the periods when the signal is absent. Then we obtain an estimate of the clean signal spectrum by subtracting out the estimated of the noise spectrum from the noisy speech spectrum. The enhanced signal is obtained by computing the inverse discrete Fourier transform (IDFT) of the estimated signal spectrum using the phase of the noisy signal. This algorithm involves a single forward and inverse Fourier transform. To avoid any speech distortion we have to carefully during the subtraction process. If too little is subtracted then the speech signal remains will interfere with noise. If too much is subtracted then some part of speech information might be removed. Principle of Spectral Subtraction

The principle of spectral subtraction shown below is introduced by Boll in 1979.

Let y(n) be the sampled noisy speech signal, x(n) be the clean signal and d(n) be the noise signal. We assume the sampled noisy speech signal consist clean signal and noise signal, therefore we can write as

y(n) = x(n) + d(n) (1)

Taking the short-time Fourier transform of y(n), we get

Y (ωk) = X(ωk) + D(ωk) (2)

for ωk = 2πk/N and k = 0,1,2, . . . ,N - 1, where N is the frame length in samples.

We can express Y (ωk) in polar form as

Y (ωk)=| Y (ωk)| (3)

We can multiply the Y(ωk) by its conjugate Y*( ωk) to obtain the short-term power spectrum of the noisy speech.

|Y (ωk)|² = |X(ωk)| ² + |D(ωk)| ² + X(ωk) ・D*(ωk) + X* (ωk) ・D (ωk)

= |X(ωk)| ² + |D(ωk)| ² + 2Re|X(ωk)D*(ωk) | (4)

The terms |D(ωk)| ² , X(ωk) ・D*(ωk) and X* (ωk) ・D (ωk) cannot be obtained directly and are approximated as E{|D(ωk)| ² }, E{ X (ωk) ・D* (ωk)} and E{ X* (ωk) ・D (ωk)} where E [・] denotes the expectation operator. Typically, E{|D(ωk)| ² } is estimated during non-speech activity and is denoted by |Dˆ(ωk)| ². If we assume that d(n) is zero mean and uncorrelated with the clean signal x(n), then the terms E{ X (ωk) ・D* (ωk)} and E{ X* (ωk) ・D (ωk)} reduce to zero. Thus, from the above assumptions, the estimate of the clean speech power spectrum, denoted as |Xˆ(ωk)| ², can be obtained as follows:

|Xˆ (ωk)|² = |Y(ωk)| ² - |Dˆ(ωk)| ² (5)

The above equation describes the power spectrum subtraction algorithm. The estimated power spectrum |Xˆ(ωk)| ² is not guaranteed to be positive, but can be half-wave rectified. The enhanced signal is finally obtained by computing the inverse Fourier transform of |Xˆ(ωk)| using the phase of the noisy speech signal. We can write in the following form:

|Xˆ (ωk)|² = H²(ωk) |Y(ωk)| ² (6)

Where H(ωk) = is the gain (or suppression) function and ≈ |Y (ωk)|² / |Dˆ(ωk)| ².

Assuming that the cross terms in equation (6) are zero. Hence, H(ωk) is always positive taking values in the range of 0 ≤ H(ωk) ≤ 1. H(ωk) is called the suppression function because it provides the amount of suppression or attenuation applied to the noisy power spectrum |Y (ωk)|² at a given frequency to obtain enhanced power spectrum |Xˆ (ωk)|².

A general version of the spectral subtraction algorithms is given by


Where p is the power exponent with p=1 yielding the original magnitude spectral subtraction and p=2 yielding the power subtraction algorithm.

From the equation (2), the noisy spectrum Y (ωk) at frequency ωk is obtained by summing two complex-valued spectra at frequency ωk. Then Y (ωk) can be represented geometrically in the complex plane as the sum of two complex numbers, X(ωk) and D(ωk). Figure below shows the representation of Y (ωk) as a vector addition of X(ωk) and D(ωk) in the complex plane.

Figure 2.4 Representation of the noisy spectrum Y(ωk) in the complex plane as the sum of the clean signal spectrum X(ωk) and noise spectrum D(ωk).

2.3.2 Kalman Filtering

Kalman filter is operates through a prediction and correction mechanism. The error is statistically minimized by predicts a new state from its previous estimation and adding a correction term proportional to the predicted error. Kalman filter is the main algorithm to estimate dynamic systems specified in state-space form. The Kalman filter consists in a set of mathematic equations which give an optimum recursive solution through the least square method. The goal of this solution is to calculate an unbiased minimum variance linear estimator of the state in t, based on the information available in t-1, and update these estimations, with the additional information available in t, (Clar eh al. 1998). The study of Kalman filter is based on Wiener filter. Wiener Filter

The objective of Wiener filter is to remove the noise signal from a corrupted signal. This optimal Wiener filter was proposed by Norbert Wiener during the 1940s. Statistical approach has been used to reduce the amount of noise in the corrupted signal this filter. Every device in fact will introduce an error in the output when a signal is measured. Let xk be the original signal, hk is the response of device, yk is the output. We can write as

yk = xk * hk

Apply Fourier Transform,

Yj = Xj ・ Hj

The second source of signal corruption is the unknown background noise nk is added due to the process. ykˆ , the measured signal :

ykˆ = yk + nk

Solve this equation, if we do not have noise and we know the transfer response, then the solution is

Xj =

If we have noise, we have to filter the output signal with a Wiener filter.

Xj =

Normally, the filters designed are use for a specific frequency but in Wiener filters we need the knowledge about the spectral properties of the original signal and noise. After that, we have to find an output that would be as close as possible to the original signal which is LTI filter. The Wiener filter makes the assumption that the signal and additive noise are stationary linear stochastic processes with known spectral characteristics or known autocorrelation and cross-correlation. The requirement of this filter must be physically realizable and which use the performance criteria of minimum mean-square error. Discrete Kalman Filter-The Process to be estimated

In 1960, R.E. Kalman published his famous paper describing a recursive solution to the discrete data linear filtering problem [Kalman60].

The Kalman filter has the goal of solving the general problem of estimate the state X ϵ of a process controlled in discrete time, which is dominated by a linear equation in stochastic difference in the following way:

Xn = A ・ Xn-1 + wn-1

with a measure Y ϵ , that is:

Yn = C ・ Xn + vn

The random variables wn and vn represent the process and the measure error, respectively. Assuming they are independent of each other and are white noise variables with normal probability distribution:

p(w) ≈ N(0,Rw)

p(v) ≈ N(0,Rv)

Practically, the covariance matrix of the process's perturbation, Rw, and the measure's perturbation, Rv, could change in time but we assumed they are constants. The matrix A is assumed to be of m x m dimension and it relates the state in the period n-1 with the state in the n moment. The matrix C has a dimension n x m and it relates the state with the measure Yn. These matrixes may change over time, but generally we also assumed it as constant. The Algorithm of Discrete Kalman Filter

The Kalman filter estimates the previous process using a feedback control. It estimates the process to a moment over the time and then it gets the feedback through the observed data.

From the point of view of the equation that used to derivate the Kalman filter, it separates them into two groups which included time update equations and measurement update equations. The first group of equations, time update, has to throw the state to the n moment taking as reference the state on n-1 moment and the intermediate update of the covariance matrix of the state. The second group of equations, measurement update, has to take care of the feedback and add new information inside the previous estimation to achieve an improved estimation of the state.

The time update equation can be seen as prediction equations, while the measurement equations can be seen as correction equations. The final estimation algorithm can be defined as a prediction-correction algorithm to solve many problems. The Kalman filter works through a projection and correction mechanism to predict the new state and its uncertainty and correct the projection with the new measure. Figure below show the cycle of discrete Kalman algortihm.

Figure 2.5 The discrete Kalman filter cycle-the time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time

The specified equations for the state prediction are detailed as follows:

From the equations predict the state and covariance estimations forward from moment n-1 to n. These two formulas give us an estimate value for xn and its covariance. The first Kalman equation estimates the next sample from the previous state. The second Kalman equation is the covariance matrix used to predict the estimation error. The A matrix relates the state in the previous moment n-1 with the actual moment n, this matrix could change for the different moments over the time. Rw represents the covariance of the process random perturbation which tries to estimate the state.

The specified equations for the state correction are detailed as follows. They are called measurement updating equations.

First, during the state projection correction, we have to calculate the Kalman gain, Re,n. This gain factor is chosen in such a way it minimizes the covariance error of the new state estimation. The next step is to measure the process to get yn and generate a new state estimation which incorporates the new observation. Lastly, is to find a new estimation of the error covariance through the last equation. After each couple of updates, time and measure, the process is repeated taking as starting point the new state estimations and the error covariance.

The figure below shows us the complete operation of the filter, combining the prediction and correction and the five Kalman equations.

Figure 2.6 Main equations of Kalman Filter -the interaction of the prediction and correction steps