The primary goal

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


In this project, the primary goal is to design a MATLAB based simulator for processing of speech together with the aid of the WDCT technique and to obtain a reconstructed speech signal, which is similar to the input speech signal.

To achieve these results, sample speeches were obtained. These were modeled as an autoregressive (AR) process and represented in the frequency domain by the WDCT filter.

The original speech signal and the reconstructed speech signal obtained from the output of the filter were compared. The idea of this comparison is to pursue an output speech signal, which is similar to the original one. It was concluded that WDCT is a good constructing method for speech.


1.1 Introduction

Speech is a form of communication in every day life. It existed since human civilizations began and even till now, speech is applied to high technological telecommunication systems. As applications like cellular and satellite technology are getting popular among mankind, human beings tend to demand more advance technology and are in search of improved applications. For this reason, researchers are looking closely into the four generic attributes of speech coding. They are complexity, quality, bit rate and delay. Other issues like robustness to transmission errors, multistage encoding/decoding, and accommodation of non-voice signals such as in-band signaling and voice band modem data play an important role in coding of speech as well.

In order to understand these processes, both human and machine speech has to be studied carefully on the structures and functions of spoken language: how we produce and perceive it and how speech technology may assist us in communication. Therefore in this project, we will be looking more into speech processing with the aid of an interesting technology known as the warped discrete cosine transform. Presently, this technique is not widely used in the field of signal processing, however it is a potential nominee to be considered. More details on Speech Processing and the WDCT will be explained in the later chapters of this thesis report.

1.2 Overview:

Speech processing has been a growing and dynamic field for more than two decades and there is every indication that this growth will continue and even accelerate. During this growth there has been a close relationship between the development of new algorithms and theoretical results, new filtering techniques are also of consideration to the success of speech processing. One of the common adaptive filtering techniques that are applied to speech is the Wiener filter. This filter is capable of estimating errors however at only very slow computations. On the other hand, the Kalman filter suppresses this disadvantage. But DCT provides a wider advantage than the above by providing a larger energy compaction capability resolution compared to others.

As widely known to the world, Kalman filtering techniques are used on GPS (Global Positioning System) and INS (Inertial Navigation System). Nonetheless, they are not widely used for speech signal coding applications. The reason why Kalman filter is so popular in the field of radar tracking and navigating system is that it is an optimal estimator, which provides very accurate estimation of the position of either airborne objects or shipping vessels.

The new feature is termed as warped-discrete cosine transforms (WDCT). The feature is obtained by replacing the discrete cosine transform (DCT) by the warped discrete cosine transform (WDCT).The WDCT is implemented as a cascade of the DCT and IIR all-pass filters. Due to this motivating fact, there are many ways a WDCT can be tuned to suit engineering applications such as image applications as well as the speech. Knowing the fact that preserving information, which is contained in speech, is of extreme importance, the availability of signal filters such as the WDCT is of great importance.

1.3 Back ground (Speech Processing)

Speech Production:

Speech is produced when air is forced from the lungs through the vocal cords and along the vocal tract. The vocal tract extends from the opening in the vocal cords (called the glottis) to the mouth, and in an average man is about 17 cm long. It introduces short-term correlations (of the order of 1 ms) into the speech signal, and can be thought of as a filter with broad resonances called formants. The frequencies of these formants are controlled by varying the shape of the tract, for example by moving the position of the tongue. An important part of many speech codecs is the modeling of the vocal tract as a short term filter. As the shape of the vocal tract varies relatively slowly, the transfer function of its modeling filter needs to be updated only relatively infrequently (typically every 20 ms or so).

The vocal tract filter is excited by air forced into it through the vocal cords. Speech sounds can be broken into three classes depending on their mode of excitation.

  • Voiced sounds are produced when the vocal cords vibrate open and closed, thus interrupting the flow of air from the lungs to the vocal tract and producing quasi-periodic pulses of air as the excitation. The rate of the opening and closing gives the pitch of the sound. This can be adjusted by varying the shape of, and the tension in, the vocal cords, and the pressure of the air behind them. Voiced sounds show a high degree of periodicity at the pitch period, which is typically between 2 and 20 ms. This long-term periodicity can be seen in Figure 1 which shows a segment of voiced speech sampled at 8 kHz. Here the pitch period is about 8 ms or 64 samples.
  • Unvoiced sounds result when the excitation is a noise-like turbulence produced by forcing air at high velocities through a constriction in the vocal tract while the glottis is held open. Such sounds show little long-term periodicity as can be seen from Figures 3 and 4, although short-term correlations due to the vocal tract are still present.
  • Plosive sounds result when a complete closure is made in the vocal tract, and air pressure is built up behind this closure and released suddenly.

Some sounds cannot be considered to fall into any one of the three classes above, but are a mixture. For example voiced fricatives result when both vocal cord vibration and a constriction in the vocal tract are present.

Although there are many possible speech sounds which can be produced, the shape of the vocal tract and its mode of excitation change relatively slowly, and so speech can be considered to be quasi-stationary over short periods of time (of the order of 20 ms). Speech signals show a high degree of predictability, due sometimes to the quasi-periodic vibrations of the vocal cords and also due to the resonances of the vocal tract. Speech coders attempt to exploit this predictability in order to reduce the data rate necessary for good quality voice transmission

From the technical, signal-oriented point of view, the production of speech is widely described as a two-level process. In the first stage the sound is initiated and in the second stage it is filtered on the second level. This distinction between phases has its origin in the source-filter model of speech production.

The basic assumption of the model is that the source signal produced at the glottal level is linearly filtered through the vocal tract. The resulting sound is emitted to the surrounding air through radiation loading (lips). The model assumes that source and filter are independent of each other. Although recent findings show some interaction between the vocal tract and a glottal source (Rothenberg 1981; Fant 1986), Fant's theory of speech production is still used as a framework for the description of the human voice, especially as far as the articulation of vowels is concerned.

What is Speech Processing?

The term speech processing basically refers to the scientific discipline concerning the analysis and processing of speech signals in order to achieve the best benefit in various practical scenarios . The field of speech processing is, at present, undergoing a rapid growth in terms of both performance and applications. This is stimulated by the advances being made in the field of microelectronics, computation and algorithm design. Nevertheless, speech processing still covers an extremely broad area, which relates to the following three engineering applications:

  • Speech Coding and transmission that is mainly concerned with man-to man voice communication;
  • Speech Synthesis which deals with machine-to-man communications;
  • Speech Recognition relating to man-to machine communication.

Speech Coding:

Speech coding or compression is the field concerned with compact digital representations of speech signals for the purpose of efficient transmission or storage. The central objective is to represent a signal with a minimum number of bits while maintaining perceptual quality. Current applications for speech and audio coding algorithms include cellular and personal communications networks (PCNs), teleconferencing, desktop multi-media systems, and secure communications.

Speech Synthesis:

The process that involves the conversion of a command sequence or input text (words or sentences) into speech waveform using algorithms and previously coded speech data is known as speech synthesis. The inputting of text can be processed through by keyboard, optical character recognition, or from a previously stored database. A speech synthesizer can be characterized by the size of the speech units they concatenate to yield the output speech as well as by the method used to code, store and synthesize the speech. If large speech units are involved, such as phrases and sentences, high-quality output speech (with large memory requirements) can be achieved. On the contrary, efficient coding methods can be used for reducing memory needs, but these usually degrade speech quality.

Speech Recognition:

Speech or voice recognition is the ability of a machine or program to recognize and carry out voice commands or take dictation. On the whole, speech recognition involves the ability to match a voice pattern against a provided or acquired vocabulary. A limited vocabulary is mostly provided with a product and the user can record additional words. On the other hand, sophisticated software has the ability to accept natural speech (meaning speech as we usually speak it rather than carefully-spoken speech). Speech information can be observed and processed only in the form of sound waveforms. It is an essential for speech signal to be reconstructed properly.


The purpose of sampling is to transform an analog signal that is continuous in time to a sequence of samples discrete in time. The signals we use in the real world, such as our voices, are called "analog" signals. In order to process these signals in computers, most importantly it must be converted to "digital" form. While an analog signal is continuous in both time and amplitude, a digital signal is discrete in both time and amplitude. Since in this thesis, speech will be processed through a discrete Kalman filter, it is necessary for converting the speech signal from continuous time to discrete time, hence this process is described as sampling.

The value of the signal is measured at certain intervals in time. Each measurement is referred to as a sample. Once the continuous analog speech signal is sampled at a frequency f, the resulting discrete signal will have more frequency components than the analog signal. To be precise, the frequency components of the analog signal are repeated at the sample rate. Explicitly, in the discrete frequency response they are seen at their original position, and also centered around +/- f, and +/- 2 f, etc.

In order to ensure that the signal still preserve the information, it is necessary to sample at a higher rate greater than twice the maximum frequency of the signal. This is known as the Nyquist rate. The Sampling Theorem states that a signal can be exactly reconstructed if it is sampled at a frequency f, where f> 2fm where fm is maximum frequency in the signal. On the other hand, if the signal is sampled at a frequency that is lower that the Nyquist rate. This signal when converted back into a continuous time signal, will exhibit a phenomenon called aliasing. Aliasing is the presence of distortion components in the reconstructed signal. These components were not present when the original signal was sampled. In addition, some of the frequencies in the original signal may be lost in the reconstructed signal. Aliasing occurs due to the fact of overlapping of signal frequencies, which occurs when the sampling frequency is too low. Frequencies "fold" around half the sampling frequency - which is why this frequency is often referred to as the folding frequency.

Technical Characteristics of the Speech Signal :

An engineer looking at (or listening to) a speech signal might characterize it as follows:

  • The bandwidth of the signal is 4 kHz
  • The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz
  • There are peaks in the spectral distribution of energy at (2n - 1) * 500 Hz ; n = 1, 2, 3, . . . (1.1)
  • The envelope of the power spectrum of the signal shows a decrease with increasing frequency (-6dB per octave).

Bandwidth :

The bandwidth of the speech signal is much higher than the 4 kHz stated above. In fact, for the fricatives, there is still a significant amount of energy in the spectrum for high and even ultrasonic frequencies. However, as we all know from using the (analog) phone, it seems that within a bandwidth of 4 kHz the speech signal contains all the information necessary to understand a human voice.

The Envelope of the Power Spectrum Decreases with Increasing Frequency :

The pulse sequence from the glottis has a power spectrum decreasing towards higher frequencies by -12dB per octave. The emission characteristics of the lips show a high-pass characteristic with +6dB per octave. Thus, this results in an overall decrease of -6dB per octave

Filter Bank Analysis:

One could also measure the spectral shape by means of an analog filter bank using several bandpass filters as is depicted below. After rectifying and smoothing the filter outputs, the output voltage would represent the energy contained in the frequency band defined by the corresponding bandpass filter.

Previous Topologies

Dynamic time warping dynamic :

The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two.

Dynamic Time Warping is a pattern matching algorithm with a non-linear time normalization effect. It is based on Bellman's principle of optimality, which implies that, given an optimal path w from A to B and a point C lying somewhere on this path, the path segments AC and CB are optimal paths from A to C and from C to B respectively. The dynamic time warping algorithm creates an alignment between two sequences of feature vectors, (T1, T2,.....TN) and (S1, S2,....,SM).

A distance d(i,j) can be evaluated between any two feature vectors Ti and Sj . This distance is referred to as any two feature vectors Ti and Sj is computed recursively by adding its local distance d(i,j) to the evaluated global distance for the best predecessor. The best predecessor is the one that gives the minimum global distance D(i,j) at row i and column j:

The computational complexity can be reduced by imposing constraints that prevent the selection of sequences that cannot be optimal . Global constraints affect the maximal overall stretching or compression. Local constraints affect the set of predecessors from which the best predecessor is chosen.

Dynamic Time Warping (DTW) is used to establish a time scale alignment between two patterns. It results in a time warping vector w, describing the time alignment of segments of the two signals. assigns a certain segment of the source signal to each of a set of regularly spaced synthesis instants in the target signal.

Kalman Filter

Theoretically, the Kalman Filter is an estimator for what is called the "linear quadratic problem", which focuses on estimating the instantaneous "state" of a linear dynamic system perturbed by white noise. Statistically, this estimator is optimal with respect to any quadratic function of estimation errors. In practice, this Kalman Filter is one of the greater discoveries in the history of statistical estimation theory and possibly the greatest discovery in the twentieth century. It has enabled mankind to do many things that could not have been done without it, and it has become as indispensable as silicon in the makeup of many electronic systems

In a more dynamic approach, controlling of complex dynamic systems such as continuous manufacturing processes, aircraft, ships or spacecraft, are the most immediate applications of Kalman filter. In order to control a dynamic system, one needs to know what it is doing first. For these applications, it is not always possible or desirable to measure every variable that you want to control, and the Kalman filter provides a means for inferring the missing information from indirect (and noisy) measurements. Some amazing things that the Kalman filter can do is predicting the likely future courses of dynamic systems that people are not likely to control, such as the flow of rivers during flood, the trajectories of celestial bodies or the prices of traded commodities.

From a practical standpoint, these are the perspectives that this section will present:

  • It is only a tool
  • It aids mankind in solving problems, however, it does not solve any problem all by itself. This is however not a physical tool, but a mathematical one, which is made from mathematical models. In short, essentially tools for the mind. They help mental work become more efficient, just like mechanical tools, which make physical work less tedious. Additionally, it is important to understand its use and function before one can apply it effectively.

  • It is a computer program
  • It uses a finite representation of the estimation problem, which is a finite number of variables; therefore this is the reason why it is said to be "ideally suited to digital computer implementation" . However, assuming that these variables are real numbers with infinite precision, some problems do happen. This is due from the distinction between finite dimension and finite information, and the distinction between "finite" and "manageable" problem sizes. On the practical side when using Kalman filtering, the above issues must be considered along with the theory.

  • It is a complete statistical characterization of an estimation problem
  • This is a complete characterization of the current state of knowledge of the dynamic system, including the influence of all past measurements. The reason behind why it is much more than an estimator is because it propagates the entire probability distribution of the variables it is tasked to estimate. These probability distributions are also useful for statistical analyses and the predictive design of sensor systems.

  • In a limited context, it is a learning method.

The estimation problem is modeled in a way that distinguishes between phenomena (what one is able to observe) and noumena (what is really going on). Above that, the state of knowledge about the noumena is that one can deduce from the phenomena. That state of knowledge is represented by probability distributions, which represent knowledge of the real world. Thus this cumulative processing of knowledge is considered a learning process. It is a fairly simple process, however quite effective in many applications.

Weiner Filtering

Wiener filter are rather simple and workable, but after the estimation of the background noise, one neglects the fact that the signal is actually speech. Furthermore, the phase component of the signal is left untouched. However, this is perhaps not such a bad problem; after all, human ear is not very sensitive to phase changes. The third restriction in spectral subtraction methods is the processing of the speech signal in frames, so the Proceeding from one frame to another must be handled with care to avoid discontinuities.

Noise reduction is a key-point of speech enhancement systems in hands-free communications. A number of techniques have been already developed in the frequency domain such as an optimal short-time spectral amplitude estimator proposed by Ephraim and Malah including the estimation of the a priori signal-to-noise ratio. This approach reduces significantly the disturbing noise and provides enhanced speech with colorless residual noise. In this paper, we propose a technique based on a Wiener filtering under uncertainty of signal presence in the noisy observation. Two different estimators of the a priori signal-to-noise ratio are tested and compared. The main interest of this approach comes from its low complexity.

Single and multi-channel enhancement:

Single channel methods operate on the input obtained from only one microphone. They have been attractive due to cost and size factors, especially in mobile communications. In contrast, multi-channel methods employ an array of two or more microphones to record the noisy signal and exploit the resulting spatial diversity. The two approaches are not necessarily independent, and can be combined to improve performance. For example, in practical diffuse noise environments, the multi-channel enhancement schemes rely on a single-channel post-filter to provide additional noise reduction.

We discuss single-channel methods and introduce the contributions of this project towards this area is also included in this document. This section is intended to be a survey on single-channel enhancement algorithms.

Maximum-likelihood estimation

Consider the estimation of a parameter µ = [µ1 : : : µp]T based on a sequence of K observations y = [y(0) : : : y(K¡1)]T . In ML estimation, µ is treated as a deterministic variable. The ML estimate of µ is the value µ ML that maximizes the likelihood function p(y; µ) defined on the data. ML estimation has several favorable properties, in particular, it is asymptotically unbiased and efficient, i.e., as the number of observations K tends to infinity, the ML estimate is unbiased and achieves the Cramer-Rao lower bound (CRLB). It can be shown that

The maximization of the likelihood function is performed over the domain of µ. In many cases, µ ML cannot be computed in closed form and a numerical solution is obtained instead. Such numerical solutions are typically obtained through iterative maximization procedures such as the Newton-Raphson method or the expectation-maximization (EM) approach. The initial value of the parameter used to start the iterative procedure usually has a strong impact on whether the final estimate results in a local or a global maximum of the likelihood function.

In applications where the parameter µ is known to assume one of a finite set of values, the problems due to the iterative procedures can be avoided by performing the maximization over this finite set. An exhaustive search over the finite parameter space guarantees a global maximum. For speech enhancement, we assume that both speech and noise can be described by independent auto-regressive (AR) processes. The problem is then one of estimating the speech and noise LP coefficients based on the observed noisy speech in an ML framework. The clean speech AR model can be mathematically expressed as

Where a1; : : : ; ap are the LP coefficients of order p and e(n) is the prediction error, also referred to as the excitation signal. It is common to model e(n) as a Gaussian random process. The LP analysis is typically performed for each frame of 20-30 ms, within which speech can be assumed to be stationary.

For each frame, the model parameters are the vector of LP coefficients µ = [a1 : : : ap], and the variance of the excitation signal. A similar model can be obtained for the noise signal. The physiology of speech production imposes a constraint on the possible shapes of the speech spectral envelope. Since the spectral envelope is specified by the LP coefficients, this knowledge can be modeled using a sufficiently large codebook of speech LP coefficients obtained from long sequences of training data. Such a-priori information about the LP coefficients of speech has been exploited successfully in speech coding using trained codebooks. Similarly, noise LP coefficients can also be modeled based on training sequences for different noise types. Thus, it is sufficient to perform the maximization over the speech and noise codebooks.

We characterize the speech and noise power spectra, which can be used to construct a Wiener filter to obtain the enhanced speech signal. Given the noisy data, the excitation variances maximizing the likelihood are determined for each pair of speech and noise LP coefficients from the codebooks. This is done for all combinations of codebook pairs, and the most likely codebook combination, together with the optimal excitation variances, is obtained. Since this optimization is performed on a frame-by-frame basis, good performance is achieved in non-stationary noise environments.

Apart from restricting the search space, using a codebook in the ML estimation has an additional benefit in applications where a codebook index needs to be transmitted over a network, e.g., in speech coding. In this case, the likelihood function can be interpreted as a modified distortion criterion.

Bayesian MMSE estimation

In ML estimation, the parameter µ is treated as a deterministic but unknown constant. In the Bayesian approach, µ is treated as a random variable. The Bayesian methodology allows us to incorporate prior (before observing the data) knowledge about the parameter by assigning a prior pdf to µ.

A cost function is formulated and its expected value, referred to as the Bayesian risk, is minimized. A commonly used cost function is the mean squared error (MSE). In this case, the Bayesian minimum mean squared error (MMSE) estimate µBY of µ given the observations y is obtained by minimizing E[(µ ¡ µBY)2], where E is the statistical expectation operator. The expectation is with respect to the joint distribution p(y; µ). Thus, the cost function to be minimized can be written as where the posterior pdf p(µjy) is the pdf of µ after the observation of data. Since p(y) ¸ 0, it is sufficient to minimize the inner integral for each y. An estimate of µ can be found by determining a stationary point of the cost function (setting the derivative of the inner integral to zero). We can write E[µjY = y], where y is a realization of the corresponding random variable Y. Using Bayes' rule, the posterior pdf can be written as where the denominator p(y) is a normalizing factor, independent of the parameter µ. we describe a method to obtain Bayesian MMSE estimates of the speech and noise AR parameters. The respective prior pdfs are modeled by codebooks. The integral in is replaced by a summation over the codebook entries. We also consider MMSE estimation of functions of the AR parameters, and one such function is shown to result in the MMSE estimate of the clean speech signal, given the noisy speech. As in the ML case, MMSE estimates of the speech and noise AR parameters are obtained on a frame-by-frame basis, ensuring good performance in non stationary noise.

In the ML estimation framework, one pair of speech and noise codebook vectors was selected as the ML estimate, whereas the Bayesian approach results in a weighted sum of the speech (noise) codebook vectors. The Bayesian method provides a framework to account for both the knowledge provided by the observed data and the prior knowledge.


2.1 DCT

DCT-based encoding algorithms are always lossy by nature. DCT algorithms are capable of achieving a high degree of compression with only minimal loss of data. This scheme is effective only for compressing continuous-tone images in which the differences between adjacent pixels are usually small. In practice, JPEG works well only on images.

The discrete cosine transform(DCT) is used in most image and video coding standards such as JPEG and MPEG, because its coding performance approaches that of Karhunen- Lo`eve transform(KLT) when the input can be modeled by a Gauss-Markov source with high correlation. However, not every block of images or difference images in video coding is well modeled as a Gauss-Markov source. The DCT fails to compress much information into low frequency components if the block of image contains high frequency components. DCT is a generic name for a class of operations identified and published some years ago. DCT-based algorithms have since made their way into various compression methods.

2.2 Warped Discrete Cosine Transform (WDCT)

Recently, there has been an increasing interest in noisy speech enhancement for speech coding and recognition since the presence of noise seriously degrades the performance of the systems. Many approaches have been investigated in order to achieve speech enhancement. These include the spectral subtraction, Wiener filtering, soft decision estimation, and minimum mean square error (MMSE) estimation approaches. Most of these researches on speech enhancement are based on the discrete Fourier transform (DFT) to make it easier to separate the speech and noise in the transform domain. However, the discrete cosine transform (DCT) has been found to be better in enhancing noisy speech as compared to the DFT because of several reasons the main reason is that the DCT provides significantly higher energy compaction capability compared to the DFT. To provide a higher resolution for the energy compacted region without increasing the DCT length, we devise a method to warp the input frequency, which adjusts the frequency distribution of the input speech to be more suitable for the DCT. It will be shown in this paper, the warped DCT (WDCT) outperforms the conventional DCT in terms of enhancing noisy speech. Furthermore, our approach can be implemented in real time with a little additional computation

Here, we review an N-point WDCT of the input vector [x(0);x(1); :::;x(N- 1)]'. The N-point DCT, {X(0),X(1)......... X(N-1)} is defined by

That is, the ith coefficient of Fk(z-1) is the (k; i)th element of the DCT matrix. It can be shown that Fk(z-1) is a bandpass filter with a center frequency at (2k+1)=2N; with the sampling frequency normalized to 1. Hence, the magnitude response of Fk(z-1) for small k is larger for low-frequency inputs such as voiced sounds, which enable data compression by giving more emphasis to the lower band outputs than the higher band ones.

Further, inputs with mostly high-frequency components such as unvoiced sounds have a higher magnitude response of Fk(z-1) for large k, which enables high frequency coefficients to have compacted energy. This is a desirable feature for noise removal purposes. Note that the frequency resolution of the DCT is uniform. Therefore, incorporating a nonlinear frequency resolution closely following the psychoacoustic Bark-scale will result in an enhanced representation for the speech signals. We introduce such nonlinearity in DCT using warping.

To warp the frequency axis, we apply an all-pass transformation by replacing z-1 with an all-pass filter A(z) defined by

A(z)= (-ß+)/(1-ß)

where ß is the control parameter for warping the frequency response. A(z) is known as the Laguerre filter and is widely used in various signal processing algorithms. The resulting

Fk(A(z)) now becomes an infinite impulse response (IIR) filter given by




Implementation of the WDCT

The WDCT can be implemented in several ways. The most straightforward approach is to implement the filters in a Laguerre network (considering first order all-pass filters, A(z), which are reset every N samples). In the second approach, we can implement the filtering by a matrix-vector multiplication in two steps:

first we divide the all-pass IIR transfer functions into N terms, and then sample the frequency responses of the warped filter bank to obtain the WDCT matrix through an inverse discrete Fourier transform (IDFT).

We use the second approach, which is the filter bank for an N-tap finite impulse response (FIR) filter, the result of filtering and decimation by N corresponds to the inner product of the filter coefficient vector and the input vector. From Parseval's relation, this is again equal to the inner product of the conjugate DFT of the input and the DFT of the filter coefficients, which is equal to the sampled value of Fk(ejw ) for w = (2pk=N) where k = 0;1; :::;N -1: Similarly, we can approximate the result of the filtering with Fk(A(ejw )) as the inner product of the input vector and the IDFT of the sampled sequence of Fk(A(ejw )): More detailed description about the WDCT and its implementations can be found below.

Applying for Speech enhancement

We assume that a noise signal n is added to a speech signal x, with their sum being denoted by y. Taking the DCT gives us,

where k denotes the kth the frequency bin, M is the total number of frequency components, and t is the frame index in the time domain, respectively. Given a frame of noisy speech signal, the basic assumption adopted in a speech enhancement approach could be described by the following hypotheses:

Split-band global soft decision

For determining the statistically reliable frequency warping control parameter a, we apply a separate statistical model to each respective split frequency band. For this, we split the whole frequency range into high-band and low-band regions. At first, for the high-band, the probability density functions (pdf's) of the noisy speech conditioned on the H0 and H1 are assumed to be

Frequency warping control parameter determination

The optimal frequency warping control parameter is chosen so as to minimize the reconstruction error for the given image-compression algorithm. Since, however, this architecture can not be applied directly to speech enhancement, we have to determine the frequency warping control parameter a according to the spectral distribution in each frame. For the purpose of achieving a statistical robustness as well as on-line implementation, we propose a way to determine a in the following way

where Pmin = 0.2 and a€ [_min(= -0.01), _max = (0.01)]. Considering equation above it is not difficult to find out that a(t) becomes min as HB-GSPP approaches one only if LBGSPP is sufficiently small. On the other hand, a(t) approaches to amax as LB-GSPP increases while HB-GSPP is kept low. For the purpose of avoiding a rapid variation, we apply a temporal smoothing technique to a(t) such that where a(t) denotes the smoothed control parameter and ?p is a smoothing parameter. For on-line implementation, we prepare the 16 different WDCT matrices for the values of a at off-line stages. In other words, the frequency warping control parameter a is uniformly quantized into 16 step in the range of a. For that reason, we substitute a(t+1) with a closet value from the finite set.



Matlab Introduction

MATLAB is a high performance language for technical computing .It integrates computation visualization and programming in an easy to use environment Mat lab stands for matrix laboratory. It was written originally to provide easy access to matrix software developed by LINPACK (linear system package) and EISPACK (Eigen system package) projects.

MATLAB is therefore built on a foundation of sophisticated matrix software in which the basic element is matrix that does not require pre dimensioning

Typical uses of MATLAB

  1. Math and computation
  2. Algorithm development
  3. Data acquisition
  4. Data analysis ,exploration ands visualization
  5. Scientific and engineering graphics

The main features of MATLAB

  1. Advance algorithm for high performance numerical computation ,especially in the Field matrix algebra
  2. A large collection of predefined mathematical functions and the ability to define one's own functions.
  3. Two-and three dimensional graphics for plotting and displaying data
  4. A complete online help system
  5. Powerful,matrix or vector oriented high level programming language for individual applications.
  6. Toolboxes available for solving advanced problems in several application areas

Chapter 4



  1. J.d.johnston,"transform coding of audio signals using perceptual noise criteria," IEEE J.selected areas in comm,vol.6,pp.314-323,Feb.1988.
  2. T.Thiede,E.Kabot,"A New Perceptual Quality Measure for Bit Rate Reduced Audio,"Proc.100th AES convention,copenhagen,preprint 4280,1996.
  3. A.Makur,S.K.Mitra,"Wraped Discrete Fourier Transform Theory and Applications,"IEEE Trans.Circuits systems I,vol.48,pp.1086-1093,sept.2001
  4. Steven B.Davis and paul mermelstien,comparision of parametric representation for monosllabic word recognition in continously spoken sentences,IEEE trans Acoust,speech,signal proceesing,vol.28(4).pp357-366,1980.
  5. R.Muralishankar and A.G.Ramakrishnan,DCT based pseudo-complex cepstrum,ICASSP01,vol.1,pp.521-524,2002.
  6. R.Muralishankar and A.G.Ramakrishnan,pseudo complex ceptrum using disrete cosine transform,accepted,international journal of speech technology.
  7. N.I.Cho and S.K.Mitra,wraped discrete cosine transform and its application in image compression,IEEE trans circuiots syst video technol,vol 10,pp.1364-1373,dec2000.
  8. J.O.Smith III and J.S.Abel,bark and ERB Bilinear Transforms,IEEE trans speech audio processing vol,7pp 697-708,june 1999.
  9. D.A Renolds and R. C. Rose Robust textindependent speker identi_cation using gaussian mixture speaker models,.IEEE Trans. Speech,Audio Processing, vol. 3, pp 72-83,1995
  10. Viswa N Guptha , Matthew Lenning, and Paul Mermelstein, .Decesion rules for speakerindependent isolated word recognition,. ICASSP' 84 , vol. 9 ,pp 336-339,1994
  11. J. -H Chang and N.S Kim, "Speech enhancement : new approaches to soft decesion",IEICE Trans .Inf and Syst., vol . 27,E84-D, pp.1231-1240,sep 2001.
  12. N .S Kim and J.-H Chang, "Speech enhancement using a soft-decision noise noise suppressing filter," IEEE Trans Acoust., Speech, Signal processing, vol . 28, pp. 137-145, Apr. 1980
  13. N.S.Kim and J.H.Chang,"spectral enchancement based on global soft decision",IEEE signal processing letters,vol,7,no.5,pp.180-110,may 2000
  14. R.J.McAulary and M.L.Malpass,"speech enhancement using a soft decision noise suppression fiter",IEEE trans acoust speech,signal processing,vol.28,pp.137-145,apr.1980
  15. N.I.Cho and S.K.Mitra "wraped discrete cosine transform and its application in image compresion",IEEE trans.circuits technol,vol,pp1364-1373,dec200.
  16. T.sikora,"trends and perspective in image and video coding,"proceedings of the IEEE,vol.93,no.1,pp.6-17,2005.
  17. N.I.Cho and S.K.Mitra,"wraped discrete cosine transform and its application in image copression,"IEEE transactions on circuts and systems for video technology,vol.10,no.8,pp.1364-1373,2000.
  18. A.M,Bruckstein,M.Elad,and R.Kimmel,"down-scaling for better transformations,"IEEE transactions on image processing,vol.12,no.9,pp.1132-1144,2003
  19. Y.tsaig,M.Elad,P.Milanfar and G.H.Golub,"variable projection for near-optimal filtering in low bitrate block coders".IEEE transactions on circuits and systems for video technology,vol.15,no.1,pp.154-160,2005.
  20. Nam Ik Cho,sanjit K.Mitra,"An image compression algorithm using wraped discrete cosine transform"IEEE int.conf image processing,vol.2,pp.834-837,oct.1999.kobe,japan.
  21. Nam Ik Cho,sanjit K.Mitra,"wraped discrete cosine transformand its application in image compression",accepted for publication,IEEE trans.circuits and systems for video technology.
  22. Yair Shoham,allen Gersho,"efficient Bit allocation for an arbitary set of quantizers",IEEE trans,acoustics,speech and signal Processing,vol.36,no.9,sep.1988.
  23. Siu-Wai Wu,Allen Gersho,"Rate-constrained picture adaptive quantization for jpeg baseline coders",in proc IEEE int conf acoustics,speech and sgnal processing,vol.5,pp.38-392.apr.1993