This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In this chapter, the actual methodology is explained in detail. The whole speech recognition can be divided into two parts, training and recognition. In training the speech has to be read into the system. Speech should be analyzed and it should be converted into coefficients (LPC coefficients or MFCC coefficients), which is called feature extraction. Speech words in the form of coefficient vectors, also called templates should now be saved in a template bank to use for recognition. In the recognition part, the word spoken by the user will go though the same process and the template corresponding to this word will be compared with the templates already saved in template bank. Then based on a decision rule the closest word in the bank will be recognized as output of the process. This whole process is discussed in depth in the coming sections of the chapter.
The whole process of speech recognition based on pattern-recognition paradigm has four main steps:
Every system and every person in this world needs to be trained in order to do perform a task quite successfully. Training is nothing but feeding the system with the knowledge, using which the system will accomplish the tasks assigned to it. In this case, the system is Automatic Speech Recognition (ASR) system and the task assigned to it is speech recognition.
The speech words to train the system, depends on the application where the system is used. In this project a sine wave will be generated and noise will be added to it. Later on these two signals will be added to generate distorted sine wave. Then a low pass filter will be designed, through which the distorted sine wave will be passed to get the actual sine wave. Also, the spectrum of all these signals will be observed.
The training words in this project include, "sinewave", "noise", "addnoise", "designfilter", "passfilter", "spectrum", to perform the tasks listed above. Initially, all these words will be acquired by the system. However, these acquired words will be infected with noise and silent boundaries. Therefore, before processing these words for feature extraction, noise reduction and word trimming needs to be done based on corresponding algorithms. These methods are explained in the following sections.
A. Speech Acquisition
Speech acquisition is a process of recording the training words into the system. This can be done using any input voice devices like microphone. The quality of input device decides the amount of noise in the acquired words. If a high quality microphone is used to acquire the speech words, the noise effect in the words will be less. Also, the environmental conditions at the time of speech acquisition have a major effect on the speech noise levels. If the environment is noisy, then the speech words will be affected with more noise. Therefore, at the time of speech acquisition, the environment needs to be silent and noise-free.
B. Noise Reduction
Spectral subtraction introduced by Boll 1979; Berouti, Schwartz, and Makhoul 1979; Lim and Oppenheim 1979 is an effective speech enhancement tool for the removal of stationary additive noise from degraded speech. The goal of spectral subtraction is the suppression of additive noise from a corrupted signal. Speech degraded by additive noise can be represented by:
Where, d(t), s(t) and n(t) are the degraded or corrupt speech, original clean speech and noise signals respectively. From the discrete Fourier transform (DFT) of sliding frames typically in the order of 20-40 ms, an estimate of the original clean speech is obtained in the frequency domain by subtracting the noise estimate from the corrupt power spectrum:
The main assumption made here is noise reduction is achieved by suppressing the effect of noise from the magnitude spectra only. The subtraction process can be in power terms as in Equation 2 or in true magnitude terms, i.e. using the square roots of the terms in Equation 2. Power subtraction is adopted here as it is more common in the literature and since experimental evidence suggests that it's the simple approach and is same as the magnitude spectra.
Figure 4.1: Noise reduction process
C. Start & End point detection
Detection of speech in the presence of a background of noise is the main problem in speech processing, which is referred as the endpoint location problem. The accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum. Considering the speech recognition technique based on template matching, the exact timing of an utterance will generally not be the same as that of the template because they possess different durations. In many cases the accuracy of alignment depends on the accuracy of the endpoint detections. There will be some adverse situations where the end point detection will become difficult. These situations are such as Words which begin or end with low-energy phonemes (weak fricatives), Words which end with an unvoiced plosive, Words which end with a nasal, Speakers ending words with a trailing off in intensity or a short breath (noise). If all these conditions are considered while developing the algorithm the performance of the algorithm will be effective.
The method proposed by Rabiner and Sambur and used in this demonstration uses two measures of the signal - the zero crossing rates and the energy. Three thresholds are computed:
ITU - Upper energy threshold.
ITL - Lower energy threshold.
IZCT - Zero crossings rate threshold.
The algorithm-Start up
At initialization, record sound for 100ms and assume this as a 'silence' signal. For this measured background noise compute average (IZC') and std dev (Ïƒ) of zero crossing rate.
Choose Zero-crossing threshold (IZCT) - This threshold is for unvoiced speech, as it is being computed upon the recorded silence signal. According to the algorithm, this is given as,
IZCT = min (25 / 10ms, IZC' * 2 Ïƒ)
In this step the energy, E(n) needs to be computed for interval and get max, IMX and for silence, IMN. Now I1 and I2 needs to be computed based on the formulae given below.
I1 = 0.03 * (IMX - IMN) + IMN (3% of peak energy)
I2 = 4 * IMN (4x silent energy)
Now, from these two values, get energy thresholds (ITU and ITL) based on the formulae given below.
ITL = MIN (I1, I2)
ITU = 5 * ITL
As all the thresholds are computed for energy and zero-crossing rates for silence and the voiced signal, it is the time to start searching for the end points. This is based on the steps given below.
Search sample for energy greater than ITL and save as start of speech, say S.
Search for energy greater than ITU and S becomes start of speech. If energy falls below ITL, restart the search.
Search for energy less than ITL and save as end of speech.
As a result of the energy computation, start and end points of the speech will be marked. Now, these start and end points needs to be refined according to zero-crossings. To do so, search back 250 ms and count number of intervals where rate exceeds IZCT. If this count exceeds 3, set starting point, S. Otherwise, S remains the same. Repeat the same search after end.
Definition: Feature extraction is a method in which the input continuous speech signal will be converted into a set of symbols.
The process is done in such a way that when we synthesize the symbols it again represents the continuous speech signal. In the initial stages speech synthesis comprises several algorithms which are drawn from wide variety of sources. Like statistical pattern recognition, communication theory, signal processing etc. even though each of these areas rely on different recognizers, the most important common denominator for every recognition system is the signal processing front-end, where speech waveform is converted into parametric representation which is in turn used for analysis and further processing.Â
Speech signal of speaker may be in different forms, for example, speaking style, context, and emotional state of the speaker. The aim of signal processing is to extract important information of a signal by means of transformation and to store the coefficients in to the vector. Till today many feature extraction algorithms are available. The short-term spectrum of the speech signal, defined as a function of time, frequency, and spectral magnitude, is the most primitive ways of representation of the speech signal. Various approximations like filter bank magnitudes, linear prediction coding (LPC) and Mel-cestrum coefficient for to the short-term spectrum are also popular. There are many other kinds of features that can be extracted from a speech signal in which some of them are mentioned below.
Power spectral analysis (FFT)
Linear predictive analysis (LPC)
Perceptual linear prediction (PLP)
Mel scale Cepstral analysis (MEL)
Relative spectra filtering of log domain coefficients (RASTA)
First order derivative (DELTA)
Out of all these methods, LPC method has been selected in this project for its simplicity and efficiency. The next section will give a complete understanding of Linear Predictive Coding method.
Linear prediction is the most powerful tool for speech analysis in which the basic parameters of speech can be estimated. Accurate estimate of the speech parameters and efficient computational model of speech are the highlights of the linear prediction.
The theme behind linear predictive analysis is that a speech sample at the current instinct of time can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and linear predicted values a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for linear predictive analysis of speech.Â However, the main assumption to be made is the speech signal should be stationary. In order to fulfill this requirement the major assumption made here is the speech signal is stationary for 20 or 25 milliseconds of time period which is a fair assumption because, it is impossible for human muscles to expand or contract within 20 to 25 milliseconds of time span. Because of this simplicity and computational efficiency, this LPC method has been chosen in this project.
Linear Predictive Coding is a combination of a several tasks. These include, Preemphasis, Frame blocking, Auto-correlation analysis and LPC analysis.
PreemphasisInput speech signal
Figure 4.2: LPC method flow diagram
The basic process of linear predictive coding starts with Pre emphasis in which the the digital speech signal s(n) is feed through the low order digital filter, to make the signal less reactive to the finite precision effects. After pre emphasis process frame blocking task is performed in which rearranging the speech sequence is performed before being processed. For instance if the first frame consists of the N samples, then the second frame begins M samples after the first frame, with (N - M) samples overlaps. The LPC spectrum estimates will be very smooth from frame to frame if M << N, On the other hand there wont be any overlap between adjacent frames if M > N, in certain circumstances some of the speech signal will be totally lost. By trial and error method the author chooses the values for N and M as 200 and 100 respectively (50% overlap) when the sampling rate of the speech is 12 kHz which is the general assumption made by the user. The sampled speech signal possessing the discrepancies at the beginning and at the ending of the speech segment shall be minimized by using windowing. For instance if the window is defined as w[n], 0 â‰¤ n â‰¤ N âˆ’ 1, then the windowed output can be expressed as,
X [n] = x[n]w[n], 0 â‰¤ n â‰¤ N âˆ’ 1.
Because of the good out of band rejection characteristic hamming window is chosen for the autocorrelation method of LPC. The Hamming window function is given by,
, 0 â‰¤ n â‰¤ N âˆ’ 1.
Mathematical modeling of LPC coefficients
In order to find out the Linear Predictive Coding (LPC) coefficients, the present sample needs to be derived from the past samples of the speech signal, from the definition of LPC method. If we consider that the present sample is being derived from past p samples, if x(n) is the present sample value of speech signal and,
if (n)= estimated value of x(n) from its past p samples.
Then the error of estimation will become:
Ideally, this error should be zero. However, this model will produce an approximation of the actual signal. Therefore, to measure the quality of this estimation, the energy of this error signal should be minimum. The smaller the energy of the error, the better the estimation of the coefficients of model. The total energy of the error signal is the total area under the square of the signal. Mathematically, this can be represented as:
In order to make the energy of the error signal minimum, modified equation of energy of error signal should be partially derivate with respect to the coefficients,,â€¦,, must be equated to zero. Thus finding the partial derivative of with respect to gives,
This can also be written as,
In the above equation, every summation term is an auto-correlation value. Therefore, by Replacing the auto-correlated expressions with , the above equation can be written as:
The above equation contains p unknowns. Therefore, p such equations are required to get a solution. Therefore, by finding the partial derivative of with respect to gives,
In the same way, finding the partital derivative with respect to gives,
For the final derivative with respect to ,
If we place all these equations into a matrix form to find the solution, it looks like :
Using the linear set of equations obtained in the above matrix, the LPC coefficients can be found out with ease. Therefore, using these LPC parameters corresponding to each word will be saved in a file, which will be useful for recognition.
Once the system is trained with the words corresponding to the application, it is time to test the system with its recognition capabilities. Recognition process starts with acquiring the spoken word to be recognized, reducing the noise from spoken word and trimming the word by finding the start, end points of the voiced signal.
Speech Acquisition, Noise Reduction, Start & End point detection procedures shall be followed same as performed in training.
Decision rule plays a key role in the performance of speech recognition systems. This section involves comparison of the spoken word LPC parameters with the LPC vectors already saved in the vector bank.
Once the acquired speech is made noise free and trimmed with end point detection, it has to be processed to find the LPC parameters of the spoken word. Once these LPC parameters are found, a comparison should be made between the LPC vector of the spoken word with the LPC vectors available in the vector bank produced during training. This comparison is made based on some distance measure algorithms. Any of these algorithms will measure the distance between the spoken word LPC vector and all the LPC vectors available in the vector bank. Then out of all these distances, the word with minimum distance measured will be recognized as a spoken word. In this project, a simple distance measure is used for decision making purpose. In this algorithms, the differences between, each LPC coefficient of spoken word and each LPC coefficient of the words in vector bank, will be squared, summed up, whose square root gives the distance.
In this chapter, the whole methodology of speech recognition using Linear Predictive Coding is explained in detail. The training of recognition procedures have been explained in great detail. Mathematical modeling of several procedures has been derived. Additional concepts like noise reduction and end point detection have been explained in addition to the core algorithms. Having this knowledge, one can understand the MATLAB implementation of Speech Recognition system using LPC, which is explained in the coming chapter.