This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Speech recognition system can be done in two domain which is the time domain and frequency domain. Time domain analysis is a real-time process with respect to time. All the process in time domain analysis is without changing to frequency domain. Usually, frequency domain process is done for analysis of signal properties. Signals are converted from time domain to the frequency domain and the spectrum can help to determine which frequencies are present in input signal and which is missing. These two analyses produce same output with different approaches.
Normalization is the process of standardizing the amplitude of an audio signal by increasing or decreasing the amplitude. This is to let the resulting peak amplitude matches a required result. Normalization is often mistakenly recognized as increasing the loudness or the level while it actually just dragging the loudest peaks to 0dB. Since it is a linear translation, normalization does not establish any distortion but the noise level is increased along with the content. Particularly, a constant amount of gain is applied to selected section of the recording to get the highest peak to a target point.
For speech signals to undergo processes beforehand, they need to be normalized first as the signals were not recorded with the same volume or same energy. Thus, there is a need to perform normalization so that the speech signals are made recorded with the same level and for the next processes done to be fair. Normalization 
2.1.2 Silence Removal
In voice recording, it is natural to have silence moment in the beginning and ending of a speech. This kind of silence cannot be avoided because a person take a few seconds to speak after the button 'Record' is clicked and also before the button 'Stop' is clicked. This recorded silence is actually the background noise recorded when speaker says nothing. This undesirable background noise has to be removed to in order to be computationally more efficient in the next process.
There are three state-representations in speech which are silence (S), unvoiced (U), and voiced (V). In silence, there is no speech produced while unvoiced is where the vocal chords are not vibrating. In voiced, the vocal chords are tense and therefore vibrate periodically when air flows from the lungs which results in quasi-periodic waveform. It is not exact in term of the segmentation of the waveform into well-defined regions of silence, unvoiced, and voiced. And it is not easy to differentiate a weak, voiced sound from even silence, or unvoiced sound from silence. [Silence removal 2]
Nevertheless, it is generally not hard to remove the silence from the speech signal. There are two widely accepted methods which are Short Time Energy (STE) and Zero Crossing Rate (ZCR). These two methods have been used for a long time for silence removal even though they have their own limitation regarding setting thresholds. Thus, we have created an algorithm to remove the silence in the beginning and ending of the speech signal. This algorithm however does not move the background noise in voiced speech signal.
2.1.3 Background Noise Filter
Background noise is the noise included in the recorded wave file which can be the microphone noise or the environmental noise such as traffic noises, alarms, people talking and many more. This kind of noise cannot be avoided because it is natural that it exists while recording. It is important that we remove these noises as it will affect the output of the next processes. One of the many ways to remove these noises is using a filter.
There are many type of filter that can be used to remove the background noise such as high pass filter, low pass filter, band-pass filter, band-stop filter and such. But in this project, we apply band-stop filter to the speech signals. Band-stop filter is a filter that passes most frequencies unaltered, but block or attenuates a certain range of band of frequencies. A 'stopband' is the range of frequencies that a band-stop filter blocks, which is limited by a lower cut-off frequency and a higher cut-off frequency. Notch filters, a special type of band-stop filter, have stopband that is very narrow which create a 'notch' in the frequencies allowed to pass. If we merge a number of notch filters, it will form a 'comb filter' which has multiple stopbands.
A band-stop filter that rejects the stopband completely while allowing all other frequencies to pass with no attenuation is known as an ideal band-stop filter. The transition of the response from outside the stopband to within the stopband of an ideal band-stop filter is instantaneous. Therefore, it is impossible to achieve an ideal notch filter because there is no way complete attenuation can happen within the stopband while frequencies outside the stopband go through some level attenuation. [bandstop filter 1]
2.1.4 Waveform Envelope
Waveform envelope is defined as the boundaries of the waveform properties and it is often used to study the shape of the wave signal and its properties. In this project, waveform envelope is done to obtain the shape of the maximum amplitudes of the waveform. By achieving only the maximum amplitudes of the waveform, the negative value of the waveform is neglected and the same goes to other information in between. Necessity information needed is enough by only taking the maximum value of the waveform. It is believed that the concept is similar to formant envelope extraction. Formants are the peaks in the spectral envelope. Determining formant pattern can distinguish between voice sounds, in particular vowels. However, waveform envelope does not eliminate the important information.
In this project, obtaining the waveform envelope from the speech signals is done before the speech signals are warped by DTW. With too many features in the speech signal, it is hard to see the differences between 'before DTW' and 'after DTW' because overlapping complete waveforms may delude the observer of how the process suppose to work. Therefore, by applying waveform envelope, it helps us to make sure that the speech signals are warped correctly. [waveform envelope 3]
2.1.5 Dynamic Time Warping
Speech recognition might be achieved via a pairwise comparison of the feature vectors in a test signal and reference signal. The tricky part in matching two speech signals is the precise timing. Even though different recordings of the same words include more or less the same sounds in the same order, the period of each subword within the word does not match. Without having the speech signals undergoes temporal alignment, speech recognition may result in an inaccurate output. Furthermore, the total distance between the sequences is different from one speech signal to another. The matching process need to overcome the length differences difficulties and consider the non-linear nature of the length differences within words. [DTW4]
DTW is a dynamic-programming technique that finds an optimal match between two sequences of feature vectors and put up differences in timing between test signal words and references. DTW allow a range of sequence to find path through that space that maximizes the local match between the aligned time frames by stretching and compressing the range of sequence. The total similarity variable output by this algorithm gives a good suggestion that the test signal word and reference match. [DTW2]
2.1.6 Mean Square Error
Mean Square Error (MSE) is a process of quantifying the difference between an estimator and the true value of the quantity being estimated. In our case, the estimator is the test speech signal and the true value of the quantity being estimated is the reference speech signal. Thus, MSE is seen as a signal fidelity measure to compare two signals by providing a quantitative score that describes the degree of similarity or the level of error between them. Even though MSE introduce some actual loss in applications, it is one of the most widely-used loss quantifying functions. [mse1]
In comparing two speech signals, MSE is just simply taking the differences of the two signals and square the errors which are the differences between the reference speech signal and test speech signal. This error is squared and the mean of the squared error is the final result of obtaining the MSE. An MSE of zero means that there are no differences in the two speech signals which imply perfect accuracy. However, the smallest MSE is generally interpreted as best explaining the variability in the process. Therefore, it is best to achieve the smallest MSE as possible, as close to zero as possible. [mse2]
2.2 Analysis in Frequency Domain
2.2.1 Spectral Analysis
Analysis of signals in frequency domain is usually the analysis of signal properties. Changes over time in frequency (spectral) composition and changes in spectrum over time are typical aspects in signals. From the spectrum, we can determine which frequencies are present in the input signal and which are missing. To create a spectrum, we must examine an interval of time because it is impossible to measure a signal's instantaneous spectrum. There is no information about temporal changes in frequency composition during the internal over which the spectrum is made provided in a single magnitude spectrum of a signal. We also will not know any information about the intensities of different frequencies that vary over time during the signal. This brings us to the introduction of spectrogram which we may see how the frequency composition of a signal changes over time. [Spectrogram 8]
Spectrogram is an image that shows how the spectral density of a signal varies with time. In spectrogram, the horizontal axis represents time and the vertical axis is the frequency. Spectrogram can exist in three dimensional where the third dimension indicating the amplitude of a particular frequency at a particular time which is represented by the intensity of color of each point in the image. However, the axes may switch to produce many variation of spectrogram. There are two ways to create a spectrogram which are with a series of bandpass filters or using short-time Fourier Transform (STFT) but after the modern digital signal processing is introduced, STFT has been used widely. [Spectrogram T_2]
Producing spectrograms, STFT divides the entire signal into a series of consecutive short time chunks which usually overlap. These chunks are called records or frames. Dynamic Fourier Transform (DFT) uses each record as input to calculate the magnitude of the frequency spectrum for each record which gathers a series of spectra. The spectra of successive records are plotted side by side with frequency running vertically and the amplitude at each frequency represented by a grayscale value. This will display a spectrogram which demonstrates the spectrum of one record at a time as a line graph, with frequency on the horizontal axis, and amplitude on the vertical axis. DFT size, represented as the number of digitized amplitude samples that are processed to design each single spectrum can characterize a spectrogram. [Spectrogram T_1]
On a further discussion about the connection between STFT and DFT, we can consider STFT as equivalent in function to a bank of N/2 + 1 bandpass filters, where N is the DFT size. Each filtered that is proportional to the amplitude of the signal in a discrete frequency band is centered at a slightly different analysis frequency and the output amplitude of each filter. The spectrogram is considered as symbolizing the time-varying output amplitudes of filters at consecutive analysis frequencies plotted above each other. There are few variables that can characterize the spectrogram such as its bandwidth and the range of input frequencies around the central analysis frequency that are passed by each filter. [Spectrogram 8]
2.2.2 Image Processing
Image processing literally means signal processing in any form which the input is an image. The output of image processing can be either an image or anything related to the image. Most image processing techniques consider dealing the image as a two-dimensional signal and standard signal-processing techniques are applied to it. [image processing 1] Image processing consists of a lot of branches such as image segmentation, image recognition, image differencing and many. But in this project, we only focus on image comparison. Image comparison is done with a concept that almost the same as image differencing. However, both processes produce different output. Image comparison is different from image differencing as image differencing used to determine changes between images and generate an image based on the difference between each pixel in each image for the output. While for image comparison, the difference between the two images is calculated too but the output of image comparison is the difference only, no output image. [image processing ]
In this project, image comparison takes in two images as the inputs which are the spectrograms of the speech signals. Spectrogram is treated as an image in this process. Image comparison calculates the differences between the two images. This is done by taking the differences between each pixel in each image. For this technique to work, the images must first be clean from background noises. If background noise exist in the spectrogram, it is hard for the process to distinguish the differences because the decision will be disguised by the noises which will result in inaccurate answer. Compared to the MSE process in the analysis in time domain, image processing does not take the mean of the squared error, it just take the square error of the differences to get the result. However, the size of both images must be identical as the calculation is affected by it.
2.2.3 Power Spectral Density
Power Spectral Density (PSD) is a function that demonstrates the strength of the variations of energy as a function of frequency which means that it shows the position at which frequencies variations are strong and at which frequencies variations are weak. The two axes from PSD, vertical axis are of unit power over frequency (dB/Hz) or radians over sample (rads/sample). This can be decided depending on the wanted result from the PSD. The horizontal axis is of the frequency (Hz). Computation of PSD is done directly by using Welch's averaged modified periodogram method of spectral estimation. Using PSD, we can decide on which part represents the voiced speech. [PSD1]
In this project, we only take PSD from 100Hz to 5000Hz only. That is because it is believed that vowels can already be detected from that frequency range. Therefore, frequency other than that range can be neglected as it does not contain necessity information. In PSD, the speech signal vector is segmented into eight sections of equal length, each with 50% overlap. If there is any remaining entry in the vector that cannot be included in the eight segments of equal length, they are discarded. Each segment is windowed with a Hamming window that of the same length as the segment. A full power one-sided PSD will be the result if the speech signal vector is a real-valued input while a two-sided PSD will be the result if the speech signal is a complex-valued. [PSD2]
Since PSD will take half of the length of the speech signal as the frequency at x-axis, it is better that we downsample the PSD to smaller range. Downsampling in this process is just reducing the sampling rate of a signal which is usually done to reduce the data rate or the size of the data at the output of the process. Downsampling also reduce the cost of processing because the calculation or the memory required to implement the system generally is proportional to the sampling rate. This will result in a cheaper implementation. [downsampling 1]