Text Dependent And Text Independent Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


This research work aims at designing both text-dependent and text-independent speaker recognition system based on mel frequency cepstral coefficients (MFCCs) and voice activity detector (VAD). VAD has been employed to suppress the background noise and distinguish between silence and voice activity. MFCCs will be extracted from the detected voice sample and will be compared with the database for recognition of the speaker. A new criteria for detection is proposed which is expected to show very good performance in noisy environment. The system will be implemented on MATLAB platform and a new approach for designing a voice activity detector (VAD) has been proposed. In order to prove the effectiveness of the proposed system comparative analysis of the proposed design approach will be done with the Artificial neural networks technique. In recent years there has been a significant amount of work, both theoretical and experimental, that has established the viability of artificial neural networks (ANN's) as a useful technology for speaker recognition. The performance of both the systems will be evaluated under different noisy environments and in different languages and emotions. The overall efficiency of the proposed speaker recognition system depends mainly on the detection criteria used for recognizing a particular speaker. Global optimization techniques like Genetic Algorithm (GA), Particle Swarm Optimization (PSO) etc. can prove very useful in this context and hence for setting up of the detection criteria Genetic Algorithm will be employed.


Development of speaker recognition system began in early 1960's with the exploration into voiceprint analysis, where the characteristics of an individual voice were thought to be able to characterize the uniqueness of an individual much like a fingerprint. The early systems designed had many flaws and their detection efficiency gets severely affected in the presence of noise. This fact ensured to derive a more reliable method of predicting the correlation between two sets of speech utterances. Speaker recognition is the process of recognizing the speaker from the database based on characteristics in the speech wave. Most of the speaker recognition systems contain two phases. In the first phase feature extraction is done. The unique features from the voice data are extracted which are used latter for identifying the speaker. The second phase is feature matching in which we compare the extracted voice data features with the database of known speakers. The overall efficiency of the system depends on how efficiently the features of the voice are extracted and the procedures used to compare the real time voice sample features with the database.

For security application to crime investigations, speaker recognition is one of the best biometric recognition technologies. We can give our speech signal as password to the lock system of our home, locker, computer etc. Speaker recognition can also be helpful in verifying voice of criminal from the audio tape of telephonic conversations. The main advantage of biometric password is that there is nothing like forgetting, misplacing as knowledge-based password.

Voice biometric compared to other biometric is user friendly, cost-effective, convenient, secure. Robust speech recognition systems can be applied to high accuracy connected digits recognition systems. It finds application in the recognition of personal identification numbers, credit card numbers, and telephone numbers.

The main requirement of the modern speaker recognition system is that it should have high accuracy , low complexity and easy calculation. Hidden Markov Model (HMM) has been successfully applied to both the isolated word and continuous speech recognition , however it fails in addressing discrimination and robustness issues for classification problems. The acoustic analysis based on MFCC which represents the ear model [1], has given good results in speaker recognition. Background noise and microphone used also effect the overall performance of the system [2].

Speaker recognition systems contain three main modules:

(1) Acoustic processing

(2) Features extraction or spectral analysis

(3) Recognition.

All three modules are shown in Fig. 1 and are explained in detail in the subsequent sections.

Fig.1. Basic structure of speaker recognition system

Research and development on speaker recognition methods and techniques has been undertaken for more than four decades and it is still an active area. Many approaches like human aural and spectrogram comparisons, simple template matching, dynamic time-warping approaches, and modern statistical pattern recognition approaches, such as neural networks and Hidden Markov Models (HMMs) have been used . Many techniques have been used for speaker recognition including Hidden Markov Models (HMM) [Siohan, 1998], Gaussian Mixture Modeling (GMM) [Reynolds, 1995], multi-layer perceptrons [Altosaar and Meister, 1995], Radial Basis Functions [Finan et al., 1996] and genetic algorithms [Hannah et al., 1993].

Over the last decade, neural networks have attracted a great deal of attention. They offer an alternative approach to computing and to understanding of the human brain. Neural networks, have the ability to derive meaning from complicated or imprecise data.They can be used to extract patterns and detect trends which are difficult to analyse by either humans or other computer techniques. The advantages offered by neural networks are:

Adaptive learning,Self-Organization,Real Time Operation,Fault Tolerance via Redundant Information Coding.


Research has been focussed on Feature based Recognition Systems . Using features from Speech based sources it has been tried to creat a reliable, robust and efficient recognition system. However, variations caused due to differences in individual speaker characteristics, emotion variations and noise disturbances increases the complexity of such a system.

Template-matching techniques are being used for Text-dependent methods .The input speech is represented by a sequence of feature vectors, generally short-term spectral feature vectors. Using a dynamic time warping (DTW) algorithm the time axes of the input speech and each reference template or reference model of the registered speakers are -aligned .The degree of similarity between them,accumulated from the beginning to the end of the speech is calculated. Statistical variation in spectral features can be modelled by Hidden Markov Model (HMM).

HMM-based methods were introduced as extensions of the DTW-based methods .A new technique for computing verification scores using multiple verification features from the list of scores for a target speaker's background speaker set was introduced by Park, A (2001).This technique was compared to the baseline logarithmic likelihood ratio verification score using global GMM speaker models .It gave no improvement in verification performance.

Zhou, L(2000) used neural networks and fuzzy techniques .They were applied to a speaker independent speech recognition system. The tests for a great number of speech templates of Chinese digits 0-9 collected from the persons from different areas and in noisy environment gave a recognition rate of 92.2%.

Moonasar, V, Venayagamoorthy, G (2002) proposed a speaker verification system that can be improved and made robust with the use of a committee of neural networks for pattern recognition rather than the conventional single network decision system. Supervised Leaning Vector Quantization (LVQ) neural network as pattern classifier were used. Linear Predictive Coding (LPC) and Cepstral signal processing techniques are used to make hybrid feature parameter vectors to combat the effect of decreased recognition rate with increased number of speakers to be recognized.

The most commonly used acoustic vectors are Mel Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC) and Perceptual Linear Prediction Cepstral (PLPC) coefficients and zero crossing coefficients (Yegnanarayana et al., 2005; Vogt et al., 2005). All these features are based on the spectral information derived from a short time windowed segment of speech.

They differ mainly in the detail of the power spectrum representation. A new modification of Mel-Frequency Cepstral Coefficient (MFCC) feature has been proposed for extraction of speech features for Speaker verification (SV) application (Saha and Yadhunandan,2000).This is compared with original MFCC based feature extraction method and also on one of the recent modification. The study uses multi-dimensional F-ratio as performance measure in Speaker Recognition (SR) applications to compare discriminative ability of different multi parameter methods.An MFCC like feature based on the Bark scale is shown to yield similar performance in speech recognition experiments as MFCC (Aronowitz et al., 2005).The BFCC features perform well for text dependent speaker verification systems. Revised perceptual linear prediction was proposed by Kumar et al. (2010), Ming et al. (2007) for the purpose of identifying the spoken language;Revised Perceptual Linear Prediction Coefficients

(RPLP) was obtained from combination of MFCC and PLP.

The objective of modeling technique is to generate speaker models using speaker-specific feature vectors.Such models will have enhanced speaker-specific information at reduced data rate. This is achieved by exploiting the working principles of the modeling techniques. Earlier studies on speaker recognition used direct template matching between training and testing data. In the direct template matching, training and testing feature vectors are directly compared using similarity measure. For the similarity measure, any of the techniques like spectral or Euclidean distance or Mahalanobis distance is used (Liu et al., 2006).The disadvantage of template matching is that it is time consuming, as the number of feature vectors increases. For this reason, it is common to reduce the number of training feature vectors by some modeling technique like clustering. The cluster centres are known as code vectors and the set of code vectors is known as codebook. The most well-known codebook generation algorithm is the K-means algorithm (Mporas et al., 2007; Ming et al., 2007). In 1985,Soong et al. used the LBG algorithm for generating speaker-based vector quantization (VQ) codebooks for speaker recognition. In order to model the statistical variations, the hidden Markov model (HMM) for textdependent speaker recognition was studied. The system performances in neural network based networks were also studied (Clarkson et al., 2006). In HMM,time-dependent parameters are observation symbols.Observation symbols are created by VQ codebook labels. Continuous probability measures are created using Gaussian mixtures models (GMMs) (Krause and Gazit, 2006). The main assumption of HMM is that the current state depends on the previous state.In 1995, Reynolds proposed Gaussian mixture modeling (GMM) classifier for speaker recognition task (Krause and Gazit, 2006; Clarkson et al., 2006).This is the most widely used probabilistic technique in speaker recognition. The GMM needs sufficient data to model the speaker and hence good performance. In the

GMM modeling technique, the distribution of feature vectors is modelled by the parameters mean,

covariance and weight.GMM outperformed the other modeling techniques. The disadvantage of GMM is that it requires sufficient data to model the speaker well (Aronowitz et al., 2005).

Various researchers are still trying to improve the peformance of speaker recognition systems so as to achieve better peformance .Use of various existing optimization techniques namely genetic algorithm, particle swarm optimization, neural networks etc can come handy in improving the performance .

Description of Broad Area/Topic

Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves . At the highest level, all speaker recognition systems contain two main modules: feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. Each module will be discussed in detail in later sections.


Acoustic processing is sequence of processes that receives analog signal from a speaker and convert it into digital signal for digital processing. Human speech frequency usually lies in between 300Hz-8000kHz [2].Therefore 16kHz sampling size can be chosen for recording which is twice the frequency of the original signal and follows the Nyquist rule of sampling [3].The start and end detection of isolated signal is a straight forward process which detect abrupt changes in the signal through a given threshold energy. The result of acoustic processing would be discrete time voice signal which contains meaningful information. The signal is then fed into spectral analyser for feature extraction.


Feature Extraction module provides the acoustic feature vectors used to characterize the spectral properties of the time varying speech signal such that its output eases the work of recognition stage. Main steps involved in feature extraction are explained below:

It is a process of extracting a small amount of speaker specific information in the form of feature vectors at reduced data rate from the input voice signal that can be used as a reference model representing each speaker's identity. A general block diagram of speaker recognition system is shown in Fig 2.

Fig.2 Speaker recognition system

It is clear from the above diagram that the speaker recognition is a 1:N match where one unknown speaker's extracted features are matched to all the templates in the reference model for finding the closest match. The speaker feature with maximum similarity is selected.

MFCC Extraction

Mel frequency cepstral coefficients (MFCC) is probably the best known and most widely used for both speech and speaker recognition. A mel is a unit of measure based on human ear's perceived frequency. The mel scale is approximately linear frequency spacing below 1000Hz and a logarithmic spacing above 1000Hz. The approximation of mel from frequency can be expressed as-

mel(f) = 2595*log(1+f/700)--------(1)

where f denotes the real frequency and mel(f) denotes the perceived frequency. The block diagram showing the computation of MFCC is shown in Fig. 3.

Fig.3 MFCC Extraction

In the first stage speech signal is divided into frames with the length of 20 to 40 ms and an overlap of 50% to 75%. In the second stage windowing of each frame with some window function is done to minimize the discontinuities of the signalby tapering the begining and end of each frame to zero. In time domain window is point wise multiplication of the framed signal and the window function. A good window function has a narrow main lobe and low side lobe levels in their transfer function. In our work hamming window is used to perform windowing function. In third stage DFT block converts each frame from time domain to frequency domain. In the next stage mel frequency warping is done to transfer the real frequency scale to human perceived frequency scale called the mel-frequency scale. The new scale spaces linearly below 1000Hz and logarithmically above 1000Hz. The mel frequency warping is normally realized by triangular filter banks with the center frequency of the filter normally evenly spaced on the frequency axis. The warped axis is implemented according to equation 1 so as to mimic the human ears perception. The o/p of the ith filter is given by-


S(j) is the N-point magnitude spectrum (j =1:N) and Ωi(j) is the sampled magnitude response of an M-channel filter bank (i =1:M). In the fifth stage Log of the filter bank output is computed and finally DCT (Discrete Cosine Transform) is computed. The MFCC may be calculated using the equation-


where N' is the number of points used to compute standard DFT.

Fig.4 Triangular filter bank

Voice Activity Detector

Voice Activity Detector (VAD) has been used to primarily distinguish speech signal from silence. VAD compares the extracted features from the input speech signal with some predefined threshold. Voice activity exist if the measured feature values exceed the threshold limit, otherwise silence is assumed to be present. Block diagram of the basic voice activity detector used in this work is shown in Fig. 5

Fig. 5 VAD block diagram

The performance of the VAD depends heavily on the preset values of the threshold for detection of voice activity. The VAD proposed here works well when the energy of the speech signal is higher than the background noise and the background noise is relatively stationary. The amplitude of the speech signal samples are compared with the threshold value which is being decided by analyzing the performance of the system under different noisy environments.

3.Feature matching

a)Using Euclidean Distance

In the recognition phase a sequence of feature vectors {x1,x2,….,xT}for unknown speakers are extracted and then compared with the codebooks in the database. For each codebook a distortion measure is computed. The speaker with the lowest distortion is chosen.

Thus, each feature vector of the input is compared with all the codebooks, and the codebook with the minimized average distance is chosen to be the best. The formula used to calculate the Euclidean distance can be defined as follows:

The Euclidean distance between two points P = (p1, p2…pn) and Q = (q1, q2...qn),

-----------(4 )

The speaker with the lowest distortion distance is chosen to be identified as the unknown person.

b)Neural Networks (NN)

Several popular classification techniques (pattern matching): HMM, GMM, DTW, VQ,NN are being used for Speaker Recognition. NN gives much less error rates on small samples and hence NN was a good choice for our work.

Neural neworks consists of layers. Layers are made up of a number of interconnected 'nodes' which contain an 'activation function'. Patterns are presented to the network via the 'input layer', which communicates to one or more 'hidden layers' where the actual processing is done via a system of weighted 'connections'. The hidden layers then link to an 'output layer'.

Most ANNs contain some form of 'learning rule' which modifies the weights of the connections according to the input patterns.


Automatic speaker recognition works on the principle that a person's speech exhibits characteristics that are unique to the speaker. Speech signals in training and testing sessions cannot be same due to many facts such as people's voice change with time, health conditions, speaking rates, etc. Acoustical noise and variations in recording environments present a challenge to speech recognition .The challenge would be to make the system "Robust". If the recognition accuracy does not degrade significantly, the system is called "Robust".


The goals of this research work are:

Develop a new text-dependent and text-independent speaker recognition framework with the help of MFCC and VAD.

Dynamically train the speaker recognition system with clean and noisy (additive and convolutive) speech signals. Each time a new speech signal is input to the system, additive white Gaussian noise at different values of SNR and echo with varying values of delay are added to the clean speech signals.

Investigate the performance of the proposed text-independent and text-dependent speaker recognition systems under noisy environments.

Compute the accuracy rates of identifying the test speaker in clean and noisy environments using the designed speaker recognition model and compare it with the artificial neural network based speaker recognition technique.

5. To analyze the best method of removing background noise in voice signal.


Speaker recognition is the process of automatically recognizing who is speaking based on unique characteristics contained in the speech wave. Most of the speaker recognition systems contain two phases. In the first phase feature extraction is performed in which the unique features from the voice data are extracted which can be used latter for identifying the speaker. In the second phase feature matching is performed and this phase comprises of the actual procedures carried out for identifying the speaker by comparing the extracted voice data features with the database of known speakers. The overall efficiency of the system depends on the fact that how efficiently the features of the voice are extracted and the procedures used for comparing the real time voice sample features with the database.

The following steps will be performed:

a ) voice will be recorded using microphone

b) Voice activity detection to be performed on the extracted voice

c) Feature extraction using MFCC

d) Speaker recognition using Euclidean distance

e) Compare the result obtained in (d) using Neural Network

f) Calculate % error for (d ) and (e)

g) Display on serial port

Data:This work focuses on developing a system that uses the speech signal as a recognition system. The speech signal will be recorded using microphone. The signal is text dependent, where speakers will utter the words which will form a database. Different speaker will generate different speech waves.

Tools:The main tools that will be used in this research is MATLAB software. The MATLAB DSP (Digital Signal processing) toolbox and neural network toolboxes will be used to develop the programs in the software. A GUI will be designed in MATLAB for speaker recognition.

Hardware:The hardware that will be used in this research is:

1. Laptop

2. Intel Pentium Core 2 Duo 1.6Ghz

3. USB PC Microphone

Fig.6 shows the flow chart of Automatic Speaker Recognition System

Current voice from speaker is not present in database


Speaker is valid & speaker id is output

No speech is present & keep on checking voice i/p

Comparision block(compares present voice with the database)

Database voice sample MFCCs

Is Match Found ?

Is Voice Activity Detected?

VAD block to detect voice activity Incoming speech samples





Extract MFCCs of the detected voice activity

FIG 6:Flow Chart of Speaker Recognition System


The complete system will consist of software coded in matlab with graphical user interface, a mic for capturing voice based data and a hardware circuit connected to the computer via serial port used for operating a lock and delivering the result on LCD.

As soon as the system is activated, the microphone connected to a computer will start capturing voice based signals and converting them to electrical signals that can be saved and analyzed.

Coded in matlab the system will analyze the data captured by microphone for white noise and for background data that will be differentiated by voice if it is below a specified threshold limit.

This data will be utilized to filter out the needed speech command from the complete voice signal having noise and background sound. The task will be accomplished by generating voice signals similar to noise and background sound but will be 180 degrees out of phase with them, so as that can be cancelled resulting in only the needed speech command.

Once the voice command is successfully extracted from the complete signal, this will be then analyzed, extracting various parameters needed for successful comparison to the database speech.

The extracted features will be:

Base frequencies present in the signal

The amplitude variation of the peaks

The energy envelope present in the signal

The above mentioned parameters will be compared with the parameters of the speech stored in database in the form of wave file. A threshold will be defined for each feature, if the comparisons made for each feature is under specified thresholds, then the result will be declared true otherwise false. In either case a data packet associated with the result will be sent over serial port (UART protocol), to the microcontroller.

The hardware part will consist of a microcontroller, Relay and 16x2 LCD. On receiving the message from the computer via serial port (UART protocol) this microcontroller will operate a relay and will flash a message on the LCD reporting the result either matched or unmatched. The relay output further can be used to operate a actuator to open or close a door.