Mel Frequency Cepstral Coefficients Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This research work aims at designing both text-dependent and text-independent speaker recognition system based on mel frequency cepstral coefficients and voice activity detector (VAD). VAD has been employed to suppress the background noise and distinguish between silence and voice activity. MFCCs will be extracted from the detected voice sample and will be compared with the database for recognition of the speaker. A new criteria for detection is proposed which is expected to show very good performance in noisy environment. The system will be implemented on MATLAB platform and a new approach for designing a voice activity detector (VAD) has been proposed. In order to prove the effectiveness of the proposed system comparative analysis of the proposed design approach will be done with the Artificial neural networks technique. In recent years there has been a significant amount of work, both theoretical and experimental, that has established the viability of artificial neural networks (ANN's) as a useful technology for speaker recognition. The performance of both the systems will be evaluated under different noisy environments and in different languages and emotions. The overall efficiency of the proposed speaker recognition system depends mainly on the detection criteria used for recognizing a particular speaker. Global optimization techniques like Genetic Algorithm (GA), Particle Swarm Optimization (PSO) etc. can prove very useful in this context and hence for setting up of the detection criteria Genetic Algorithm will be employed.


Development of speaker recognition system began in early 1960's with the exploration into voiceprint analysis, where uniqueness of an individual is characterised by the characteristics of an individual voice. The detection efficiency of speaker recognition systems gets severely affected in the presence of noise. This fact ensured to derive a more reliable method. Speaker recognition is the process of recognizing the speaker from the database based on some characteristics in the speech wave. Most of the speaker recognition systems contain two phases. In the first phase feature extraction is done. The unique features from the voice data are extracted which are used latter for identifying the speaker. The second phase is feature matching in which we compare the extracted voice data features with the database of known speakers. The overall efficiency of the system depends on how efficiently the features of the voice are extracted and the procedures used to compare the real time voice sample features with the database.

For security application to crime investigations, speaker recognition is one of the best biometric recognition technologies. We can give our speech signal as password to the lock system of our home, locker, computer etc. Speaker recognition can also be helpful in verifying voice of criminal from the audio tape of telephonic conversations. The main advantage of biometric password is that there is nothing like forgetting or misplacing it.

Voice biometric compared to other biometric is user friendly, cost-effective, convenient and secure. Robust speech recognition systems can be applied to high accuracy connected digits recognition systems. It finds application in the recognition of personal identification numbers, credit card numbers, and telephone numbers.

The modern speaker recognition systems are designed keeping in mind that it should have high accuracy, low complexity and easy calculation. Hidden Markov Model (HMM) has been successfully applied to both the isolated word and continuous speech recognition; however it does not address discrimination and robustness issues for classification problems. The acoustic analysis based on MFCC which represents the ear model [1], has given good results in speaker recognition. Background noise and microphone used also effect the overall performance of the system [2].

Speaker recognition systems contain three main modules:

(1) Acoustic processing

(2) Features extraction or spectral analysis

(3) Recognition.

All three modules are shown in Fig. 1 and are explained in detail in the subsequent sections.

Fig.1. Basic structure of speaker recognition system

Research and development on speaker recognition methods and techniques has been undertaken for more than four decades and it is still an active area. Many approaches like human aural and spectrogram comparisons, simple template matching, dynamic time-warping approaches, and modern statistical pattern recognition approaches, such as neural networks and Hidden Markov Models (HMMs) have been used. Many techniques have been used for speaker recognition including Hidden Markov Models (HMM) [Siohan, 1998], Gaussian Mixture Modeling (GMM) [Reynolds, 1995], multi-layer perceptrons [Altosaar and Meister, 1995], Radial Basis Functions [Finan et al., 1996] and genetic algorithms [Hannah et al., 1993].

In the last decade, neural networks have been a topic of interest. Neural networks have the ability to derive meaning from complicated or imprecise data. They can be used to extract patterns and detect trends which are difficult to analyse by either humans or other computer techniques. The advantages offered by neural networks are:

Adaptive learning, Self-Organization, Real Time Operation, Fault Tolerance via Redundant Information Coding.


Research has been focused on Feature based Recognition Systems. Using features from Speech based sources it has been tried to create a reliable, robust and efficient recognition system. However, variations caused due to differences in individual speaker characteristics, emotion variations and noise disturbances increases the complexity of such a system.

Template-matching techniques are being used for Text-dependent methods .The input speech is represented by a sequence of feature vectors. Short-term spectral feature vectors are generally used. Dynamic time warping (DTW) algorithm is used to align the time axes of the input speech and each reference template or model of the registered speakers. Accumulated from the beginning to the end of the speech, the degree of similarity between them is calculated. Statistical variation in spectral features can be modeled by Hidden Markov Model (HMM).

HMM-based methods are extensions of the DTW-based methods .A new technique for computing verification scores using multiple verification features from the list of scores for a target speaker's speech was introduced by Park, A (2001).This technique was compared to the baseline logarithmic likelihood ratio verification score using global GMM speaker models .It gave no improvement in verification performance.

Zhou, L (2000) used neural networks and fuzzy techniques .They were applied to a speaker independent speech recognition system. The tests for a great number of speech templates of Chinese digits 0-9 collected from the persons from different areas and in noisy environment gave a recognition rate of 92.2%.

Moonasar, V, Venayagamoorthy, G (2002) proposed a speaker verification system that can be improved and made robust with the use of a committee of neural networks for pattern recognition rather than the conventional single network decision system. Supervised Learning Vector Quantization (LVQ) neural network as pattern classifier were used. Linear Predictive Coding (LPC) and Cepstral signal processing techniques are used to make hybrid feature parameter vectors to combat the effect of decreased recognition rate with increased number of speakers to be recognized.

The most commonly used acoustic vectors are Mel Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC) and Perceptual Linear Prediction Cepstral (PLPC) coefficients and zero crossing coefficients (Yegnanarayana et al., 2005; Vogt et al., 2005). All these features are based on the spectral information derived from a short time windowed segment of speech.

They differ mainly in the detail of the power spectrum representation. For extraction of speech features for Speaker verification (SV) application a new modification of Mel-Frequency Cepstral Coefficient (MFCC) feature has been proposed (Saha and Yadhunandan,2000).It is compared with original MFCC based feature extraction method. The study uses multi-dimensional F-ratio as performance measure in Speaker Recognition (SR) applications to compare discriminative ability of different multi parameter methods. An MFCC like feature based on the Bark scale gives same performance in speech recognition experiments as MFCC (Aronowitz et al., 2005).The BFCC features is good for text dependent speaker verification systems. Revised perceptual linear prediction was proposed by Kumar et al. (2010), Ming et al. (2007) for the purpose of identifying the spoken language;Revised Perceptual Linear Prediction Coefficients (RPLP) was obtained from combination of MFCC and PLP.

The aim of modeling technique is to generate speaker models using speaker-specific feature vectors. Such models will have more speaker-specific information. Earlier studies on speaker recognition used direct template matching between training and testing data. In the direct template matching, training and testing feature vectors are directly compared using similarity measure. For the similarity measure, any of the techniques like spectral or Euclidean distance or Mahalanobis distance is used (Liu et al., 2006).The disadvantage of template matching is that as the number of feature vectors increases it becomes time consuming. Hence we reduce the number of training feature vectors by some modeling technique like clustering. The cluster centres are called as code vectors and the set of code vectors is known as codebook. K-means algorithm is the most widely used codebook generation algorithm (Mporas et al., 2007; Ming et al., 2007). In 1985,Soong et al. used the LBG algorithm for generating speaker-based vector quantization (VQ) codebooks for speaker recognition. The system performances in neural network based networks were also studied (Clarkson et al., 2006). In HMM, Observation symbols are created by VQ codebook labels. Continuous probability measures are created using Gaussian mixtures models (GMMs) (Krause and Gazit, 2006). In 1995, Reynolds proposed Gaussian mixture modeling (GMM) classifier for speaker recognition task (Krause and Gazit, 2006; Clarkson et al., 2006).This is the most widely used probabilistic technique in speaker recognition. The GMM needs sufficient data to model the speaker. In the GMM modeling technique, the distribution of feature vectors is modeled by the mean, covariance and weight.GMM outperformed the other modeling techniques. The disadvantage of GMM is that it requires sufficient data to model the speaker well (Aronowitz et al., 2005).

Various researchers are still trying to improve the peformance of speaker recognition systems so as to achieve better peformance .Use of various existing optimization techniques namely genetic algorithm, particle swarm optimization, neural networks etc can come handy in improving the performance .

Description of Broad Area/Topic

Speaker recognition is the process of recognizing the speaker from the database based on some characteristics in the speech wave. Most of the speaker recognition systems contain two phases. In the first phase feature extraction is done. The unique features from the voice data are extracted which are used latter for identifying the speaker. The second phase is feature matching in which we compare the extracted voice data features with the database of known speakers. Each module will be discussed in detail in later sections.


Acoustic processing is sequence of processes that receives analog signal from a speaker and convert it into digital signal for digital processing. Human speech frequency usually lies in between 300Hz-8000kHz [2].Therefore 16kHz sampling size can be chosen for recording which is twice the frequency of the original signal and follows the Nyquist rule of sampling [3].The start and end detection of isolated signal is a straight forward process which detect abrupt changes in the signal through a given threshold energy. The result of acoustic processing would be discrete time voice signal which contains meaningful information. The signal is then fed into spectral analyser for feature extraction.


Feature Extraction module provides the acoustic feature vectors used to characterize the spectral properties of the time varying speech signal such that its output eases the work of recognition stage. A small amount of speaker specific information in the form of feature vectors from the input voice signal is extracted and it is used as a reference model representing each speaker's identity. A general block diagram of speaker recognition system is shown in Fig 2.

Fig.2 Speaker recognition system

It is clear from the above diagram that the speaker recognition is a 1:N match where one unknown speaker's extracted features are matched to all the templates in the reference model for finding the closest match. The speaker feature with maximum similarity is selected.

MFCC Extraction

Mel frequency cepstral coefficients (MFCC) is probably the best known and most widely used for both speech and speaker recognition. A mel is a unit of measure based on human ear's perceived frequency. The mel scale is approximately linear frequency spacing below 1000Hz and a logarithmic spacing above 1000Hz. The approximation of mel from frequency can be expressed as-

mel(f) = 2595*log(1+f/700)--------(1)

where f denotes the real frequency and mel(f) denotes the perceived frequency. The block diagram showing the computation of MFCC is shown in Fig. 3.

Fig.3 MFCC Extraction

In the first stage speech signal is divided into frames with the length of 20 to 40 ms and an overlap of 50% to 75%. In the second stage windowing of each frame with some window function is done to minimize the discontinuities of the signalby tapering the begining and end of each frame to zero. In time domain window is point wise multiplication of the framed signal and the window function. A good window function has a narrow main lobe and low side lobe levels in their transfer function. In our work hamming window is used to perform windowing function. In third stage DFT block converts each frame from time domain to frequency domain. In the next stage mel frequency warping is done to transfer the real frequency scale to human perceived frequency scale called the mel-frequency scale. The new scale spaces linearly below 1000Hz and logarithmically above 1000Hz. The mel frequency warping is normally realized by triangular filter banks with the center frequency of the filter normally evenly spaced on the frequency axis. The warped axis is implemented according to equation 1 so as to mimic the human ears perception. The o/p of the ith filter is given by-

--------------- (2)

S(j) is the N-point magnitude spectrum (j =1:N) and Ωi(j) is the sampled magnitude response of an M-channel filter bank (i =1:M). In the fifth stage Log of the filter bank output is computed and finally DCT (Discrete Cosine Transform) is computed. The MFCC may be calculated using the equation-

--------- (3)

where N' is the number of points used to compute standard DFT.

Fig.4 Triangular filter bank

Voice Activity Detector

Voice Activity Detector (VAD) has been used to primarily distinguish speech signal from silence. VAD compares the extracted features from the input speech signal with some predefined threshold. Voice activity exists if the measured feature values exceed the threshold limit, otherwise silence is assumed to be present. Block diagram of the basic voice activity detector used in this work is shown in Fig. 5

Fig. 5 VAD block diagram

The performance of the VAD depends heavily on the preset values of the threshold for detection of voice activity. The VAD proposed here works well when the energy of the speech signal is higher than the background noise and the background noise is relatively stationary. The amplitude of the speech signal samples are compared with the threshold value which is being decided by analyzing the performance of the system under different noisy environments.

3. Feature matching

a) Using Euclidean Distance

In the recognition phase a sequence of feature vectors {x1, x2,….,xT}for unknown speakers are extracted .These are compared with the codebooks in the database. For each codebook a distortion measure is computed. The speaker with the lowest distortion is chosen.

Thus, each feature vector of the input is compared with all the codebooks. The codebook with the least average distance is chosen to be the best. The formula used to calculate the Euclidean distance can be defined as follows:

Let us take two points P = (p1, p2…pn) and Q= (q1, q2...qn). The Euclidean distance between them is given by

-----------(4 )

The speaker with the lowest distortion distance is chosen as the unknown person.

b) Neural Networks (NN)

Several popular classification techniques (pattern matching): HMM, GMM, DTW, VQ, NN are being used for Speaker Recognition. We have neural network as recognizer.

Neural networks consists of layers. Layers are made up of a number of interconnected 'nodes' which contain an 'activation function'. At the 'input layer' patterns to be recognised by the network are presented .It communicates to one or more 'hidden layers' .Here the actual processing is done through a system of weighted 'connections'. The hidden layers is then connected to an 'output layer'.

Most ANNs modifies the weights of the connections according to the input patterns based on some form of 'learning rule' .


Automatic speaker recognition works on the principle that a person's speech exhibits characteristics that are unique to the speaker. Speech signals in training and testing sessions cannot be same due to many facts such as people's voice change with time, health conditions, speaking rates, etc. Acoustical noise and variations in recording environments present a challenge to speech recognition .The challenge would be to make the system "Robust". If the recognition accuracy does not degrade significantly, the system is called "Robust".


The goals of this research work are:

Develop a new text-dependent and text-independent speaker recognition framework with the help of MFCC and VAD.

Dynamically train the speaker recognition system with clean and noisy (additive and convolutive) speech signals. Each time a new speech signal is input to the system, additive white Gaussian noise at different values of SNR and echo with varying values of delay are added to the clean speech signals.

Investigate the performance of the proposed text-independent and text-dependent speaker recognition systems under noisy environments.

Compute the accuracy rates of identifying the test speaker in clean and noisy environments using the designed speaker recognition model and compare it with the artificial neural network based speaker recognition technique.

5. To analyze the best method of removing background noise in voice signal.


Most of the speaker recognition systems contain two phases. First phase is feature extraction in which the unique features from the voice data are extracted which are used latter for identifying the speaker. In the second phase is feature matching and this phase comprises of the actual procedures carried out for identifying the speaker by comparing the extracted voice data features with the database of known speakers. The overall efficiency of the system depends on the fact that how efficiently the features of the voice are extracted and the procedures used for comparing the real time voice sample features with the database.

The following steps will be performed:

a ) voice will be recorded using microphone

b) Voice activity detection to be performed on the extracted voice

c) Feature extraction using MFCC

d) Speaker recognition using Euclidean distance

e) Compare the result obtained in (d) using Neural Network

f) Calculate % error for (d) and (e)

g) Display on serial port

Data: This work focuses on developing a system that uses the speech signal as a recognition system. The speech signal will be recorded using microphone. The signal is text dependent, where speakers will utter the words which will form a database. Different speaker will generate different speech waves.

Tools: The main tools that will be used in this research is MATLAB software. The MATLAB DSP (Digital Signal processing) toolbox and neural network toolboxes will be used to develop the programs in the software. A GUI will be designed in MATLAB for speaker recognition.

Hardware:The hardware that will be used in this research is:

1. Laptop

2. Intel Pentium Core 2 Duo 1.6Ghz

3. USB PC Microphone

Fig.6 shows the flow chart of Automatic Speaker Recognition System

Current voice from speaker is not present in database


Speaker is valid & speaker id is output

No speech is present & keep on checking voice i/p

Comparision block(compares present voice with the database)

Database voice sample MFCCs

Is Match Found ?

Is Voice Activity Detected?

VAD block to detect voice activity Incoming speech samples





Extract MFCCs of the detected voice activity

FIG 6:Flow Chart of Speaker Recognition System


The complete system will consist of software coded in matlab with graphical user interface, a mic for capturing voice based data and a hardware circuit connected to the computer via serial port used for operating a lock and delivering the result on LCD.

As soon as the system is activated, the microphone connected to a computer will start capturing voice based signals and converting them to electrical signals that can be saved and analyzed.

Coded in matlab the system will analyze the data captured by microphone for white noise and for background data that will be differentiated by voice if it is below a specified threshold limit.

This data will be utilized to filter out the needed speech command from the complete voice signal having noise and background sound. The task will be accomplished by generating voice signals similar to noise and background sound but will be 180 degrees out of phase with them, so as that can be cancelled resulting in only the needed speech command.

Once the voice command is successfully extracted from the complete signal, this will be then analyzed, extracting various parameters needed for successful comparison to the database speech.

The extracted features will be:

Base frequencies present in the signal

The amplitude variation of the peaks

The energy envelope present in the signal

The above mentioned parameters will be compared with the parameters of the speech stored in database in the form of wave file. A threshold will be defined for each feature, if the comparisons made for each feature is under specified thresholds, then the result will be declared true otherwise false. In either case a data packet associated with the result will be sent over serial port (UART protocol), to the microcontroller.

The hardware part will consist of a microcontroller, Relay and 16x2 LCD. On receiving the message from the computer via serial port (UART protocol) this microcontroller will operate a relay and will flash a message on the LCD reporting the result either matched or unmatched. The relay output further can be used to operate a actuator to open or close a door.