Introduction Person Recognition English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

There are many ways that humans can identify each other, and so is for machines. There are many different identification technologies available, many of which are in commercial use for years. The most common person verification and identification methods today are Password/PIN, known as Personal Identification Number, systems. The problem with that or other similar techniques is that they are not unique, and it is possible for somebody to forget, loose, or even have it stolen by somebody else in order to overcome these problems there has been developed a considerable interest in "biometrics" identification systems, which use pattern recognition techniques to identify people using their characteristics. For example in bank transactions and entry into secure areas, such technologies have the disadvantage that they are intrusive both physically and socially. The user must position the body relative to the sensor, and then pause for a second to declare himself or herself.

It is a process that requires a person's personal attributes which can be used to identify him/her. Identification techniques require a person to use one of the following methods to verify his/her identity:

Remembering something: like passwords and PIN numbers

Providing written proof: such as signatures

Carrying other physical evidence: Such has door keys

Carrying Photographed Cards: Such as National Identity Cards, Driver's License

We intend to use speech as an identification technique in our project because it is a very effective means of identification. If using speech a person is not required to carry anything like cards or keys PINs. It only requires a person to provide a word or code to proof of identification.


The word "biometrics" consists of two Greek words "bios" which means life and "metron" means measure. Thus from this information, we may define biometrics in general as the science of recognizing a human being by examining physical features of an individual.

"Biometrics is computers based technology used to plot and record the physical and behavioral characteristics of an individual for identification or authentication. The patterns of these individuals are matched in real time against a database of enrollees [1]".

Difference between Speech recognition and detection

Many people think that the two terms speech recognition and speech detection the same. Though there are many similar techniques, and are based on the same ideas and algorithms, but they are two different systems. The prior step to speech recognition is the accurate detection of human voices in arbitrary scenes. This is the most important process involved. The main difference is the fact that speech recognition is detecting speech and searching through a dataset in order to find an exact match but on the other hand speech detection is looking for any match and as soon as one is found then the search stops.

Speech recognition

Bridging the gap between the world of the computer and that of its user is one of the chief goals of computer engineering research. Graphical interfaces, input devices, speech generators, and handwriting recognition systems are just a few examples of how we are making computers more accessible. Machine vision is a more recent thrust in this direction, and represents a major step in bringing the computer into the world.

This method requires a microphone to record the voice of a user, which is then checked for various unique features in it that might match a particular sample in the database [3].

Recognition Methods

Handwriting and Signature Recognition

As the name suggests, these techniques employ the methodology of asking a user to provide a signature or some other text in one's handwriting at runtime which is then checked for authorization in the database. Since this method does not involve any part of the human anatomy, it is called behavioral biometrics [3].

recognition. [5]

Retina Scanning

It involves the scanning of the retina for the measurement and analysis of the unique pattern of blood vessels present in the retina.

Face Recognition

This method involves the storing of many key features of ones face in the form of multidimensional face space and then comparing the input face with the ones already existing in a database.

Iris Scanning

Iris is the area of the eye where the pigmented or colored circle, rings the dark pupil of the eye. In the process of iris scanning, a picture of the human eye is taken from which the iris portion is extracted and its various patterns are used as a tool for comparison and identification.

Fingerprint Recognition

This process involves scanning a fingerprint and then comparing various unique features in it which are made up of ridges and valleys found in the epidermis layer of the human skin.

Why speech recognition?

The field of speech recognition is so wide that its benefits are also in that range. For example a disabled person, that has none of the legs or arms can not control a machine or applying some jobs that needs arm or foot control. It means a disabled person can not materialize a job that needs muscular force. But with the help of speech recognition they can do the job, and they don't need to reach the light control unit to turn on or off the lights. With a speech recognition technique they can easily turn the unit on or off. Speech recognition is very effective for saving of time and energy. Instead of giving command with hands or legs (controlling a mobile) it is much easier with speech. Another example is from telephone systems use speech recognition systems. When a company gets a call from a client, that client can be routed to the one that client wants to reach without the need of secretary. So that will also decrease costs of the company. Hiring a secretary costs much more than speech recognition software when thinking for long time profit. It can also be use in banking applications. Usable and safer. Because no one wants to share their banking info's, like their passwords or some info's, with a person. Speech recognition is also very effective in security systems. Speech recognition is used for entrance of buildings or rooms that are confidential. So the real benefit of the speech recognition is the accuracy.

Project brief Introduction

The visual and vocal characteristics of a person are two sources of distinctiveness that can provide information about the individuality of an individual. We describe a person recognition system using speech as primary sources of personal identity information. A software program is developed for recognizing the commands. It derives the input from the user in form of speech then recognizes it and performs according to the conditions specified in the code and corresponding application is done.

The main intention of our project is to recognize a person on the basis of his or her voice characteristics extracted by MBLPCC technique. The working principal of speech recognition technology is: Speech recognition also known as the speaker recognition. it has two categories: speaker identification and speaker verification. Speaker identification is used to determine which one of the people speaks, i.e. "one out of more election"; According to the voice of different materials, speech recognition can be divided into the text-dependent, and text-independent technology. The text-dependent speech recognition system requires speaker pronounce in accordance with the contents of the text. Each person's individual sound profile model is established accurately. People must also be identified by the contents of the text during recognition to achieve better effect. Text-independent recognition system does not require fixed contents of words, which is relatively difficult to model, but is convenient for user and can be applied to a wide range. speech recognition is an application based on physiological characteristics of the speaker's voice and linguistic patterns. Different from speech recognition, voiceprint recognition is regardless of contents of speech. Rather, the unique features of voice are analyzed to identify the speaker. With voice samples, the unique features will be extracted and converted to digital symbols, and then these symbols are stored as that person's character template. This template is stored in a computer database, a smart card or bar-coded cards. User authentication is processed inside the recognition system.

Chapter 2

2.1 The Basic Properties of Speech

The production of speech signal can be explained as the inhaling of air through vocal cord via vocal tract, which extends from glottis to the mouth. The tongue is used to vary the formant frequencies. As the shape of the vocal area varies relatively slowly, the transfer functions for the modeling filter is required to be updated every 20 ms or so.

Speech sounds can be broken down into two classes depending on their type of excitation.

Voiced sounds are produced when the vocal cords vibrate interrupt the flows of air from the lungs to the vocal area by opening and closing and quasi-periodic pulses are produced for excitation by air. The pitch of the sound is determined opening and closing rate done by adjusting the variations in the shapes and the tension in the vocal cords. The pitch period is typically between 2 and 20 ms.

Plosive sounds are generated by complete closure made in the vocal area, and the sudden air pressure is built up and released suddenly.

Figures 2.1(a), 2.1(b) will elaborate the concept

2.2 The Human Voice

The human voice is just an air-pressure variance. It is produced by airflow pressed out of the lungs and going out through the mouth and nasal cavities

The vocal folds are thin muscles looking like lips, placed at the larynx. At their front end they are connected together permanently, and on the other end they could be open or closed. When they are closed, the vocal folds are stretched next to each other, forming an air block. The air pressure from the lungs is forcing its way through that block, pushing the vocal folds aside. The air passes through the formed crack and the pressure drops, allowing the vocal folds to close again. This process goes on and on, vibrating the vocal folds to produce the voiced sound.

Males have usually longer vocal cords than females. This is the reason that causes a lower pitch and a deep voice. A triangle is formed by the opening of vocal folds allowing the air to reach the mouth cavity easily. Random noise is generated by any turbulence in the vocal area.

2.3 Factors associated with speech

2.3.1 Formants

It has been known from research that vocal area and nasal tract are tubes with non uniform cross-sectional area. As sound generated propagates through these the tubes, the frequency spectrum is shaped by the frequency selectivity of the tube. This effect is very similar to the resonance effects observed in organ pipes and wind instruments. In the context of speech production, the resonance frequencies of vocal area are called formant frequencies or simply formants.

In our engineered model the poles of the transfer function are called formants. Human Auditory system is much more sensitive to poles than zeros.

2.3.2 Phonemes

Phonemes can be defined as the "Symbols from which every sound can be classified or produced". Every Language has its particular phonemes which range from 30 - 50. English has 42 phonemes. For speech crude estimation of information rate considering physical limitations on articulatory motion is about 10 phonemes per second.

Types of Phonemes

Speech sounds can be classified in to 2 distinct classes according to the mode of excitation.

Plosive Sounds

Voiced Sounds Plosive Sounds

These sounds are produced as a result of sudden releasing of pressure created at the front end of the vocal area made by complete closure. Voiced Sounds

The vocal area is excited by the production of quasi-periodic pulses which are generated by the vocal cord's vibration in a relaxation oscillation.

Voiced sounds are characterized by

• High Energy Levels

• Very Distinct resonant and formant frequencies.

The rate at which the vocal chord vibrates determines the pitch.

2.4 Special Type of Voiced and Unvoiced Sounds

There is however some special types of voiced and unvoiced sounds which are briefly discussed here. The purpose of their discussion here is only to give the reader an idea about the further types of voiced and unvoiced speech.

2.4.1 Vowels

They are produced by quasi periodic pulses when they excite the exciting a fixed vocal area of the vocal cords. The resonant frequencies generated by varying the cross-sectional area of the vocal area. Area function is the dependence of cross-sectional area upon distance along the tract.

The area function of a particular vowel is determined primarily by the position of the tongue but the position of jaws and lips to a small extent also affect the resulting sound. Examples a, e, i, o, u. Chord vibrates determines the pitch.

2.4.2 Semivowels

It is very difficult to characterize /w/, /l/, /r/, /y/. A gliding transition in the vocal tract area function between adjacent phonemes characterizes them.

Thus the acoustic characteristics of these sounds are strongly influenced by the circumstance in which they occur.

2.4.3 Nasals

The nasal consonants /m/, /n/, and /. / are produced with glottal excitation and the vocal tract totally constricted at some point along the oral passage way. The air flow through the nasal tract is made to radiate at the nostrils by lowering the velum. The nasalized vowels are spectrally broader or highly damped and characterized by resonance.

Chapter 3

Techniques for Speech Recognition

3.1 Commonly used Techniques

Speaker recognition comprise of two parts: speaker verification and speaker recognition. Speaker verification refers to the process of determining whether or not the speech samples belong to some specific speaker. However, in speaker recognition, the goal is to determine which one of a group of known voices best matches the input voice sample.

It is evident that the method used to extract and model the speaker-dependent characteristics of the speech signal seriously affects the performance of a speaker recognition system.

Different techniques have been used to extract features. Some of them are listed below

Hidden Markov Models (HMM)

Mel-Scale Frequency Cepstral coefficients (MFCC)

Linear Predictive Cepstral coefficient (LPCC)

Hidden Markov Model

A HMM is a temporal probabilistic model in which the state of the process is described by a single discrete random variable. Loosely speaking, it is a Markov chain observed in noise. The theory of hidden Markov models was developed in the late 1960s and early 1970s by Baum, Eagon, Petrie, Soules and Weiss [10].

This model is utilized in characterizing the statistical attributes/properties of a signal.

We can show HMM λ (A, b, π) by the following parameters:

It is a set of K discrete states which are not shown. The state at time t is qt € {1, 2…. K}

A state transition probability distribution matrix A = { a ij }, where ajk = P ( q t+1 = k / qt = j), 1 < =j, k < = K

Probability density functions for each state b = {bk (x)}, where bK (xt) = πK = P ( q1 = k), 1 < = K < = K.

An initial state distribution π = {π k}, where p (xt / (qt = K) , 1<=K<=K

HMM λ, at any discrete time t, will always be in one state which are not shown, qt = k, from which according to the probability distribution bK , it produces an output. The parameters (A, b, π) of λ are weighting factors explaining the firmness/strength on the dependencies among the observations (outputs) and states. Local conditional beliefs are shown by them and their combined effect gives a likely combinational of hypotheses. HMM has the capability to carry out a number of jobs relied on sequences of observations:

LEARNING: Provided an observation sequence X = {x1, x2 ….xT} and model, λ, the model parameters can be adjusted as P (X / λ) IS MAXIMIZED.

PREDICTION: The observation sequences and their associated state sequences in which there is an inherent reflection of probabilistic characteristics, are predicted by an HMM model λ.

SEQUENCE CLASSIFICATION: Provided a given observation sequence X = { x1, x2,….xT } , by computing P(X/λi) for a set of known models λi, we can classify the sequence as belonging to class i for which P(X/λi0) is maximized.

SEQUENCE INTERPRETATION: Provided X = { x1, x2,…..xT} and an HMM, λ, applying the Viterbi algorithm shows a single most likely state sequence

Q = {q1, q2….qT}

That there is a predictable order of features when speech is from different position, i.e. away or near etc. Suppose, there is training set mode of collection of speech images for each subject. We generate the observations sequence O from an X x Y speech using an X x L sampling window with X x M voice.

Mel-Scale Frequency Cepstral coefficients

This technique (MFCC) is being used as a vector quantization and feature extraction to minimize the data so that can be easy to handle. Through this technique a speaker can be identify by voice and control access service such as database access service ,voice mailing and remote access to computers.

As the speech signal is slowly time varying signal .when this signal is examined over a short period of time like 5 and 100 ms, the characteristics of such signal are quite stationary. If we take a large time period of 0.2s then signal characteristics changes so that's why we take short time period for spectral analysis. .MFCC technique makes use of two types of filter, linearly spaced filters and logarithmically spaced filters. For identification of signal characteristics MFCC is expressed as Mel frequency scale. This scale has linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. There are different characteristics of speech on different time depending on speaker's physical conditions [11], [12]

Fig 3.1.1: Block diagram of MFCC processor

The speech signal has consists of tones with different frequencies. Each tone has with an actual frequency, f, measured in Hz; a subjective pitch is measured on the 'Mel' scale. The mel-frequency scale is linear frequency spacing below 1000Hz and a logarithmic spacing above 1000Hz. As a reference point, the pitch of a 1 kHz tone, 40dB above the perceptual hearing threshold, is defined as1000 mels. Therefore we can use the following formula to compute the mels for a given frequency f in Hz [13]

Log Mel spectrum has to be converted back to time. The result is called the Mel frequency Cepstral Coefficients (MFCCs).The MFCC can be calculated by the fallowing equation.

Where K is Cepstral coefficient.

Linear Predictive Cepstral Coefficients

LPCC is a combination of linear predictive coefficients (LPC) and Cepstral coefficient (CC).


Input Signal

LPC Coefficients




This technique is very briefly discussed here because we are using it in our project so it is discussed in detail in the proceeding portion.

Linear Predictive Cepstral Coefficients

3.2.1 What is LPC?

The coefficients obtain from Linear Predictive Coding Filter is known as linear predictive coefficients. These are the estimated formants. Introduction

The human speech recognition process is done by two major factors: the source excitation and the vocal area shaping. When we model the speech process when we model the speech recognition process, we have to model these two factors.

Excitation process is an estimation of the pitch which is special property of a speech signal

Vocal area shaping is a process with a fundamental algorithm which helps to estimate the formants.

In this model, a speech signal is define as s (n) which is considered to be an output of the system. And the input of the system excitation signal as u (n).The speech sample s (n) is modeled as a linear combination of past and present inputs and past outputs.

This relation states that

Where G is a gain factor and {ak}, {bt} (filter coefficient) are the system parameters.

In this p shows the output samples.

The transfer function H (z) of system is given as

Where H (z) shows pole zero model.

There are two special cases of this model.

When bt=0, 1≤ l≤ q, H(z) reduces to all-pole model, known as autoregressive model.

When ak=0, for 1≤ k≤ p, H(z) because an all-zero or moving average model.

The transfer function of the entire pole model is

any causal rational system can be decomposed as

G' is a gain factor and Hmin is a transfer function of minimum phase filter and Hap (z) is transfer function of an all-pass filter.

The minimum phase can e compressed as

Where I is a practically finite integer, This pole -zero models can be estimated only by all-pole because it can relatively contributes to phase.

The z-transform of the transfer function is

G is the gain, if G=1 then the equation of transfer function becomes

Where the polynomial

Is denoted by A (z).

"The filter coefficients {ak} are the Linear Prediction Coefficients." Error function

The error signal is the difference between the input speech and the estimated speech. Estimation of LPC

There are two commonly used methods for estimating the LP coefficients



Both methods are used for the minimization of the error signal. Estimating the formants

LPC generally find the formants of the speech signal. Each sample is represented as the linear combination of the previous samples by difference equations called linear predictor, also known as LPC. These LPC components are the formants of the speech signal. The estimation process is done by minimizing the mean square error among the predicted signal. LPC Parameters

The autocorrelation method of order p using LPC analysis.


Where "r" is an autocorrelation vector,

Where "a" is a filter coefficients vector, and

This matrix is nonsingular and it gives the following solution,

The autocorrelation method is very effective in speech processing.

3.2.2 Cepstrum

It is a transform technique used to gain the information from the speech signal of a person .It is used to separate excitation signal that have the information of the pitch and the words and transfer function. Cepstrum, have the information about the quality of the speech applications are same as LPC's. But for spectral analysis it is completely different technique.

In speech recognition process we have two methods to form our code book

These two methods are categories into the voiced and unvoiced signal. In our voice signal part we extract the pitch like our transfer function. It is helpful to extract the vowel sound parameters in a speech signal. The unvoiced sound part contains of non vowel words.

This approach is a different way of looking speech parameters and known as the source filter model. [14]

Mathematically, they are described in the time domain as:

We know convolution in time domain is multiplication in frequency domain,

So above expression becomes

This is a mathematical process; we take log on the both sides of the above expression

Computing the inverse Fourier transform of the equation

this is called the "Quefency". Quefency is the x-axis of the Cepstrum. Their units are in time. Typically the axis of interest is form

0ms - 10ms

The coefficients obtained from the above process are known as "Cepstral Coefficients". [14], [15]

3.2.3 LPC Parameters to CC

CC parameters are very important in a speech recognition model. The direct conversion from LPC to CC is done using the following method,

Where 1 < m < p, and

Here m > p.

CC are the log magnitude of the Fourier components of a speech signal. CC are more robust for speech recognition model.

Generally, it is used a Cepstral representation with Q > p coefficients, where Q~ (3/2) p. [15]

3.2.4 Speech Feature extraction Process

The Feature extraction process of speech is done with an algorithm known as Feature Extraction Process. A speech signal is inserted to this process and this process returns out the output Multi band LPCC.

The main steps of this algorithm are:

Input Speech signal

Extract full-band LPCC from full band speech signal

Apply Discrete Wavelet Transform to decompose the speech signal into sub bands.

Repeat 3 until the desire numbers of sub bands achieved

Save Low frequency sub bands and discard the high frequency sub bands.

Extract LPCC from a low frequency sub band

Add subband-LPCC to Full band-LPCC

Repeat 6, 7 until LPCC of all sub bands is calculated and be added.

Multi band LPCC generated

3.2.4 Block Diagram of MBLPCC feature Extraction

This analysis is based on time-frequency multi resolution analysis. MBLPCC features are used as the front end of the speech recognition process. MPLCC features are good representation of speech envelope of spectrum of vowels. MPLPCC are also selected due to their simplicity.

Chapter 4

Wavelet Transforms

Wavelet transform and Fourier transform are the common tools for signal analysis. Fourier transform normally generalize the complex Fourier series while the wavelet transform decomposes a signal in to wavelets basis. [16]

Fourier transform and wavelet transform represent a signal with linear combination of its basis functions. [17].

4.1 What are Basis Functions?

Concept of basis functions can be explained with this example, a two dimensional vector (x, y) can have two basis functions (1, 0) and (0, 1) because it can be represented as the linear combination of these functions. As we multiply x with (1, 0) we will get (x, 0) and when we multiply y with (0, 1), we will get (0, y). The sum of (x, 0) and (0, y) is (x, y). [17]

We can also scale these basis functions like if we have a vector over a domain between 0 and 1, we can divide this domain from 0 to ½ and from ½ to 1. then we can divide the original vector within 4 step vectors from 0 to ¼, ¼ to ½, ½ to ¾ and ¾ to 1. [18]

4.2 Fourier analysis

Fourier transform translate a signal in to frequency domain for analyzing its frequency components called the Fourier coefficients. These coefficients represent the sine and cosine variations. There are also few types of Fourier transforms, Discrete Fourier transform Windowed Fourier transform.

4.3 Wavelet transforms versus Fourier transforms

Fast Fourier transform and discrete Fourier transform are quit similar as both are linear operations to generate the data stream of containing various segments of different lengths. Mostly data vector of length 2^n.

Both have same mathematical properties as well like the inverse matrix for both these transform is the transpose of the original. Both can be analyzed in different domains (frequency and time).

In the case of FFT the domain will contain the basis functions of sin and cosines while for DWT basis functions are the wavelets.

These wavelets are localized.

Fourier analysis retains the frequency information but the temporal information (when each frequency component happened) is lost during the transformation process. Wavelet transform retains the temporal information, as well.

In 1987, wavelets were first proved as the foundation of new approach to signal processing and analyzing.

The basis feature of both wavelet and Fourier transform is the orthogonality of their basis functions.

Wavelet transform has a significant advantage over Fourier transform especially in those cases where signal facing discontinuities and sharp spikes.

Wavelet transform is a strong source to decompose analysis and synthesize with an emphasis on time-frequency localization. [18]

In window procedure the short time Fourier transform is capable of obtaining the time information of the signal. Here window is a square wave which truncates the sin and cosine functions to obtain the particular width. We use the same window for all frequencies and also the resolution is same for all positions in time-frequency plane as shown in the figure.

In DWT window size is varies frequency scale to overcome the discontinuities and smooth components, for this purpose we use short high frequency basis functions for discontinuities and long low frequency basis functions for smooth components.

A function f can be represented by either Fourier or wavelet transform

The following figure shows the STFT with impulse and frequency response of the sin signal with loss in its resolution.

The following figure clearly shows that the wavelet transform has an upper hand on STFT in better localization of time-domain impulse with slightly inferior resolution (frequency).

The frequency localization for higher frequency sine function is not good with wavelet transform. There are also some trade-offs between wavelet and Fourier transforms. But overall the wavelet is more efficient as compared with Fourier. [19], [20]

4.4 Types of "Mother wavelets" basis Functions

There is infinite number of possible mother wavelets. Most commonly used are

The Haar Wavelets

Daubechies order 4 Wavelets (D4)

The Coiflet order 3 Wavelet (C3)

The symmlet order 8 Wavelet (S8)

Fig.4.4.1: Graphical comparison among different mother wavelets.

4.5 Discrete wavelet transforms mathematical model

For translating a "Mother function" or "analyzing wavelet "Φ(x)".

Defining an orthogonal basis set

Variables s and l are integers for scaling the mother function to generate the wavelets.

s is the width, l is locating index, and it gives the position. Mother function is rescaled with the power 2 factor, so it shows self similarity as if we know the Mother functions we can get the basis functions. These analyzing wavelets are used in scaling equation

W(x) is a scaling function for the mother function "Φ(x)". Ck are the wavelets coefficients, the coefficients must satisfy the linear and quadratic constraints

δ is a delta function and l is location index.

The coefficients are like a filter coefficients. They help in breaking complicated signals into simpler components and can be used in the analysis or segmentation of complex signals, in the recognition or detection of particular features and in compression as well de-noising of signal, in fact wavelets decomposes the signal into different resolution scales with indexing the scale and the position. [17]

4.6 Wavelets Bases

Wavelets bases are the bases of nested function space. It can be used to analyze signal at multiple scales.

Wavelet coefficients carry both time domain and frequency domain information.

Basis functions vary in position and scale.

The fast wavelet transform is more efficient than others

Wavelets packets are the linear combinations of wavelets itself.

Two band analysis of DWT

Fig.4.6.1: Wavelet transform to full band and with three sub bands.

If speech signal is band limited form 0 to 4000 Hz then three sub bands will be, 0-4000 Hz, 0-2000 Hz, and 0-1000 Hz.

4.7. Wavelet applications:

Wavelets transform has main uses in many fields like

Speech Modeling

Computer and Human visions

Quantum physics

Image compression

Denoising noisy data


Chapter 5

Vector Quantization

Vector Quantization

Vector quantization is a data compression. It is a fixed to fixed length algorithm. It is just like approximators which round-off the digits to the nearest integer. Simple 1-Dimensional vector quantization example is shown below.

Here every number which lies between 0 and 2 is considered as 1, every number lies between 0 and (-2) is considered as -1.

This is although a simple example but it illustrates the idea of VQ very well. If we move from 1-D to 2-D for understanding of Vector quantization, this simple example will helps us a lot. [21]

Fig. 5.1.2: 2-D VQ

There are 16 regions and 16 red stars

Each red star is associated by a 4-bit number to represent the value of particular region. If a value lies in these regions, it will be represented by a 4-bit number.

These stars are called the code-vectors for a given region and the lines representing the encoding region. [21]

Vector quantization in Speech recognition

Speech applications require a large amount of data which needs huge amount of bandwidth for storage purpose and for further implementations. So vector quantization plays an important role because it compresses the data efficiently. Vector quantization has been used permanently for quantized based encoding and decoding. The response time of vector quantization is very efficient for real time processes which are a very important factor of using vector quantization in speech processing.

In speech recognition vector quantization is used to quantize the training sequence into codebook vector. Training sequence is obtained from the process of recognition. In speech recognition process the common method to generate the initial codebook vector before vector quantization is LBG algorithm.

First we see what is LBG algorithm is and why it is important in speech recognition process.

5.2 LBG Algorithm

LBG or Linde-Buzo-Gray algorithm is a very important tool in speech recognition process. LBG algorithm calculates centroid for the first codebook of a training sequence. [22]

This figure shows that two vectors v1 and v2 are generated by adding constant error to the initial codebook. The main idea of computing these training vectors is to compute the Euclidean distance of all the training vectors and then form the clusters. These clusters are formed by nearest two vectors. This procedure is going to repeat with every cluster formation. [22]

LBG algorithm is an iterative process. This algorithm firstly requires the initial training set or initial codebook. This initial codebook is obtained with the initial speech recognition processes (LPC, LPCC).

Steps of LBG algorithm are

Create initial codevector by averaging the entire training sequence generated from speech recognition process

Split codevector into two

Run iterative algorithm with two code vectors

Final codebook generated by iterative algorithm

Split final codevector in to four vectors

Repeat the above process until the desired number of codevector are obtained

LBG algorithm is summarized below.

LBG Algorithm Design

Let assume a training initial sequence consisting of M vectors

These training vectors contain all the statistical properties of a speech signal.

We assume a source vector of m,k dimensional e.g.

Let N is the number of code vectors

C represents the codebook.

Make each codevector k-dimensional

Let S is the encoding space or region for code vectors then the partition of space defines as

Assume we got a source vector x in encoding space then the approximation defines as

And the average distortion defines as


Now we define the nearest neighbor finding criteria as

This criteria says that the encoding space S should contain all the vectors that are relatively close to C. if some vectors lie on the boundary region of the cluster than there will be a decision making or tie breaking algorithm.[23][24]

Implementation algorithm of LBG :[21],[23],[24]

1. Initial training set

For first code vector N=1 and


Splitting process

For i=1, 2… N

Now length of code vectors will be doubled N => 2N

Iteration process


Now set the iterative index i=0

For m=1,2,……….M

Find the minimum value of

For n=1,2…..,N

Now code vectors will be updated

Set i=i+1



Repeat step (i)



Where n=1, 2… N


As this is the final codebook.

Repeat steps 3 and until the desired number of code vectors are obtained

Performance of Vector quantization based on LBG algorithm in terms of signal to distortion ratio (SDR) can be find out by the following equation.

Graphical Result

This is the graphical performance showing the clusters and centroid obtained by the LBG for the final stage codebook.

2-Stage Vector Quantization (2-SVQ)

Multi stage vector quantization is used in speech process instead of simple vector quantization. MSVQ gives the better approximation and compression.

MSVQ is done in these steps

Add up stages vectors (1,2,………..,P)

Apply VQ

The MSVQ implemented in our technique is 2-stage. it is chosen for simplicity. The 2-stage VQ is shown in figure below.

Fig. 5.2.3: 2-SVQ model.

The quantizer Q1 in first stage uses codebook Y= {y1, y2… yn}. And the second stage quantizer uses the codebook Z+ {z1, z2… zm}. Where x is the input vector which quantized into Y in first stage


Euclidean distance between x and Y is found such that

Where e1 is the residual vector formed by


e1 then quantized with the codevector Z

Z=Q2 (e1)

Again the Euclidean distance between Z and e1 will be

The input vector x is now quantized to

The total error between the input vector x and the quantized vector is

2-stage vector quantization is to finalize our codebook [25]

5.2.5 Multi band 2-stage VQ (MB2-SVQ)

As described in previous chapters we are using multi bands of speech for speech recognition process. Multi bands contains

Full band speech signal

Sub bands speech signal

So we use 2-stage VQ in multi bands to enhance the process of speech recognition because it helps to achieve low bit rates and less complexity in storage. [25] The main process of speech recognition using Multi band 2-stage VQ is divided in to these steps

Divide the input speech signal into L sub bands

LPCC feature extracted from each sub band

2-stage VQ applied to each sub band

Error of all 2-stage VQ outputs are combined

Determine the total error

Decision process

Chapter 6

Code and Results


Our main aim is to recognize the person in real time so it is required to have a code that computes the necessary things in real time simulation. This can not be done in one single code as it slows down the process. To accomplish this task we were required to have code that generate the database and then pass it to the processing unit for further processing i.e. codebook generation and quantization. The audio file format used here is ".wav" and the maximum length of audio file used here is not more than 3sec.

This has been done via function calling in MATLAB. The whole task was divided into certain portions according to their task related to the project. A ".m" file of each function is created in the same directory named as "FYP". The different functions that were created are:

Function used to generate database of users, called "database-generator".

Function used to call the processing, called "processor".

Function used to calculate LBG, called "lbgcalculator".

Function used to quantize the vectors, called "vecq".

Function which runs in real time and makes the recognition decision is called "recognizer".


As the name depicts this function generate the database using the other three passive functions process, lbgcalculator and vecq. This function's schematic diagram is as under

Input audio signal











It is quit clear from the diagram that how it works from the beginning till the data is generated. Inside the database-generator code a operation called DWT is performed to create small bands of voice which are also treated in a same manner like full band signal.

The following code is the MATLAB code for database-generator. It is ".m" file and works only when run in MATLAB.

function [GVQ]=databasegenerator

%% reading wav file

[sound1,fs1]=wavread('filename',number of samples required);

%% for further detail about command "wavread" see MATLAB Help.

%% Adding AWGN noise to remove the unvoiced sections of the voice sample.


%% Applying DWT

[lowc, highc] = dwt (awsound, 4000, 8000);

[lowc2, highc] = dwt (lowc, 2000, 4000);

[lowc3, highc] = dwt (lowc2, 1000, 2000);

%% Calling processor

VQ1=processor (awsound);

VQ2=processor (lowc);

VQ3=processor (lowc2);

VQ4=processor (lowc3);

%% Combining the quantized vectors of all band by matrix concatenation

VQ= [VQ1 VQ2 VQ3 VQ4];

%% Finding and replacing any NaN or Infinity (+ or -) by 0 in VQ to avoid %% complexity.

i = find (isnan (VQ));

VQ (i) = 0;

GVQ=VQ (finite (VQ));

If number of voice samples is greater than 1, then multiple copies of this code are run in a single code to generate a big database.


As the name depicts, this function runs in both passive and real time. Functions process works in almost same way as databasegenerator works. There is a slight difference i.e. one step less than databasegenerator. This function's schematic diagram is as under the processer algorithm firstly calculate the LPC , which is the basic calculation required for the implementation of speech recognition process ( when LPCC technique is used for feature extraction) . Than it transform the coefficients generated by LPC calculation into CC.

then it calls lbgcalculator to generate the final codebook, which is further passed to the next process i.e. vector quantization (vq) by calling function named as vecq. The function vecq is called two times in this process as we are using 2-stage vector quantization.(The reason for this is described earlier)

It works in a same way as databasegenerator does. The only difference is that it runs in real time and there is an extra unit called comparison and decision making. It compares the result and displays them.