Voice Identification And Authentication Program Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The human voice consists of sound made by us, humans, using the vocal folds for talking, singing, laughing, crying, screaming, etc. The vocal folds (vocal cords) are a vibrating valve that chops up the airflow from the lungs into audible pulses that form the laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to 'fine tune' pitch and tone. The articulators (the parts of the vocal tract above the larynx consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound emanating from the larynx and to some degree can interact with the laryngeal airflow to strengthen it or weaken it as a sound source. The vocal folds, in combination with the articulators, are capable of producing highly intricate arrays of sound. The tone of voice may be modulated to suggest emotions such as anger, surprise, or happiness. For e.g. Singers use the human voice as an instrument for creating music.

The term 'voice recognition' is sometimes used to refer to speech recognition, where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software in the market right now. Hence there is an element of speaker recognition, which attempts to identify the person speaking, to better recognize what is being said. 'Speech recognition,' on the other hand is a broad term which means it can recognize almost anybody's speech, such as a call-centre system designed to recognize many voices. Voice recognition is a system trained to a particular user, where it recognizes their speech based on their unique vocal sound. A Voice Recognition 'voiceprint' is a 'spectrogram'. A spectrogram is a graph that shows a sound's frequency on the vertical axis and time on the horizontal axis. Different speech creates different shapes on the graph. Spectrograms also use colour or shades of grey to represent the acoustical qualities of sound.

All of our voices are uniquely different (including twins) and cannot be exactly duplicated. Speech is made up of two components. A physiological component (the voice tract) and a behavioural component (the accent). The given project, as agreed with the Project Board, will focus on comparing pre-recorded English words/sentences spoken by a native English speaker with a non-native English speaker. There would be three pre-recorded sentences, same sentences, spoken by three different speakers; one of them would be spoken by someone who was born & raised locally and speaks the local accent, whereas the other two would be non-native English speakers. My program will, using algorithms and/or other most recent features available, compare all three sentences and then give an output of accuracy of the other two sentences, when matched with the native speaker.

1.1 Initial Plan/Project Proposal

Initially the plan/idea was to create an automated voice recognition customer service system. The system would have consisted of a simple computer, with a telephone input system and a playback function to answer queries via the telephone. The system would, when engaged, convert customer queries/spoken words into computer language (bytes) and then convert them into words, to find the solution to the question at hand. After finding the answer to the question at hand, the computer would convert the answer back into English, this time spoken-English. This would've been a spectacular project to develop, but unfortunately it would've been massive & tedious as well; too much for one person to develop in the given time period. Therefore it was decided that the project would be downsized to just having a voice comparison system.

1.2 Objectives Redefined

Since the acceptance of the given project proposal, a number of questions have been rousing my mind. The new objective, as agreed with the Project Board, was to analyze three pre-recorded English sentences, everyday sentences. The good thing about this project was that the Project Board had decided that the developer would be free to research any current software in the market, any method plausible to find the most relevant way to accurately analyze sound files. This meant that a number of doors were open to research.

The main question was: how to approach the given Project? For this purpose, the only solution was to do some thorough research that would cover everything related to the given project.

2.0 My Research:

My research focused on four key areas:

How different Human Voices are generated and recognised by us (humans)? How will a computer distinguish different voice patterns?

Has anyone created something like this before? If yes, were they successful? If yes, to what extent?

Would the given project be finished on time, based on my current skills?

What has already been done, in terms of voice recognition/comparison/authentication? And what is currently being done/used?

During my research I found out that Speech Recognition applications have a vast range & domain in the current market; they include voice dialling, call routing, appliance control and content-based spoken audio search (e.g. calling a an automated customer query system where particular words are spoken), simple data entry (e.g. entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g. word processors or emails or current operating systems), in aircraft cockpits (usually termed Direct Voice Input) and just starting to appear in mobile phones as well (e.g. the new iPhone can be trained to perform voice-operated functions) etc.

The accuracy of a speech recognition system is usually specified in terms of its performance and speed. Accuracy may be measured in terms of performance accuracy which is usually rated with word error rate (WER), whereas speed is measured with the real time factor. Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR). Most people who have used speech recognition systems previously, would tend to agree that dictation machines can achieve very high performance in controlled conditions. Commercially available speaker-dependent dictation systems usually require only a short period of training (sometimes also called `an enrolment period') and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions; which can be:

The users are trained properly

The System is given proper time to achieve proper speaker adaptation

The System is used in a noise-free environment

This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected.

2.1 Acoustic Modelling

An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. Acoustic Modelling is commonly being used by speech recognition engines to recognize speech. Both acoustic modelling and language modelling are important parts of modern statistically-based speech recognition algorithms.

I also found that audio can be encoded at different sampling rates i.e. samples per second and different bits per sample (the most common being: 8-bits, 16-bits or 32-bits). Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized.

3.0 Current Audio Models Being Used

3.1 Hidden Markov model (HMM)1

Modern general-purpose speech recognition systems are generally based on Hidden Markov Models. These are statistical models which output a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought of as a Markov model for many stochastic processes.

Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and de-correlating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information.

3.1.0 Java Implementation of HMM Model

Jahmm (pronounced "jam"), is a Java implementation of Hidden Markov Model (HMM) related algorithms. It's been designed to be easy to use and general purpose. This library is reasonably efficient, meaning that the complexity of the implementation of the algorithms involved is that given by the theory. However, when a choice must be made between code readability and efficiency, readability has been chosen. It is thus ideal in research (because algorithms can easily be modified) and as an academic tool. It gives an implementation of the Viterbi, Forward-Backward, Baum-Welch and K-Means algorithms, among others.

Image 1 - The Results of a HMM Model 1

Hidden Markov models can also be applied in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and bioinformatics.

3.2 Dynamic time warping (DTW)-based speech recognition2

Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics - indeed, any data which can be turned into a linear representation can be analyzed with DTW. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

Image 2 - The Results of a DTW Model 1

3.3 Other Engines

Apart from the HMM & DTW Models, other companies have come up with their own range of speech recognition systems/software. These include:

Microsoft Corporation (Microsoft Voice Command)

Nuance Communications (Nuance Voice Control)

Vito Technology (VITO Voice2Go)

Speereo Software (Speereo Voice Translator)

and SVOX.

With Microsoft implementing their TAPI & SAPI applications in their latest Operating Systems such as Windows Vista and Windows 7 Platform, Speech Recognition is quickly becoming the most important aspect of all modern Operating Systems, thus, making more & more users rely on speech recognition systems.

Project Approach

Project approach covers the methodologies considered to develop the given project. As with every project, at the beginning there are many options available, but once that decision to use a specific methodology is finalized, then there is no turning back and one has to stick with the initial decision with all heart & soul. Same was the scenario, at the beginning of the given Project. Many options were available to the developer's disposal (me), out of which the most suitable & appropriate approach was selected, after numerous meetings with the supervisor, as to which option to consider.

Option 1 Currently there are two options that I consider giving importance to. One is focusing on the Hidden Makrov Models, which are based on Fourier Trans/Series and the other one is researching Microsoft (MSDN) library online for component object modelling (COM).

If the first option is finalized with my supervisor, then, the system would entirely be based on algorithms and measuring of each corresponding byte the given three sound files will represent. By calculating the frequency (highs & lows) of the entire sound file, i'll be able to get the sound file in a mathematical form, which can then be compared with the rest of the sound files (which will also be converted into binary for comparison).

Option 2 The other option, if finalized, would mean looking into the already existing APIs in the Microsoft Developers Library and somehow incorporating them to read the given three sound files. Once they are loaded into the API, the API would use its algorithms to match patterns and compare the two non-native English sentences with the native one. The output can then be displayed on a bar/oscilloscope chart, with the percentage of accuracy of the non-native sentences, when compared to the native one, displayed in numbers as well.

Another Option? Another option, a simpler and less accurate one, would be to read in all alphabets in all words in the given sentences and to convert them to ASCII codes. Once in ASCII, they can be converted to binary easily and once in binary, they can be compared against each other easily. But this would be the least accurate option, as no use of any algorithms would be involved and the whole system would be based on the pre-fed binary for each word/alphabet.

Time Scale:

The timescale of the Project, from beginning till the end, is 6 months. I submitted my Final/Revised Project Proposal on 19th October 2010, and the Project is due by 30th April. This gives me another one and a half months to develop and test the system.

Target Audience & Benefits

As this is not a complete software package, so there are no direct target audience involved. But, if the Project is successful, my research and work can be re-used to built-upon from, to make a completely automated voice recognition system; that could record queries, understand each word said to it and then find the solution to that query and read it back to the user.

Following is the List of uses of Voice/Speech Recognition Technology:

Cost cutting - Cost savings for organization that have telephone based customer services/call centres, has to top the list of Advantages of Speech Recognition System.

Health care - There are a wide range of uses for Speech Recognition Technology in the Health Care Sector. Speech recognition can be used to enable deaf people to understand spoken English via Speech to Text Technology. Can also be implemented in Medical Documentation. Apart from that, people with speech disabilities, specially children born with this kind of disorder, can be taught how to speak properly using this technology.

Air Traffic Control - Can be used in Air Traffic Control Towers, where it's a norm for the operators to sleep during their job; speech recognition, if implemented, can provide an alternative fallback point, in case the operator makes any error. The pilot can directly communicate with the Computer managing Air Traffic Control.

In Banks and in Call centres - As mentioned previously, it can/is being used in automated telephony services. One local example is MBNA & Barclay Banks. When a customer rings the customer service, he/she is asked to pronounce their card number. After personally using this system, i believe this is a remarkable achievement for mankind.

On the Battlefield - Can be, in the distant future, used in the battlefield to send droids, drones etc and can be commanded by the Army remotely using Speech recognition.

Means of Authentication - Can also be used as a form of authentication. For example nowadays we see many Retina and Fingerprint scanners, in a same way we voice can be used to authorise certain instruments, computers, safes/vaults, access to buildings etc.


When one thinks of Speech Recognition System, one can easily forget the negative aspects, though there aren't many. The only one that I can think of/looking at the present gloomy financial crisis surrounding the world is, that a lot of Customer Service Operators would lose their jobs, if this system is implemented in all workplaces. Organizations would rather spend on acquiring a Speech Recognition Call Answering System, which can answer many phone/customer queries at once, rather than hiring staff.

Sound Processing Languages/Software

The core emphasis in this project would be on using mathematical calculations to measure the intensity of each sound file, at every second; to keep getting a constant reading, which can then be compared with other sound files. Fourier Trans and Hidden Markov Model are not exactly programming languages, but they can be used in conjunction with major Programming languages like C++ and Visual Basic. C++ and other major programming languages have libraries that support the algorithms used in Fourier Trans/HMM model.

The Project So Far..

So far the project has mostly revolved around research and actual development has yet to start. Research is the core aspect of any Project, specially this project, because when one has to develop complex system, (such as the system at hand - Speech Recognition System) one can't re-invent the wheel, as that would be too much tedious and time consuming. Instead, in this global era, its more convenient to research on all the systems created/being used currently, then modifying already existing APIs/Algorithms to fit into the given project.

So far, I've learnt what the core components of a speech Recognition System are, how they can be incorporated to work as a software package that would 'read' the desired sound files and then display the accuracy for each one of them.

Following is a rough outline of what the developed Program would look like, please note that this is just a rough sketch:

Minor Details to Consider Before the Development Stage

To Record the Sound samples in a neutral environment, free from distortion

To make sure while recording that the input for all the three files is the same

Initially, to consider focusing on voice from the same gender (as Male & Female voice pitches are entirely different and might confuse the Program)

Functional & Non Functional Requirements

Every decent project has a list of requirements that define what the project does ultimately. These requirements depicting the functionality of a system are broken down into functional & non functional requirements. Functional requirements normally include hardware requirements, activities and services such as inputs, outputs and various processes that the system requires to function. Whereas non-functional requirements include features and characteristics in terms of performance of a system; in the case of the given project it would mean the accuracy of the system to identify and compare different, pre-recorded voice samples.

Functional Requirements:

Since the Project would focus on just analyzing and comparing three given sound files, therefore there aren't many Functional. Functional Requirements for the System include:

A simple Pentium 3 computer running any version of Windows Operating System after 1998.

A sound input device, to record the Sound Samples. A simple microphone would do.

Windows Voice Recorder (which comes built-in with the Operating System)

A Sound output device, such as Speakers

A monitor obviously

Other Requirements, if any, would be added as the development part progresses.

Non Functional Requirements

Project Schedule The total time for researching, developing and implementing the aforementioned Speech Recognition is 7 months. It would be handy to have the first working prototype ready by the start of April, a month before the hand-in date for the given project.

System Performance This system is entirely based on its correct identification of sound patterns, their comparison and then correctly displaying a measure of their accuracy; therefore ths=

Test Plan

Current Literature Review

Currently I am focusing on Hidden Markov Models and their implementation, and Fourier Trans (which is a derivative of HMM Models). The Web is full of information about HMM & Fourier Trans Algorithms, apart from the Web the following books are included in the current literature review:

Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill

Markov Models for Pattern Recognition: From Theory to Applications by Gernot A. Fink



Bibliography n appendix in the end..

Cs 4 Adobe, sound lab