A Survey Of Voice Recognition System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A voice-recognition system allows a computer to translate voice requests and dictation into text. The most obvious use for voice-recognition systems is to allow the physically disabled to communicate with the computer, but other types of users, such as self-employed people and senior managers without keyboard skills, are seen as an important market. The three main voice recognition systems currently available are DragonDictate for Windows, the IBM VoiceType system and the Philips dictation system. As a consequence of its reliance on phonology, linguistics, signal processing, statistics, computer science, acoustics, connectionist networks, psychology and other fields, there are many technologies involved in voice technology. In this paper we will throw light on the history, working, components of voice recognition system, types, uses, weaknesses and flaws, advancements in technology etc. As several researches are being made voice recognition may become speech understanding in near future.

Keywords: Noise, Recognition, software, speech.


Voice recognition is "the technology by which sounds, words or phrases spoken by humans are converted into electrical signals, and these signals are transformed into coding patterns to which meaning has been assigned". While the concept could more generally be called "sound or speech recognition", we focus here on the human voice because we most often and most naturally use our voices to communicate our ideas to others in our immediate surroundings. In the context of a virtual environment, the user would presumably gain the greatest feeling of immersion, or being part of the simulation, if they could use their most common form of communication, the voice. The difficulty in using voice as an input to a computer simulation lies in the fundamental differences between human speech and the more traditional forms of computer input. While computer programs are commonly designed to produce a precise and well-defined response upon receiving the proper (and equally precise) input, the human voice and spoken words are anything but precise. Each human voice is different, and identical words can have different meanings if spoken with different inflections or in different contexts. Several approaches have been tried, with varying degrees of success, to overcome these difficulties.


Voice Recognition Systems was founded in 1994 to provide speech recognition systems to industries that utilize transcription departments for record keeping. Voice automated medical transcription is one of the hottest technologies today for physicians and support staff to properly document patient encounters and notes. 90% of Voice Recognition Systems business comes from medical related fields. Most physicians, and hospitals that, have either heard about this technology to produce large amounts of text or are so far behind in their transcription that they are under pressure to stay caught up or get caught up!

Voice Recognition Systems caters to the professional that wants to produce large amounts of text and documentation in the minimal amount of time.

What is Voice Recognition?

The field of computer science that deals with designing computer systems that can recognize spoken words or in other words voice recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. Note that voice recognition implies only that the computer can take dictation, not that it understands what is being said. Comprehending human languages falls under a different field of computer science called natural language processing. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding

A number of voice recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent.

Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems. Recently, great strides have been made in continuous speech systems -- voice recognition systems that allow you to speak naturally. There are now several continuous-speech systems available for personal computers.

Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialized situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset. Increasingly, however, as the cost decreases and performance improves, voice recognition systems are entering the mainstream and are being used as an alternative to keyboards.

Voice recognition systems can be characterized by many parameters, some of the more important of which are shown in Table below:

Table1: Typical parameters used to characterize the capability of voice recognition systems

An isolated-word voice recognition system requires that the speaker pause briefly between words, whereas a continuous voice recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

Recommended Requirements

A user requires a computer with no less than a Core2Duo processor, 2 gigabytes of RAM and a certified sound card (SoundBlaster is the industry default). As long as we have these 3 things present, we can be assured that NaturallySpeaking will work somewhat satisfactorily provided that user takes the time to build their own customized voice file and learn the basic functions of the system. What we mean by that, they will learn how to dictate properly to the system and learn how to navigate around a document and correct their mistakes as they occur.

The two basic rules of voice recognition are this:

The faster the processor, the faster you can dictate!

The more memory (RAM) that you have, THE MORE ACCURATE THAT DICTATION WILL BE!

This applies to any and all brands of voice recognition solutions!

How Voice Recognition System Works

The figure shows a block diagram of the voice recognition aspects of this system used together with noise and echo reduction technology.

Fig 1:- Working of voice recognition system

The voice of the driver is picked up by the communications microphone and is first processed by the block labelled RNF (Referenced Noise Filter), a type of echo canceller. This block has a direct feed from the music system, so that this background noise can be reduced. If the vehicle is an emergency services vehicle, this feed might well come from the siren.

The RNF block may also have a feed from the actual Automatic Speech Recognition (ASR) module itself. This is so that if the ASR system is talking to the driver, the driver can speak over the automated voice and still be understood. The RNF technology ensures that the ASR hears the driver's voice, but not the sound of its own voice coming out of the loudspeakers. This is an important aspect of interactive speech recognition systems, the ability to choose an item from a menu, without waiting until the system as listed all possibilities. This is called "barge-in".

Once the 'echo' has been removed, the signal is processed by the VRE, (Voice Recognition Enhancer) block. This technology can provide in the region of 6-18dB of noise reduction with minimal damage to the speech element of the signal. From here the speech is fed into the ASR module.

Because the VRE technology will allow voice type signals to pass through, it is possible that the voices of passengers, or maybe even music from their portable sound systems, might still corrupt the quality of the speech entering the ASR module. In some cases, a second microphone can be used to pick up the unwanted sound, so that the ENR (Enhanced Noise Reduction) block can eliminate this noise from the system, ensuring as clear speech as possible enters the ASR module. Such dual microphone noise reduction technology has particular applicability in cellular phone and headset applications where the communications microphone picks up significantly more speech than the second microphone, while the noise at both is fairly correlated.

Components of Typical Voice Recognition System

The Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec. These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.

Fig 2:- Components of a typical speech recognition system.

Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic level, speaker variability is typically modelled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use. Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modelling.

Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words.

7. Types of Speech Recognition

Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using.

Isolated Words

Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class.

Connected Words

Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run-together' with a minimal pause between them.

Continuous Speech

Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation.

Spontaneous Speech

There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.

8. Uses and Applications

Although any task that involves interfacing with a computer can potentially use ASR, the following applications are the most common right now.


Dictation is the most common use for ASR systems today. This includes medical transcriptions, legal and business dictation, as well as general word processing. In some cases special vocabularies are used to increase the accuracy of the system.

Command and Control

ASR systems that are designed to perform functions and actions on the system are defined as Command and Control systems. Utterances like "Open Netscape" and "Start a new xterm" will do just that.


Some PBX/Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones.


Because inputs are limited for wearable devices, speaking is a natural possibility.


Many people have difficulty typing due to physical limitations such as Repetitive Strain Injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text.

Embedded Applications

Some newer cellular phones include C&C speech recognition that allows utterances such as "Call Home". This could be a major factor in the future of ASR and Linux. Why can't I talk to my television yet?

9. Speech Recognition Software

9.1 XVoice

XVoice is a dictation/continuous speech recognizer that can be used with a variety of XWindow applications. It allows user-defined macros. This is a fine program with a definite future. Once setup, it performs with adequate accuracy.

XVoice requires that you download and install IBM's ViaVoice for Linux. It also requires the configuration of ViaVoice to work correctly. Additionally, Lesstif/Motif (libXm) is required. It is also important to note that because this program interacts with X windows, you must leave X resources open on your machine, so caution should be used if you use this on a networked or multi-user machine.

9.2 CVoiceControl/kVoiceControl

CVoiceControl (which stands for Console Voice Control) started its life as KVoiceControl (KDE Voice Control). It is a basic speech recognition system that allows a user to execute Linux commands by using spoken commands. CVoiceControl replaces KVoiceControl.

The software includes a microphone level configuration utility, a vocabulary "model editor" for adding new commands and utterances, and the speech recognition system.This software is primarily for users.

9.3 Open Mind Speech

Started in late 1999, Open Mind Speech has changed names several times (was VoiceControl, then SpeechInput, and then FreeSpeech), and is now part of the "Open Mind Initiative". This is an open source project. Currently it isn't completely operational and is primarily for developers.This software is primarily for developers.

9.4 GVoice

GVoice is a speech ASR library that uses IBM's ViaVoice SDK to control Gtk/GNOME applications. It includes libraries for initialization, recognition engine, vocabulary manipulation, and panel control. Development on this has been idle for over a year.This software is primarily for developers.

9.5 ISIP

The Institute for Signal and Information Processing at Mississippi State University has made its speech recognition engine available. The toolkit includes a front-end, a decoder, and a training module. It's a functional toolkit.This software is primarily for developers.

9.6 CMU Sphinx

Sphinx originally started at CMU and has recently been released as open source. This is a fairly large program that includes a lot of tools and information. It is still "in development", but includes trainers, recognizers, acoustic models, language models, and some limited documentation.This software is primarily for developers.

9.7 Ears

Although Ears isn't fully developed, it is a good starting point for programmers wishing to start in ASR. This software is primarily for developers.

9.8 NICO ANN Toolkit

The NICO Artificial Neural Network toolkit is a flexible back propagation neural network toolkit optimized for speech recognition applications. This software is primarily for developers.

9.9 Myers' Hidden Markov Model Software

This software by Richard Myers is HMM algorithms written in C++ code. It provides an example and learning tool for HMM models described in the L. Rabiner book "Fundamentals of Speech Recognition". This software is primarily for developers.

9.10 Jialong He's Speech Recognition Research Tool

Although not originally written for Linux, this research tool can be compiled on Linux. It contains three different types of recognizers: DTW, Dynamic Hidden Markov Model, and a Continuous Density Hidden Markov Model. This is for research and development uses, as it is not a fully functional ASR system. The toolkit contains some very useful tools. This software is primarily for developers.

10. Commercial Software

10.1 IBM ViaVoice

IBM has made true on their promise to support Linux with their series of ViaVoice products for Linux, though the future of their SDKs aren't set in stone (their licensing agreement for developers isn't officially released as of this date - more to come).

Their commercial product, IBM ViaVoice Dictation for Linux performs very well, but has some sizeable system requirements compared to the more basic ASR systems (64M RAM and 233MHz Pentium). The package includes: documentation (PDF), Trainer, dictation system, and installation scripts. Support for additional Linux Distributions based on 2.2 kernels is also available in the latest release.

10.2 Babel Technologies

Babel Technologies has a Linux SDK available called Babear. It is a speaker-independent system based on Hybrid Markov Models and Artificial Neural Networks technology. They also have a variety of products for Text-to-speech, speaker verification, and phoneme analysis.

10.3 Nuance

Nuance offers a speech recognition/natural language product (currently Nuance 8.0) for a variety of platforms. It can handle very large vocabularies and uses a unqiue distributed architecture for scalability and fault tolerance.

10.4 Abbot/AbbotDemo

Abbot is a very large vocabulary, speaker independent ASR system. It was originally developed by the Connectionist Speech Group at Cambridge University. It was transferred to SoftSound.

AbbotDemo is a demonstration package of Abbot. This demo system has a vocabulary of about 5000 words and uses the connectionist/HMM continuous speech algorithm. This is a demonstration program with no source code.

10.5 Entropic

The fine people over at Entropic have been bought out by Micro$oft... Their products and support services have all but disappeared. Their support for HTK and ESPS/waves+ is gone, and their future is in the hands of M$.

11. Weaknesses and Flaws of Speech Recognition

No speech recognition system is 100 percent perfect; several factors can reduce accuracy. Some of these factors are issues that continue to improve as the technology improves. Others can be lessened -- if not completely corrected -- by the user.

11.1 Low signal-to-noise ratio

The program needs to "hear" the words spoken distinctly, and any extra noise introduced into the sound will interfere with this. The noise can come from a number of sources, including loud background noise in an office environment. Users should work in a quiet room with a quality microphone positioned as close to their mouths as possible. Low-quality sound cards, which provide the input for the microphone to send the signal to the computer, often do not have enough shielding from the electrical signals produced by other computer components. They can introduce hum or hiss into the signal.

11.2 Overlapping speech

Current systems have difficulty separating simultaneous speech from multiple users. If you try to employ recognition technology in conversations or meetings where people frequently interrupt each other or talk over one another, you're likely to get extremely poor results.

11.3 Intensive use of computer power

Running the statistical models needed for speech recognition requires the computer's processor to do a lot of heavy work. One reason for this is the need to remember each stage of the word-recognition search in case the system needs to backtrack to come up with the right word. The fastest personal computers in use today can still have difficulties with complicated commands or phrases, slowing down the response time significantly. The vocabularies needed by the programs also take up a large amount of hard drive space. Fortunately, disk storage and processor speed are areas of rapid advancement -- the computers in use 10 years from now will benefit from an exponential increase in both factors.

11.4 Homonyms

Homonyms are two words that are spelled differently and have different meanings but sound the same. "There" and "their," "air" and "heir," "be" and "bee" are all examples. There is no way for a speech recognition program to tell the difference between these words based on sound alone. However, extensive training of systems and statistical models that take into account word context have greatly improved their performance.

12. Advancements in Technology

What we have seen during the last 16 years are leaps and bounds in speech recognition. Third party vendors have come into play offering "templated" systems where you would insert fields into document forms and be able to move from field to field and dictate into those fields thereby producing a report. When you dictate into forms, you are prompted within each field to dictate whatever is relevant to that field, within that report. A customized database in other words.

We have also observed companies who are building intelligent vocabularies for specialized professions. There are specialized medical, legal and other specialized vocabularies with industry standard words, like proper Latin terminology, procedures, and for medical professionals, drugs, instruments, diseases, medical procedure terminology, etc.

Being able to dictate into a digital recorder, then being able to play that dictation back to the system and have it automatically transcribed for them with out having to sit in front of a computer, has been immensely popular although we recommend that a end user take the time to learn how to use NaturallySpeaking before they graduate to a digital recorder.

13. The Future of Speech Reconition

For several decades, scientists developed experimental methods of computerized speech recognition, but the computing power available at the time limited them. Only in the 1990s did computers powerful enough to handle speech recognition become available to the average consumer. Current research could lead to technologies that are currently more familiar in an episode of "Star Trek." The Defense Advanced Research Projects Agency (DARPA) has three teams of researchers working on Global Autonomous Language Exploitation (GALE), a program that will take in streams of information from foreign news broadcasts and newspapers and translate them. It hopes to create software that can instantly translate two languages with at least 90 percent accuracy. DARPA is also funding an R&D effort called TRANSTAC to enable our soldiers to communicate more effectively with civilian populations in non-English-speaking countries, adding that the technology will undoubtedly spin off into civilian applications, including a universal translator.

A universal translator is still far into the future, however -- it's very difficult to build a system that combines automatic translation with voice activation technology. According to a recent CNN article, the GALE project is "'DARPA hard' difficult even by the extreme standards" of DARPA. Why? One problem is making a system that can flawlessly handle roadblocks like slang, dialects, accents and background noise. The different grammatical structures used by languages can also pose a problem. For example, Arabic sometimes uses single words to convey ideas that are entire sentences in English.

At some point in the future, speech recognition may become speech understanding. The statistical models that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words. Although it is a huge leap in terms of computational power and software sophistication, some researchers argue that speech recognition development offers the most direct line from the computers of today to true artificial intelligence. We can talk to our computers today. In 25 years, they may very well talk back.