The display of the Speech Recognitionscreensaveron a Toshibalaptop, in which the characterresponds to questions, e.g. “Where are you?” or statements, e.g. “Hello.” Speech recognition(also known asautomatic speech recognitionorcomputer speech recognition) converts spoken words to text. The term “voice recognition” is sometimes used to refer to speech recognition where the recognition system is trained to a particular speaker - as is the case for most desktop recognition software; hence there is an aspect ofspeaker recognition, which attempts to identify the person speaking, to better recognize what is being said. Speech recognition is a broad term which means it can recognize almost any bodys speech - such as a call centre system designed to recognise many voices. Voice recognition is a system trained to a particular user, where it recognises their speech based on their unique vocal sound.+Speechrecognitionapplications include voice dialing (e.g., “Call home”), call routing (e.g.,“I would like to make a collect call”),domoticappliance control and content-based spoken audio search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g.,word processorsoremails), and in aircraftcockpits(usually termedDirect Voice Input). The first speech recognizer appeared in 1952 and consisted of a device for the recognition of single spoken digits. Another early device was theIBM Shoebox, exhibited at the1964 New York World's Fair.
One of the most notable domains for the commercial application of speech recognition in the United States has been health care and in particular the work of themedical transcriptions(MT). According to industry experts, at its inception, speech recognition (SR) was sold as a way to completely eliminate transcription rather than make the transcription process more efficient, hence it was not accepted. It was also the case that SR at that time was often technically deficient. Additionally, to be used effectively, it required changes to the ways physicians worked and documented clinical encounters, which many if not all were reluctant to do. The biggest limitation to speech recognition automating transcription, however, is seen as the software. The nature of narrative dictation is highly interpretive and often requires judgment that may be provided by a real human but not yet by an automated system. Another limitation has been the extensive amount of time required by the user and/or system provider to train the software. A distinction in ASR is often made between “artificial syntax systems” which are usually domain-specific and “natural language processing” which is usually language-specific. Each of these types of application presents its own particular goals and challenges.
How conventional speech recognition systems work
Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace, or reduce the reliability on, standard keyboard and mouse input. This can especially assist the following:
- people who have little keyboard skills or experience, who are slow typists, or do not have the time or resources to develop keyboard skills.
- dyslexic people, or others who have problems with character or word use and manipulation in a textual form.
- People with physical disabilities that affect either their data entry, or ability to read (and therefore check) what they have entered. A speech recognition system consists of the following:
- A microphone, for the person to speak into.
- speech recognition software.
- computer to take and interpret the speech.
- good quality soundcard for input and/or output. Speech recognition systems used by the general public e.g. phone-based automated timetable information, or ticketing purchasing, can be used immediately - the user makes contact with the system, and speaks in response to commands and questions. However, systems on computers meant for more individual use, such as for personal word processing, usually require a degree of “training” before use. Here, an individual user “trains” the system to understand words or word fragments (see section 2.3); this training is often referred to as “enrolment”. At the heart of the software is the translation part. Most speech recognition software breaks down the spoken words into phonemes, the basic sounds from which syllables and words are built up. These are analysed to see which string of these units best “fits” an acceptable phoneme string or structure that the software can derive from its dictionary 2. It is a common misassumption that such a system can just be used “out of the box” for work purposes. The system has to train to recognise factors associated with the users voice e.g. speeds, pitch. Even after this training, the user often has to speak in a clear and partially modified manner in order for his or her spoken words to be both recognised and correctly translated. Most speech .recognition software is configured or designed to be used on a standalone computer. However, it is possible to configure some software in order to be used over a network3.
Discrete and continuous systems
Until the late 1990's, commercial speech recognition systems were “discrete” in nature. This required the user to speak in an unnatural, command-like manner, with a gap between each word (to allow time for the computer to process the input, and to make it clear to the computer when individual words ended). Not surprisingly, these speech recognition systems were more oriented towards command-based applications. Entering and editing a large document, such as a book chapter, was very time consuming. The rapid increase in desktop computer processing power and better algorithms led to the development of “continuous” systems. These allowed the user to speak at near-normal speed, and still (after sufficient training) obtain a high accuracy rate. Contemporary continuous systems can give an accuracy of 90%-95%, depending on training (though this still means that one in ten or twenty words can be expected to be incorrect).
Training the system
The amount of enrolment required depends on the software used, the processing power of the computer, and the desired final accuracy of the system. Older versions of the software/computer can require a considerable amount of training; however, enrolment times have fallen drastically (partially due to the rapid increase in standard processing power) in recent years. For example, an observer noted that one package from 1999 “requires a demanding enrolment procedure when a slower computer is being used. It requires the user to read up to 100 paragraphs of an adult level text such as 'Alice in Wonderland' or '3001: The Final Odyssey'.” However, the next year, an upgraded version of the package “may take less than 10minutes, after installation, to set up and enrol...Nevertheless, achieving good results in a short time demands a computer with a suitably high specification, e.g. Pentium II or Pentium III with at least 64 Megabytes of RAM (128 Mb is better), a good quality soundcard and microphone.”4 Some speech recognition software manufacturers claim an enrolment time of as short as 5 minutes5, though in practice this may adversely affect the accuracy of the system in use. Most speech recognition systems allow the user to undertake further enrolment procedures, or retraining of specific misidentified words, if needed. Contemporary systems also allow you to train against your own text or documents, therefore creating your own specialized “vocabulary” for the system to recognize. Multiple language support is now commonplace, with systems such as Speech Pearl supporting over 40 languages6. Reducing extraneous factors There are a number of methods for increasing the accuracy and ease of use of
Speech recognition systems
- Using a high-performance computer. If using contemporary speech recognition software, the computer will usually need to contain a fast processor and a large amount of RAM in order to work efficiently. Though software packaging often states that it will run on “64MB of RAM”, this has often found to be inadequate, resulting in a much longer training time 7.256MB of RAM minimum is more preferable, which can cause problems in schools, colleges and universities that use older computers.
- using a good quality microphone. Microphones with “Active Noise Reduction” or “Active Noise Cancellation” can reduce the amou nt of background noise that can “confuse” the software.
It is notoriously difficult to measure the accuracy of speech recognition systems, as there are so many technical and human factors involved. Several experiments have attempted to compare speech recognition with other kinds of data entry, such as mouse, keyboard and handwriting recognition. Many of these experiments do not reach an overall conclusion concerning “which system is better”9. One of the problems with the wider take-up of speech recognition is that the level of accuracy attained by a user does not match that stated on the software packaging10. This can be for all manner of reasons, such as the machine specification being insufficient, or (more often) the level of training undertaken by the reader. Independent reviews of speech recognition systems indicate a score of around 95% accuracy being possible with an increasing number of systems. For example, in tests11 involving dictating a newspaper story, email message and business letter, Dragon Naturally Speaking 6.0 scored 95% accuracy, Via Voice scored 92% accuracy and Naturally Speaking 5.0 scored only 85% accuracy. Speech recognition systems increasingly offer specialist vocabulary building systems.This step is particularly helpful when subject and user specific words and acronyms are likely to be used, such as specialist vocabulary from university subjects12. Anecdotal evidence from various web sites points to a reduction in the error rate, by using specialist vocabulary, of usually around a third.
In thehealth caredomain, even in the wake of improving speech recognition technologies, medical transcriptions (MTs) have not yet become obsolete. Many experts in the fieldanticipate that with increased use of speech recognition technology, the services provided may be redistributed rather than replaced. Speech recognition can be implemented in front-end or back-end of the medical documentation process. Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document. It never goes through an MT/editor. Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report. Deferred SR is being widely used in the industry currently. ManyElectronic Medical Records(EMR) applications can be more effective and may be performed more easily when deployed in conjunction with a speech-recognition engine. Searches, queries, and form filling may all be faster to perform by voice than by using a keyboard.
High-performance fighter aircraft
Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16aircraft (F-16 VISTA), the program in France on installing speech recognition systems onMirageaircraft, and programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays. Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system. Some important conclusions from the work were as follows:
- Speech recognition has definite potential for reducing pilot workload, but this potential was not realized consistently.
- Achievement of very high recognition accuracy (95% or more) was the most critical factor for making the speech recognition system useful- with lower recognition rates, pilots would not use the system.
- More natural vocabulary and grammar, and shorter training times would be useful, but only if very high recognition rates could be maintained.
Laboratory research in robust speech recognition for military environments has produced promising results which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance aircraft. Working with Swedish pilots flying in theJAS-39Gripen cockpit, England (2004) found recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly. Contrary to what might be expected, no effects of the broken English of the speakers were found. It was evident that spontaneous speech caused problems for the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax, could thus be expected to improve recognition accuracy substantially. TheEuro fighter Typhooncurrently in service with the UKRAFemploys a speaker-dependent system, i.e. it requires each pilot to create a template. The system is not used for any safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of othercockpitfunctions. Voice commands are confirmed by visual and/or aural feedback. The system is seen as a major design feature in the reduction of pilotworkload, and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands.
Training air traffic controllers
Training for military (or civilian) air traffic controllers (ATC) represents an excellent application for speech recognition systems. Many ATC training systems currently require a person to act as a “pseudo-pilot”, engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation. Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task. The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers using speech recognition. Generally, the recognition accuracy falls short of providing graceful interaction between the trainee and the system. However, the prototype training systems have demonstrated a significant potential for voice interaction in these systems, and in other training applications. The U.S. Navy has sponsored a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a complex training system including displays and scenario creation. Although the recognizer was constrained in vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained language, using specific vocabulary specifically designed for the ATC task. Research in France has focused on the application of speech recognition in ATC training systems, directed at issues both in speech recognition and in application of task-domain grammar constraints. The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech recognition from a number of different vendors, includingUFA, Inc, andAdacel Systems Inc (ASI). This software uses speech recognition and synthetic speech to enable the trainee to control aircraft and ground vehicles in the simulation without the need for pseudo pilots. Another approach to ATC simulation with speech recognition has been created by Supremis. The Supremes system is not constrained by rigid grammars imposed by the underlying limitations of other recognition strategies.
Telephony and other domains
ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing, however, ASR in the field of document production has not seen the expected increases in use. The improvement of mobile processor speeds made feasible the speech-enabled Symbian and Windows Mobile Smart phones. Current speech-to-text programs are too large and require too much CPU power to be practical for the Pocket PC. Speech is used mostly as a part of User Interface, for creating pre-defined or custom speech commands. Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice Command), Nuance Communications (Nuance Voice Control), Vito Technology (VITO Voice2Go) and Speereo Software (Speereo Voice Translator).
- Automatic translation
- Automotive speech recognition (e.g.,Ford Sync)
- Telematics (e.g. vehicle Navigation Systems)
- Court reporting (Realtime Voice Writing)
- Hands-free computing: voice command recognition computeruser interface
- Home automation
- Interactive voice response
- Mobile telephony, including mobile email
- Multimodal interaction
- Pronunciationevaluation in computer-aided language learning applications
- Video Games, possible expansion into the RTS genre followingTom Clancy's EndWar
- Transcription(digital speech-to-text).
- Speech-to-text (transcription of speech into mobile text messages)
- Air Traffic Control Speech Recognition
Performance of speech recognition system
The performance of speech recognition systems is usually specified in terms of accuracy and speed. Accuracy may be measured in terms of performance accuracy which is usually rated withword error rate(WER), whereas speed is measured with thereal time factor. Other measures of accuracy includeSingle Word Error Rate(SWER) andCommand Success Rate(CSR).Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. There is some confusion, however, over the interchangeability of the terms “speech recognition” and “dictation”. Commercially available speaker-dependent dictation systems usually require only a short period of training (sometimes also called `enrollment') and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. `Optimal conditions' usually assume that users:
- have speech characteristics which match the training data,
- can achieve proper speaker adaptation,
- Work in a clean noise environment (e.g. quiet office or laboratory space).
This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected. Speech recognition in video has become a popular search technology used by several video search companies. Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations. Bothacoustic modelingandlanguage modelingare important parts of modern statistically-based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling has many other applications such assmart keyboardanddocument classification.
Hidden Markov model (HMM)-based speech recognition
Hidden Markov model
Modern general-purpose speech recognition systems are generally based on Hidden Markov Models. These are statistical models which output a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as astationary process. Speech could thus be thought of as aMarkov modelfor many stochastic processes.Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence ofn-dimensional real-valued vectors (withnbeing a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist ofcepstralcoefficients, which are obtained by taking aFourier transformof a short time window of speech and decorrelating the spectrum using acosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), eachphoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes. Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximummutual information(MMI), minimum classification error (MCE) and minimum phone error (MPE). Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use theViterbi algorithmto find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (thefinite state transducer, or FST, approach).
Speech recognition in the education sector
A literature review revealed that there has been mo re published on the use of speech recognition in UK schools than in colleges or universities over the last decade. Much of this use was as part of short-term or experimental investigative projects. Various comparative reports noted that speech recognition is but one of many methods of data entry for people with physical or learning difficulties. For example, Braille input/output devices, touch screen systems and trackballs have all been used successfully in the classroom13. However, over the last few years, there has been a surge in interest from within HE and FE regarding the use of speech recognition technologies. This is due to input methods for children - A comparison of speech recognition and three other methods of data entry. While the research revealed interesting observations about how the systems were used and reacted to input, the authors (as with several similar comparative papers in this field) shied away from an overall “this is best” conclusion. ??the increase in performance and accuracy of such systems ??a greater incorporation of student s and learners with some disability into the educational sector increasing legal and more obligations to provide accessible 14 ICT systems For example, the University of Exeter has a disability policy, statement and dedicated resource cent re. The University library has an “IT special needs zone”15, where students can use one of a number of speech recognition systems to write essays or carry out other computer-based work. Meanwhile, the University of Glasgow provides a range of disability-oriented ICT software to students, including a speech recognition software package16. An informal survey of various university web sites indicated that an increasing number offer access to speech recognition software. However, in most cases this was offered not in a department or research group, or even in a laboratory, but in the university library or closely associated central IT centre. Details on what staff support the student would receive in either using the package, or the enrolment procedure, were infrequently found.
Speech recognition outwith the education sector
Possibly the most widespread application of speech technology in contemporary life has been its incorporation into telephone-based information retrieval systems. This is almost a natural development, as telephones take speech input (though in a passive, “passing it on” manner) anyway. At the most basic level, some mobile phones offer the facility to select a phone number from the in-phone directory by saying the name associated with it e.g. “David”, “Husband”, “Mother”; the phone then dials the stored number automatically. At a more useful level, speech recognition is increasingly used in automated telephone-based interactive services. For example, it is possible to check the weather forecast, the price of a stock market share, or book a flight using an increasing number of these services17. There are advantages to this for both the customer (no wait ing for a human operator) and the service supplier (less staff required, can operate 24/7). However, such systems still tend to heavily guide the customer through a range of options, as interpreting free and natural instructions is beyond the capabilities and knowledge banks of contemporary systems. In addition, security considerations are an impediment to development of some areas; for example, there are concerns over who can buy airline tickets, or transfer money between bank accounts, without human intervention or examination, using such a system. In view of this, there is a Speech recognition technology is being tentatively used, and researched, in the car industry18. This is not surprising, as contemporary cars are heavily marketed according to technical innovations and features. Development is based in four areas:
- hands-free use of mobile phone handsets in the car e.g. “Dial office”
- Speech instructions to navigation systems e.g. GPS-connected digital maps: “How far is it to the motorway junction?”
- in-car system interaction e.g. “Turn on the radio to the travel reports channel.”
- in-car steering systems The last of these, not surprisingly, is the one least developed due to obvious safety consideration. Though there are advantages for speech recognition in cars (such as being able to keep both hands on the steering wheel while telling the “car” to do something), there are considerable obstacles in terms of in-car noise, and vocal interference from passengers who are inches/feet away.
- Contemporary speech recognition systems There are a surprisingly large number of speech recognition systems (of different kinds) that are commercially available. The TechDis Accessibility Database (see appendix A2) contains details of 12 such products. The first two in the list below appear to be the most widely used in UK university and colleges.
The lend a hand and support received from teachers and friends overwhelmed me during the preparation of this term paper. I express my sincere gratitude to my subject teacher Mr. Randhir Singh for his guidance, continuous support and cooperation throughout my term paper, without which the present work would not have been possible. I would thank my friends to help me out in searching the material related to topic without which this term paper would not have been completed.
- http://www.connectweb.co.uk/public/products/report/philips.html - SpeechPearl press release.