Overview Of Speech Recognition Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The process of converting acoustic signal like human voice which is captured through a device like a microphone or a telephone and converts it to a desired action (like, to a set of words or movement). The result of this action could be an input for further linguistic processing which could be interpreted for the understanding of speech.

The invention of speech recognition dates back to the time of Alexander Graham Bell. His wife´s hearing impairment inspired his experiment to convert speech to spectrographic images of sound, which is a picture visible that could be interpreted by a hearing impaired person. Unfortunately, his wife could not interpret these pictorial sounds. But more importantly this very same research led to his invention of the telephone.

  Source: howstufworks

Then in the early 1960's, IBM developed and demonstrated "Shoebox" - adevice forerunner of today's voice recognition. This device could respond and recognize 16 spoken words. This included basic arithmetic operations such as minus, plus, total and digits from zero through nine. IBM´s Shoebox could then calculate and print these answers. 

3. How does it work? 

Speech recognition technology uses a few main sub-processes: capture/conversion, fragmentation and contextualization. 

The first process is the capture and conversion of speech.  The application captures spoken sound waves which is analog waves.  In order for the computer to recognize these sounds, the waves are converted with the use of an analog-to-digital converter (ADC).  The output of the ADC are digital waves that the computer can recognize which it then filters for pure sound in an attempt to isolate and remove ambient noise.

The second sub-process is fragmentation.  The application then fragments the these waves in to very small pieces "as short as a few hundredths of a second, or even thousandths in the case of plosive consonant sounds -- consonant stops produced by obstructing airflow in the vocal tract -- like "p" or "t."  The application then crosses these sounds, to identify the specific phonemes.

The third sub-process is contextualization.  The recorded phonemes are then evaluated in context to the surrounding phonemes in order to create words, phrases, etc.  This is accomplished by cross checking the recorded combinations of phonemes against a library in order to narrow the potential words which were spoken until the application is able to identify the best probability of what was said.  This output is then either put

ViaVoice is IBM´s speech recognition software for computer and mobile device. The ViaVoice provides automatic speech-recognition and text-to-speech (TTS) capabilities with minimal processor requirements. In 2003, IBM sold its ViaVoice desktop products for Windows and Mac OS X. to Scan Soft (its competitor who owns Dragon NaturallySpeaking). 

This is a new technology from Google Inc that indices audio and video files. It allows user to search for content (words) in a movie file the same way as you would look for a word in a word file. For example, you could analyze President Obama's inauguration speech by searching the word "Economy", and GAUDI would highlight the sections in the movie to show your search results.

GAUDI uses its own speech technology to transform spoken words into text and then leverages Google's search technology to return results. One of the reason GAUDI started as a popular project within Google is to analyze political speeches during the recent America's Presidential election.

Eventually, most people would want to interact with their PCs and other computing devices through voice in the future. Microsoft has provided a Speech Recognition software ("Microsoft's speech technologies") that is built-in within Windows Vista . Microsoft claims that "Windows Vista Speech Recognition provides excellent recognition accuracy that improves with each use as it adapts to your speaking style and vocabulary." However, during its first live-demo at its launch, the product did not been working properly. This event became one of Microsoft famous public humiliation.

The recent release of speech recognition in Windows 7 has improved a lot from the one in Microsoft Vista. As it adapts to the user's speaking style and vocabulary, its accuracy improves with each usage. 

Supremis ATCC is a software company that focuses at developing speech technology for Air Trafic Control (ATC) training and operational support. The goal of the software is to reduce accidents through a software that is entirely controlled by intelligent response to human voice.  

Among other Nuance's Dragon products such as speech recognition for Microsot Word, one of its most widely use product is Dragon Medical. Instead of writing and consulting patients at the same time, Dragon Medical assist Doctors by providing real-time notes, reports and graphs while they consult with patients.

Shazam is a song recognition software for mobile devices. Shazam enables its user identify songs (and singers) from their cellphone, just by recording the song that the user hears. 

Real-time speech recognition is used  by physicians for medical transcription, transferring voice into Electronic Health Record (EHR). This allows them to review, sign, and make their notes available right away into databases. This technology cuts the transcription cost and saves the data-entry time. 

One of the most promising usages of speech recognition is to help disabilities people. The computer-human interaction will enable blind people take full advantage of the computer technology. Recent research are trying to develop voice control of robotic arms and environmental control units, including a description of a Voice Activated Domestic Appliance System (VADAS).

Although the techonology for disabled people is already available today, it is still at a very early stage primarily due to the human factors. Several human factors issues are identified under Challenges and Benefits section.


Source: NCBI U.S. National Library of Medicine

Speech recognition is especially useful those with difficulties in using their hands, from Repetitive Stress Injuries (RSI), a disability that prevents people using conventional computer input devices. Frequent keyboard users and developed RSI are the early target market for speech recognition technology. 

As robots are becoming more and more popular in helping elders and providing common house chores services (especially in Japan), speech recognition technology allows the robots to interact even further with humans without the conventional input device. 

In F-35 Lightning II, the Air Force has developed the communication technology between pilot and aircraft. The F-35 is America's first fighter aircraft with a speech recognition system to understand a pilot's verbal commands to manage various aircraft flights, such as communications and navigation.

thepudding.com is a free phone service that determines which specific advertisement to be used for telecommunication users by applying speech-recognition to phone conversations. Their platform manages the entire campaign life cycle across all mobile channels - SMS, MMS, mobile web, voice calls, video and mobile apps. 

Skype was also considering to use a similar technology for its calls.

IVR is a telephony technology which allows interaction between users and a phone system to acquire or enter information into a company's database.

The IVR lets callers interact with a company via its Interactive Voice Recognition, mainly for customer service usage. The IVR systems will read and analyze information from the company's database and then relate that information back to the customer in spoken format.

With the help of Gaudi's technology, videos from YouTube's channels are automatically transcribed from speech to text and indexed. The users now can not only search for titles and descriptions of the videos, but also its spoken content. The speech recognition allows user to fast forward to the most relevant parts of the video.

Logistic application system is used as a verbal communication tool between employees and back-end management systems. A picker receives commands via a headset and confirms the completion of an operation by speaking in a microphone. This principle boosts the speed of picking and productivity: pickers are liberated from paper pick lists and mobile barcode readers. Though this is used during picking, it can also be integrated into other warehousing operations: placement, replenishment, quality control, etc. 

6.1 Low signal-to-voice ratio

Speech recognition does not work as effectively when any of these situations are present: high environment noise, the use of a new microphone (different from the one used to train the system), or interfering signals. This could be solved with the use of a high quality, noise cancelling microphone.

Source: fink  

6.2 Homonyms

Speech recognition softwares can lose accuracy when working with homonyms, or words that sound exactly the same but have a different meaning and spelling. Some examples of homonyms include: week and weak, weather and whether, for and four, etc. 

Solution: 'I scream' sounds a lot like 'ice cream', software take such words and studies the statically probability based on the writing style for which word to use.Statistical modelling like the HMM are extensively used to minimize this effect and improve the performance of the software.

Source: howstuffworks

6.3 Voice Overlapping

One of the toughest missions for a speech recognition system is to recognize who is the person speaking when there are multiple users speaking at the same time.  This happens a lot on meetings or conversations when people are constantly interrupting each other.

Solution: Using enhanced existing technologies, it is now able for a speech recognition system to recognize multiple speeches. This has allowed robots to recognize the direction and hence categorize speech input qualitatively according to each individual..

Source: howstuffworks 

6.4 Continuous speech without gaps 

Speech without gaps, which otherwise is a key word separation indicator for speech recognition software, is a major constraint. An example of this is the phrase "recognize speech," which, when said rapidly and without gaps sounds very similar to "wreck a nice beach". In this case, the program analyzes the context of the sentence, going over the previous phrase to choose the right meaning, basing this analysis on the phoneme. The two examples previously described above would be broken down into the following:"r  eh k ao g n ay  z       s  p  iy  ch" for the phrase "recognize speech", and " r  eh  k     ay     n  ay s     b  iy  ch" for the phrase "wreck a nice beach".  

Source: howstuffworks

6.5 Speech deterioration

  Speech deterioration could occur at instances when human voice is temporarily (such as infections, viruses, soar throats, etc) or permanently impaired (caused by accidents, age, etc). 

Solution: By saving every speech dictating session, people that have a progressive speech deterioration can continue using this programs. This is due to the fact that even if their voices change drastically from one year to another, by saving their speed data every session, the program is constantly updated. 

7.1 Professional life

 With each business trip, mobile users struggle to carry their notebook computers, datebooks, files and luggage. Lightweight, speech-enabled mobile devices will only let business professionals take advantage of one application to perform various tasks through voice recognition.

7.2 Personal

In future, the time you  spend in your car driving into the office, you will be checking your stock portfolio, ordering gifts through a catalog, purchasing books online, conducting bank transactions, or sending e-mails. With enhanced integration of our daily devices like computer networks,telephones, Internet, cars, home appliances, home security and making them compatible with each other through voice-enabling, new opportunities are endless and limited only by our imagination.

Source: myadvisor

7.3 Regulations

 Microsoft owns a patent for the Automatic Censorship of Audio Data for Broadcast, an invention that addresses 'producing censored speech that has been altered so that undesired words or phrases are either unintelligible or inaudible.' This patent describes options of muting offensive words, replacing them with less offensive versions, and  as an alternative provides for overwriting the undesired word with a masking sound, i.e., "bleeping" the undesired word with a tone.

Source: Slashdot.org

7.4 Education

In future, interactive learning CD's will become a more interactive format than what it is today. Interactive CD's not only imparts knowledge and enhances volcabulary but also promotes improved behavioural patterns.  Some people believe that the reason failure of current learning CD's is because of its lack of instructiveness.

7.5 Global Autonomous Language Exploitation (GALE)

GALE project is on of the largest speech recognition-related project ongoing as of 2007, which involves both speech recognition and translation components. 

Source: wikipedia

"GALE program develops and applies computer software technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. Automatic processing "engines" will convert and distill the data, delivering pertinent, consolidated information in easy-to-understand forms to military personnel and monolingual English-speaking analysts in response to direct or implicit requests. 

GALE consists of three major engines: Transcription, Translation and Distillation. The output of each engine is English text. The input to the transcription engine is speech and to the translation engine, text. Engines will pass along pointers to relevant source language data that will be available to humans and downstream processes. The distillation engine integrates information of interest to its user from multiple sources and documents. Military personnel will interact with the distillation engine via interfaces that could include various forms of human-machine dialogue (not necessarily in natural language)."

Source: LDC, University of Pennsylvania 

Speech recognition technology is becoming more and more prevalent. What a few years ago was a novel premium feature only affordable by consumers of high standard technology, today is available for the mass market in any standard mobile phone or car.

Speech recognition allows us to peek into the future and gives us a hint of what we could call the beginning of a new world. Although still primitive, integration between humans and machines is not anymore a futuristic fiction coming out of some Hollywood super-production, but a new reality that we must be aware of.

Every day machines become more and more interactive with humans through speech and touch. As technology becomes gradually integrated into our lives, we are changing the way we act, communicate, and establish our social relations; in other words, we are changing the way we live. 

The sources for this wiki are from various websites that are mentioned below each corresponding section.

 This wiki is complied in its simplest form to avoid cluttering for detailed data. We have considered this wiki to be only as a source of information and not of in-depth analysis. However, if anyone reading this wiki is keen on obtaining further details (eg: the statistical modeling and the relevance of the Hidden Markov Model(HMM) , etc).  Please do contact any of the following contributors.