Speech Recognition Is Viewed Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Speech recognition is viewed as an integral part of future human-computer interfaces that are envisioned to use speech, among other means, to achieve natural, pervasive, and ubiquitous computing. Since speech is the most natural way of communication of human beings, people had identified speech recognition technology would help greatly when working with machines. In simple speech recognition can be defined as converting spoken human language into machine readable input, but applications of this technology show that its power goes beyond the definition. [3] Speech recognition, in simple can be considered as converting spoken words into machine readable input. But new definitions of speech recognition says it is not only converting spoken words into machine readable format but also understanding the meaning of the speech and make decisions or respond accordingly. Speech recognition technology was developed drastically since last two decades due to massive development in Natural Language Processing and Artificial Intelligence technologies. [4]


People use media like text, visual elements, speech and signs etc. to communicate with each other. But out of those, speech is the most common and natural media of communication of human beings. At early ages of industrial development a variety of machines were built to perform many functions. Machine operators used control panels which consisted of buttons, switches etc. to control machines. In addition, most of today's computers basically have a mouse and a keyboard to interact with them.[4] Later people thought that how easy to interact with machines if there is any natural interaction mechanism available. Hence a lot of researches were carried out on machine interaction through a natural interaction mechanism like speech mechanism. That is the origin of speech recognition technology. [3]

As an undergraduate who has great curiosity towards the new innovations in IT industry, I think speech recognition techniques will be the area, where most of the new researches and innovations happen. Not only innovations and researches but also new technologies may immerge in the area of speech recognition techniques. So I think I can use my technological knowledge effectively if I select topic like, "speech recognition" as my independent study.

The paper is organized as follows. Section 2 describes the Overview of Speech Recognition Techniques. Section 3 is about the Major Researches carrying out related to Speech Recognition Techniques. Application of Speech Recognition Techniques will be discussed under the Section 4. The Section 5 describes the Development Environments. Future Directions of Speech Recognition will be presented in Section 6. After that the Section 7 discusses the essence of the research paper and the paper ends with Section 8 explaining my identifications in the research.


In simple, speech recognition can be defined as converting natural human speech into machine understandable format. A typical speech recognition system consists of modules such as signal analysis, acoustic analysis, acoustic model, sequential model, segmentation, time alignment etc. People have identified many speech recognition techniques. The most popular models and algorithms are Hidden Markov Model, Dynamic Time Warping, and Neural Network Approaches (Static and Dynamic). [7]

Each of them has both pros and cons. Measuring performance of speech recognition system is an important factor when evaluating a speech recognition system.. Speech recognition technique is considered as a emerging research area, because thousands of researches have been carried on this area in terms of technology improvement and application.[3] Researches such as parallel processing for speech recognition, Multi- lingual speech recognition systems etc. have been conducted to improve the speech recognition in terms of technology improvement. On the other hand researches such as Speech recognition for meetings, Smart web, auditory scene analysis etc. have been carried out in terms of application of speech recognition techniques. Today, applications of speech recognition can be seen in almost every field. It has successfully utilized in health care, military, computing, auto mobile, telephony etc. fields. Speech recognition is one of the popular research areas for last two decades and even today many researches are conducted on speech recognition technology improvement and applications. Since there are some problems still to be solved in speech recognition, researchers are trying to overcome these problems. Researchers have identified some key challenging areas in speech recognition. Robusteness of speech recognition systems, Portability of the systems which refers to rapidly designing, developing and developing systems for new applications, Adaptation to changes accordingly, language barriers, designing language models [17] etc. At the mean time researches are conducted on how speech recognition can be utilized in real world phenomena. Some popular researches are going on in speech processing for meetings, semantic web [16], multi lingual speech recognition systems, speech recognition for mobile telephony etc. A large number of applications of speech recognition can be identified in many fields. Health care (medical transcription) is one of the earliest applications of speech recognition [1]. In addition, it is used in medical querying, patient caring etc. in heath sector. This technology is very widely utilized in the area of military. In some fighter jets and helicopters it is successfully integrated. Speech recognition has granted commanders to interact with large military databases through a speech interface in battle field, so that rapid access to information is enabled. In mobile telephony speech recognition is integrated, hence functionalities such as automatic dialing, speech to SMS, speech to E-mail, command selection have been able to achieve. Utilization of speech recognition in computers has made a great benefit to people with disabilities. This technology is used in entertainment field as well. Speech recognition based computer video games are the next generation of video games. Microsoft has already developed an X-Box play station in which speech recognition is adopted.

Utilization of speech recognition in home and office automation has eased the day to day work of people. Another interesting application of speech recognition is court reporting. Robotics is one of the emerging areas of Artificial Intelligence technology in which speech recognition is used to develop human like robots.

And it is important to consider some forecasts mentioned by the giants of Information Technology field. IBM intends to have better-than-human ASR by 2010. Bill Gates predicted that by 2011 the quality of ASR will catch up to humans. Justin Rattner from Intel said in 2005 that by 2015, computers will have "strong capabilities" in speech-to-text [15].


Researches on speech recognition techniques were initiated in early 1940 s, and thereafter many researchers were carried out in terms of technology and applications.


Speech data contain various types of information that can use to identify the speaker. This includes speaker unique information due to vocal tract, excitation source and behavior of feature. [8] The information about the behavior feature embedded in signal and that can be used for speaker recognition. The speech analysis stage deals with stage with suitable frame size for segmenting speech signal for further analysis and extracting .The following three techniques can be use for speech analysis.

3.1.1 Segmentation analysis

The frame size is used for speech analysis in this case. Shift in the range of 10ms-30 ms to get speaker information. Segmentation analysis is used for extract vocal tract information of speaker recognition.

3.1.2 Sub segmental analysis

Sub segmental analysis is done by using frame size and shift in range 3 - 5ms. This technique is used to mainly for analyze and extract the features of the excitation state.

Supra segmental analysis

In this case, frame size is used for analysis the speech. This technique is used for mainly to analyze and characteristic due to speaker's behavior character.



The speech feature extraction is about while maintaining the signal discriminating power reduces the dimensionality of the input vector. As we know from basic formation of identification of speaker and verification of system that the number of training vector and test vector needed for the problem grows with the dimension of the given input therefore we need feature extraction of speech signal. [9]

Principal Component analysis (PCA), Linear Discriminate Analysis (LDA), Independent Component Analysis (ICA) and Linear Predictive coding are some feature extraction.

The different feature extraction technique describe as spectral feature like band energies, formats, spectrum and Cepstral coefficient mainly speaker specific information due to vocal tract, Excitation source feature like pitch and variation in pitch and Long term feature like duration information energy due to behavior feature.[10]

Windowed frame

Mel spectrum

Continuous speech


Discrete Fourier Transformer

Mel Frequency Wrapping


Inverse DFT

Magnitude Spectrum

Mel cestrum

Figure.1: Feature Extraction diagram [7]



Procedure for Implementation

Principal Component analysis


Non linear feature extraction method,

Linear map, fast, eigenvector-based

Traditional, eigenvector base method,

; good for Gaussian data

Linear Discriminate


Non linear feature extraction method,

Supervised linear map; fast,


Better than PCA

for classification

Independent Component

Analysis (ICA)

Non linear feature extraction method,

Linear map, iterative non- Gaussian

Blind course separation, used

for de-mixing non- Gaussian

Linear Predictive coding

Static feature extraction

method,10 to 16 lower order


It is used for feature Extraction at lower


Spectral subtraction

Robust Feature extraction method

It is used basis on Spectrogram

Table1: List of technique with their properties For Feature extraction. [7]


The main aim of modeling technique is by using speaker specific feature vector generate speaker models. Modeling technique can categorize into two classifications, speaker recognition and speaker identification. Speaker identification technique automatically identifies who is speaking. The speaker recognition can be categorized into two parts.

They are speaker independent and speaker dependent. In speaker independent mode the computer should be ignored the speaker specific characteristics of the speech signal and get the intended message. On the other side in case of speaker recognition machine should be extracted speaker characteristics in the acoustic signal.

The basic objective of speaker identification is comparing a speech signal from an unknown speaker to a database of known speaker .The system can recognize the speaker r, which is trained with a number of speakers. Speaker recognition can also be dividing into two types, text- independent and text dependent methods. In text dependent method the speaker say sentences or key words which are having the same text for recognition trials training trials. Text independent does not rely on a specific texts being spoken. [11]

Following are the modeling which can be used in speech recognition process:

3.3.1 Pattern Recognition approach

The pattern recognition approach involves two main steps namely, pattern training and pattern comparison. The essential feature of this approach is that it uses a well formulated mathematical framework and establishes consistent speech pattern representations, for reliable pattern comparison, from a set of labeled training samples via a formal training algorithm. [12] A pattern recognition has been developed over two decade received much attention and applied widely too many practical pattern recognition problem.A speech pattern representation can be in the form of a speech template or a statistical model and can be applied to a sound (smaller than a word), a word, or a phrase. In the pattern comparison stage of the approach, a direct comparison is made between the unknown speeches (the speech to be recognized) with each possible pattern learned in the training stage in order to determine the identity of the unknown according to the goodness of match of the patterns.[7]

3.3.2 Template based approaches

Template based approaches matching unknown speech is compared against a set of pre-recorded templates in order to find the most suitable Match. This has the main advantage of using perfectly reliable word models.

A collection of prototypical speech patterns are stored as reference patterns representing the dictionary of candidate s words. Recognition is then carried out by matching an unknown spoken utterance with each of these reference templates and selecting the category of the best matching pattern. Usually templates for entire words are constructed. This has the advantage that, errors due to segmentation or classification of smaller acoustically more variable units such as phonemes can be avoided. In turn, each word must have its own full reference template; template preparation and matching become prohibitively expensive or impractical as vocabulary size increases beyond a few hundred words. [7]

One key idea in template method is to derive typical sequences of speech frames for a pattern (a word) via some averaging procedure, and to rely on the use of local spectral distance measures to compare patterns. Another key idea is to use some form of dynamic programming to temporarily align patterns to account for differences in speaking rates across talkers as well as across repetitions of the word by the same talker. But it also has the disadvantage that pre-recorded templates are fixed.

3.3.3 Knowledge based approaches

An expert knowledge about variations in speech is hand coded into a system. This has the advantage of explicit modeling variations in speech; but unfortunately such expert knowledge is difficult to obtain and use successfully. Thus this approach was judged to be impractical and automatic learning procedure was sought instead. [11] Vector Quantization is often applied to ASR. It is useful for speech coders, i.e., efficient data reduction. Since transmission rate is not a major issue for ASR, the utility of VQ here lies in the efficiency of using compact codebooks for reference models and codebook searcher in place of more costly evaluation methods. For IWR, each vocabulary word gets its own Vector Quantization codebook, based on training sequence of several repetitions of the word.


Whole-word matching and Sub-word matching are two ways of matching techniques.

The incoming digital-audio signal compares against a prerecorded template of the word in the whole - word matching.

This technique takes less processing than sub-word matching, but this technique needs that the someone (user) prerecord every word that will recognize. Sometimes hundred thousand words and these templates require huge amounts of storage.[13]

In sub-word matching, engine looks for sub-words - usually phonemes. Then performs further patterns recognition on those. This requires more processing than whole-word matching technique and requires much less storage. [14]


In the early development stages speech recognition techniques were used in limited number of fields. But when it is developed further researches tend to use this technology in almost every area. Therefore so many applications of speech recognition can be seen nowadays. [17]

Health Care Field

One of the most earliest and successful utilization of speech recognition was occurred in health care field. It was especially used in medical transcription; however inception was not very successful, because it was introduced as a method to completely avoid transcription rather than supporting the transcription. But later transcription service was provided in a redistributed manner rather than replaced.

In medical document processing, speech recognition is implemented either in front-end or back-end. Front-End SR is where the provider dictates into a speech-recognition engine, in here right after they are spoken the recognized words are displayed, and the dictator responsible for signing off and editing on the document. It is never going through MT/editor. [18]


Speech recognition technique is more widely used in military than any other field. It is adapted more or less from high performance fighter jets to battle management Substantial efforts were devoted in the last decades to the evaluation of speech recognition and test of speech recognition in fighter aircraft.

Battle Management

History shows that it was very useful to use speech recognition in battle management .Battle management command centers generally need rapid access and control of number of , rapidly changing information databases. System operators and commanders need to query those databases, in an eyes-busy environment; much of the data is showed in a display format. Human machine interaction by voice has to be very helpful in those environments.

Modern Pocket PCs

Today, modern Pocket PCs such as Symbian, Windows Mobile Smart phone are integrated with speech recognition capabilities. Speech is used as a part of UI (User Interface), for creating custom or pre-defined speech commands.

Telephone Industry

Telephone industry is one of the areas that have utilized speech recognition more effectively. Automatic dialing, speech to SMS, speech to mobile E-mail is the most common applications of speech recognition in telephone industry.

Life of disabled people

Some speech recognition applications have made great impact on life of disabled people. It is very useful for people who have unable to use [18] and difficult to use their hand, from mild stress injuries to involved disabilities which require alternative input for support with accessing the computer.


Gaming is using speech to give commands rather than clicking mouse or turning over joystick or pressing keys. Thus, it has introduced a new dimension for the video gaming.


Robotics is another field in which speech recognition technology is successfully utilized. Due to increasing demands for symbiosis of humans and robots, humanoid robots are increasingly expected to possess perceptual capabilities similar to humans. In particular, hearing capabilities are essential for social interaction, be-cause spoken communication is very important for normal-hearing people. Speech recognition has made great impact to overcoming this weakness of Robots.

Automatic Navigation Systems

Speech recognition integrated automatic navigation systems have been successfully tested. The work load that a driver has to perform has reduced in considerable amount in automatic navigation systems. A method of providing navigational information to a vehicle operator includes processing destination information spoken by a vehicle operator with an on-board processing system on a vehicle. The processed speech information is transmitted via a wireless link to a remote data center and analyzed with a voice recognition system at the remote data center to recognize components of the destination information spoken by the vehicle operator. The accuracy of the recognition of the components of the destination information is confirmed via interactive speech exchanges between the vehicle operator and the remote data center. A destination is determined from confirmed components of the destination information, route information to the destination is generated at the remote data center, and the route information transmitted to the on-board processing system from the remote data center via the wireless link.


The future of speech recognition techniques is very bright because it is applicable on most of the fields.

5.1 Language Modeling

Statistical language models are used by current systems to help decrease the resolve acoustic ambiguity and search space. As vocabulary size increases and other constraints are relaxed to build number of habitable systems, it will be growing important to get as much constraints as possible from language models; perhaps incorporating semantic and syntactic constraints that are unable to capture by purely statistical models.

5.2 Robustness

In a robust system, performance decreases gracefully as conditions become more different from those under which it was trained. Differences in acoustic environment and channel characteristics must be received particular attention.

5.3 Adaption

How can system periodically adapt to conditions changing, this means new speakers, microphone, task, etc and enhance through use. Such kind of adaption can occur at many levels in systems sub word models, word pronunciations, language models, etc.

5.4 Portability

This refers to the aim of fast designing, developing and deploying systems for the new applications. At present, systems tend to suffer significant degradation when moving to a new task. They must be trained on examples specific to the new task in order to return to peak performance, which is more expensive and more time consuming.


Speech is the most natural way of communication of human beings. But today most of the machines have interfaces like keyboard, control panel or such devices to interact with users. In simple, speech recognition can be defined as converting natural human speech into machine understandable format. A typical speech recognition system consists of modules such as signal analysis, acoustic analysis, acoustic model, sequential model, segmentation, time alignment etc. People have developed many speech recognition models and algorithms. The most popular models and algorithms are Hidden Markov Model, Dynamic Time Warping, and Neural Network Approaches (Static and Dynamic).[1]

Speech recognition technology was developed drastically since last two decades due to massive development in Natural Language Processing and Artificial Intelligence technologies. For example, multilingual speech recognition systems have been developed as a result of adopting other emerging technologies. In the beginning, speech recognition technology was used in limited areas such as health care, military etc. But today people try to experiment this technology in many areas such as home automation, hands free computing etc. So it is quite obvious that speech recognition technology will be used in almost every area.[3]

In this research paper I have discussed what is a Speech Recognition, What are the technologies that used for speech recognition, the requirement behind, the applications of speech recognition techniques and at the same time the advantages, what sort of environment is needed to develop speech recognition and future directions.

In future, communication interface of most of the machines surely be a speech recognition interface.


I heartily thankful to my supervisor, who introduced me to this beautiful subject, Dr H.T. Chaminda, whose encouragement, supervision and support from the preliminary to the current level enabled me to develop this research work. Lastly, I would like to put forward my sincere thanks to my mother, father, lecturers, my friends and those who supported me in any respect during the completion of this research work.