Implementation Of Speech Recognition System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Speech recognition technology is one from the fast growing technologies. There are number of applications in different areas and serving people. Nearly 30% people of the world are suffering from various disabilities; many of them are blind, dyslexia or unable to use their hands effectively. The speech recognition systems in those cases provide a significant help to them and they can share information with people by operating computer through voice as input instead of keyboard or mouse.

This project is developed keeping that factor into mind, and a little effort is made to achieve this aim. The project is capable to recognize the speech and convert the input audio into text. It also helps the user to open different system software such as opening Ms-paint, notepad and calculator.

At initial level effort is made to provide help for basic operations as discussed above, but the software can further be updated and enhanced in order to cover more operations.


Speech is one of the powerful and natural ways to interact with the humans. When we talking about computers then we know that computer only know the binary language i.e. 0 and 1. So if an application is created in such a way that it can communicate e with the computer using voice of the humans and controlled by the voice commands then it open a vast opportunity in the field to IT and gives concept to artificial intelligence. Even though the idea of the using speech as medium to communicate with the computer is not new and lots of applications have been developed that are serving mankind by communicating with the computer and the some of the big milestone has been achieved when this concept was applied in the secure mechanism such voice recognition system is commonly used in some organization when secure entrance is very prior and there are any other fields that are getting benefit from this concept and opening a big opportunity that is yet to be explored.

The report gives on overview of speech recognition technology, and its applications. The first section describes the speech recognition process, its applications, its limitations and the future of this technology. Later part of report covers the speech recognition process, code of the software and its working. The last part consists of the different potentials uses of the application and further improvements


Understand the speech recognition and its fundamentals.

Its working and applications in different areas

Development for Speech Recognition software that can mainly be used for:

Speech Recognition

Speech Generation

Text Editing


Develop a speech recognition system that takes speech as input and after recognition write text.

Literature Review

3.1 Speech Recognition:

Speech recognition is a technology that able a computer to capture the words spoken by a human with a help of microphone [1] [2]. These words are later on recognized by speech recognizer, and finally system outputs the recognized words. The process of speech recognition consists of different steps that will be discussed in the upcoming sections

In speech recognition an ideal situation is the one in which a speech recognition engine recognizes all words uttered by a human but, practically the performance of a speech recognition engine depends on number of factors. Vocabularies, Pronunciation, multiple users and noisy environment are the major factors that are counted in as the depending factors for a speech recognition engine [3].

3.2 Speech recognition Types

Speech recognition systems can be divided into the number of classes based on their ability to recognize that words and list of words they have. A few classes of speech recognition are classified as under:

3.2.1 Isolated Speech

Isolated words usually contain a pause between two utterances; this doesn't mean that it only accepts a single word but instead it requires one utterance at a time [4].

3.2.2 Connected Speech

Connected words or connected speech is similar to isolated speech but it allows separate utterances with minimal pause between them.

3.2.3 Continuous speech

Continuous speech allow the user to speak almost naturally, it is also called the computer dictation.

3.2.4 Spontaneous Speech

At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.

3.3 Speech recognition System Components

3.3.1 Voice Input

By using microphone, audio is input to the system. The pc sound card produces the equivalent digital representation of received audio [4] [5] [6].

3.3.2 Digitization

The process of converting the analog signal into a digital form is known as digitization [4], it involves the both sampling and quantization processes. Sampling is converting a continuous signal into discrete signal, while the process of approximating a continuous range of values is known as quantization.

3.3.3 Acoustic Model

An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech [4]. The software acoustic model breaks the words into the phonemes [6].

3.3.4 Language Model

Language modeling is used in many natural language processing applications such as speech recognition tries to capture the properties of a language and to predict the next word in the speech sequence [4]. The software language model compares the phonemes to words in its built in dictionary [6].

3.3.5 Speech engine

The job of speech recognition engine is to convert the input audio into text [4]; to accomplish this it uses all sorts of data, software algorithms and statistics. Its first operation is digitization as discussed earlier, that is to convert it into a suitable format for further processing. Once audio signal is in proper format it then searches the best match for it. It does this by considering the words it knows, once the signal is recognized it returns its corresponding text string.

3.4 Microsoft Speech API

Broadly the Speech API can be viewed as a middleware which behaves as channel between applications and speech engines (recognition and synthesis). Applications also use simplified higher-level objects instead of directly calling methods on the engines.

In Speech API, both applications and engines do not directly communicate with each other. Instead each interacts with each other at runtime component (sapi.dll). There is an API implemented by this component which is used by applications, and another set of interfaces for engines [7]

Typically in SAPI applications, issue calls through the API The sapi.dll runtime component interprets these commands and processes them, where necessary calling on the engine through the engine interfaces. The recognition and synthesis engines also generate events in between processing [7]

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. Up till now, a number of versions of the API have been released, which are either as part of a Speech SDK, or Windows OS. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server [8].

In general all versions of the API have been designed in such a way that a software developer can develop an application to perform speech recognition and synthesis by using a standard set of interfaces, accessible from a variety of programming languages.

The Speech SDK version 5.0, incorporating the SAPI 5.0 runtime was released in 2000. This was a complete redesign from previous versions and neither engines nor applications which used older versions of SAPI could use the new version without considerable modification [8].

The design of the new API included the concept of strictly separating the application and engine so all calls were routed through the runtime sapi.dll. This change was done to make the API more 'engine-independent', so applications does not depends on features of a specific engine.

Features of the API include:

Shared Recognizer. In desktop speech recognition applications, a recognizer object is used and runs in a separate process (sapisvr.exe) [7]. Applications that use the shared recognizer communicate with this single instance. That allows sharing of resources and a user interface can be used globally for control of all speech applications [7].

In-proc recognizer. The applications that require separate control of the recognition process they use in-proc recognizer object not shared recognizer.

Grammar objects. Speech grammars are normally used to specify the words that the recognizer is listening for. There are some methods that exist for instructing the recognizer to load a built-in dictation language model [7].

Voice object. This performs speech synthesis, producing an audio stream from text [7].

Audio interfaces. It is a runtime that includes objects for performing speech input from the microphone or speech output to speakers (or any sound device), as well as to and from .wav files [7].

User lexicon object. This object is used for user defined words and pronunciations to be added by a user or application. These are added to the recognition or synthesis engine's built-in lexicons [7].

Object tokens. It allows recognition, text to speech engines, audio objects, lexicons and other categories of object to be registered. [7].


I have developed a speech recognition system on Microsoft Visual Studio as a platform tool and the language that I have used in developing the application is (a programming language).

My application is a desktop application that uses the one of the new feature that Microsoft has introduced for creating AI application like speech recognition system is Microsoft speech SDK. It is one of the many tools that enable a developer to add speech capability in to applications. Speech SDK can be used in either C#, C++, VB or any COM compliant language that is supported by Microsoft SDK.

In terms of programming speech can be divided in to two subcategories. First is Text to speech conversion and secondly speech recognition. In this report I shall be focusing on the speech recognition conversion.

4.1 Speech Recognition Engine:

The first step in creating the application is to create a speech engine. The speech engine can be created in either of the two ways inproc or Shared. Here I have created a shared engine that is accessible anywhere. In shared engine, Speech Application Programming Interface creates the speech recognition engine in a separate process and all application will share this speech recognizer.

When we create a speech recognition engine, the system API uses the setrecognizer for looking the object token that the application has specified. The object token contains the class ID (CLSID) of the main Speech rocognize engine and this class is created [8]

The user-training UI might produce some adapted model files. These can be saved and their location stored in the recoprofile object token. The engine can read the location of these files from the object token later [8]

The speech recognition engine then interacts with applications using events that are subscribed to the application. These important events are the recognition event and the hypothesis event. These event are raised when the engine make a good recognition or a hypothesis respectively.


In speech recognition process each application has at least one Recocontext object implementing isprecocontext. It is because of this interface the application creates and loads grammars and activates recognition. Speech API informs the Speech Recognition engine of each Recocontext associated with it using oncreaterecocontext and ondeleterecocontext. The Speech Recognition engine returns a pointer to Speech API from the oncreaterecocontext function, which is afterwards passed to the engine in any future calls that need to refer to the Recocontext.

Grammar Handling

In Speech Recognition, recognizing grammar is the main concept and on the basis of recognizing grammar speech recognition can further be classified in to two types. Firstly Command & Control and second Dictation

Grammar in other words is the list of all possible recognition outputs that can be generated. An application can limit the possible combinations of the words spoken by user by choosing proper grammar.

In a command and control scenario a developer provides a limited set of possible word combinations to the system and the speech recognition engine tries to match the words spoken by the user from the limited list. In command and control the accuracy of recognition words or the combination of words is very high. It is always better for applications to implement command and control as to achieve a higher accuracy of recognition and making the application respond better.

In Dictation mode, the recognition engine tries compares the input speech from the complete list of the dictionary words. In the dictation mode, a high accuracy of recognition can be achieved if the user has prior trained the recognition engine by speaking in to it. The engine can be trained by creating of a profile by using the speech properties in the control panel.


At the initial level I have been successful in recognizing the words that are spoken. The words that are uttered by the user, if they are present in the dictionary (a storage space) the speech recognition engine will display those words. There are some problems that I would be discussing in the next chapter.


There are some of the factors that cause the speech recognition system to failure and those factors are sometimes affecting my application too. Those factors are

6.1. Homonyms:

Are the words that are differently spelled and have the different meaning but acquires the same pronunciation, for example "there" "their" "be" and "bee". This is a challenge for computer machine to distinguish between such types of phrases that sound alike.

6.2. Overlapping speeches:

A second challenge in the process, is to understand the speech uttered by different users, current systems have a difficulty to separate simultaneous speeches from multiple users.

6.3. Noise factor:

The program requires hearing the words uttered by a human distinctly and clearly. Any extra sound can create interference, first you need to place system away from noisy environments and then speak clearly else the machine will confuse and will mix up the words.

6.4. Pronunciation:

Commonly words can have the multiple pronunciations associated with them. So the utterance of correct words largely depends upon the correct pronunciation of spoken words.

6.5. The future of speech recognition.

• Accuracy will become better and better.

• Dictation speech recognition will gradually become accepted.

• Greater use will be made of "intelligent systems" which will attempt to guess what the speaker intended to say, rather than what was actually said, as people often misspeak and make unintentional mistakes.

• Microphone and sound systems will be designed to adapt more quickly to changing background noise levels, different environments, with better recognition of extraneous material to be discarded.