This article focuses on Automatic Speech Recognition Systems. Speech recognition is a field that involves a lot of different disciplines. As more and more research is being done in this field, the application possibilities of speech recognition increases, this includes military uses as well as health care usage.
This paper presents a description of the basic speech recognition categories. It then goes on to discuss Finite State Machines and its use in ASR. The paper looks briefly at some ways in which Speech Recognition Technology may be applied. It also focuses on the need of developing a framework for Speech Application Development."
Keywords- Automatic Speech Recognition (ASR); Finite State Machines (FSM).
1.1 Definition of speech recognition:
Speech Recognition (is also known as Automatic Speech Recognition (ASR) or computer speech recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program.
1.2 Applications for Speech Recognition
There are numerous applications for ASR, and a few military and health applications will be discussed here briefly.
1.2.1 Military Uses 
Command and Control on the Move (C2OTM) is an American
Figure 1: C2TOM force elements and communication of human- machine communication by voice
army project that aims to keep command and control entities mobile along with mobile troops in a war zone. Figure 1 shows some of these mobile force elements that require C2OTM.
One example of how speech recognition will be used in this project is: the foot soldier's voice translation of what is being observed can be used to assess the battlefield situation information, and aid in weapons system selection. Another instance of voice recognition in this application is: in field repair and maintenance can be aided by a voice access to information and a helmet mounted display to show the information.
Figure 2: Combat team tactical training system concept and applications of speech based technology
The American Navy Personnel Research and Development Center has proposed creating a combat team tactical training application. This is illustrated in figure 2 . The aim of this project is to have personnel respond to ongoing combat simulations using voice, typing, trackballs, and other modes so as to communicate with both machine and with each other and applications of speech based technology
An air force application being investigated by the United Kingdom's Defense Research Agency is an application that will recognize pilots' voices and allow them to enter reconnaissance reports. A simpler application that are being researched in terms of the air force is to allow for voice control of radio frequencies, displays and gauges in order to increase mission effectiveness and safety of the pilots.
Figure 3 , shows a matrix the classes of different voice applications with the interest of various military and government end users.
Data Entry and Communication
Command and Control
Naval CIC Officer
Air Traffic Control
Joint Force Commander
Figure 3: Classes of different voice applications
1.2.2 Health Care Uses
There are many, many more applications of speech recognition in society outside of the military. One such application pertaining to the health care industry is the use of speech recognition in automatic medical transcription. This avenue is being seriously considered as it may prove to be more cost effective with a projected savings of $230 a week in 1998.
II. LITERATURE SURVEY
2.1 Basic Model of Speech Recognition:
Research in speech processing and communication for the most part, was motivated by people's desire to build mechanical models to emulate human verbal communication capabilities. Speech is the most natural form of human communication and speech processing has been one of the most exciting areas of the signal processing. Speech recognition technology has made it possible for computer to follow human voice commands and understand human languages. The main goal of speech recognition area is to develop techniques and systems for speech input to machine. Speech is the primary means of communication between humans. For reasons ranging from technological curiosity about the mechanisms for mechanical realization of human speech capabilities to desire to automate simple tasks which necessitates human machine interactions and research in automatic speech recognition by machines has attracted a great deal of attention for sixty years. Based on major advances in statistical modeling of speech, automatic speech recognition systems today find widespread application in tasks that require human machine interface, such as automatic call processing in telephone networks, and query based information systems that provide updated travel information, stock price quotations, weather reports, Data entry, voice dictation, access to information: travel, banking, Commands, Avionics, Automobile portal, speech transcription, Handicapped people (blind people) supermarket, railway reservations etc. Speech recognition technology was increasingly used within telephone networks to automate as well as to enhance the operator services. This report reviews major highlights during the last six decades in the research and development of automatic speech recognition, so as to provide a technological perspective. Although many technological progresses have been made, still there remain many research issues that need to be tackled.
The recognition process is shown below (Fig .4).
Figure 4: Basic model of speech recognition
2.2 ASR Approaches
There are three approaches to ASR that are as follows
Acoustic Phonetic Approach
Pattern Recognition Approach
Artificial Intelligence Approach
2.2.1 Acoustic phonetic approach:
The earliest approaches to speech recognition were based on finding speech sounds and providing appropriate labels to these sounds. This is the basis of the acoustic phonetic approach (Hemdal and Hughes 1967), which postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language and that these units are broadly characterized by a set of acoustics properties that are manifested in the speech signal over time. Even though, the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring sounds (the so-called co articulation effect), it is assumed in the acoustic-phonetic approach that the rules governing the variability are straightforward and can be readily learned by a machine. The first step in the acoustic phonetic approach is a spectral analysis of the speech combined with a feature detection that converts the spectral measurements to a set of features that describe the broad acoustic properties of the different phonetic units. The next step is a segmentation and labeling phase in which the speech signal is segmented into stable acoustic regions, followed by attaching one or more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of the speech. The last step in this approach attempts to determine a valid word (or string of words) from the phonetic label sequences produced by the segmentation to labeling. In the validation process, linguistic constraints on the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in order to access the lexicon for word decoding based on the phoneme lattice. The acoustic phonetic approach has not been widely used in most commercial applications (, Refer fig.2.32. p.81).The following table 3 broadly gives the different speech recognition techniques.
2.2.2 Pattern Recognition approach:
The pattern-matching approach (Itakura 1975; Rabiner 1989; Rabiner and Juang 1993) involves two essential steps namely, pattern training and pattern comparison. The essential feature of this approach is that it uses a well formulated mathematical framework and establishes consistent speech pattern representations, for reliable pattern comparison, from a set of labeled training samples via a formal training algorithm.
A speech pattern representation can be in the form of a speech template or a statistical model (e.g., a HIDDEN MARKOV MODEL or HMM) and can be applied to a sound (smaller than a word), a word, or a phrase. In the pattern-comparison stage of the approach, a direct comparison is made between the unknown speeches (the speech to be recognized) with each possible pattern learned in the training stage in order to determine the identity of the unknown according to the goodness of match of the patterns.
The pattern-matching approach has become the predominant method for speech recognition in the last six decades ( Refer fig.2.37. pg.87).
A block schematic diagram of pattern recognition is presented in fig.5 below. In this, there exists two methods namely template approach and stochastic approach.
Figure 5: Block diagram of pattern Speech Recognizer
188.8.131.52 Template Based Approach:
Template based approach  to speech recognition have provided a family of techniques that have advanced the field considerably during the last six decades. The underlying idea is simple. A collection of prototypical speech patterns are stored as reference patterns representing the dictionary of candidate's words. Recognition is then carried out by matching an unknown spoken utterance with each of these reference templates and selecting the category of the best matching pattern. Usually templates for entire words are constructed. This has the advantage that, errors due to segmentation or classification of smaller acoustically more variable units such as phonemes can be avoided. In turn, each word must have its own full reference template; template preparation and matching become prohibitively expensive or impractical as vocabulary size increases beyond a few hundred words. One key idea in template method is to derive typical sequences of speech frames for a pattern (a word) via some averaging procedure, and to rely on the use of local spectral distance measures to compare patterns. Another key idea is to use some form of dynamic programming to temporarily align patterns to account for differences in speaking rates across talkers as well as across repetitions of the word by the same talker.
184.108.40.206 Stochastic Approach:
Stochastic modeling  entails the use of probabilistic models to deal with uncertain or incomplete information. In speech recognition, uncertainty and incompleteness arise from many sources; for example, confusable sounds, speaker variability s, contextual effects, and homophones words. Thus, stochastic models are particularly suitable approach to speech recognition. The most popular stochastic approach today is hidden Markov modeling. A hidden Markov model is characterized by a finite state markov model and a set of output distributions. The transition parameters in the Markov chain models, temporal variabilities, while the parameters in the output distribution model, spectral variabilities. These two types of variabilites are the essence of speech recognition.
Compared to template based approach, hidden Markov modeling is more general and has a firmer mathematical foundation. A template based model is simply a continuous density HMM, with identity covariance matrices and a slope constrained topology. Although templates can be trained on fewer instances, they lack the probabilistic formulation of full HMMs and typically underperform HMMs. Compared to knowledge based approaches; HMMs enable easy integration of knowledge sources into a compiled architecture. A negative side effect of this is that HMMs do not provide much insight on the recognition process. As a result, it is often difficult to analyze the errors of an HMM system in an attempt to improve its performance. Nevertheless, prudent incorporation of knowledge has significantly improved HMM based systems.
2.2.3 Artificial Intelligence approach
The Artificial Intelligence approach  is a hybrid of the acoustic phonetic approach and pattern recognition approach.
In this, it exploits the ideas and concepts of Acoustic phonetic and pattern recognition methods. Knowledge based approach uses the information regarding linguistic, phonetic and spectrogram. Some speech researchers developed recognition system that used acoustic phonetic knowledge to develop classification rules for speech sounds. While template based approaches have been very effective in the design of a variety of speech recognition systems; they provided little insight about human speech processing, thereby making error analysis and knowledge-based system enhancement difficult. Nature provided insights and understanding to human speech processing. In its pure form, knowledge engineering design involves the direct and explicit incorporation of expert s speech knowledge into a recognition system. This knowledge is usually derived from careful study of spectrograms and is incorporated using rules or procedures. Pure knowledge engineering was also motivated by the interest and research in expert systems. However, this approach had only limited success, largely due to the difficulty in quantifying expert knowledge. Another difficult problem is the integration of many levels of human knowledge phonetics, phonotactics, lexical access, syntax, semantics and pragmatics.
Alternatively, combining independent and asynchronous knowledge sources optimally remains an unsolved problem. In more indirect forms, knowledge has also been used to guide the design of the models and algorithms of other techniques such as template matching and stochastic modeling. This form of knowledge application makes an important distinction between knowledge and algorithms enable us to solve problems. Knowledge enables the algorithms to work better. This form of knowledge based system enhancement has contributed considerably to the design of all successful strategies reported. It plays an important role in the selection of a suitable input representation, the definition of units of speech, or the design of the recognition algorithm itself.
2.3 Taxonomy of Speech Recognition:
Existing techniques for speech recognition have been represented diagrammatically in the following figure (Fig 6).
Figure 6: Taxonomy of speech Recognition
2.4 Feature Extraction:
In speech recognition, the main goal of the feature extraction step is to compute a parsimonious sequence of feature vectors providing a compact representation of the given input signal. The feature extraction is usually performed in three stages.
The first stage is called the speech analysis or the acoustic front end. It performs some kind of spectro temporal analysis of the signal and generates raw features describing the envelope of the power spectrum of short speech intervals. The second stage compiles an extended feature vector composed of static and dynamic features. Finally, the last stage (which is not always present) transforms these extended feature vectors into more compact and robust vectors that are then supplied to the recognizer. Although there is no real consensus as to what the optimal feature sets should look like, one usually would like them to have the following properties: they should allow an automatic system to discriminate between different through similar sounding speech sounds, they should allow for the automatic creation of acoustic models for these sounds without the need for an excessive amount of training data, and they should exhibit statistics which are largely invariant across speakers and speaking environment.
2.5 Finite State Machines
One of many problems in speech recognition today leads to spacious and redundant recognition networks. The Finite State Machine (FSM) toolkit could give really good solution for this problem.
2.5.1 FSM overview
For better specification fundamentals principles of FSM toolkit will be dealt here. FSM is divided into acceptors and transducers.
2.5.2 The Finite-State Acceptor
The Finite State Acceptor - FSA can accept or reject any input set of elements, for example chars. Like other Finite-State Automatons it has one initial - bold circle in the figure 7 state and one or more final states (shown as double circle)
Figure 7: FSA example - this accepts two Czech numerals
2.5.3 The Finite-State Transducer
The Finite-State transducer - FSM maps the set of input elements into the other set. In case of null output as is shown in the figure 8 means, that the output is void.
Figure 8: FST example - this maps phonemes of two Czech numerals into words
2.5.4 Weighted FSM
Adding weights on the appropriate FSM cause option of generating output number from elementary weights through appropriate way in the automaton as shown in figure 9.
Thus this is used for generating log of probability in the bi-gram grammar acceptor, furthermore generating output probability of general speech recognizer (all is based on the FSM toolkit).
Figure 9: FSA with weights - the numerals "osm" has greater probability than "dva"
2.6 Performance of speech recognition systems:
The performance of speech recognition systems is usually specified in terms of accuracy and speed. Accuracy may be measured in terms of performance accuracy which is usually rated with word error rate (WER), whereas speed is measured with the real time factor. Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).
Word Error Rate (WER): Word error rate is a common metric of the performance of a speech recognition or machine translation system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Word error rate can then be computed as:
S is the number of substitutions,
D is the number of the deletions,
I is the number of the insertions,
N is the number of words in the reference.
When reporting the performance of a speech recognition system, sometimes word recognition rate (WRR) is used instead:
H is N-(S+D), the number of correctly recognized words.
III.PROBLEMS AND PROPOSED APPROACH
3.1 Current problems of Speech recognition:
When designers approach the problem of core speech recognition, they are often faced with the problem of needing to develop an entire system from scratch, even if they only want to explore one facet of the field.
There already exists a few open source speech based systems, such as HMM and earlier versions of the Sphinx systems -. The available systems are typically optimized for a single approach to speech system design. As a result, these systems intrinsically create barriers to future enhancement that departs from the original purpose of the system.
In addition, some of these systems are encumbered by licensing agreements that make entry into the research arena difficult for non-academic institutions.
3.2 Proposed Solution
To facilitate new innovation in speech recognition research and development the proposed framework for Speech Application Development (Assistant-I) is an open source platform that incorporates state-of-the art methodologies and also addresses the needs of emerging research areas within the context of our diverse technical goals.
The Assistant-I is to be designed in the .NET Framework, making it available to a large variety of development platforms. First and foremost, Assistant-I is a modular and pluggable framework that incorporates design patterns from existing systems, with sufficient flexibility to support emerging areas of research interest. The framework is comprised of separable components dedicated to specific tasks, and can be easily replaced at runtime. To exercise the framework, and to provide researchers with a working system, Assistant-I also includes a variety of modules that implement state-of-the-art speech recognition techniques.
The Block diagram of the Framework and its way of Functioning is shown in the below figure (fig 10).
Figure 10: The Architecture of the Assistant-I model
The module functions as a simple black box that works based on a FSM model coded by the developer. The output from the FSM is a respective action taken by the respective state in the model.
The Framework takes modules as inputs which are basically the set of choices that are to be recognized and gives the output based on the FSM of the machine.
It consists of a Simple front end that actually allows the user to configure the voice and also gives flexibility to enhance features and enhance grammars based on them.
The Framework has the ability to accept as well as reject modules by the click of the user. It also incorporates speaker recognition features that provide security.
The Framework is built on a Dynamic approach that consumes very less memory and the modules installed are accessed based on the reflection class's concept making it easier for developers to code on speech based applications.
It also supports integration with hardware that makes the framework look totally adaptable to any conditions.
The framework also gives flexibility to the users and the developers to develop and control applications rapidly with full control at their hands.
The Results are based on the following parameters
Response time of System
Grammar Load Time
Grammar Switch Time
The testing was done on three different modules on 10 emulated systems and the results are based on Ordinal Scale
Response Time of System: This indicates the time taken by the system for loading the module and its resources.
Grammar Load Time: This records the time taken by the system to load the designated Grammar.
Grammar Load Time
Grammar Unload Time: This records the time taken by the system to unload and load the designated Grammar.
Grammar Unload Time
Recognition Accuracy: The Recognition acceptance of words performed by the System
Type of Grammar
IV. Conclusion & Future Work
Speech is the primary, and the most convenient means of communication between people. Whether due to technological curiosity to build machines that mimic humans or desire to automate work with machines, research in speech and speaker recognition, as a first step toward natural human-machine communication, has attracted much enthusiasm over the past five decades. We have also encountered a number of practical limitations which hinder a widespread deployment of application and services.
There is a need for platform independent framework for speech application development which can be provided through Assistant-I
The framework can focus on various other electronic gadgets such as Mobile phones, Automatic Burglar Alarms and other items like interactive intelligent human based speech with the computer.
Also high automatic micro controlled devices can also be controlled by speech.