Automatic Speech Recognition Definition English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

1940s and 1950s - Crawling: This period that began in the 1940's and which ran right through until the latter part of the 1950's, saw an important and essential ground-work carried out upon two key paradigms: the automaton model and the probabilistic model. The automaton came to surface in the 50's out of Turing's model of algorithmic computation; a model which is now referred to as a "Turing machine". Turing's model, which was considered by many to be the foundation of modern computer science, then led to the McCulloch-Pitts neuron model - (1943), and later to mathematician Stephen Cole Kleene's contributory work to automata theory on - finite automata and regular expressions - (1951 and 1956). Another mammoth contribution to automata theory was by Claude Elwood Shannon in 1948, who applied probabilistic models of hidden Markov processes to automata for language; Shannon's work, and to be more precise his idea of a finite-state Markov process, was drawn upon by Noam Chomsky in 1956, in which, he first considered finite-state machines as a way to characterize a grammar, and in which, he defined a finite-state language as a language generated by a finite state grammar. As a whole, these early models led to the field of formal language theory, a theory which defines a formal language as a mathematical sequence of symbols using algebra and set theory; this includes the context-free grammar which was first defined by Noam Chomsky, but independently by John Backus in 1959, and Peter Naur in 1960. It was within this same period that the development of probabilistic algorithms for speech and language processing came about; this led to Claude Elwood Shannon's next important contribution, the metaphor of the noisy channel and decoding for the transmission of language through media like communication channels and speech acoustics. Shannon, through the borrowing of the concept of entropy from thermodynamics, was also able to measure the information capacity of a channel or the information content of a language, and it was at around this same time period that the first foundational works were laid for future work in speech recognition; first, with the development of the sound spectrograph in 1946, and then, with the foundational research that was conducted into instrumental phonetics. It was this important groundwork that brought about the first machine speech recognizers in the early 1950s with the highlight of them being a statistical system that was built by a team of researchers at Bell Labs; a machine that was able to recognise any of the 10 digits from a single speaker whilst achieving a 97-99% accuracy rate.

A simple model of a Turing Machine. Essentially, a Turing machine consists of an infinite tape, a head, a state register, and a finite state machine.

1957 to 1970 - Baby Steps:

By the late 1950s up until the early 1960s, speech and language processing had divided into two very distinct and separate paradigms: the first - the symbolic, and the second - the stochastic. The work of Noam Chomsky and others within the field of formal language theory and generative syntax, and in likewise manner - the work of many linguists and computer scientists on parsing algorithms, was the first line of research the symbolic paradigm took off from; a clear example of this was one of the earliest complete working parsing systems - Zelig Harris's "Transformations and Discourse Analysis" project between the years of 1958 to 1959. the second line of research was the new field of Artificial Intelligence, when in the summer of 1956 a group of researchers came together which included the aforementioned Claude Elwood Shannon for a two month workshop called "artificial intelligence". The main emphasis of this new field was the work on reasoning and logic exemplified by Allen Newell, Herbert A. Simon, and J. C. Shaw's work on the "Logic Theorist" and the "General Problem Solver". A clear consequence of the work that took place at the coming together of these researchers, was the use of simple heuristics in early natural language recognition systems that were built at around this time period. By the time the late 1960s had come about, more orderly logical systems were in use. The stochastic paradigm, in contrast to the symbolic paradigm, took off from within the departments of statistics and electrical engineering. The late 1950s saw the implementation of Bayesian artificial intelligence methods within the up till then problematic optical character recognition; Woodrow Wilson Bledsoe and fellow employee Sandia Iben Browning, came together, and built a system using Bayesian artificial intelligence methods for optical character recognition in 1959. Later on, in 1964, Charles Frederick Mosteller and David Wallace applied Bayesian artificial intelligence methods to the historical problem of attribution - as to who wrote each of the disputed Fedaralist papers. The 1960s also brought about the first test-able psychological human language models, human language models which were based upon transformational grammar. It also saw the appearance of the first online text collections (newspapers, novels, articles, dictionaries, etc.) - in the "Brown University Standard Corpus of Present-Day American English" (compiled by Henry Cucera and W. Nelson Francis between the years of 1963 to 1964), and William Wang's "Chinese Dictionary on Computer" in 1967.

Chomsky's grammar: A tree diagram, or phrase marker, considered a structural description of the sentence "The man hit the ball." It is a description of the constituent structure, or phrase structure, of the sentence, and it is assigned by the rules that generate the sentence.

1970 to 1983 - Giant Leaps:

This third and key period saw research in speech and language processing reach new heights and levels; it also saw the development of numerous research paradigms which are still forerunners in the field today. The stochastic paradigm which we mentioned previously played a key role in the advancement of speech recognition algorithms within this period, for speech recognition algorithms such as the Hidden Markov Model and the metaphors of the noisy channel and decoding were especially in use during this time; these algorithms were developed independently of each other by Frederick Jelinek, Lalit R. Bahl, Robert L. Mercer, and associates at the "IBM Thomas J. Watson Research Center", and James K. Baker at "Carnegie Mellon University". This third period also brought about the beginnings of the logic-based paradigm, through the work of Alain Colmerauer and his associates on Q-systems and metamorphosis grammars in 1970 and 1975, through the work of Fernando C. N. Pereira and David H. D. Warren on Prolog and Definite Clause Grammars in 1980, and through Martin Kay's work on functional grammar and LFG in 1979 and 1982. Another field which saw the light of day during this period was natural language understanding; it all began with the "SHRDLU" system which was created by the American professor of computer studies Terry Winograd; a system which was developed so as to simulate a robot incorporated within a blocks-type world environment. This program allowed user interaction using natural language text commands in English; a first since work within Speech Recognition in the 1940s kicked off in terms of complexity, sophistication, and extensiveness. Yale school, through its chairman of computer science Roger Schank and his associates (colleagues and students: Robert P. Abelson - 1977, Roger C. Schank, Christopher K. Riesbeck - 1981, Cullingford - 1981, Robert Wilensky - 1983, Wendy Lehnert - 1977) - built a series of language understanding programs that were specifically directed towards concepts within human knowledge; (concepts such as scripts, plans and goals, memory organisation etc.); oftentimes network based semantics was utilised and Charles J. Fillmore's notion of case roles were also incorporated in their representations (M. Ross Quillian - 1968, D. Rumelhart and D. Norman - 1975, Roger C. Schank - 1972, Y. Wilks - 1975, Walter Kintsch - 1974). This period also brought about the discourse modeling paradigm, which focused on four key areas that are present in speech: ideas in speech structure and speech focus were proposed by Barbara J. Grosz and her fellow researchers, and work on automatic reference resolution and the Belief Desire Intention framework for logic based work on speech acts was also developed (Jerry R. Hobbs - 1978, J.F. Allen and C.R. Perrault - 1980 and P. R. Cohen and C.R. Perrault - 1979).

Prolog is the language used to implement semantic grammars. It contains a handy construct known as a Definite Clause Grammar or DCG for short. DCGs greatly simplify the process of building phrase structure rules. Once the rules are defined using DCGs, the parse tree is a by-product because of Prolog's automatic backtracking mechanism.

1983 to 1993 - Further Progress:

This decade saw the come back of two classes of models which had previously lost popularity in the late 1950's, and the late 1960's. The first of the two was the finite-state models which began the revival through work that was carried out in 1981; work on finite state phonology and morphology, by Ronald M. Kaplan and Martin Kay; and work carried out on finite state models of syntax by K. Church in 1980. The second class, was the rise of the probabilistic models throughout speech and language processing, models which were strongly influenced by the work that took place at the IBM Thomas J. Watson Research Center on "probabilistic models of speech recognition"; this work that took place brought about methods which spread into part of speech parsing and attachment ambiguities, tagging, and connectionist approaches from speech recognition to semantics; a considerable amount of work also took place during this time period on natural language generation.

1983 to 1993 - Smooth Riding:

As the millenium approached, clear changes in the field of Speech Recognition became even more apparent. One such change was that probabilistic and data-driven models had become common place throughout natural language processing, and therefore probabilities were incorporated and evaluation methodologies were borrowed from speech recognition and information retrieval - and thus, employed in algorithms for parsing, parts of speech tagging, reference resolution, and discourse processing. Also, due to the increases in the speed and memory levels of the computer, commercial utilisation of a variety of sub-areas of speech and language processing now became possible; sub-areas such as: speech recognition, as well as, spelling and grammar reviewing. Finally, The rise of the web brought about a serious need for language based information retrieval and extraction.

The Concept:

Speech Communication:

One of the most simple but yet essential capabilities possessed by man is the art of speech communication; through speech, human beings can express information at will without the need of a third party tool. Despite the fact that we are able to take in more information through the eyes (rather than through the ears), being able to communicate one to another in a visual fashion is more or less completely ineffective when compared to the possibilities and potentials of speech communication. The speech wave transmits information across through a known language - through the particular speaker's vocal characteristics and emotion; if we also take into account that the acoustical and the linguistic structures of speech are so closely related to our level of intellectual ability, and as a consequence of this are also deeply related to our cultural and social development - then, we can clearly see just how much of a significant role speech plays in our everyday lives.

The speech wave produced by the vocal organs is transmitted through the air to the ears of the listeners, as shown above. At the ear, it activates the hearing organs to produce nerve impulses which are transmitted to the listener's brain through the auditory nerve system. This permits the linguistic information which the speaker intends to convey to be readily understood by the listener.

The Speech Wave - Information Retrieval by Machine:

If we take all the above into consideration, we can see why an application of artificial intelligence such as Automatic Speech Recognition has been such a sought after goal of research for a period of more than sixty years, (if of course we take into account that the original foundations for future works were laid in the 1940's, and which eventually brought about such inspirational, for the time, science fiction wonders such as the computer 'HAL', in S. Kubrick's famous movie "2001 - A Space Odyssey", and the much beloved robot "R2D2" in the G. Lucas classical movie series "Star Wars"). Despite the fact that enormous efforts have been made in trying to design and create intelligent machines, machines that are able to recognize the spoken word and then comprehend the meaning of it; we are still as yet not anywhere near enough to achieving the ultimate goal and desire we have set from the get-go: to create a machine that can recognise and thus comprehend all spoken speech, no matter the subject, no matter the saying, no matter the environment. So; Where do we currently stand when it comes to the art of automatic speech recognition? How far have these speech recognition systems come? These are not straight forward questions, or even questions that have a simple answer; what can be said however, is that - the answers lie in the ultimate goal of automatic speech recognition that we have relayed above; from this end-objective we can begin to try and answer these all important questions, and the word that should come to mind when we analyse this statement is - "accuracy". Any given system's accuracy depends wholly upon the overall conditions of the evaluation that takes place; for example, a system that is evaluated under very narrow like conditions can easily attain a human-like accuracy, but when these conditions are broadened it becomes much harder to achieve this human-like accuracy. These conditions of evaluation and thus the overall accuracy of automatic speech recognition vary depending upon the following factors:

Vocabulary size and confusability.

Speaker dependence vs. independence.

Isolated, discontinuous, or continuous speech.

Task and language constraints.

Read speech vs. spontaneous speech.

Adverse conditions.

Even if however, we take all the above factors into account, and use them as a general blueprint when performing research into speech recognition by machine as a whole, with of course the end desired goal being the ultimate, a machine that can recognise and thus comprehend all spoken speech, no matter the subject, no matter the discourse, no matter the environment, we must still not fail to recognise the inter-disciplinary nature behind automatic speech recognition by machine, and the fact that, most researchers tend to apply a monolithic approach to individual problems, problems which require the application of one or more disciplines such as the following:

Signal processing.

Physics (acoustics).

Pattern recognition.

Communication and information theory.



Computer science.


SPREX Expert System:


The SPREX system; designed by Mizoguchi, Tanaka, Fukuda, Tsujino, and Kakusho, in 1987, is a continuous speech recognition system, that uses techniques derived from knowledge engineering. The parameters within this system are defined in a symbolic fashion, and then, immediately following, forward chaining rules are applied for the recognition phase. The parameters used within this system, are formant frequencies, energy, zero crossing rate, and ratio of low frequency energy to the whole energy (L/A); variations of the feature parameters are defined through descriptors; states of features are labeled through the use of a number of varying labels; through the use of the state descriptions of the feature parameters, a good graphical interface is made available for users of the system, whose job it is to create rules; rules that are written in accordance to the level of knowledge possessed by these human experts; a rule database is used to store these rules, and they are used during the recognition phase. Now, during the recognition phase, the state descriptions and the rules are used for, segmentation, consonant recognition, vowel recognition, and re-recognition.

The Problems:

Despite the fact that numerous approaches towards automatic speech recognition have been developed; techniques such as template matching methods, statistical methods, neural network based approaches, and so forth, have not been able to achieve the end desired goal of, "speaker independent continuous automatic speech recognition". The majority of these approaches, are dependent upon heuristic knowledge, in addition to their own methodology, whatever it be. Within automatic speech recognition systems as a whole, the problem faced within the segmentation phase still remains to be solved; the importance of the problem is that, the segmentation phase is usually the first step within the recognition process; any errors faced in the segmentation phase will without a doubt spread to later stages, and the overall perfomance of the system will never climb above that of the segmenter. Also, despite the fact that the majority of speech recognition expert systems today, through the use of powerful tools, are able to transform the knowledge of experts into rules, and effectively maintain them; there still exists the obvious drawback that the labeling, uses thresholds, and this means that, rules for values near the threshold value, cannot be flexible, and it becomes difficult to determine the exact threshold values.

The Proposed Solution:

As mentioned above, the problem of the segmentation phase is a common one in automatic speech recognition systems. In order to avoid the problems faced in the SPREX expert system, the proposed idea is: an irregular unit based on spectral transition measure. Frames should be used as the structure of the speech recognition rules; this particular structure will provide the user with an easier means towards the creation of rules, and will enable the creation of an automatic rule generator; the use of fuzzy linguistic variables for representing the rules is also proposed, and the use of a rule generator cycle, a rule tester with an error reporter, and a state describer with additional modules, in order to make up the workings of the rule generating cycle; a cycle which could save precious time when making rules, and thus result in a high performance speech recognition system. In short, the concept of fuzziness would be applied to the system in order to provide flexibility of rules; a problem which was pinted out previously, and this of course would provide the flexibility required by human experts who create the rules; linguistic variables will be used instead of thresholds, so as to describe the state trajectory of parameters. The figure on the next page shows the definition of the rule structure; the figure on page * shows the overall structure of the automatic speech recognition expert system, and the flow of rule generation; and figure 6, on page *, shows the structure of the speech data base and the relations with other modules.

Figure 3 shows the definition of the rule structure; the parameters used in the rule are the same as those used in the original SPREX system; each frame has slots for the state description such as the sequence of states, the start and end values, and so forth.

Figure 5 shows the overall structure of the speech recognition expert system, and the flow of rule generation.

Figure 6 shows the structure of the speech data base and the relations with other modules.