History Of Speech Recognition Softwares Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

1.0 Introduction

Technology in this day and age is advancing at a phenomenal rate; as a result mankind is getting more so dependent upon new technologies and ideas. In addition to new technology and ideas, developing old ideas to their true potential is also considered innovation. Inventors, designers, manufactures are always looking for ways to advance their products or ideas, or in some case both; whether it is mobile phones, cars, desktop computers, there is a big battle, per say, between big corporations to gain most market share with their technology. However to stay at the top of the market, designers/inventors need to come up with new technologies or ideas every year due to the fact that rate at which technology advances will make ideas redundant in a very short amount of time. A great example of this is the iPhone created by Apple Inc; every year Apple Inc has released a newer version of their iPhone, with each release practically dwarfing the previous version. Looking at this over a period of a year may seem long, however if you consider that the 1st generation iPhone was released a mere 3 years ago on January 9, 2007. Taking into account the technology is currently at its 4th generation, it can be said that within approximately 3 years Apple Inc have managed to hop through 4 generation, this shows the vast pace of technology advances [1][2].

The main idea behind new technologies is to make life easier for mankind. The first free programmable computer was invented in 1936 by Konard Zuse, this machine invented to aid in calculations. At its time it was the most powerful calculating machine available [3].

In more recent times, hands free devices have grown a lot in the market due to people leading fast lives and required to carry out two or more task at once. Most widely known device is the hands free device for mobile phones which adopts Bluetooth technology created by telecommunications retailer Ericssons in the year 1994 [4]. Even though this mainstream technology has been around for more than a decade, only in the last 5 years or so has the true potential of Bluetooth technology has be unveiled. However Bluetooth technology is a means of wireless data transfer, which doesn't have any interaction of the receiving end. This is where voice controlled technology come in (also known as voice recognition).

Voice or speech recognition systems allow users to interact with machines; it gives them the ability for hands-free communication between devices. There are many applications for voice controlled systems, they can be used to communicate with computers, multimedia systems, also security. In addition, all three attributes can be compiles together to form a secure multimedia system between computers.

1.1 History of Speech Recognition:

Firstly in 1936 AT&T's Bell Labs constructed the very first speech synthesiser. It was called Voder and during the 1939 World Fair, it was demonstrated to the public. However due to the technology available at the time it required a keyboard and foot pedal to use the device.

The next major mild stone came in the year 1982 when Drs. Jim and Janet Baker introduced the dragon system. At the time the dragon system was the best voice recognition system avaible, however it wasn't until the year 1995 when it really excelled after being adjusted to be able to do "discrete word dictation-level speech recognition" [5]. After 2 years in 1997 dragon system introduced the inspirational "naturally speaking", "continuous speech" recognition system, it was later called Dragon Naturally Speaking [6]. The dragon naturally speaking system was later backed by technology giant IMB.

In the past 20 years speak recognition technology has advanced to a point that machines are now able to comprehend real-time speech commands. In addition to this secure voice recognition systems have also been introduced. These systems work be using the "person's voice print to uniquely indentify individuals", and with the added biometric speaker verification technology secures the speech [7].

1.2 Difference in Speech

Due to the fact that every person speak is different compared to each other, it is very difficult to map what each person is saying by using the same system. Throughout the world there are numerous languages, the people speaking the languages have different levels of understanding, a broaden vocabulary and a wide and thus a variation in syntax. In addition to this different regions have different accents which make it complicated for a speech recognition designer to tender for all cases. Furthermore, every person speaks with a different pitch, tone, and speed, thus further increasing the difficulty of speech recognition.

"Consequently, most of the history of speech recognition systems has been in making a trade-off between what the user can say or speak, and what the technology interpret that is of an acceptably high level of accuracy to the end-user." [8].

Traditionally, interaction between and machine such as a computer has been through the aid of a keyboard or mouse. Speech recognition systems involve a subject speaking into a microphone, speech recognition software stored into computer decoding the audio signal created by the voice, computer then processes the speech and carries out the task.

1.3 Synthetic Speech

Synthetic speech can be defined as artificial human speech, which is mostly produced by a computer. In addition to this synthetic speech has also been adopted into mobile phones and satellite navigation systems. However, synthetic speech involved with mobile phones other devices are pre-stored so they do not change. The next step up would be real-time synthetic speech whereby speech is random and where the computer is not limited to jus the pre-stored commands [9].

The process used to convert text into synthetic speech is called "text-to-speech synthesis" or "synthesis-by-rule." Other methods employ the technique developed by S. Saito and F. Itakura in the year 1966 known as "Linear Predictive Coding (LPC)." This method of processing audio signals uses signal and speech processing represent spectral envelopes of a digital signal [10]. LPC is sometimes called "analysis-synthesis" algorithm. This way of coding analysis the underlying vocal tract model and evaluate the speech with respects to parameters. Once this is done the outcome is then "re-synthesize" so that it can be played back through a digital system [10].

The most major problem when it comes to speech synthesis is how to digitise the vast number of words, furthermore how to their form combination. For devices like mobiles and satellite navigation systems it would be far too farfetched to store each word in its digitised form.

Another limitation is the fact that each word has a different meaning if pronounced differently and misused in context. In terms of synthesis speech the computer can only produce certain tones and pitches and thus will not be practical in conversation [8][10]

1.4 Speech Recognition

1.4.1 Simple Recogniser

Figure 1 - simple speech recogniser, user's speech is digitised and then converted to the recogniser's internal representation. The captured word is then cross referenced with the recogniser's template memory to see what word has been said. The pattern matching algorithm determines what the closest match is [11].

For the above simple speech recogniser three main components are used, firstly a speech representation, set of templates or models, pattern matching algorithm.

The speech representation component is used to convert the user speech to pattern that can be read by the pattern matching algorithm. Coding methods like linear predictive coding using LPC coefficient and zero crossings of the speech waveform convert the speech signal. This is a very fast way to digitise speech however it has its limitation; limitation is that it cannot distinguish between certain pitches and tones. As a result the words need to be pronounced very clearly. This type of system is used in most mobile phones, such as the iPhone, Nokia, Samsung, and other voice communication devices on computers. However the words need to said clearly otherwise the pattern matching algorithm will not be able to locate the correct word.

Figure 2 - 2 different Pitch track of "She went to Paris" [11]

Above is the phrase "she went to Paris" said in two different pitches. It can be seen that if digitised it will give two different forms and thus once it goes through the pattern matching algorithm it will output 2 different sentences. In addition to pattern matching algorithm there is also two other method, hidden Markov models that can be used for automatic speech recognition and maximum entropy Markov models.

1.4.2 Template Matching in Detail

Template matching, also known as pattern matching algorithm, is a very similar model used to recognise speech. "In template matching methods the decision making process matches the unknown input to each of a set of templates, which are prototype examples of pattern data. The matching criterion is generally a correlation which directly reflects the similarities between input and templates. The use of whole-word templates has achieved quite a measure of success, largely due to the procedure of dynamic time alignment (Bridle and Brown, 1979) of input and template, which provide a degree of normalization for the intra-class temporal variations." [12].

1.4.2 Hidden Markov Models

With hidden Markov models speech patterns are analysed as sequences of short time frames. This will produce a sequence of speech parameter vectors, such as linear predictive coding coefficient. Each word or more so each pattern is represented as a sequence of T (number of observations O) in time [12].


Fundamentally, hidden Markov models uses probability to determine what word has been spoken. "Recognition is a decision as to which model best matches the given input pattern, and this is the model which has the highest probability." [12] Given that there is a vocabulary of V words, then probability is calculated with the following formula:



, and , is presented as an observation sequence O, with V HMMs, for V words. [12]


[12] Markov Chain

The Markov chain in the fundamentals on what the Markov model was based upon. A Markov chain is a special case of a weighted automaton. "Automaton is defines a "formal language" as the set of strings the automaton accepts over any vocabulary." [13] The input sequence of the Markov chain determines what state the automaton will go through.


2.0 Aim

The aim of this posed project is to design and implement a speech recognition system. The main target for this system is for home use, effectively converting a home into a "smart home". By using automated speech recognition to create a smart home the final project will be called "ASH" (Automated Smart Home). Once a working prototype is developed, it shall undertake testing to determine whether the system functions to the stated requirements. The fundamental concept of the project is to totally automate the home where commands can be said from any room via hidden microphones, and a response can be heard from hidden speakers.

3.0 Objectives

Main project objective are defined below:

Development of a analogue-digital converter

Development of digital-analogue converter

Implement noise reduction

Building a database that stores words for communication

Design and build of the main speech recognition system

Conduct checks on final prototype

Experiment on how to make the system secure

3.1 ASH Requirements

Checklist on what the posed system must contain:

Be reliable

User friendly - all age groups

Advanced synthetic speech

Able to be installed in existing homes

Smooth and fast

Be able to do error handling and self debug

4.0 Proposed System

The proposed automated smart home system which adopts a speech recognition system will be designed from basic components. A small scale version of the system will be built testing purposes and if the design is successful and larger full scale version can be incorporated into production.

This project is more towards the electronics sector; as a result the pre-programmed software for the speech recognition system will be used. For reliability purposes commercial software will be used like IBM's ViaVoice software. The software is easily downloadable from the internet. For the hardware side, the main components needed are sound cards, microphones and a powerful processor to be able to process continues speech rather than single words.

All internal processing components will be housed together for easy management, all external components such as microphones and speakers will be housed where needed. The software will be installed on a small memory capacity hard drive which the processor will have access to, to process commands.

In addition the whole system will have input/output capabilities in case the user desires to add other external devices such as multimedia systems.

Figure 3 - very simple block diagram of the proposed system [14].

From the above guide the majority of the comments will be housed in the ASR (Automatic speech recognition) section. Since technology has advanced from wired to wireless technology, communication between the external components will be looked into being converted to wireless.

Work packages






















Development of a analogue-digital converter

Development of digital-analogue converter

Implement noise reduction

Building of database

Design and build of the main speech recognition system

Conduct checks on final prototype, (Experiment on how to make the system secure)

Compiling Final report & oral presentation

5.0 Project Timetable Management