Speech To Text System For Deaf People Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In the fast-paced world that we are living in today, fast and accurate communications is vital. Despite the rapid advancement of technology, there is still a communications gap between deaf people and normal people, due to those people who are not familiar with the sign language and inability of deaf people to hear audio sounds properly.

Various technologies and techniques have been implemented in order to improve the ability of deaf people to communicate with the outside world more effectively. These include physical devices such as hearing aids and the invention of the sign language, to more complicated electronic systems that enable recognition of speech, and conveying the information into image and will deliver the information for the deaf person in some way.

In this paper, a Voice to Image converter will be implemented as one of the solutions to this problem. The project will be implemented as a software system that reads in audio from a connected microphone, filter out noise in the audio and perform speech recognition to recognize what is being said. The output will be displayed in the form of an image to ease the deaf person's understanding about what is being said. The system is particularly useful as a transition for those who are not familiar with sign language to understand the gist of what other people is saying.

In order to achieve the above requirements, the converter needs to be able to perform the recognition fast, with a good accuracy rate for it to be useful for real-time usage. The lag between the speaker speaking and the system recognizing the word should be minimized as much as possible for a more responsive system. Also, the accuracy of the recognition should be high in different environments and by different uses for the system to be useful.

As such, a filtering system is proposed in order to attempt to reduce the environmental noise as much as possible before the recognition phase. For the recognition phase, effort will initially be targeted at interfacing with existing speech recognition engines such as the Microsoft Speech API, which is able to give a high accuracy rate at a reasonably fast response time.

The project should be able to let deaf people understand better what other people are saying. Having output in the form of images also makes it easier to understand, such that it can be used by relatively young people, or people who do not understand the language of the speaker well. Having a pure software implementation also reduces cost and wear and tear associated with hardware. The key focus in the project is to meet the objectives at a low cost, and with maximum user-friendliness.

Chapter 2: Aims & Objectives

To develop a system to convert voice to image to enable the deaf people to understand what another people is saying. To achieve the main objective, there are some sub-objectives that needs to be accomplished as follows:

To develop a speech acquisition system to read in audio from a microphone or from an audio file.

To develop a filtering system to reduce noise present in the acquired audio.

To interface with an existing speech recognition library to perform robust speech recognition and obtain the results of the recognition.

To develop an image retrieval system to display the image corresponding to the recognized word.

Chapter 3: Literature Review

3:1. Analysis on Similar Products and Paper Literatures

3:1:1. Speech to Text System for Deaf People

A working party was create by the Council for the Advancement of Communication with Deaf People (CACDP) in response to the report by the Commission of Enquiry into Human Aids to Communication to find ways of encouraging workers at the Crown Court System to work with deaf people. Hearing aids and lip-reading are very effective during face-to-face meetings among small groups of people. Unfortunately, many events are held in places that are less than ideal, whereby the environment might not be properly lit, and the speaker might be too far away to be seen or heard clearly. High levels of background noise also interfere with the usage of hearing aids. Under these circumstances, a simultaneous visual transcript of speech may be helpful. The author discusses the use of speech-to-text systems for deaf and hard of hearing people.


Speech to Text System (STT) Reporter transcribes the spoken words into machine shorthand, which is automatically converted back into English by a computer.

Large test in a simple font is preferred, especially for projector system.

Display needs to show a reasonable amount of speech, at least 15 seconds worth, equivalent to 50 words. The deaf person must have enough time to look away from the screen to see what is going on.

Accuracy is important and a proficient SST Reporter will aim for minimum of 95% correct spelling, with 98% correct spelling frequently achieved by an experience SST Reporter.


Remote transcription services

3:1:2. Automatic Image to Text to Voice Conversion

A good approach is to development a common platform for converting different modalities such as image to text into the same medium and associating them for efficient processing and understanding. This paper is present the development of a novel methodology based on Local-Global (LG) graphs capable for automatically converting image context into natural language text sentences and then into speech for serving as an interactive model for locating missing object in home environment.


Recognize an object image and describe their locations.


Conversion of image into Natural Language (NL) text paragraph (English natural language text sentences) by using common representation model, the Local-Global (LG) graph.

3:1:3. Dynamically Binding Image to Text for Information Communication

This demonstrates that a tight dynamical connection may be made between text and interactive visualization imagery. It shows that a bi-directional linkage may be created between the image space of a visualization program and hypertext space so that dynamical image and text representation of a data object are synchronized, thus maintaining the consistency of the visual information and information context.


Natural environment for intergrading text with imagery because graphical content and text not only coexist but may be dynamically linked.


To create a client-server application in which text and image clients access the data object through a server.

The disadvantage of this approach is that the data and its context are hidden behind the server.

3:1:4. Media Conversion from Speech to Facial Image

An automatic facial motion image synthesis scheme, driven by speech, and a real-time image synthesis design are presented. The purpose of this research is to realize an "Intelligent" human-machine interface or "Intelligent" communication system with talking head images.


Text to Image conversion

Voice to Image conversion


LPC Cepstrum

3:2. Filtering

3:2:1. FIR and IIR Filtering

Figure 1: FIR Filter Block Diagram [5]

Traditionally, audio filtering is performed by using linear Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters. These methods work very well if the noise that we are trying to remove is band-limited to a certain frequency range.

The idea of an audio filter is to remove frequencies in certain bands. They can be categorized into the following categories:

Low-pass filters: Used to remove noise in the high-frequency region (e.g. hissing and high-frequency tones)

High-pass filters: Used to remove noise in the low-frequency region (e.g. humming and other background noise)

Bandpass filters: Used if the audio signal is band limited within a certain frequency range.

Bandstop filters: Used if the noise is band limited to a certain frequency range

For FIR filtering, it is necessary to multiply the frequency response of the input signal with a step function in the frequency domain. For a low-pass filter, this corresponds to an LTI (Linear, Time-Invariant) system with an impulse response having the shape of a sinc function. Filtering can then be done either by convolution in the time-domain, or performing FFT on the input signal and impulse response, multiplying them together in the frequency domain, and inverse FFT to get back the filtered signal.

The FIR filter can be implemented by performing a convolution in the time domain between the audio to be filtered, and the FIR filter coefficients. The convolution equation is as follows: [6]

3:2:2. Comparison



Always Stable

Can be Unstable

Higher Order

Lower Order


Not Linear

Consists of Zeros

Consists of Zeros and Poles

Non Recursive


Table 1: Compare Between FIR and IIR Filter [7]

IIR filters have an impulse response with an infinite number of terms, and is used mostly for applications which are not linear and difficult to control and have no particular phase, FIR filters are on the other hand has an impulse response with a finite number of terms, which means that it is inherently always stable. It is also always possible to design FIR filters with a linear phase response, which is crucial in audio filtering. Due to the presence of feedback from previous outputs, IIR filters have a much faster response time as compared to FIR filters, and requires a much smaller order to achieve the same result. This faster response of IIR filters causes IIR filters to potentially become unstable if not designed properly. FIR filters are preferred over all IIR filters is because they are more stable and feedback is not involved.

Generally, FIR filters are preferable over IIR filters for performing audio filtering, as FIR filters are guaranteed to be stable, and they can be designed to have a linear phase response.

Besides normal FIR and IIR filters, there are also adaptive filters that have been developed to overcome some of the limitations of normal FIR and IIR filters.

3:2:3. Wavelet Transform

Wavelets are mathematical operators that are used to transform data into constituent frequency components, for further analysis and processing. The resultant frequency-domain components have a resolution that is in the same order as its scale. Wavelets was originally discovered in the fields of mathematics, and traditionally used in fields such as quantum physics and electrical engineering. Recently, wavelets have found wide application in the field of digital image processing, mainly for image analysis and denoising.

The underlying theory behind wavelet denoising comes from the theory of Fourier analysis, whereby periodic signals are decomposed into scaled sums of sines and cosines. Wavelets are a natural generalization of the Fourier transform, utilizing more general functions called mother wavelets, rather than the sine and cosine functions used in Fourier transform.

The applications of wavelet transform for audio denoising are as follows:

Apply wavelet transform to the noise signal and produce the noisy wavelet coefficients to the level.

Appropriate threshold limit at each level and threshold method (had or soft thresholding) to best remove the noises.

Inverse wavelet transform of the thresholder wavelet coefficients to obtain a denoised signal

Figure 2: Wavelet Signal Denoising [8]

The Discrete Wavelet Transform (DWT) is obtained by discretizing the Continuous Wavelet Transform (CWT). The CWT equation is as follows: [9]

Here, x(t) is the signal to be analyzed, while ψ(t) is the mother wavelet or basis function.

3:3. Comparison between Filters

Filtering Techniques




Can be used to filter out more general signals than FIR/IIR filters, as noise does not need to be band-limited

Always stable

Linear phase response possible

Different resolutions possible, depending on how many levels of decomposition

Different type of wavelets available to better fit different types of noises

Algorithm is more complex than IIR/FIR filters

Slower than FIR and IIR


Algorithm is simpler compared to wavelet

Always stable

Linear phase response possible

Faster than wavelet

Only works with band limited noise

Fixed resolution

Noise discrimination based purely on frequency band


Algorithm is simpler compared to wavelet

Faster than wavelet

Only works with band limited noise

Might be unstable

Difficult to achieve linear phase response, as response is generally non-linear

Fixed resolution

Noise discrimination purely based on frequency band

Table 2: Comparison between Filters

Based on the advantages and disadvantages above, it is seen that for audio processing, it is necessary to use either wavelet denoising or FIR denoising, as they have the ability to generate outputs with a linear phase response. Wavelets have an advantage over FIR filter in terms of flexibility, as FIR filters can only filter out noise that is within a frequency range, whereas wavelets can be chosen to optimize the elimination of different types of noises. However, wavelets are slightly slower and requires more complex calculations/circuitry as compared to FIR filters. However, FIR filters are able to remove noise better if we know the frequency range of the noise exactly (the noise is band limited).

As we want a filter with a quick response, and most of the noise to be filtered out is band limited, the FIR filter is more suitable for this project. This is because we need the system to operate in real-time, and so we need a fast filtering algorithm that have a good system performance. Also, the cutoff frequency ranges of the FIR filter can be set by the user should they need to filter out noise with a different frequency range.

3:4. Tools Development Research

3:4:1. MATLAB

MATLAB is stand for MATrix LABoratory. MATLAB is a high performance for technical computing and also a tool for numerical computation and visualization. MATLAB is widely used in all areas of applied mathematics in education and research at universities and in the industry. MATLAB is great in algebra and differential equations and for numerical integration. MATLAB is an interactive system whose basic data element is a matrix. MATLAB is a program that manipulates array-based data it is generally fast to write and run in MATLAB. [10]

MATLAB is widely used in mathematics and scientific modeling due to a huge amount of build-in libraries that allows rapid application prototyping and simulation. This reduces the need to code those libraries ourselves, allowing the user to concentrate on the logic for his/her application, without worrying about re-implementing commonly used algorithms.

3:4:2. Visual Basic

Visual Basic is a high level programming language which evolved from earlier DOS version called BASIC. It is very easy programming language to learn. The coding is similar to the English Language. Visual Basic provides many interesting sets of tools to aid in building exciting application.Visual Basic is widely used for Windows application development, whereby programs (both console based, GUI based, libraries, etc.) needs to be written to run on the Microsoft Windows OS. It is developed by Microsoft, and comes with general application libraries in the form of the .NET framework. Many third-party libraries are also available that can be called from Visual Basic. [11]

3:4:3. Comparison between Programming Languages

Programming Languages




Easily written and modified with the build-in integrated development environment and debugged with the MATLAB debugger

MATLAB includes tools that allow a programmer to interactively construct a graphical user interface (GUI)

MATLAB includes a lot of built-in libraries for performing signal processing and other mathematical operations

Lots of third-party libraries can be found to perform a variety of different tasks

Large amount of memory and on slow computer it is very hard to use

Program runs slower compared to a compiled language

Neither free nor open-source. License can be rather expensive

Visual Basic

Visual Basic are primarily an integrated, interactive development environment

Visual Basic provides a comprehensive interactive and context-sensitive online help system

Highly optimized to support rapid application development (RAD)

A free version of Visual Basic (VB Express) is available

Lots of third-party libraries available for VB

Visual basic is a proprietary programming language written by Microsoft, so programs written in Visual basic cannot, easily, be transferred to other operating systems.

Not as many signal processing libraries available like for MATLAB. User may need to implement some libraries by himself.

Table 3: Comparison between Programming Languages

From the comparison, it was decided to use MATLAB as the primary development tool due to the availability of libraries, which eases the process of acquiring and manipulating audio signals without needing to worry about operating system and driver details. MATLAB also supports many different types of audio acquisition devices.

The availability of libraries for performing Fast Fourier Transform, convolution, filter design, wavelet analysis, etc. also enables the user to concentrate on the system and algorithm, without needing to worry about re-implementing these libraries again. If necessary, third-party libraries are also easily available on the Internet.

MATLAB also contains functions for acquiring, processing and displaying various types of images, without needing to worry about the image format.

Chapter 4: Project Methodology

4:1. Block Diagram

Display Image

Image Retrieval

Speech Recognition


Speech Acquisition


Figure 3: Block Diagram of Voice to Image Converter

The block diagram in figure 1 is the basic idea for the system. Human voice is used as input. The purpose of the speech acquisition system is to read in audio from the microphone. After that the filtering system is used to reduce the noise present in the acquired audio. The purpose of the filtering system is to improve the clarity and accuracy of the input signal before it is passed to the speech recognition module.

Speech recognition fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio from a sound card into recognized speech. To transform the PCM digital audio into a better acoustic representation and apply a grammar so the speech recognizer knows what phonemes to expect. The speech recognizer will figure out which phonemes are spoken and covert the phonemes into recognized words. For our system, the speech recognition portion will be performed using an existing speech library that attempts to match what we speak with words stored in the database.

For the image retrieval system, it is basically a system for searching and retrieving images from the database. Based on the word recognized by the Speech Recognition module, an image will be fetched from the image database and displayed to the user.

4:2. Flow Chart


Read in Audio from microphone

Filtering audio signal

Send filtered signal to speech library

Speech recognition (matching to predefined words in database)

Match found? No

Fetch image from database Yes


Display Image

Figure 4: Flow Chart

The flow chart in Figure 5 describes how the Voice to Image Converter works. Human voice is used as an audio input to the computer's microphone. The system will read in the voice signal, and store it in an array. The system will then filter the acquired audio using a low-pass filter to reduce noise and improve the accuracy of the speech recognition. Once the audio have been filtered, it will be sent to a speech recognition library in order to recognize what word was being spoken. The speech recognition mode will be set to C&C (Command and Control), which will attempt to match the spoken word to a list of words stored in the word database. If a match is found, the system will retrieve an image from the image database corresponding to the recognized word, and display it to the user. Otherwise, the system will repeat the process until a word is recognized.

Chapter 5: Project Planning and Costing

5:1. Gantt Chart

Figure 5: Gantt Chart

5:2. Costing






RM 10



Table 4: Costing

As it is planned for the software work to be done in KDU computer lab, the computer and MATLAB software resources are not included in the costing, as I will be using the computers and MATLAB license installed on the KDU computers.