This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Some Deaf people can be made to hear by the use of hearing aids! All Blind people can be made to see either by using correction mechanisms or by surgery! But what about deaf people? Are they cursed? Well Obviously not. More or less it's really possible to make them speak with the help of our implementation model which relies upon the "Augmented Reality" and "Sixth Sense" Technologies. In this paper we propose our own research model for making the dumb people to speak by just using their hand gestures similar to the sixth sense device. These hand gestures are used to recognize what they are trying to convey to the real world by using a camera as a digital eye and the technique behind this is a complex digital image processing Algorithm, which differentiates various hand gestures and thereby selects the appropriate text template for conveying the message, with the help of loud speaker using text to speech conversion algorithms thereby making the dumb to communicate artificially.
Index Terms: Augmented Reality, Sixth Sense, Image processing, gesture recognization, Artificial Speaking Aid.
What is AR?
Augmented reality (AR) is a field of computer science that involves combining the physical real world with an interactive, 3D virtual world. This AR technology blurs the line between what's real and what's computer-generated by enhancing what we see, hear, feel and smell. The goal of augmented reality is to add information and meaning to a real object or place. The Augmented Reality thus enhances the user perception by adding some additional information to the real world and thus helps the mankind in accomplishing some crucial real world tasks. Unlike virtual reality, augmented reality does not create a simulation of reality. Instead, it takes a real object or space as the foundation and incorporates technologies that add contextual data to deepen a person's understanding of the subject.
Sixth Sense Using AR
Unlike other complex systems that help in technology world, this sixth sense based systems use only some minimal components that has very affordable costs, compared to the other costly devices, as follows.
Small projector & Mirror
Some Colored Markers
These components are usually tied together in a defined way to make a pendant like wearing device which helps in augmenting the real world with the virtual world by enabling us to interact directly with the digital world using just the hand gestures. It is also notable because the projector essentially turns any surface into an interactive screen. Essentially, the device works by using the camera and mirror to examine the outside world, feeds that image to the phone (which processes the image, gathers GPS coordinates information and retrieves data from the Internet), and then projecting information from the projector onto the surface in front of the user, whether it's a wrist, a wall, or even a person as an output of the processing work. This device costs just around US $350 according to the estimation of the components cost. But the logic and algorithm behind this is very tedious and complex. Though the algorithms are very complex, which form the basis for the processing of the device, the output of this algorithm is very effective in solving the real world problems around the people.
Our Proposed Model
3.1. Components and Architecture
In our proposed model, the components are more or less similar to that of the sixth sense device just with some alterations in order to suit our desired application. The major components of our system includes,
Portable Computing Device
Mirror & Projector
3.1.2. Software Components
Motion Detection Algorithm
Collection of Dialogue Templates
Text to Speech Conversion Algorithm.
This artificial speaking aid has to be given some inputs to make the system available for use. The input to the system in turn is just the input to the components of the system. The camera detects the natural hand gestures of the user and predicts their respective positions, basic forms and give the resulting image to the computing device for generating events.
The input to the text to speech conversion algorithm is the set of text template selected by the computing device. The sound signals generated from the speech templates are given as inputs to the loudspeaker for speaking to the real world.
The main input to the projector is from the computing device ( Smart phone itself ), which projects the user interface of the system ( Speaking aid ) to the user for controlling options which can be projected to any surface to use them as touch screen.
Hand Gesture Recognization
The Algorithm behind this is tedious in case of the sixth sense device, But here to implement our own proposed model, We just used a simple algorithm to detect some simple hand gestures for the sake of demonstration. Here the hand movements are detected by the camera and some sequential snap shots are taken when the hand with five fingers with the thumb at the left, then the predefined set of action listening pictures are compared with the taken snapshots one by one, if any one of the gesture makes the match, then the corresponding speech template assigned to it should be selected by the computing device and given to the next component.
But here the main thing to be noted is the database which contains the gesture images and the speech templates associated with it. Also that if a corresponding gesture is made, it should invoke the computing device to give input to the projector to project the user interface screen of the system for speaking aid option control purpose. Once if an hand gesture does not match with the one that is stored in the database, /then it may be a problem. But the main important thing is that the user should be trained to use the system by providing some instruction.
For example, consider the below hand gesture recognition of one simple hand ( right hand )
Here the hand is just recognized by analyzing the negative view of the camera, i.e. the lighter parts of the picture are marked and then compared using statistical analysis of bit positions and colour combinations of the lighter bits after some threshold values. Then the recognized image is checked with the images stored on the database of image gestures.
Here the gestures themselves will act as separate events in mapping the speech templates with the gesture, as in the case of event driven programming where the events are provided with some ID's and so here the gestures are recognized by their distinct id's.
Thus it helps in selecting the speech template to be conveyed to the real world.
Rather than storing some predefined images of hand gestures, the system provides the advantage of selecting the own hand gesture and matching the speech template with the gesture is up to the ability of the user. This in some way will help the user to easily remember the actions and gestures, as they themselves used them. And also helps the system in solving the controversies between the images of gestures.
Image Processing Algorithm used: Blob Colouring Algorithm
Easy and fast detection within acceptable frame rates (~2 fps+).
Ease-up the Detection
To make the detection more easy and fast, it was decided to focus solely on the color/brightness of a certain object (instead of detecting the user's real finger).
Find regions of a pre-defined color/brightness within an image.
A â€žbackwards L" shaped template is passed over the whole image from left to right and top to bottom.
backwards L" shaped Template
For each pixel calculate the distance ...
d1 between itself and its left neighbour.
d2 between itself and its upper neighbour.
Definition: "Distance of two pixels"
Grayscale: difference between the gray levels.
RGB: Euclidean distance ERGB in the RGB color space.
HSI: difference between hue or intensity.
A pixel is considered to belong to a different region if the distance di between the adjacent pixel is greater than a certain threshold T.
(d1 > T) and (d2 > T)
Pixel is different from both neighbours => assign to a new region.
(d1 < T) and (d2 > T)
Pixel is different from above neighbour, but similar to left neighbour => assign to same region as left neighbour.
(d1 > T) and (d2 < T)
Pixel is different from left neighbour, but similar to above neighbour => assign to same region as above neighbour.
(d1 < T) and (d2 < T)
Pixel is similar to both neighbours => assign it to the same region as the neighbours.
Case 4 is problematic:
Current pixel is similar to both neighbours, but the regions for the neighbours differ.
Both neighbour regions differ but are equivalent due to the current pixel
Currently examined template => â€žred line"
Current Pixel => â€žred"
Pixels in region â€ž1" => â€žgreen"
Pixels in region â€ž2" => â€žblue".
Equivalent Region Problem
A 2D-integer-array is used to store the region number for each pixel. If the 4th case occurs, renumbering the whole integer array (due to two regions being equivalent), is very time consuming. Especially because this can happen more than once. Pre-processing: All pixels not belonging to the defined color/brightness range are removed from the image (=> color set to â€žblack").Only colors which we are searching for are now present. Therefore the actual distance does not need to be calculated anymore.=> It is either â€ž0" (color->color) or â€ž1" (color->black).
An equivalence map is used instead of the integer-array.
Region equivalence Map
Problem: In case it is discovered that region â€ž2" is not only equivalent to â€ž1" but also toâ€ž3", this information would be overwritten. If renumbering is taking place immediately the processing time is raising again.
Region equivalence Trees Region equivalence Table before and after flattening
Determining which Blob to use
Problem: the biggest Blob is not always the one, which should be detected
(i.e. large reflection from the surface).
Absolute ratio between width and height of the blob bounding box. Minimum region size in pixels
Bounding Box width-height Ratio
Selection of Speech templates
From among the large number of speech templates, one particular template must be selected for each and every distinct gesture based on the ID's of the gesture images and the ID's of the template.
Both should be mapped together by the computing device and then the selected speech template is given as input to the text to speech conversion algorithm. Here the gesture to text mapping can be done in many ways like each and every hand gesture is provided with a unique id and such Id's should be mapped with some predefined text templates. If there is a match between, the snap taken and gesture in the database, then the corresponding speech template is selected.
Text to Speech Conversion
The selected speech template is then given to verbose text to speech converters which are freely available as a free source over the internet. Here in this step, the speech template is converted into corresponding sound signals and they can be used as input to the loudspeakers which will convey the message loudly to the real world.
UI for Controlling options using projector
The system has user interface options for controlling the system performance like Volume control of the speakers, Speaking frequency of the speaker, Accent and language controls are altered using the user interfaces provided by the projector by using Augmented Reality concepts, with the help of markers and their position coordination. This will even can be used to enhance the usability of the smartphones by providing an Augmented Reality touch to the smartphones.
The system model which we have proposed here is the new one which uses the sixth sense device and serves as an enhancement to the device as well as an artificial speaking aid for the dumb. Hence they can use this device as a sixth sense device, mobile phone and even as the proposed speaking aid .as this uses the broader field Augmented Reality & Sixth sense for its working its further enhancement options are really very open.