Gesture interpretation can be seen as a way for computers to begin to understand human body language, thus building a richer bridge between machines and humans than primitive text user interfaces or even GUIs, which still limit the majority of input to keyboard and mouse. It has also become increasingly evident that the difficulties encountered in the analysis and interpretation of individual sensing modalities may be overcome by integrating them into a multimodal human-computer interface. This system is needed for interpreting speech and gesture sensing modalities in the context of human computer interface. This research can benefit from many disparate fields of study that increase our understanding of the different human communication modalities and their potential role in Human Computer Interface which can be used for expert to have computer assisted surgery , handicapped persons to control their wheel-chair, mining etc. This paper presents a real time vision-based speech and hand gestures recognition system for gesture interpretation system using dynamic time wrapping algorithm. Speech and Hand Gesture are an important modality for Human Computer Interaction. Vision based recognition system can give the computer the capability of understanding and responding to gestures.
Index Terms- Human-computer interface, multimodality
Recent advances in various technologies, coupled with an explosion in the available computing power, have given rise to a number of novel human-computer interaction modalities-speech, vision-based gesture recognition, eye tracking, electroencephalograph etc. Successful embodiment of these modalities into an interface has the potential of easing the human computer interface bottleneck that has become noticeable with the advances in computing and communication.
Gesture Interpretation is topic in science and Technology with goal of interpreting human gestures via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face or hand. Current focuses in the field include emotion recognition from the face and hand gesture recognition. Many approaches have been made using cameras and computer vision algorithms to interpret sign language. However, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques.
In human-human interaction, multiple communication models such as speech, gestures and body movements are frequently used . The standard input methods, such as text input via the keyboard and pointer/location information from a mouse, do not provide a natural, intuitive interaction between humans and robots. Therefore, it is essential to create models for natural and intuitive communication between humans and computer. Furthermore, for intuitive gesture-based interaction between human and robot, the computer should understand the meaning of gesture with respect to society and culture. The ability to understand speech and hand gestures will improve the naturalness and efficiency of human interaction with computer, and allow the user to communicate in complex tasks without using tedious sets of detailed instructions.
2. Scope of the Problem
In this section, we outline both the scientific and engineering challenges in designing speech-gesture driven multimodal interfaces in the context based gesture interpretation system. Our main goal is to design a dialogue-enabled HCI system for collaborative decision making, command, and control. While traditional interfaces support sequential and unambiguous input from devices such as keyboard and conventional pointing devices (e.g., mouse, track pad), speech-gesture driven dialogue-based multimodal interfaces relax these constraints and typically incorporate a broader range of input devices (e.g., spoken language, eye and head tracking, speech, gesture, pen, touch screen, displays, keypads, pointing devices, and tactile sensors). The ability to develop a dialogue-based speech-gesture driven interface is motivated by the knowledge of the natural integration patterns that typify people's combined use of different modalities for natural communications. Recent trends in multimodal interfaces are inspired by goals to support more transparent, flexible, efficient, and powerfully expressive means of HCI than ever before. Multimodal interfaces are expected to support a wider range of diverse applications, to be usable by a broader spectrum of the average population, and to function more reliably under realistic and challenging usage conditions. The main challenges related to the design of a speech-gesture driven interface for gesture interpretation system are:
A. domain and task analysis;
B. acquisition of valid multimodal data;
C. speech recognization;
D. recognizing users gesture;
E. interoperability of devices.
A. Domain and Task Analysis
Understanding the task domain is essential to make the challenge of building a natural interface for gesture interpretation system (or other application domains) a tractable problem. This is because multimodal signification (through speech, gesture, and other modalities) is context dependent. Within this context, cognitive systems engineering (CSE) has proven to be an effective methodology for understanding the task domain and developing interface technologies to support performance of tasks -. The theoretical frameworks of distributed cognition , activity theory , and cognitive ergonomics  also have the potential to help isolate and augment specific elements of the crisis management domain for multimodal system design. one should consider scale and needs before settling on a single framework, making it important to consider a variety of approaches in designing a collaborative multimodal gesture interpretation System.
B. Acquisition of valid multimodal data
An important feature of a natural interface would be the absence of predefined speech and gesture commands. The resulting multimodal "language" thus would have to be interpreted by a computer. While some progress has been made in the natural language processing of speech, there has been very little progress in the understanding of multimodal HCI . Although, most gestures are closely linked to speech, they still present meaning in a fundamentally different form from speech. Studies in human-to-human communication, psycholinguistics, and others have already generated a significant body of research on multimodal communication. However, they usually consider a different granularity of the problem. The patterns from face-to-face communication do not automatically transfer over to HCI due to the "artificial" paradigms of information displays. Hence, the lack of multimodal data, which is required to learn the multimodal pattern, prior the system building creates so-called chicken-and-egg problem.
C. Speech Recognization
Improving performances in voice recognition can be done taking into account the following criteria:
dimension of recognizable vocabulary;
spontaneousness degree of speaking to be recognized;
dependence/independence on the speaker;
time to put in motion the system
system accommodating time at new speakers;
decision and recognition time;
recognition rate (expressed by word or by sentence).
Gesture acquisition is concerned with the capture of the hand/body motion information in order to perform subsequent gesture recognition. Gestures are in general defined as movement of the body or limbs that expresses or emphasizes ideas and concept. In the context of multimodal systems, pen and touch-based interfaces are also commonly viewed to fall under the gesture recognition domain. However, while for pen- and touch-based systems, gesture acquisition is merely a marginal problem, it requires considerable effort for most other approaches. Aside from pen- and touch-based systems , the most common gesture acquisition methods are based on magnetic trackers, cyber-gloves and vision-based approaches. The suitability of the different approaches depends on the application domain and the platform. Pen based approaches are the method of choice for small mobile devices and are cost effective and reliable. Acquisition using
Magnetic trackers  and/or cyber gloves is efficient and accurate but suffers from the constraint of having to wear restrictive devices. In contrast, vision-based approaches offer entirely contact-free interaction and are flexible enough to operate on all platforms except the smallest mobile devices.
E. Interoperability of Devices
Both the interpretation of multimodal input and the generation natural and consistent responses require access to higher level knowledge. In general, semantics required by multimodal systems can be categorized along two dimensions: general versus task/domain specific, and dynamic versus static occurrence relations between speech and gesture. The issue of interoperability across the wide range of devices is very critical for a seamless flow of information and communication. Hence, it is important to design unified multimedia applications there are many challenges associated with the accuracy and usefulness of gesture recognition software. For image-based gesture recognition there are limitations on the equipment used and image noise. Images or video may not be under consistent lighting, or in the same location. Items in the background or distinct features of the users may make recognition more difficult . Moreover all the postures are stored in the form of database so that they may be used for comparing the postures with current posture and accordingly make decisions. This is extra overhead on the system to manage huge image database.
3. SYSTEM ARCHITECTURE
The paper is basically based on dynamic time wrapping algorithm. This algorithm is used for both speech and hand gesture recognization. A context based gesture interpretation system takes valid input and based on this input it perform response action. This paper describes overall methodology used for both speech and hand gesture.
Figure 3.1. Proposed System Architecture for Speech Recognization
3.2 Design of Hand Gesture Recognization
A new vision-based framework is presented in this paper, which allows the users to interact with computers through hand postures, being the system adaptable to different light conditions and backgrounds. Its efficiency makes it suitable for real-time applications. The present paper focuses on the diverse stages involved in hand posture recognition, from the original captured image to its final classification. Frames from video sequences are processed and analyzed in order to remove noise, find skin tones and label every object pixel. Once the hand has been segmented it is identified as a certain posture.
Figure3.2. Proposed System Architecture for Hand Recognization
Figure 3.2 Proposed system above shows the system architecture used in this paper. In this first a live image is captured by camera. This image is converted into frame which is divided into different scan lines. Individual pixels under these scan lines are extracted and their RGB value is calculated. Based on this color of the pixels on the scan lines, the hand posture is detected. This direction is given as a command to the toy car to move in the specified direction.
Modeling skin color requires the selection of an appropriate color space and identifying the cluster associated with skin color in this space. In this paper, gray scale is chosen since the hue and saturation pair of skin-tone colors are independent of the intensity component [Jon98].
Thus, colors can be specified using just two parameters instead of the three specified by RGB Space color (Red, Green, Blue). In order to find common skin tone features, several images involving people with different backgrounds and light conditions were processed by hand to separate skin areas. Yellow dots represent samples of skin-tone color from segmented images,
while blue dots are the rest of the image colour samples. It can be observed that skin-tone pixels are concentrated in a parametrical elliptical model. For practical purposes, however, skin-tone pixel classification was simplified using a rectangular model. The fact that the appearance of the skin colour tone depends on the lighting conditions was confirmed in the analysis of these images. The values lay between 0 and 35 for hue component and between 20 and 220 for saturation component.
In gesture interpretation system Human-Computer Interaction is an important part of systems design. Quality of system depends on how it is represented and used by users. Therefore, enormous amount of attention has been paid to better designs of HCI. The new direction of research is to replace common regular methods of interaction with intelligent, adaptive, multimodal, natural methods. Motivated by the tremendous need to explore better HCI paradigms, there has been a growing interest in developing novel sensing modalities for HCI. To achieve the desired robustness of the HCI, multimodality would perhaps be an essential element of such interaction. Clearly, human studies in the context of HCI should play a larger role in addressing issues of multimodal integration. Even though a number of developed multimodal interfaces seem to be domain specific, there should be more systematic means of evaluating them. Modeling and computational techniques from more established areas such as sensor fusion may shed some light on how systematically to integrate the multiple modalities. However, the integration of modalities in the context of HCI is quite specific and needs to be more closely tied with subjective elements of "context." There have been many successful demonstrations of HCI systems exhibiting multimodality. Despite the current progress, with many problems still open, multimodal HCI remains in its infancy. A massive effort is perhaps needed before one can build practical multimodal HCI systems approaching the naturalness of human-human communication.