Gesture Recognition At Present Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Gesture recognition at present is one of the most advancing technologies finding its way into thousands of applications that have eased their work and made them much more efficient. Gesture recognition is basically concerned with capturing the gestures of the human body and interfacing that with computers. The captured gestures are detected by the computer and used in various applications wherever required. In this project, we will be creating a media player coded in C#, which will be controlled by simple palm gestures for its basic functions such as Play, Pause, Next Track, Previous Track, and Volume Control. We will be using a simple webcam as the input device to track and record the gestures, and will be implementing the media player controls with these gestures by using Emgu CV.


Gesture recognition at present is one of the most advancing technologies finding its way into thousands of applications that have eased their work and made them much more efficient. Gesture recognition is basically concerned with capturing the gestures of the human body and interfacing that with computers. The captured gestures are detected by the computer and used in various applications wherever required. For e.g.: Capturing hand movements to move the pointer on the computer or laptop without using the mouse.

The basic technique followed here is: a camera is used to detect and read the movements of the human body. Using various algorithms, the computer detects these movements and takes it as an input to control devices or applications.

Gesture recognition has found potential in various fields. It has helped the physically impaired to interact with computers, for e.g.: by interpreting sign language.

Gesture recognition has found massive potential and use in 3d animation, wherein the actual human lip syncing movements are detected, tracked and applied on the 3d models to make it look as realistic as possible [1].

In this research paper, we propose a design for a media player that will be controlled by hand gestures to implement the basic functions of the player, such as Play, Pause, Volume Up/Down, Next/Previous track. The proposed media player will be designed in Microsoft Visual Studio and coded in C#.

Gesture recognition is a technique by which human gestures or movements are captured and interfaced with a computer, taken as input, and further used in various applications. Gesture recognition has a tremendous potential in thousands of applications. The computer uses various algorithms to detect the body movements and take them as input and further use them on any application to perform different functions. The basic goal of gesture recognition is to create and work on a system that can understand specific human gestures, and use these gestures to convey some information, to maybe run a device or an application.

Gestures of the human body are detected by the camera and specific gestures when detected are made to drive the application to perform a particular task. In simple terms, the gestures recorded by the camera, are taken as input, and once specific gestures are recorded, the code simply tells the device what it has to do for that specific recorded gesture.

Gesture recognition can be broken down into a few simple steps.

C:\Users\DELL\Pictures\Capture1234.JPG Figure 1.Gesture Recognition Procedure

Colour Spaces

The main challenge in skin detection using the webcam would be using a suitable environment where in the gestures can be detected. Skin colour segmentation requires the selection of a specific colour space via which the webcam can recognize and detect the image. Human skin colour has it is own feature colour and can easily be distinguishes from other objects. Therefore, in this application, skin colour segmentation approaches are used as the detection instrument. The first thing to consider is the type of colour space that used and how to model it [4].

Skin colour segmentation can be defined as the process of differentiation between skin and non-skin pixels. However, there are some difficulties in robustly detecting the skin colour. The ambient of the light and shadows can affect the appearance of the skin-tone colour. Moreover different camera produce different colour values even from the same person and moving object can cause blurring of colours. Thus a suitable colour space needs to be selected in order to proceed with the gesture recognition for our application.

The colour spaces studied for further implementation were:

RGB Colour Space

Red, Green, Blue is a convenient colour model for computer graphics because the human visual system works in a way that is similar though not quite identical to an RGB colour space. Red, Green, Blue (RGB) colour space is the most common colour space used to represent images. RGB is developed with CRT as an additive colour space and it has high correlation, non-uniformity and mixing of chrominance and luminance data. Therefore RGB is not suitable for colour analysis and colour based recognition.

HSV Colour Space (Hue Saturation Value)

"Hue" is an attribute of a colour by which we give it a name such as "red" or "blue". "Value" is another word for "luminance" the attribute of a colour that makes it seem equivalent to some shade of gray between black and white. Saturation is a measure of how different a colour appears from a gray of the same luminance. Zero saturation indicates no hue, just gray scale. The HSV colour space is normalized.

Figure 2.HSV Colour Space Diagrammatic Explanation

The preceding figure shows a line drawing of HSV space in the form of hexa-hadrons. Each of its cross sections is a hexagon. At the vertices of each cross section are the colours red, yellow, green, cyan, blue, and magenta. A colour in HSV space is specified by stating a hue angle, the chroma level, and the lightness level. A hue angle of zero is red. The hue angle increases in a counter clockwise direction. Complementary colours are 180 apart.

Unlike RGB, HSV separates luma, or the image intensity, from chroma or the colour information. This is very useful in many applications example ours. For example, if we want to do histogram equalization of a colour image, you probably want to do that only on the intensity component, and leave the colour components alone. Otherwise we will get very strange colours. In computer vision we often want to separate colour components from intensity for various reasons, such as robustness to lighting changes, or removing shadows.

RGB has to do with "implementation details" regarding the way RGB displays colour, and HSV has to do with the "actual colour" components. Another way to say this would be RGB is the way computers treats colour, and HSV try to capture the components of the way we humans perceive colour.

YCbCr Colour Space

YCbCr is a family of colour spaces used as a part of the colour image pipeline in video and digital photography systems. Y is the luma component and CB and CR are the blue-difference and red-difference chroma components. To define YCbCr, we cannot call it an absolute colour space. It is rather a way of encoding RGB information. YCbCr colour space makes use of this fact to achieve more efficient representation of scenes/images. It does so by separating the luminance and chrominance components of a scene, and use less bits for chrominance than luminance.

HSV is similar to YCbCr in a way that both colour spaces reduce the effect of uneven illumination in an image. Therefore, both colour spaces are typically used in video tracking and surveillance.

The costs for conversion for HSV colour spaces are expensive and the pixels with large and small intensities are discarded as hue and saturation becomes unstable. Compared to HSV, YCrCb is an encoded nonlinear RGB signal and the transformation simplicity and explicit separation of luminance and chrominance components makes this colour space attractive for skin colour modeling. Testing both the colour spaces, we found that the YCbCr colour space resulted in a better recognition of our hand which was to be used for gesture recognition further and thus we decided to proceed with our project using this colour space.


Background Subtraction

This method is generally implemented to extract foreground objects from an image. It basically helps in cropping out the object of preference in the image and thus helps in reducing the overall amount of data that needs to be processed. We start the image recognition of our hand with this method, by emphasizing and extracting our hand which is the foreground, against the background objects present in the captured image.

i. Capture RGB image from webcam

A standard webcam with 640x480 resolution was used which is a pretty low resolution so we continued with this resolution to improve processing speed and to make sure that our approach is able to handle low-resolution images. We noticed that the images were captured in fair amount of light. That is, neither too bright nor too dark.

ii. Convert image to YCC Colour Space

The captured image is converted to YCC Colour space using the following formula:

Y = 0.299R + 0.587G + 0.114B

Cr = R - Y

Cb = B - Y

iii. Make a gray image such that the pixels in the skin range are white and those outside the range are black.

iv. Perform Erosion and Dilation on the image.

v. Perform Gaussian Smooth for smooth edges.

Figure 3.YCbCr algorithmic flowchart

1.1 Erosion

This image morphology technique is applied in binary images to erode away foreground pixels (white pixels) from boundary regions, in order to reduce its boundaries, which results in the reduction in pixel size of the foreground images, leaving holes within those areas of erosion.

The input to the erosion operator includes:

i. The image of the hand captured, that is to be eroded.

ii. Structural Element : Which is a set of co-ordinate points.

Erosion of the binary image is computed, each foreground input pixel is considered in turn, and the structural element is superimposed for each turn such that the origin of the element is coinciding with the input pixel coordinates. If for every pixel in the structuring element, the corresponding pixel in the image underneath is a foreground pixel, then the input pixel is left as it is. If any of the corresponding pixels in the image are background, however, the input pixel is also set to background value.

1.2 Dilation

This image morphology technique is applied in binary images to enlarge the boundaries of regions of foreground pixels (white pixels), resulting in the growth of foreground pixel size, causing holes within those regions to shrink.

The method to compute the dilation of the binary image is similar to Erosion, which includes the image of the hand captured as well as the structural element. The difference in dilation is that if at least one pixel in the structuring element coincides with a foreground pixel in the image underneath, then the input pixel is set to the foreground value. If all the corresponding pixels in the image are background, however, the input pixel is left at the background value.

Dilation is the dual of erosion i.e. dilating foreground pixels is equivalent to eroding the background pixels.

Contour Calculation

We begin the gesture recognition phase of our hand by first making our application recognize our hand. Since the webcam will capture everything possible within its vision in the environment, we need to make sure that only our palm is detected and used so that our application can function [10].

For this, we use the method of Contour Calculation, where-in we restrict only certain objects to be highlighted for our gesture recognition process. We make use of the EmguCV function findContours, in order to evaluate the various contours that the webcam detects and specify the contour values that we wish to detect.

The findCountours function takes in three variable types, namely, the approximate method, return type and the memory storage. For our project, we make use of a specific chain approximation technique that approximates each object of a specific size it finds. It compresses horizontal, vertical and diagonal segments, ie, the function leaves only their ending points. The endpoints are retrieved in a list of contours.

Once the contours are detected, we initiate a variable that can hold the largest visible contour that is detected. Thus, when we place our hand for gesture recognition, we make sure that our palm is the largest detectable object.

Figure 4.Contour detection algorithm

Finger Computing

ConvexityDefects is a feature that finds the defects between a convex hull and a contour; those defects are useful to find features in a hand, as for example the number of fingers. Convexity defects have a start point, end point and the depth. The starts and ends are the fingertips while the depth points are the space between the fingers [8].

The steps involved are:

Approximate the polygon and find the convex hull of the biggest contour.

Compute convexity defects using the built in Emgu CV function which return a list of contour convexity defects, each one represented by a tuple (start, end, depth points).

For every defect, check alignment of the start point and depth point and accordingly decide if it's a finger or not a finger.

This is done in the following way:

Compute startCircle as a circle with the center as start point and radius 5pixels.

Compute depthCircle as a circle with the center as depth point and radius 5 pixels.

Compute box as the minimum area rectangle of the biggest contour.

If (startCircle.Center.Y < || depthCircle.Center.Y < &&(startCircle.Center.Y<depthCircle.Center.Y)&&(Math.Sqrt(Math.Pow(startCircle.Center.X-depthCircle.Center.X,2)+(Math.Pow(startCircle.Center.Y-depthCircle.Center.Y, 2)) >(box.size.Height / 6.5)) then it is identified as a finger.

Store the X and Y positions of the finger identified.

Figure 5.Finger computing

4. Recognizing Gestures

Once our palm has been detected by the webcam, and our finger count being detected and displayed correctly, we will make our application detect the image and yield a desired result on the screen. In this case, we have decided on certain gestures that correspond to certain functions of the media player, as shown in the images.

To make sure that the gestures are recognized correctly, and the correct result is displayed, we follow a procedure as given below:

1. We position our hand at a particular gesture; say with our palm outspread, for the 'STOP' function of our player. Thus, the finger count of 5 will be detected and displayed by the application, and we want this finger count to result in the 'STOP' function of the player.

2. Since we have not implemented the mapping procedure yet, we will simply check if the detection works properly by displaying the function on the screen.

3. To avoid flickering of the detection due to light conditions, we have mentioned the 'STOP' function detection to be >=4 fingers. Thus only if this condition is satisfied, 'STOP' will be displayed on the screen.

4. Similarly, for the play function, we position three fingers such that the difference in y value (ie the height of the fingers), should be less than 100 pixels, and only if this condition is satisfied, we display 'PLAY' on the screen.

5. Similarly, for the pause function,we position two fingers in such a way that the difference in the x values of the fingers (ie the distance between their tips), is between 80-140 pixels. Only if this condition is satisfied, we display 'PAUSE' on the screen.

This is demonstrated in the following figures.

Figure 6.Detected finger counts

5. Results and Analysis

We designed our media player coded in C# and designed the gesture recognition features using EmguCV, which is an OpenCV wrapper for C#. We tested our media player in different lighting conditions, in three colour spaces: RGB, HSV and YCbCr. We chose three test conditions in order to select which colour space was the most appropriate and from the results obtained we concluded that YCbCr gave the best results in recognizing gestures as show in Figure.

Table 1.Test results

Table 2.Accuracy results

6. Conclusion

As technology advances, we move from simple hardware devices to just simple gestures with our own hands, to run a basic application, such as the media player that we have implemented here. It is a simple player coded in C# that implements functions such as 'Play', 'Pause' and 'Stop' by recognizing hand gestures. The gesture recognition feature not only makes it efficient for a normal user to operate the media player but also makes it more useful to physically challenged people. Future scope of this application includes gestures being recognized for more functions and making its interface more interactive for the user. Use of a simple webcam to detect gestures to implement various functions for a simple media player could ignite ideas to develop various other applications commonly used by users, or implement gesture recognition in the existing media players itself.