Efficient Humanaction Recognition System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In this paper, an efficient human action recognition system using feature points and single camera method based on neural network representation and recognition is proposed. The novel representation of action videos is based on learning spatially related human body posture prototypes using Self Organizing Maps (SOM). Fuzzy distances from human body posture prototypes are used to produce a time invariant action representation. Multilayer perceptrons are used for action classification. The algorithm is trained using data from a multi-camera setup. An arbitrary number of cameras can be used in order to recognize actions using a Bayesian framework. Due to the growing interest in visual surveillance has led to human action recognition. So we propose a new and efficient method for human action recognition system using single camera and feature points. Our proposed method overcomes the problems in the existing system and recognizes the action of the required human. The system is developed in such a way, first it is trained using the feature extraction and feature tree method and then system will be capable of identifying the action from postures. We prove that our proposed is very efficient and can recognize actions quickly too.

Index Terms- Human action recognition, multilayer perceptrons, feature tree, visual surveillance.


Human action recognition is an active research field, due to its importance in a wide range of application, such as intelligent surveillance [2], Visual surveillance, Human action recognition, Crowd behavior analysis, tracking of an individual in crowded scenes, etc. The term Artificial Intelligence (AI) is the study of modeling of human mental function by computer Program, where the term action is related with the term activity and movement. Therefore Action is referred to as single period of human motion patterns (like walking step) but Activities consist of a number of action/movements (like dancing).

The objective of the estimation process is to find the most probable action according to the parameters. We have to estimate which posture the current image stands for, then recognize which action the posture sequence means. A critical problem in a recognition system is how to improve the accuracy and speed. There are two classes of estimation approaches. They are learning-based and example-based. The learning based approaches use trained classifiers, while the example based ones search in exemplars. Action recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Its strength in providing personalized support for many different applications and its connection to many different fields of study such as medicine, human-computer interaction, or sociology. Visual (or video) surveillance devices have long been in use to gather information and to monitor people, events and activities. There is an increasing desire and need in video surveillance applications for a proposed solution to be able to analyze human behaviors and identify subjects for standoff threat analysis and determination. The main purpose of this survey is to look at current developments and capabilities of visual surveillance systems and assess the feasibility and challenges of using a visual surveillance system to automatically detect abnormal behavior ,detect hostile intent, and identify human subject [1].

Problem Statements - A critical problem in a recognition system is how to improve the accuracy and speed. There are two classes of estimation approaches. They are learning-based and example-based. The learning based approaches use trained classifiers, while the example based ones search in exemplars.

Action recognition methods suffer from many drawbacks in practice, which include

The inability to cope with incremental recognition problems.

The requirement of an intensive training stage to obtain good performance.

The inability to recognize simultaneous multiple actions and difficulty in performing recognition frame by frame.

The method should allow continuous action recognition over time.

The use of multi-camera setups involves the need of camera calibration for a specific camera setting.


In this paper, we propose a framework using feature tree technique, which recognizes the unknown action using the inputs of the camera. We show that the proposed framework is more accurate and efficient when compared to the existing frameworks in recognizing the actions. The proposed approach does not require the use of the same number of cameras in the training and recognition phases. It is done using feature-tree technique.




















Input Training Image

Feature Detection and Extraction

Labeling Feature with Action class

Train the system with action name

Query Image

Feature Detection and Extraction

Retrieve action from trained data

Notify the action

Trained Data

Fig. Framework for Feature tree


example for feature tree

Method Description

After image segmentation, the image is decomposed into a number of homogeneous regions. In Fig. 1, it shows that the image is represented by a two-level tree, where the root node represents the whole image and child nodes represent the region-based objects. The root node is assigned to the global feature, which is the color histogram in this case. Local region-based features, such as color moment, texture, size and shape, are assigned to the child nodes. This enables global and local image features to be integrated through a tree structure.The main advantages in the feature-tree construction stage, local spatiotemporal features are detected and extracted from each labeled video, and then each feature is represented by a pair [d,l] where d is the feature descriptor and l is the class label of the feature. Finally, we index all the labeled features using SR-tree. In the recognition stage, given an unknown action video, we first detect and extract local spatiotemporal features, and then for each feature we launch a query into the feature- tree. A set of nearest neighbor features and their corresponding labels are returned for each feature. Each returned nearest neighbor votes for its label. This process is repeated for each feature, and these votes can be weighted based on the importance of each nearest neighbor. Finally, a video is assigned a label, which receives the most votes. The entire procedure does not require intensive training. Therefore, we can easily apply incremental action recognition using the feature-tree.

A.PreProcessing Phase

For skin color segmentation, first we contrast the image. Then we perform skin color segmentation.

Finding Probability of human Module Then, we have to find the largest connected region. Then we have to check the probability to become a face of the largest connected region. If the largest connected region has the probability to become a face, then it will open a new form with the largest connected region. In this module we apply Digital Signal processing filters to remove the noises in the image.

Binary Image Conversion Module For face detection, first we convert binary image from RGB image. For converting binary image, we calculate the average value of RGB for each pixel and if the average value is below than 110, we replace it by black pixel and otherwise we replace it by white pixel. By this method, we get a binary image from RGB image.

In our database, there are two tables. One table "Person" is for storing the image and their index of 4 kinds of action which are stored in other table "Position". In the "Position" table, for each index, there are 6 control points for lip Bezier curve, 6 control points for left eye Bezier curve, 6 control points for right eye Bezier curve, lip height and width, left eye height and width and right eye height and width. So, by this method, the program learns the action of the people.

Movement feature extraction module From the converted binary image format, we extract the features (such as movement of legs, hands etc.). Features are extracted by applying the edge detection algorithm first; to exactly extract the features of human posture and remove the background / unwanted noises in the picture.

B.Training Phase

In this module, we train the system with the actions depicted in the image. The actions are trained in the system which uses the technique of feature tree. The action representing the image is trained in the system with feature tree model.

C.Testing Phase

For action detection of an image, we have to find the Bezier curve of the lip, left eye and right eye. Then we convert each width of the Bezier curve to 100 and height according to its width. If the person's action information is available in the database, then the program will match which action's height is nearest the current height and the program will give the nearest action as output.If the person's action information is not available in the database, then the program calculates the average height for each action in the database for all people and then gets a decision according to the average height.


Canny edge detector algorithm is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in image. At each stage, for computing the output pixel at a particular row, we need input pixels at rows below and above. Thus, output at the first and the last row are undefined. The same happens in the case of columns too. Thus, the size of the valid pixels in the output reduces after each step. To incorporate this, the output width and height and the output buffer's position changes after each step. This is illustrated in each stage as API argument adjustments. This API adjustment technique is the same for other applications where a sequence of VLIB functions are called.

A.Noise Reduction

Noise brings about high gradient magnitudes which in turn produces unintended edges. To reduce this effect, the image is convolved with a 2D Gaussian filter, which brings each pixel in closer harmony with its neighbors. This is basically a smoothing process. Practically, a discrete Gaussian function requires a window size of 5*sigma approximately (where sigma is the standard deviation of Gaussian function). Sigma is the parameter which controls the smoothing, which in turn boils down to window size. Larger window sizes are generally not recommended because it is computationally very expensive and also causes over smoothing, removing weak edges. A window size of 5x5 or 7x7 is recommended. The Gaussian filters can be generated using the matlab command fspecial.

B.Gradient Filtering

This kernel computes the first order horizontal and vertical gradients, and computes the magnitude of the

gradient using L1 norm. This step creates an additional 1-pixel border of invalid data around the input image. As before, care has to be taken to use the modified height and width for the next step.

C. Non-Maximum Suppression

The edge magnitude image may contain wide ridges around the local maxima which are visualized as

thick edges. Non-maximum suppression produces thin edges removing the non-maxima pixels along the

normal direction preserving the connectivity of contours. At each pixel location, two virtual points lying along the normal direction on either side of the current location are interpolated using the gradient magnitudes from surrounding neighbors. If the gradient magnitude of the current position is greater than those at the virtual points, it is declared a possible edge by giving a value of 127. Else it is made 0. The direction estimation and interpolation steps were combined to eliminate division, and hence, removing the reciprocal tables required for division. Finally, the output is binary valued containing either 0 or 127. This step creates another 1-pixel border of invalid data around the image.

D .Hysteresis Thresholding

This stage of the algorithm is split into a block based (VLIB_doublethresholding) and a non-block based (VLIB_edgeRelaxation) kernel. This is done to give the flexibility of using a part of this stage in a block based manner. Double thresholding uses two parameters high threshold and low threshold. If the gradient magnitude is above high threshold and it is a possible edge, then it is declared a strong edge and a value of 255 is stored. Its location is stored in a dynamic list where the size of the list grows based on the edge structure.


So by using the proposed framework we illustrate that our method efficiently recognizes the action from the video datasets and also solves problem of inability to cope up the problem of incremental recognition and helps to form a better framework in recognizing actions. Our proposed system using single camera and feature tree will be very efficient in recognizing the actions and it will be very useful in the human surveillances applications.


In proposed system, we recognize the normal human actions like walking, running etc. from a images. As future enhancement, we can recognize the actions from a video sequence. Also in future we can also try to recognize the human action in specific with lip reading etc.