Moving object detection is one of the active researches in computer vision. It is widely used in surveillance applications, guidance of autonomous vehicles, video compression, tracking of moving objects, automatic target recognition and so on . The aim of moving object detection is to separate the moving objects from the background. According to the movement of the camera, the methods of moving object detection can be divided into two
types: moving object detection in static scenes and in dynamic scenes. Moving object detection in static scenes is relatively simple. It has been a mature technology and applied in various systems successfully. While moving object detection in dynamic scenes still has many key problems which have not been solved, especially when the background is relatively complicated, the problems become more difficult. Therefore, the research of the methods in dynamic scene is becoming the hot spot of the current application in computer vision.
At present, there are two dominant methods for moving object detection, optical flow method and global motion compensation method. Optical flow method has high computational complexity and poor anti noise capability, and itï¿½ï¿½s used only with special hardware. So the global motion compensation method has been widely used in the field. The main idea of this method is to estimate the global motion parameter of a camera between frames through image matching, and then compensate the motion of the camera. In this way, the detection in dynamic scenes is transformed into static scenes. The difficulty of this method is to estimate the global motion parameters robustly, including feature points extraction and matching, removing the invalid matching points and the optimal solution of global motion parameters. In the paper, we adopt the latter method.
Several popular feature detectors including Harris corner, SIFT[5-6]and SURF have been widely used in the global motion compensation. Harris corner does not have the scale invariance and has many different changes in terms of gray and illumination. SIFT(Scale-Invariant Feature Transform) algorithm is widely used in many applications because the feature descriptor is relatively invariant to changes in orientation, scale, illumination and contrast. But, SIFT algorithm canï¿½ï¿½t satisfy the request of real time because of the large amount of calculation, and high time complexity. SURF (Speeded Up Robust Features) algorithm, the accelerated version of SIFT, have a greater promotion in real time. In order to achieve the real-time requirement, we choose the SURF algorithm. Furthermore, to further improve the speed and the precision, we make some improvements based on SURF algorithm.
The proposed algorithm process is as follows: first, extract the feature points and match them between adjacent frames by using the improved SURF algorithm and matching method, then choose the affine transformation model to describe the global motion, use RANSACto remove the invalid matching points and least square method to obtain the optimal global motion parameters(affine transformation matrix), finally compensate the previous frame by using the parameters, and obtain the objects by the difference between the current frame and the compensated frame. After morphological image processing, we get the accurate moving objects. The overall process of the proposed algorithm is summarized in Fig. 1.
Fig.1. Flowchart of the proposed algorithm.
2 Improve SURF algorithm and matching method
2.1 SURF algorithm
This section reviews the original SURF algorithm. It is proposed by Bay H, Tuytelaars T, Gool L V in 2006. This algorithm is similar with SIFT algorithm. But, it is faster than SIFT in calculation speed. It relies on integral images to reduce the computation time and we also call it the ï¿½ï¿½Fast-Hessianï¿½ï¿½ detector.
The SURF detector is based on the determinant of the Hessian matrix. Based on Integral Image, we can calculate the Hessian matrix. Given a point in an image I, the Hessian matrix in at scale ï¿½ï¿½ is defined as formulaï¿½ï¿½1ï¿½ï¿½ :
Here refers to the convolution of the second order Gaussian derivative with the image at point and similarly for and . As Gaussian filters are non-ideal in any case, and given Loweï¿½ï¿½s success with LoG approximations, Bay push the approximation even further with box filters. These approximate second order Gaussian derivatives, and can be evaluated very fast using integral images, independently of size. The following formula ï¿½ï¿½2ï¿½ï¿½as an accurate approximation for the Hessian determinant using the approximated Gaussians-box filters:
refers to the convolution of the box filters with the image at point and similarly for and .
Scale spaces are usually implemented as image pyramids. The image pyramids in SURF is constructed by changing the size of box filters rather than iteratively reducing the image size. The output of the 9ï¿½ï¿½9 ?lter is considered as the initial scale layer, to which we will refer as
scale ï¿½ï¿½=1.2.The following layers are obtained by ?ltering the image with gradually bigger masks, such as 9ï¿½ï¿½9,15ï¿½ï¿½15,21ï¿½ï¿½21,27ï¿½ï¿½27,etc. If the size of box filter is Nï¿½ï¿½N, the corresponding scale ï¿½ï¿½=1.2ï¿½ï¿½N/9. In order to localize interest points in the image and over scales, a non-maximum suppression in a 3 ï¿½ï¿½ 3 ï¿½ï¿½ 3 neighborhood is applied. To do this, each pixel in the scale-space is compared to its 26 neighbors, comprised of the 8 points in the native scale and the 9 in each of the scales above and below. The maxima of the determinant of the Hessian matrix are then interpolated in scale and image space.
Rotation invariance is achieved by detecting the dominant orientation of each feature point using Haar wavelet responses in x and y directions within a circular neighborhood of radius 6s around the feature point. Here s is the scale at which the feature point was detected. The size of the Haar filter kernel is scaled to be 4sï¿½ï¿½4s. The responses are weighted with a Gaussian centered at the feature point. The Gaussian is dependent on the scale of the point and chosen to have standard deviation 2.5ï¿½ï¿½. The dominant orientation is estimated by calculating the sum of the horizontal and vertical Haar wavelet responses within a sliding orientation window covering an angle of ï¿½ï¿½/3. The two summed responses constitute a vector, and the longest vector lends its orientation to the feature point.
When extracting descriptor, the first step is to construct a square window with size 20ï¿½ï¿½ and the window is oriented along the dominant orientation. Then divide the window into 4ï¿½ï¿½4 regular sub-regions. For each sub-region, compute Haar wavelet responses of size 2ï¿½ï¿½ at 5ï¿½ï¿½5 regularly spaced sample points. refers the sum of responses in horizontal direction and refers the sum of responses in vertical direction. and respectively refers the sum of the absolute values of the responses in horizontal and vertical direction. Hence, each sub-region has a four-dimension descriptor vector for its underlying intensity structure . For the window having 4ï¿½ï¿½4 sub-regions, each feature point has a 64-dimension descriptor vector. Last, we turn the descriptor into aunit vector to achieve the invariance to contrast.
2.2 Improve SURF algorithm
ï¿½ï¿½1ï¿½ï¿½ Limit the number of detected feature points
SURF algorithm focus on the detecting effect, without considering the number of the feature points and position. However, if we detect much more feature points in the image frame, it not only increases the time of calculating the feature pointsï¿½ï¿½ descriptor, but also increases the matching time and the complexity of calculating the optimal global motion parameters. As we know, the affine transformation matrix only need three pairs of matching points at least to achieve the image geometry transform. Hence, reducing some matching points will not affect the final result, and it can improve the efficiency of the whole algorithm.
When using SURF detects feature points, it applies a non-maximum suppression in a 3 ï¿½ï¿½ 3 ï¿½ï¿½ 3 neighborhood. Each pixel in the scale-space is compared to its 26 neighbors, comprised of the 8 points in the native scale and the 9 in each of the scales above and below. But when the image fame is more complex, it can detect a large number of feature points, which increases the computation in the subsequent processing. Therefore, when detecting feature points, we apply the non-maximum suppression in a 7 ï¿½ï¿½ 7 ï¿½ï¿½ 7 neighborhood. In the center of the point for 7 ï¿½ï¿½ 7 region, we compare the determinant to its 146 neighbors, comprised of the 48 points in the native scale and the 49 in each of the scales above and below. In the 7 ï¿½ï¿½ 7 ï¿½ï¿½ 7 neighborhood, we can detect the appropriate number of feature points which have stronger robustness, and the efficiency of the whole algorithm is promoted.
ï¿½ï¿½2ï¿½ï¿½ A fast method for calculating the feature pointï¿½ï¿½s dominant orientation
The method for calculating feature point's dominant orientation in SURF is using a sliding window covering an angle with 60 degrees shift around a circle region, and then calculating the sum of the horizontal and vertical Haar wavelet responses in it. The two summed responses constitute a vector, and the longest vector lends its orientation to the feature point. The shifting step of sliding window is chosen 5ï¿½ï¿½. When the sliding window shifting, there are many overlap regions generated. Therefore, we will calculate the sum of the responses repeatedly, which influence the algorithmï¿½ï¿½s efficiency. For example, among 0-60ï¿½ï¿½region and 5-65ï¿½ï¿½region, 5-60ï¿½ï¿½is an overlap region, and the sum of responses is repeatedly calculated which made the algorithm process more complexity.
We adopt a fast method for calculating the feature pointï¿½ï¿½s dominant orientation to increase the efficiency of the algorithm . The procedure is as follows:
ï¿½ï¿½1ï¿½ï¿½ Calculate the sum of horizontal and vertical Haar
wavelet responses at each whole degrees (0-360ï¿½ï¿½), and store them in and .
ï¿½ï¿½2ï¿½ï¿½ Calculate the integral of and , defined as and :
The calculation of is similar to .
ï¿½ï¿½3ï¿½ï¿½ Calculate the sum of Haar wavelet responses in 60ï¿½ï¿½sensor region with the end of any angle i.
The calculation of is similar to . The local orientation vector could be calculated as formulaï¿½ï¿½5ï¿½ï¿½:
At the end we choose the longest local orientation vector over all windows as the dominant orientation of the feature point.
Using this algorithm to calculate the dominant orientation, the repeated calculation are wiped off and the shifting step of sliding window is changed into 1ï¿½ï¿½from 5ï¿½ï¿½. Comparing with the original algorithm, the improved algorithm decreases the complexity and increases the accuracy.
2.3 Improve the feature points matching method
Matching two feature points is done by comparing the corresponding feature point descriptors. In the process of searching for matching points, the global search method and KD-Tree algorithm are widely used at present. Global search method is easy to implement, but it needs to calculate the distance of all points in the two point sets. So this method has large amount of computation and it will detect many invalid matching points. KD-Tree algorithm takes full advantage of the data structure information of feature points. It only calculates a part of pointsï¿½ï¿½ distance in the two point sets by constructing KD-Tree. Though KD-tree algorithm reduces the computational complexity and improves the accuracy, it costs additional time to construct KD-Tree. The study shows that when the number of feature points is small, the speed of KD-Tree is not obviously increased.
Hence, we propose an improved matching method based on the global search method. When searching for a corresponding point in the adjacent frame, the point is searched in a certain region around the feature point of the current frame instead of the range of entire image. The size of the certain region is decided by the speed of the background. This method reduces the calculation of the distance between the matching points, decreases the number of invalid matching points, and reduces the complexity of the subsequent step. In a word, it improves the speed and accuracy at the same time. When measuring the similarity between the matching points, there are two steps. The first step is to do the initial match by using the sign of the Laplacian (based on the determinant of the Hessian matrix). When the matching points have the same sign of the Laplacian, we do the subsequent similarity measure, or we judge the two points are not matched. Hence, this minimal information allows for faster matching and gives a slight increase in performance. The last step is to calculate the Euclidian distance between the two feature pointsï¿½ï¿½ 64-dimension descriptors, and a matching pair is detected if its nearest neighbor (closest Euclidean distance in descriptor space) is closer than 0.65 times the distance to the second nearest neighbor. Over here, 0.65 is a threshold that it can be changed .The smaller it is, the less match points pair we get. The process of feature points matching is summarized in Fig. 2.
Fig.2. Flowchart of feature points matching.
3 Global motion compensation and objects detection
In this section, the first step is to compensate the global motion of camera by using the matching points. This step converts the detection in dynamic scenes into static scenes.
We choose the affine transformation model to describe the global motion. Affine transformation, with six parameters, is suitable for translation, rotation, scale and stretch. In the two-dimensional space, the affine transformation can be expressed as formulaï¿½ï¿½6ï¿½ï¿½:
Here, refers to the feature points in the previous frame, and refers to the feature points in the current frame. represents translation and represents rotation, scale ,stretch and so on.
The matching pairs we get in section 2 must have invalid matching points. We adopt the RANSAC algorithm to remove them and get the best set of interior points. Then we use the least square method in the best set of interior points to calculate the optimal global motion parameters (affine transformation matrix). Next we compensate the previous frame by using these parameters. After this step, the backgrounds of the previous frame and the current frame are unified, and the detection in dynamic scenes has been transformed into static scenes. So we use the frame difference method between the current frame and the affine transformed frame to detect the moving objects. Finally, the binary image is processed by morphological method to reduce the small holes and residual noise points, and to smooth the objectsï¿½ï¿½ contour. At this point, we get the accurate moving objects.
4 Experimental results and analysis
To make our experimental results have more persuasion, we did all simulation experiments in the following situations: Hardware environment: CPU Intel(R) Core(TM) i5 M520 @2.40GHz, RAM 4G, NVIDIA NVS 3100M; Software development tools: Microsoft VS 2008, OpenCV 2.3. The size of video frame in the paper is 720ï¿½ï¿½ 480, and the frame rate is 25fps.
4.1 The results of improved SURF
In the paper, we mainly make some improvements on SURF in two ways: One is to limit the number of detected feature points by changing the range of non-maximum suppression. The results are listed in Fig.3 and Tabel1.The other one is to adopt a fast method for calculating the featur pointï¿½ï¿½s dominant orientation. The results are listed in Fig.4 and Table2.
Fig.3 Results of detected feature points
Table1. Results of limiting feature pointsï¿½ï¿½ number
The range of
non-maximum suppression Number of feature points Time/ms
3ï¿½ï¿½3ï¿½ï¿½3 551 315
5ï¿½ï¿½5ï¿½ï¿½3 387 209
7ï¿½ï¿½7ï¿½ï¿½3 234 137
Fig.3(a) shows the detecting result when we apply the non-maximum suppression to detect the feature points in the 3 ï¿½ï¿½ 3 ï¿½ï¿½ 3 neighborhood which is the original SURF algorithm. Fig.3(b) and Fig.3(c) respectively show the detecting results in the 5ï¿½ï¿½5ï¿½ï¿½3 neighborhood and in the 7 ï¿½ï¿½ 7 ï¿½ï¿½ 3 neighborhood. As shown in Fig.3, most of the detected feature points distribute on the background, which is favour of modeling the background. Furthermore, the improved SURF effectively limit the number of feature points, and feature points distribute on the background evenly with strong robustness. Table1 also shows the results of limiting feature pointsï¿½ï¿½ number. Fig.3(a) detects 551 feature points, costing 315ms, Fig.3(b) detects 387 feature points, costing 209ms,and Fig.3(c) detects 234 feature points, costing 137ms. Comparing these data, we know the improved SURF effectively limits the number of feature points and decreases the detecting time. However, too few feature points will influence the accuracy of object detection, so we choose the 7 ï¿½ï¿½ 7 ï¿½ï¿½ 3 neighborhood to get about 200 feature points .
(a) Before improving
(b) After improving
Fig.4 Results of the fast method for dominant orientation
Table2 Results of the fast method for dominant orientation
Time/ms Matching pairs
SURF 137 177
Improved SURF 121 144
The Improved SURF wipes off the repeated calculation and the shifting step of sliding window is changed into 1ï¿½ï¿½from 5ï¿½ï¿½. Comparing with the original algorithm, the improved algorithm decreases the complexity and increases the accuracy. As shown in Table2, the detecting time is saved 16ms by adopting the fast method to calculate the feature pointsï¿½ï¿½ dominant orientation, and this method removes a part of invalid matching pairs, dropped from 177 to 144. Fig.4 shows the contrast effect of feature points matching by using the original SURF and the improved SURF. Comparing the Fig.4(a) and Fig.4(b), it is obvious that Fig.4(b) removes some invalid matching pairs which demonstrates the efficiency of the improved algorithm.
4.2 The results of improved matching method
(a) Before improving
(b) After improving
Fig.5 Results of improved matching method
Table3 Results of improved matching method
Global search method Improved matching method
Matching pairs 156 160
Matching time/ms 32 16
Best set of interior points by RANSAC 51 130
In the part of feature points matching, we mainly make some improvements based on the global search method, including limiting the search scope, judging the sign of the Laplacian and comparing the nearest neighbor and the second nearest neighbor. In the paper, we search for a feature pointï¿½ï¿½s corresponding points in the adjacent frame within a square of 60 ï¿½ï¿½ 60 centered on the feature point. Table3 indicates the improved matching method has a great promotion on time, and matching time dropped from 32ms to 16ms. Two methods nearly detect the same number of matching pairs. However, analyzing the best set of interior points obtained by RANSAC, there are only 51 matching pairs in the best set by using the original global search method which shows that 105 invalid matching pairs are removed by RANSAC. From these numbers, it can be seen that the previous matching step has detected too many invalid matching pairs which influences the accuracy of the global motion model. But in the improved matching method, there are 130 matching pairs in the best set of interior points, and only 30 invalid matching pairs are removed by RANSAC. Itï¿½ï¿½s obvious that the improved matching method effectively reduces the number of invalid matching pairs. Furthermore, with more matching pairs in the best set of interior points, the global motion model established by the least square method is more precise and the result of moving object detection is more accurate. As is shown in Fig.5, for the same frame, with the original matching method we canï¿½ï¿½t detect the moving object, but with the improved matching method, we detect the moving object successfully.
4.3 The results of global motion compensation and object detection
We adopt the proposed method in the paper based on improved SURF to detect the moving object in dynamic scenes. Fig.6 respectively shows the results of the fourth frame, the sixth frame and the eighth frame. Fig.6(a) shows the original frames and Fig.6(b) shows the results of global motion compensation for the previous frame. Comparing (a) and (b), the backgrounds of the current frame and the previous frame are unified which indicates the effect of global motion compensation for the previous frame is good, and it realizes the transformation from dynamic scenes to static senses. Fig.6(c) shows the detected object. Experimental results show that the proposed method in the paper is able to complete moving object detection in dynamic scenes.
(a) (b) (c)
Fig.6 Results of global motion compensation and object detection
Table4 Costing time of different algorithm
4 6 8
SITF Feature points 308 276 323
Time/ms 703 621 723
SURF Feature points 551 513 578
Time/ms 367 331 382
Improved SURF Feature points 234 212 245
Time/ms 152 144 155
Table4 respectively shows the costing time of different method based on SIFT, SURF and improved SURF. The method based on SIFT costs an average of 700ms to process a video frame, and the method based on SURF costs an average of 350ms. While the proposed method in the paper based on improved SURF only costs an average of 150ms to complete the object detection.
These results indicate that the proposed moving object detection method based on improved SURF not only has high accuracy and robustness, but also has a good advantage of time.
For the moving object detection in dynamic scenes and real-time requirement, an effective method based on improved SURF algorithm was proposed. First, extract featuer points by the improved SURF algorithm and match them by the improved mtching method based on the global match method. We mainly made some improvements on SURF in two ways: one is to limit the number of detected feature points by changing the range of non-maximum suppression, the other one is to adopt a fast method for calculating the feature pointï¿½ï¿½s dominant orientation. Then calculate the optimal global motion parameters (affine transformation matrix) by using RANSAC and the least square method. Finally, compensate the previous frame with the parameters, and obtain the object by the frame difference method. After morphological image processing, we got the accurate moving object. The experimental results showed that the proposed method can successfully detect the moving object in dynamic scenes. It not only has high accuracy and robustness, but also has a good advantage of time comparing with the method based on SIFT and SURF.