Effective Object Detection And Tracking Computer Vision Engineering Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- The aim of this paper is to describe the implementation and evaluation of a system that is able to detect and track multiple moving foreground objects through image sequence. For this purpose, relevant portions of the paper "Learning patterns of activity use real-time tracking" by Chris Stauffer and W. Eric L. Grimson [23] were implemented. In accordance with their work, the report splits the tracking task into two sub-tasks. The first is to detect foreground objects in each frame of a sequence as blobs, and the second is to track the detected blobs across frames. To understand the dynamics of the object detection algorithm proposed by Stauffer and Grimson, modifications are evaluated. Experiments confirm that the proposed model detects foreground objects well under variable conditions. To track individual blobs across frames, the Kalman filter is used. Especially when tracking multiple object simultaneously, the data association problem can be motivated from the ambiguities that can arise between tracked blobs and newly detected blobs. An association technique which solves this problem in simple instances is developed and tested on several video sequences. These tests also indicate that the overall system can generally detect and track multiple non-occluding objects in the face of some clutter in the scene.

Keywords: detect and tracking, real-time, Mixture of Gaussian, Kalman Filter, data associations.


THE rapidly increasing use of digital cameras provides constantly growing amounts of video imaginary. In line with the technological advances, the political developments in recent years have (unfortunately) lead to a dramatic rise in interest for surveillance, specifically to monitor vulnerable public places such as train stations, airports, and shopping centres. In view of the plethora of digital data that is being accumulated, an interesting and challenging problem is the algorithmic interpretation and fault detection in factories, as well as abnormal behaviours recognition. For surveillance and other applications the sheer amounts of data are becoming difficult to use and process manually. Being not surprise, automated methods are seen as a promising solution to this dilemma, by highlighting salient events in a sequence. One of the central tasks for this goal is motion tracking, that is, given a sequence of images from a camera (for example monitoring a street scene), it is desired to detect and track moving foreground objects in such way that the object's motion trajectories become evident. Thus motion tracking is aimed at determining the visual identify of objects at different pointa in time. One such information is available; it can be used in later processing steps to reveal further information about the objects, their actions, goals and their relation to the environment.


This 4th year project describes the implementation and analysis of the object detection and tracking in MATLAB, and then the system is evaluated in different video and image sequences for its efficiency, following the initial by Stauffer and Grimson limits the task of object tracking from a static camera and decompress the problem into two parts as below:

Problem Statement

The first part is the method use for object detection should be capable of dealing with changes in the geometry scenes, such as cars which park for a prolonged time will become part of background after sufficient time has passed. Furthermore, repetitive motion in the background should not identify as foreground motion. Moreover, for analysing data from outdoor scene, an ability to adapt change to illumination changes, both sudden and gradual, is a requirement for qualitatively good foreground segmentation.

The second part is how to associate problem with the true objects that caused them? This is due to tracking multiples in a video frames, how to maintain the object identities. Besides that, the uncertainty about the true state, which due to inaccurate observations from the background subtraction process (in approximating its true position, extracting its rough shapes etc).


The goals of this project were to implement the proposed object detection method as described by authors and gain insight into the update equations and the associated parameters. The second aim was to successfully track multiple objects in simple situations without occlusions, and robustly combine all noisy observations into more accurate state estimate in what is called filtering.

Related Research

There have been numerous literature about object detection, classification, tracking [13,1,49]. The survey presented here cover only some work that as same context to the project. For comprehensive understanding, I also give some related information on some techniques which similar to the task here, which not covered in study.

Moving Object Detection

Adaptive Background Subtraction

Few different approach to this basic scheme of background subtraction in terms of foreground region detection, background maintenance and post processing. Heikkila and Silven uses the simple version of this scheme where a pixel at location in the current image is marked as foreground if following condition are met,


is the predefined threshold value. The background image is updated by the use of an Infinite Impulse Response (IIR) filter as follow:


Then, the foreground pixel map is computed with morphological algorithms, such as dilation, erosion, and the elimination of small-sized regions.

Although this technique is robust at extracting the most pixels of moving regions even they stopped. This technique failed when it deal to dynamic changes, for instance, stationary objects uncover the background (e.g., a parked car moves out of the parking lot) or sudden illumination changes happened.

Statistical Methods

Above challenge can be overcome by more advance methods, which by using statistical characteristics of individual pixels. This method inspired by the background subtraction methods in terms of keeping and dynamically updating statistics of the pixels that belong to the background image process, and the foreground pixels are identified by differencing each pixel's statistic with corresponding background model. Compare to previous technique, this approach has more popularity, due to its reliability in some undetermined scene, such as, noise, illumination changes and shadow [2]. An another example that can illustrates the statistical methods, is paper by Stauffer and Grimson [4,5]. In their paper, the author described an adaptive background mixture model for real-time tracking, every pixel is separately modelled by a mixture of Gaussians which are updated online by incoming image data, in order to detect whether a pixel belongs to a foreground or background process. An implementation of this model is explained in section later.

Temporal Differencing

Temporal differencing is another method is can use to identify background or foreground pixels. This approaches attempts to detect moving regions by making use of the pixel-by-pixel difference of consecutive frames (more than two frames must be provided to the system) in a video sequence. The advantages of this method compare to previously method is that this technique is highly adaptive to dynamic scene changes, although, this method fails in detecting whole relevant pixels of some types of moving objects. Also, this method fails to detect stopped object in the scene. Fortunately, this approach can be modified to success in detect higher-level image processing. Lipton et al. proposed a two-frame differencing scheme where the pixels satisfy the following equation are marked as foreground [6,8,9].


Sometime, three frame differencing can be used to overcome shortcoming in two-frame differencing [7,10]. For instances, Collins et al. developed a hybrid model that combines three-frame differencing witrh an adaptive background subtraction model for their project [11]. The hybrid algorithm successfully segments moving regions in video without the defects of temporal differencing and background subtraction.

Shadow and Light Change Detection

Above described algorithms perform robust on indoor and outdoor environments and have been used for real-time surveillance for years. But, those algorithms are susceptible to shadow, highlights and global illuminations changes (for e.g., sudden change of weather). The impact of shadow can cause object classification inaccurate or fail due to fail in object segmentation progress. The proposed solution to this problem mostly are commonly chromaticity [12,13,14,15,18] or stereo [17] information to cope with shadows and sudden change in illuminations.

Horprasert et al. proposed a novel method to this challenge. In their project, each pixel is represented by a colour model that will separate the brightness from the chromaticity component. When a pixel is introduced, this pixel is classified into four different categories (shaded background, background or shadow, highlighted background and moving foreground object). this categories is achieved by calculating the distortion of brightness and chromaticity between the background and the current image pixel, similar to [16]. While McKenna et al. [19] use the information of gradient and chromaticity and cope with shadows. The authors use gradient information in to ensure the effectiveness in ambiguous cases. In some systems, a global light change is detected by counting the number of foreground pixels and if the total number exceeds some threshold (e.g., 60% of total image size), the system is reset to adapt the sudden illumination change [20,21].

Object Classification

The effectiveness of moving background subtraction will affect the result in object classification due to moving regions detected video may correspond to each different object in real-world such as vehicle, pedestrian, plant, etc. hence, the object claassification is an essential task in object detection and tracking, in order to recognize the type of detected objects. Recently, two approaches toward this classification, which are motion-based and shape-based classification. The motion-based attempt to use temporarily tracked features of objects for the solution whereas shape-based use of object's 2D spatial information.

Motion-based classification

Motion-based is robust in distinguish non-rigid objects from rigid objects. The method is proposed by [23,24] is based on the temporal self-similarity of a moving object. as the object moved, this algorithm start the self-similarity to make sure of its periodic motion.

Shape-based classification

This method is more commonly than previous approach. As the word implied, the classification is based on shape (e.g., rectangle, area, silhouette or gradient of detected objects). Sometimes, dispersedness is used as the classification metric and it is defined in term of object's area and contour length (perimeter) as following equation:


Classification is performed at each frame and tracking results are used to improve temporal classification consistency.

Saptharishi et al propose a classification scheme which uses a logistic linear neural network trained with Differential Learning to recognize two classes: vehicle and people [25]. Papageorgiou et al. presents a method that makes use of the Support Vector Machine classification trained by wavelet transformed object features (edges) in video images from a sample pedestrian database [26]. This method is used to recognize moving regions that correspond to humans.

Another classification method proposed by Brodsky et al. [27] use a Radial Basic Function (RBF) classifier which has a similar architecture like a three-layer back-propagation network. The input to the classifier is the normalized gradient image of the detected object regions.

Object Tracking

Tracking is normally followed by detection, tracking will follow the path of the corresponding object on the frames. Tracking is a challenging problem across interest among computrer vision researcher. Two common approaches to the tracking, one is based on correspondence matching and other one is carries out by making use of position prediction or motion estimation. In other word, the techniques that track part of object employ model-based schemes to locate and track body parts. Some example models are stick figure, cardboard model [27], 2D contour and 3D volumetric models.

Amer [28] presents a non-linear voting based scheme for tracking objects as a whole. It make use of object features, such as size, center of mass and motion by voting and decides final matching or the correspondence to each of object in frame. This method can hadle problem such as occlusion and object split.

Stauffer and Grimson [23] come out a technique which is linearly predictive multiple hypotheses tracking algorithm. This algorithms similar to hyrid method, in which the incopration between size and position of objects for seeding and maintaining a set of Kalman Filter for motion estimation. Besides that, The Extended Kalman filter is updated version of Kalman filter, used to predict tracjetory and handle occlusions too.

PFinder [24] then discover a method which makes use of a multi-class statistical model of color and shape to track and handle of people in real-time.


Object Detection and Tracking

Figure 1.0 below showed the overview of the real-time video object detection, classification, and tracking system. The system proposed is robust to distinguish foreground and background objects, also distinguish between stationary and stopped foreground object from static background model in dynamic situation. Furthermore, the system is also able to classify human and vehicle; track only human or pedestrian in real-time.

The system assumed to work under real-time environment. Hence, the decisions of selecting the ideal technique for various challenges are affected by their computational run-time performance as well as quality. Besides that, the system is able to deal with stationary camera and PTZ (Pan/Tilt/Zoom) camera where the view frustum may change arbitrary.

Figure 1.0: The system block diagram

The system starts with fetching video from camera (PTZ or static) that monitoring a site. The system can deal with neither monochrome nor full colour video imaginary. Then, the system start the foreground and background detection algorithm, whereas the background model is start is first few best background based on algorithm. To achieve this, the hybrid combination of algorithms is used, which are adaptive background threshold, Mixture of Gaussians, and Temporal Differencing. Each of algorithms can overcome the limitation of other algorithms. This algorithm will create a foreground pixel map based on region-based classification.

The object classification algorithm makes use of foreground pixel map belonging to each individual connected region to create a silhouette for the object. the silhouette and centre of mass of an object are used to generate a distance signal. This signal is scaled, normalized and compared with pre-labelled signals in a template database to decide on the type of the object. the output of the tracking step is used to attain temporal consistency in the classification step.

Whereas the object tracking algorithm make use of extracted features from detected algorithms together with a correspondence match scheme to track the object from frame to frame.

Object Detection

Figure 1.1: The object detection system diagram

In object detection system, for foreground detection, the combination of a background model and plus some low-level image post-processing. These algorithms will achieve to foreground pixel map, and of cause, the features of every objected detected are extract in video frame. For the robust detection, the background model must be ideal; the background model will undergo initialization and update. The details are described as following:

Background model

Adaptive Background-Subtraction model

Figure 1.2: MoG- Foreground Detection (adapted from dissertation)

Given a multimodal distributions over data points, the Mixture of Gaussians is a useful density model that decomposes the distribution in terms of a number of distinct Gaussians densities. While a single Gaussians density distributions is uni-modal, the Mixture of Gaussians achieves multimodality by mixing (i.e., weighting) several different Gaussians together. The resulting distributions is able to represent multiple high-density areas as a compact extension of the single Gaussians. Formally, the multivariance Mixture of Gaussians of K components is defined as follows:


Where take to be concatenation of all mean vectors and covariance matrices into one parameter. The parameters are non-negative mixing proportions which sum to one (. Each proportion can be seen as prior of the mixture component . The mixture components are multivariance Gaussians densities, parameterised by their mean vectors and covariance matrices and given by :


A mixture of Gaussians is a discrete latent variable model that explains the data generation process. Given a mixture, new data can be generated by randomly selecting a components k according to probabilities and .

Figure 1.3: Two different views of a sample pixel process (in blue) and corresponding Gaussians Distributions shown as alpha blended red sphere

Temporal Differencing

This method utilizes of the pixel-wise difference between two or three consecutive frames in video imaginary to extract moving regions. It is highly adaptive approach to dynamic scene changes: however, it will fails in extracting all relevant pixels of a foreground object especially when the object has uniform texture or move slowly. The time object stops moving, temporal differencing methods fails in detecting a change between consecutive frames and loses the object. Special supportive algorithms are needed to overcome the situations.

In this project, two-frame temporal differencing is implemented. Let consider following equations:


Let be the gray-level intensity value at pixel position (x) and at time instance of n of frame I which range is from [0 255]. The 2-frames temporal differencing scheme suggests that pixel is moving if satisfies following situation:


The implementation of two-frames differencing can be complete by exploiting the background subtraction method's model update parameter. If and are set to zero, the background holds the image and background subtraction scheme becomes identical to two-frames differencing

Post-level Processing

Upon successfully distinguish background and foreground object. But, the output for the previous result is not essential for the further processing. The system might give inaccurate result when no post-level processing is process. Few factor affecting the result in foreground detection, for instance, camera noise, reflectance noise and background coloured object noise or shadows and sudden illumination change. This kind of noise will definitely affect our final result later. In order to overcome the problem, few techniques introduced in post-level processing including morphological operations, such as erosion and dilation. Combination of erosion and dilation can decrease noise level that caused by camera, reflectance and background coloured object.

Erosion erodes one-unit thick boundary pixels of foreground regions, whereas dilation is the negation of erosion, as its name implied, to expand the foreground region boundaries with one-unit thick pixels.

Figure 1.4: Pixel level noise removal sample. (a) Estimated background image. (b) Current image. (c) Detected foreground regions before noise removal. (d) Foreground regions after noise removal.

For the shadow and sudden illumination changes, the algorithms proposed here is utilize the fact that the pixels shadows regions' RGB colour is parallel with the RGB colour vector of the corresponding background pixels, but with small amount of derivation. Normally, the brightness of the shadow's pixel is less than the corresponding background pixel's. Expressions bellows illustrates how to eliminate the shadow from background model:




the is the predefined values which is . The dot product is to test whether and have the same direction. If dot product is close to one, this means that they are almost in the same direction with small amount of derivation.

Figure 1.5: shadow removal sample. (a) Estimated background. (b) Current image. (c) Detected foreground pixels (shown as red) and shadow pixel (shown as green). (d) Foreground pixels after shadow pixels are removed.

The sudden illumination removal is also an essential requirement is effective object detection and tracking. This can be overcome by the topology of the object edges in the scenes which doesn't change too much and the boundaries of the foreground regions don't correspond to actual edges in the scenes whereas in case of large object motion the boundaries of the detected foreground model correspond to actual edges in the image. for checking the boundaries correspondence, the gradient is utilize here, which gradient difference between current image and consecutive image is obtained, then find only edges, and the edges in smoothing by morphological technique.

Figure 1.6: Detecting true light change. (a) Background reference. (b) Background's gradient. (c) Current image. (e) Gradient's difference.

Extracting Object Features

The object features make use in object tracking, object feature included centre of mass, size, and silhouette contour of the object's blob.

For the calculation of centre of mass, the following equation is used:


where n represent number of pixels in O.

Object tracking

The aim of tracking is to efficiently combine a series of corrupted "Snapshot" observations derived from the foreground blobs of an object into a single, coherent motion trajectory that summarises the position of the object throughput the sequence. The tracking method that developed is utilize of the object features that found previously to get matching between objects in consecutive frames. By analysing the object information, our tracking system is able to detect left and removed object as well. The tracking system diagram show as follow:

Figure 1.7: The object tracking system diagram

Figure 1.8: The correspondence-based system diagram

Correspondence-based Object Matching

The tracking system initialize with correspondence matching, which means that the matching the objects in previous image to the new object detected in current image. Here, the memory of computer is utilized, as the matching of object is stored in a bi-partite graph. In order to perform matching, interaction is done between list of previous object and new blobs to evaluate their correspondence. For instance, for each previous object, the interaction done by checking whether the new object is close to previous object or not. The criterion of centre of mass is used here, which use to find the closeness of previous frame object and current frame object. The closeness between two objects should satisfy following equation:


After that, the next step is to check the similarity between two objects to improve correct matching. Here, criterion of size ratio of the objects is used, the objects cannot be grow or shrink too much between frames in order to get matching. The size ratio can be calculated as below:


the variable and are pre-defined threshold value, checking the objects for size is especially useful if an object in the previous frame splits into large and a very small region due to inaccurate segmentation. This check eliminates the chances of matching a big region to a small region.

Object Classification

The ultimate goal of effective object detection and tracking is to extract proper object to respond to what user need. In this project, the objects concerned are human and vehicle. Hence, I developed a novel video object classification method based on object shape similarity as part of our visual surveillance system.

The Classification Metric

Our object classification metric is based on the similarity of object shapes. There are numerous methods in the literature for comparing shape [4, 7, 2, 3, 22]. The reader is especially referred to the surveys presented in [7, 3] for good discussions on different techniques. The important requirements of a shape comparison metric are scale, translation and rotation invariance, our method satisfies all three of these properties.

Figure 1.8: Sample object silhouette and its corresponding original and scaled distance signals. (a) Object silhouette (b) Distance signal (c) Scaled distance signal.

Scale invariance: Since we use a fixed length for the distance signals of object shapes, the normalized-and-scaled distance signal will almost be the same for two different representations (in different scales) of the same pose of an object.

Translation invariance: The distance signal is independent of the geometric position of the object shape since the distance signal is calculated with respect to the centre of mass of the object shape. Due to the fact that the translation of the object shape will not change the relative position of the centre of mass point's position with respect to the object, the comparison metric will not be affected by translation.

Rotation invariance: We do not use the rotation invariance property of our classification metric since we want to distinguish even the different poses of a single object for later steps in the surveillance system. However, by choosing a different starting point p on the silhouette of the object in contour tracing step, we could calculate distance signals of the object for different rotational transformations for each starting point p.


I implemented a video player application (vPlayer) to test our algorithms. The video player can play video clips stored in compressed and uncompressed AVI format. The player application both displays the video data on the screen and at the same time it feeds the image to our video analyser algorithms such as object tracker. The architecture of the player application is made flexible in order to load different types of video clips and use different video analysers. Thus, we created two APIs: VideoDecompressorAPI to load video images and VideoAnalyzerAPI to analyse video images by using several algorithms. The application is implemented using Matlab2010b. All of the tests in the next sections are performed by using the player application, vPlayer on Microsoft Windows 7 Ultimate x64 operating system on a computer with an Intel Core 2 Duo 2.2 GHz CPU and 2.0 GB of RAM.

Object Detection and tracking

The system generally computes good foreground masks during daytime (images 1-4), but occasionally fails to distinguish fine colour tones from the background (image 2). Changing sunlight appears to require components to transition to new pixel modes and thus increases the variance in significant portions of the image (image 2). This could be avoided by lowering the standard deviation threshold L to encourage replacement. Image 5 was recorded when the flood lights were turned on and illustrates the effects of sudden global lighting changes and recovery thereafter (images 6 and 7). The model is able to adapt to changes in scene geometry and includes the parking car in image 8 into the background after two minutes (image 9). During night time, the waving tree branches in the lower left corner cause the variance to be particularly high there (images 7-10). Also, during these times well-illuminated areas will generally have lower variance. Global lighting change is visible again in image 11 and adaptation to sunrise evident in the hours after (images 12-14). Although variances are overall increased during night-time, it can recover more confident mixtures in the morning.

Table 1.0: The parameters which were used for the live web-cam experiment. α and ρ were set high to account for the low frame-rate of approximately 1fps.

Throughout these phases, the algorithm is often able to recover useful foreground masks. It is interesting to note that in the mean plots we occasionally observed patchy areas which did not seem to discover a reasonable background estimate after a considerable time (images 7-9 on the left). It is suspected this happens when a component has very small variance (i.e. usually when pixels were saturated at a value of (255, 255, 255) T for some time), and the replacement components (which are supposed to capture new input) are repeatedly deleted because of increased noise level. This suggests we should either set L higher during night-time, or even use individual thresholds instead as suggested by Stauffer and Grimson. Another improvement could be to constrain variances to be above a certain constant (for example by thresholding them to that constant if they would drop below). This technique was employed, for example, by Fran¸cois and Medioni [10]. The robustness evaluation of the system showed a reasonable ability to adapt to significant changes in lighting and recover from extended periods of high levels of noise. Significant claims of Stauffer and Grimson appear to be confirmed. The parameters that were used for this experiment are given in Table 1.0. Note the relatively high settings of a and ρ due to the low frame rate. The parameters (significantly a, ρ and L) were not finely tuned for this purpose, but promise improved performance if given more careful attention.

Object Classification

In order to test the object classification algorithm we first created a sample object template database by using an application to extract and label object silhouettes. The GUI of this application is shown in Figure 1.9. We used four sample video clips that contain human, human group and vehicle. The number of objects of different types extracted from each movie clip is shown in Table 1.1. We used the sample object database to classify object in several movie clips containing human, human group and vehicle. We prepared a confusion matrix to measure the performance of our object classification algorithm. The confusion matrix is shown in Table 1.2.

Table 1.1: Occlusion handling results for sample clips

Figure 1.9: Sample video frames before and after occlusions

Table 1.3: Number of object types in the sample object template database

Table 1.4: Confusion matrix for object classification

Object Tracking

Table 1.5: Tracking objects in the PETS2000 sequence. From top left to bottom right: tracked objects are highlighted by bounding boxes and their associated labels in representative snapshots. The foreground masks for each frame are given below the corresponding tracking snapshot. Shown are only connected components of size greater than 10 pixels.

The overall system, from foreground detection to object tracking was implemented in 4000 lines of Matlab code and is able to process approximately four frames of size 160 Ã-120 pixels per second on the Intel machine, using no graphical output. The rate drops noticeably if more than a few objects are being tracked, as the Kalman filter updates were not optimised for this purpose. Thus currently, the system processes fewer frames than the 11 to 13 frames (of same size) per second that Stauffer and Grimson claimed. Note that the code was not yet compiled to a native executable so that a significant increase in speed can still be expected. The system is able to successfully detect and track non-occluding objects in a number of different sequences under varying conditions. The following examples show snapshot images of the state of the tracker at different points in various sequences. As a first example, the results of tracking are shown for the PETS2000 sequence. Table 1.5 shows the tracking output, as well as the computed background segmentation for representative frames in the sequence as time progresses from top left to bottom right. The state representation used for Kalman tracking includes position, velocity and blob size state. The association algorithm solved the linear assignment problem for matrices based on the negative log likelihood cost function in equation above. In Table 1.5, the bounding boxes of matched foreground blobs and the numeric IDs of the associated Kalman filters are highlighted in green, marked as solid green circles are the centroid predictions of those Kalman filters, which received an observation during this frame. In this instance, all filters receive appropriate observations. The slightly degraded image quality in this and the other examples is due to the frame size of 120 Ã-160 pixels that was used for all experiments, and highlights that even limited amounts of data can be used in tracking applications.

The sequence proceeds as follows: in frame 540 a car, labelled as object 53, is driving on the road. In the background, a person is barely visible, walking on a small path towards the parked cars. The person is not yet tracked as its connected component is of size less than 10 pixels. In frame 650, the car is parking at the side of the road, and the person has advanced further into the scene. Frame 930 shows that the parked car is not tracked anymore. A person who has left the car is labelled as object 68, and the person entering from the background is now tracked as object 69. A white car, which has quickly crossed the scene, is tracked as object 64 and is still visible in the top left of this frame. Frame 1020 shows the two people (objects 68 and 69) walking along the street. By frame 1080 object 68 has left the scene and only object 69 is still tracked. The final frame 1190 shows a new person in the scene which is labelled as object 71 and tracked together with the first person (object 69) from frame 930.

This sequence illustrates a number of features of the tracking algorithm. For one, background segmentation produces adequate foreground blobs for use in the tracking algorithm as the various foreground masks indicate. In the foreground mask associated with frame 930 it is visible that object 53 has now been absorbed into the background. This confirms the ability to adapt to changing scene geometry, as claimed by Stauffer and Grimson. Furthermore, tracking of objects 53, 68 and 69 is evident from the snapshots. The creation of new hypotheses (64, 68, 69, and 71) as well as the deletion of objects (53, 64 and 68) show that the pool of hypotheses is adapted dynamically, depending on the foreground blobs that are available.


In this thesis we presented a set of methods and tools for a "smart" visual surveillance system.

I implemented three different object detection algorithms and compared their detection quality and time-performance. The adaptive background subtraction scheme gives the most promising results in terms of detection quality and computational complexity to be used in a real-time surveillance system with more than a dozen cameras. However, no object detection algorithm is perfect, so is our method since it needs improvements in handling darker shadows, sudden illumination changes and object occlusions. Higher level semantic extraction steps would be used to support object detection step to enhance its results and eliminate inaccurate segmentation.

The proposed whole-body object tracking algorithm successfully tracks objects in consecutive frames. Our tests in sample applications show that using nearest neighbour matching scheme gives promising results and no complicated methods are necessary for whole-body tracking of objects. Also, in handling simple object occlusions, our histogram-based correspondence matching approach recognizes the identities of objects entered into an occlusion successfully after a split. However, due to the nature of the heuristic we use, our occlusion handling algorithm would fail in distinguishing occluding objects if they are of the same size and color. Also, in crowded scenes handling occlusions becomes infeasible with such an approach, thus a pixel-based method, like optical flow is required to identify object segments accurately.

We proposed a novel object classification algorithm based on the object shape similarity. The method is generic and can be applied to different classification problems as well. Although this algorithm gives promising results in categorizing object types, it has two drawbacks: (a) the method requires effort to create a labeled template object database (b) the method is view dependent. If we could have eliminated (b), the first problem would automatically disappear since one global template database would suffice to classify objects. One way to achieve this may be generating a template database for all possible silhouettes of different classes. This would increase the computational time, but may help to overcome the need for creating a template database for each camera position separately.

In short, the methods we presented for "smart" visual surveillance show promising results and can be both used as part of a real-time surveillance system or utilized as a base for more advanced research such as activity analysis in video.