Interest Point Based Local Image Feature Detection Methods Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper is a brief survey of all interest point based possible methods of feature extraction and detection from image. In video content analysis, captured video frames need to be converted in images for object detection. Thus methods discussed here can also be used in video content analysis. Some of these methods are successfully implemented in our project 'Video Content Analysis in Video Surveillance System', and are discussed in brief. One can find this paper a handy reference for interest point based local image detection methods.

Video surveillance system involves many operations of pattern recognition to identify specific features in acquired video or video frame. Frames are converted in to image to minimize the complexity of video content analysis to image content analysis. Low-level features, like color, texture and shape are extracted or detected from these key-frames (images) for supporting indexing and retrieval [1] In pattern recognition features are numeric or symbolic units of information constructed from measurements by sensors or some intelligent mathematical models like Gabor filters[9]. The information termed as image content or image features can represent the small parts of image ( local or internal features of image) or whole image itself, or global features of image.

The global features include intensity level, histograms, and other output obtained by considering whole image as an input. Global features won't give any information about the local features. Local features are the integral part of image, like a particular color, a particular small portion, or particular texture in image and so on. Local features that can be detected or extracted are very large in number; virtually infinite. Out of this infinite set of local features, varying with perspective of the operations carried over an image(s). e.g., a digital 3rd umpire in sports access local features of sequence of images to give decisions like player out, or Faull, or penalty etc…

Local image features are represented using descriptors, termed as local feature descriptor. Descriptor stands for a process or mechanism or algorithm using which one can obtain the local features and their spatial or else relationship with each other and the whole image. Thus, a local image descriptor is a numeric feature computed from the image patch while local image feature is more mature representation of the local image descriptor.


It is not convenient and feasible to process entire image for object detection and recognition due to which state-of-the art object detection and recognition systems uses divide and conquer method for their working. This method is also known as parts-and-structure method. In this method, system divides the image or object in to smaller parts; and then defines an appearance models and spatial relationships of those parts to detect or recognize the object in part or completely. Here divided parts are termed as segments of an image or objects [9][13].This method has following advantages over detection of whole object.

Description of smaller local parts of image(s) or an object(s) is much more simpler than having description of whole image(s) or object(s).

Part of the object hidden or overlapped by another object (generally termed as occlusion) can be handled more easily than considering the whole object(s) or image(s).

The accuracy of operation is more as parting of object(s) yield less critical data to handle.

The text should be in two 8.45 cm (3.33") columns with a .83 cm (.33") gutter.

The objects to be detected are the part of foreground image(s), which is virtually not constant ( Part, which may change itself with angle or other factors). The background is constant part of image, always overlapped by foreground. The 'parts-and-structure' method first decides object of interest in foreground of training images and then finds the points of interest in those objects for which descriptions get created. These descriptions of each point of interest help one to learn the model of object class, and thus the object get detected or recognized. A conceptual flow diagram for this can be as shown in Figure 2. This may vary in its complexity depending upon how intelligently the algorithm is implemented. This can be termed as level, manual or automatic of supervision leading to interest points of assured quality by which making model learning more easier. Less supervision makes model learning and detection more complex as interest points of assured quality may become rare.

Figure 1. An example of object class detection with "parts-and-structure" model. Same parts - tires, motor, and handles of two motorcycles are marked with green circles while their spatial relationship is shown with blue lines.[13]

Figure 2. Conceptual flow diagram of learning stages for object class detection with "parts and structure" model.

Supervision involves segmentation of objects as well as auto or manual marking of interest points. Based on supervision level, rough classification of current object detection methods are of following four types.

Unsupervised methods : learning object classes from a set of unlabeled images. This is practically difficult to achieve, as almost no supervision is possible here.

Semi-Supervised methods : learning object classes from a set of labeled images. Some of these methods uses set of image features without a structure model. AdaBoost method by Opelt et al. [21], Bayesian learning of image features by Carbonetto et al [3], and Jurie and Schimd's [11] shape based region detectors and descriptors are the examples of such methods. While some methods uses the structure model. Examples of such methods are : Agrawal and Roth's [2] vocabulary for parts of object used along with information of their spatial relationships, Fei Fei et al.'s[7] one-shot learning of object categories.

Supervised methods ; learning object classes from the labeled and segmented images. Examples of such methods are: Harris-Laplace and SIFT[3-5 ] interest point detection. SIFT description uses GMM and SVM classifiers without a spatial model.

Strongly supervised methods: uses labeled training

Images with manually selected interest points or areas. Examples of such methods are: Mohan et al's[19] detection of humans from a sub-window by detecting head, legs and other organs separately using Haar wavelets and SVM. SIFT key detector[3-5 ], Gaussian mixture models, multiresolution Gabor filters[ 9 ] uses manually marked interest points,.

These methods can be used to detect whether an object is present in the image or not and that also without giving the object's precise location or even any kind of guess of its location. Such applications of object detection methods are really useful in video content analysis in video surveillance systems. The term "object detection" is used for methods, which detect an object's presence in an image. The term "localization" means accurately localizing where in the image the object resides. But such methods are still very far from the accuracy due to social and technical hindrances. Social hindrances are due to the fear of privacy violations; the most feared possibility is in video surveillance systems. In technical hindrances most important issue is false positive results. For example a human detection system may detect a statue of liberty in image as human. This happened due to either unintelligent software (poorly designed algorithms) or hardware's incapability to do supposed operations with highest precision. Due to this; even if the method could localize the object in addition to detecting its presence, it is commonplace to only measure whether the presence was correctly detected, not how precisely the object was localized.

The same methods with slight modifications can be used for the object instance detection or object matching. In object instance detection the same object must be detected in different images. If object detection method becomes too selective in object learning then it will may fail by Intra class variations (the differences between objects belonging to the same class) but still be able to capture inter-class variations (to distinguish objects from different classes from one another). But object instance detection methods must learn details specific to the object so that it can distinguish and identify the object of interest from all others in the image or sequence of images or video. This also force object-matching methods to be highly robust to viewpoint changes. Scale Invariant Feature Transform (SIFT)by Lowe[3-5 ], Maximally Stable External Regions (MSER)by Matas et al[17] are example of such methods.


Interest point detection is also known as Distinguished regions [16], affine regions[19], salient regions[4-6], regions of interest To be worthwhile the method have to be invariant to scale, rotation, noise, illumination and all other affine changes so that same point can be found. No such change should have to affect the overall efficiency of the designed method.

A. Harris Corner Detector: Harris and Stephens[10] implemented this method of combined corner and edge detector. Main motivation of this is motion analysis from an image sequence created with a moving camera or captured sequential images( frames turned in to images by deducting audio and other features of a video frames). This detector is based on local auto-correlation of the signal. The local auto correlation measures changes when a part of succeeding image (a small patch) is shifted or changes slightly. A change of intensity I(x,y) for shift E(u,v) for an image can be given as..

Where w(x,y)is a windowing function like Gaussian. If the shift is small then approximation can be used.

Where M is a symmetric 2X2 matrix computed from image derivatives as Ia is the derivative calculated in direction a.

Eigenvalues l1 and l2 of the matrix M are then solved. If both are small, image is flat in that point, if both are large then there is a corner and if one is small and other is large then there is an edge. Thus corner response can be calculated without explicit Eigen value decomposition,

Where k is an empirical constant 0<K<1.0, |R| is flat point. R>0 means a corner point, whereas R<0 mean edge point. The actual selected corner points are the local maxima of the R, so only one point per corner is actually selected. The local minima results in edge points selection but it is not as useful as corner point detection. Corner points are much stable than edge points for small variations in an image. Harris corner detector is invariant to rotation but not invariant to scale and affine changes.

It is partially invariant to intensity change; too low contrast in corner area reduces R to 0 the point may get classified as flat. Too high contrast results in ambiguous results. Lindeberg[14] and Mikolajczyk et al[18] put forth the improved Harris detector called as Harris-Laplace detector. Harris-Laplace achieves scale variance by computing a multi-scale representation for Harris detector and then selects Laplacian points, which have a local maximum of normalized image derivatives. A threshold of |R| is used to remove non-distinctive corner points as they are not stable to changes. An iterative algorithm is used for each point found to detect scale and location of interest point. Different scales may changes the exact location. The scale of the interest point is detected by finding the maximum of Laplacian-of-Gaussian. Mikolajczyk et al[18] improved the Harris detector for affine changes and termed it as Harris-Affine detector.

B. SIFT Detector: Lowe[4-6]designed and put forth the Scale Invariant Feature Transform(SIFT), which includes both interest point detector and a local image descriptor. SIFT detector works in following four main stages:

Scale-space extrema detection: Potential interest points are searched and identified with a Gaussian-Difference function in all scales and locations within an image(s).[4-6 ]

Key-point localization: Here, a model is used to determine location and scale of an interest point. Interest point(s), which are found to be unstable, are deleted or simply not considered further. Manual supervision is generally used to initiate the process.[4-6]

Orientation assignment: After the key-point localization, local image gradients are used to assign one or more orientations for each key-point.[4-6 ]

Key-point descriptor: At last the descriptor(s) for the key-point(s) are created. These can be shown with cross, circle, or any small identification mark.

To detect interest points a continuous scale function, L(x,y,s) is used. The function L(x,y,s) is a convolution product of variable-scale-Gaussian, G(x,y,s) and input image, I(x,y).

The function G(x,y,s) can be stated as :

Lowe[ ] has proposed to use extrema of Gaussian-difference function as interest points which can be detected efficiently. This Gaussian-difference of two Gaussians on nearby scales separated by constant factor k is defined as:

where L(x,y,s) is an image smoothed by a Gaussian. Thus equation 7 simply rewritten as:

SIFT detector works as below. First, the image(s) are smoothed by Gaussians, which are separated by a constant factor k in scale space. From these images L(x,y,s), which forms the image stack , adjacent images are then subtracted from each other to produce Gaussians-difference images D(x,y,s). Each octave, a doubling of s, is handled separately. Also image downscaling for every octave is used to save computation time. The local extrema from the stack of Gaussians-difference images thus formed can be used to locate interest points, i.e., a point is an interest point, if it is smallest or largest of the 3X3X3 pixels surrounding it at the same scale level and the scale levels above and below. By fitting a 3D quadratic function to local image points, exact location of the interest point is measured. This calculation also reveals interest points in low contrast areas, which are not useful and hence get discarded or removed. Gaussians-difference has strong response at edges, but as interest points they are not stable, so just like the Harris detector[10], the principle curvature is calculated for each interest point by calculating eigenvalues of the Hessian matrix for each interest point. The interest point is accepted only if the ratio between eigenvalues is small enough, and the actual calculation of eigenvalues is thus avoided.

Just like Harris detector, key-points orientation is determined by computing an orientation histogram for each interest point with their neighborhood. The highest peak of histogram or other peaks higher than 80% of the highest peak are used to determine the orientation of the interest points. Thus one interest point can have many directions (split into several interest points) if there are many dominant directions (histogram peaks) in the orientation histogram.

C. Entropy Based Detector: Kadir et al. [12] developed an interest point detector, based on an information theoretical approach, the entropy of local image regions. This method considers intra-class variations. It works as follow.

First, calculate the entropy of a gray-level histogram (color histogram) of the local image areas in several scales (circles of varying sizes are used for this purpose). Any flat image area generally has histogram with low entropy and one strong peak. If there is more variations in image area, it may have several peaks.

Second, select the scales( circles) with peaks of entropy.

Third, use an inter-scale unpredictability measure to weight entropy values. Image areas, where a specific scale has strong peak get weighted higher than areas where peak is weak compared to nearby scales. If the area is noisy then there may not be one specific scale with strong peak.

For instance if one had an image with bright circle or any other figure with a black background then the detectors area containing some black area around the bright figure in such way that there are approximately same number of white and black pixels (points), then its entropy will be strong. Entropy is small if area of interest is within the figure. And, entropy becomes smaller if the black points starts to dominate. Entropy based detection is naturally invariant to rotation, translation and small affine transforms if detection window is circular. With any other shape of the detection window ( area to be detected) entropy based detector becomes more complex to handle and may not give good result if not designed with proper care.

D. Maximally Stable Extremal Regions( MSER) Detector : Matas et al [17] introduced a method based on thresholding and an extremal region is a connected area in a thresholded image, and named it MSER. In this method all regions are found by thresholding the image with all possible thresholds. Maximally stored extremal regions are extremal regions, which virtually do not change ( change occurred is very small and negligible) with change in threshold. Regions found using MSER are invariant to all adjacency-preserving transformations (like scale, rotation, affine transforms), if stable region is found in a planar object. MSER is invariant to shift in image intensity but not invariant to contrast changes.

Different image descriptors

Image descriptors are very important players in object detection. Using them with the detectors actually adds more intelligence and efficiency to algorithm of object detection and identification. When descriptors are used with interest point detectors, descriptors are needed not to be scale or rotation invariant as the image patch can be normalized before creating the local description. Descriptors used in object detection should not be selective to small variations, as then it can't represent reliably an object class.

A. Local Image Description by Pixel value: This is a very simple and straightforward method of image description where a part of image (area of interest) around the point-of-interest is taken and the gray-level-pixel-values for that part is used directly as descriptor. For any transformation in the image, similar changes need to be accounted in this descriptor using same transforms. Two major problems with this descriptor are: First, high dimensionality of the descriptor ( for example, 20X20 area will have a descriptor of length 400 i.e. a 20X20 matrix itself) and Second, poor invariance to small or very negligible changes in an image. A small amount of noise is sufficient to produce false positive output. This problem can be solved using Principal Component Analysis (PCA), which is useful in reducing the dimensionality and poor invariance to small perturbations of the images. [8]

B. SIFT Descriptor:

SIFT [4-6] includes a local image descriptor based on local image gradients. SIFT descriptor is scale and rotation invariant. Fig. 3 shows an example of a SIFT descriptor for 8X8 point gradients. Gradient magnitudes are weighted by a Gaussian so that they become gradually smaller when the distance to the center point increases. Weighting avoids large changes in descriptor. Weighted gradients are then divided in to 4X4 sub-regions using interpolation to 8 primary directions and then summed up. Thus, SIFT descriptor is the vector of directional gradient sums from all sub-regions. PCA-SIFT [4-6 ] is variant to SIFT combining advantages of PCA and SIFT.

Local Binary Patterns ( LBP):

LBP[ 20] feature is calculated by comparing the value of a

center pixel to other pixels in 3 X 3 area, the resulting binary number is the result of LBP operator. A 256-bin histogram of LBP computed over a large area can be used as texture descriptor. LBP operator can operate on different neighborhoods. LBPP,R refers to the LBP operator, which considers P neighbors at distance R; and produces 2P output values( length of histogram). Uniform patterns in images contain more information than non-uniform patterns and segmenting non-uniform patterns from uniform one and then bundling them together controls the factor P, which in turns control length of histogram. Uniform patterns include only a limited number of bitwise transitions from 0 to 1 or opposite. A uniform LBP operator, which bundles the patterns with more than two transitions to a single bin is marked as, .

LBP is a texture descriptor and LBP histograms won't give any information about, how the texture changes spatially. LBP features are directly of no use as a local image descriptor. LBP features if used as local image descriptor creates very long feature vectors making classifier inefficient.

(a) (b) (c)

Fig.3. An Example of SIFT Local descriptor creation of length 32 ( 8 primary directions for 4 sub-regions) (a) Image Gradients (b) Weighted sum of gradients (c) Complete SIFT descriptor[4]

Fig.4 An example of LBP calculation. Pixels surrounding the central pixel are thresholded based on the value of central pixel and a binary feture is formed.

Steerable Pyramid:

Steerable pyramid is a linear decomposition of image into scale and orientation sub-bands. It is jointly shiftable in both orientation and scale. Decomposition transform can be formed by translations, dilations and rotation of a single filter. The transform is constructed as recursive pyramid. The basis functions are directional derivative operators; the number of derivatives defines the number of orientations. Nth order derivatives have N+1 orientations. Convolving the input signal with a set of oriented band-pass and low-pass kernels forms the pyramid.

Fig. 6. 3rd order (4 orientations ) steerable filters: (a) spatial domain, (b) frequency domain

Aliasing is avoided by not sub-sampling band-pass portion, while the low-pass portion is sub-sampled by a factor 2. The low-pass portion is used to compute next levels in pyramid. Each pass bands and sub-bands are stored to reconstruct the image or original signal. Fig. 7 shows an example of image decomposition using Steerable pyramids. This method is used in object detection, text detection or recognition.[1 ]

Fig. 7. Example of image decomposition using Steerable pyramids. (a) original image, (b) pyramid level decomposition with 3rd order steerable filters ( 4 directions ). (c) high pass residual sub-band

Object Detection Methods

An object is any part of image or subpart of content of an image.

A. Feature based affine invariant detection and localization of faces: This method uses a separate local image feature detection phase to detect and localize facial parts and another phase to combine them to complete the face localization using constellation model. During the detection K ( number vary with the type of feature selector used , usually K>200 is sufficient for correct detection) possible candidates of facial feature is detected. A constellation model is used to selects, which of the found candidates form a face. This is done by comparing the found candidates with the face model, formed using training set. If the required transformation is not seen in the training set, the found parts are probably false positives and do not belong to a face.

It can be modified to detect other than face objects from an image by forming different models for different objects in image using training dataset. An appearance model is used to verify if a real face ( object of interest ) is found, or not. Appearance model works at image level and generally uses template matching ( an image patch is extracted and it is classified with the help of support vector machine(SVM) ). Classifier is trained from training data where patches are manually marked and bootstrapping ( or relevant technique) is used to generate negative examples.

B. Distinctive Image features from scale-invariant keypoints: SIFT interest point detector and SIFT descriptors or alike methods can be used for object recognition. For each kepoint a closest match, smallest Euclidean distance in database is searched. Many of the found interest points arises from the cluttered background or unknown objects ( noise), so there may not always be a correct match. For correctness a carefully computed threshold based on difference between the closest match and second closest match in database is need to be use. Inspite of the large false positive matches, the method can recognize the trained objects in highly varying poses and when they are heavily occluded. Several different objects can be detected at the same time.

C. Object Class Recognition By Unsupervised Invariant Learning : The method is not completely unsupervised as it assumes that all training set images must contain an instance of the object class. i.e. when training a detector for motorcycles, all training set images must contain a motorcycle. Or, when training a detector for football ( or eye, or numbers, or any else object), all training set images must contain a football ( or an object of concern). The object model consists of parts where for each part appearance, relative scale and mutual position with other parts is known. Some parts may be occluded, and probabilistic parts are modeled with probabilistic density functions. During learning, interest points and their scales are first searched. From the appearance, scale, and mutual position, a model is learned so that it gives maximum likelihood description. Recognition is done by detecting interest points and their scales in query images, and by evaluating found regions using Baysian belief network[16]. First, N interest points are found with locations X, Scales S and appearances A. The decision is based on lilkihood for object presence modeled as

where h is a hypothesis vector of length P, which enumerates which of the detected N interest points belong to the object. H presents all valid allocations of features to the parts, which is of O (Nr). The complexity of approach results in low detection of interest points and even lowers detection of actual object parts.

Rrapid Object Detection Using A Boosted Cascade of Simple Features:

Viola and Jones[22] method uses simple features based on integral images, which are extremely efficient to compute. These features then combine by an AdaBoost classifier to create an efficient object detector. The classifier used in cascade. The method is supervised and require to train using segmented images of the training class ( images or objects that we need to differentiate from background) and background images ( images not containing any object of concern to detect). Windowed fashion detection, in which image is divided in to patches and the detector is used separately for each patch, is used. The method uses simple rectangular features. ( one can define else features in place of rectangular). Value of feature is computed by taking the sum of pixel values in the white parts ( high luminance) of the filter, and subtracting it from sum of pixel values in the gray part( low luminance). With respect to base size M X N used for detector the number of features varies. For example, base size 24 X 24 used for detector, the complete set of feature values is over one hundred thousand. Thus, efficient computation of feature values is important and can be achieved using intermediate representation of image, or integral image.

Fig. 8. Four rectangular features computed such that the sum of pixel values in white parts of the rectangle are subtracted from the sum of pixel values in gray parts

The integral image contains the sum of pixel values above and to the left of the current pixel in the original image. Which is computed as..

where Ii(x,y) is the integral image and I(x,y) is the original image. The integral image can be computed in one pass as..

where s(x,y) is cumulative row sum and negative indexes equal to zero. Computation of single feature is fast, but for multi feature it is slow.

Conclusion And Summary

The paper discussed interest point based object detection or local image feature detection methods in brief with their mathematical models and/ or with their logical explanation. Some of the discussed methods are implemented as part of authors project and found to be satisfactory in image feature detection as their detection rate is higher than 70% in occluded environment. Accurate comparative study of these methods need to check them against at least 100 or more sample inputs and it is really a difficult task.