Features Used For Representation Of Video Frame Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In the literature different techniques have been proposed for shot boundary detection and video retrieval. This chapter reviews some of the existing techniques for SBD and CBVR. Some research issues arising out of the review of existing methods are identified, and addressed in this thesis.

The chapter is organized as follows: Section 2.1 explains the features used for representation of video frame. A review on various shot boundary techniques previously proposed is presented in Section 2.2. The various techniques available for video retrieval is discussed in section 2.3. In Section 2.4, the objectives of SBD and CBVR are presented.


Almost all shot change detection algorithms reduce the large dimensionality of the video domain by extracting a small number of features either from the entire frame or from the small regions of the frame known as Region Of Interest (ROI) (KrishnaMohan 2007). Such features include the following:

1) Luminance/Color: The simplest feature that can be used to characterize a ROI is its average grayscale luminance. This, however, is susceptible to changes in illumination. A better choice is to use some statistics of the values in some color space.

2) Luminance/Color histogram: A richer feature for a ROI is the grayscale or color histogram. Its advantage is that it is discriminant, easy to compute, and mostly insensitive to translational, rotational, and zooming camera motions. For these reasons, it is widely used in Zhang and Kankanhalli (1993) and (Cernekova et al 2003). However, it does not represent the spatial distribution of color in the ROI.

3) Image edges: Edge information of the ROI is one of the better choices of features for processing as discussed by (Nam et al 2005) and (Zabith et al 1999).The advantage of this feature is that it is invariant to illumination changes and motion, and it is related to the human visual perception of the image. Its main disadvantages are computational cost, noise sensitivity and high dimensionality.

4) Features in transform domain: DCT,DFT and wavelets can also be used to characterize the region or image information. The problem with these features are that they are not variant to camera motion.

The other features such as color anglogram and motion are also used as features. Motion can be combined with other features for better results.


The size of the region from which individual features are extracted plays an important role in the overall performance of the algorithms of shot change detection. A small region tends to reduce detection invariance with respect to motion, while a large region tends to miss transitions between similar shots. The possible choices of size of the regions are listed here:

1) Single pixel in a frame: Some algorithms extract features from each and every pixel in the frame which can be luminance and edge strength (Nam et al 2005). This process results in a very large feature vector and is very sensitive to motion.

2) Rectangular block: Another method is to segment each frame into equal-sized blocks and extract a set of features (e.g., mean value of the intensity) from each block (Lelescu et al 2003) and (Hanjalic et al 2002). This method is invariant to object motion. By computing the block motion, it is possible to enhance motion invariance.

3) Arbitrarily shaped region: Feature extraction can also be applied to arbitrarily shaped and sized regions in a frame, derived by spatial segmentation algorithms. This enables the derivation of features based on the most homogeneous regions, thus facilitating a better detection of temporal discontinuities. The main disadvantages are the high computational complexity and the instability of the region segmentation algorithms involved.

4)Whole frame: The algorithms that extract features (e.g., edge, histogram) from the whole frame (Cernekova et al 2003), (Yu et al 2001) and Lienhart (2001) have the advantage of being resistant with respect to motion, but tend to have poor performance at detecting the change between two similar shots.


Another important aspect of shot boundary detection algorithms is the temporal window that is used to perform shot change detection. In general, the objective is to select a temporal window that contains a representative amount of video activity. The following cases are typically used:

1) Two frames: The simplest way to detect discontinuity between frames is to look for a high value of the discontinuity metric between two successive frames (Yu et al 2001), Hanjalic (2002), (Li et al 2002) and (Li et al 2003). However, such an approach can fail to discriminate between shot transitions and changes within the shot when there is significant variation in activity among different parts of the video or when certain shots contain events that cause brief discontinuities (e.g., photographic flashes).It also has difficulty in detecting gradual transitions.

2) N-frame window: One of the common techniques to overcome the above problems is to detect the discontinuity by using the feature of all frames within a temporal window. This is either by computing a dynamic threshold against which a frame by frame discontinuity metric is compared or by computing the discontinuity metric directly on the window as discussed by Hanjalic (2002), (Nam and Tewfik 2005), Lienhart (2001) and ( Boccignone et al 2005)

3) Entire current shot: Another method for detecting a shot boundary is to compute one or more statistics for the entire shot and to check if the next frame is consistent with them. The problem with such approaches is that statistics computed for an entire shot may not be the representative of its end because of the great variations within the shots

4) Entire video: The characteristics of the whole video can also be taken for SBD. The problem with this approach is again the great variability within and between shots.


Having defined a feature (or a set of features) computed from each frame and a similarity metric, a shot change detection algorithm needs to detect where they exhibit discontinuity. This can be done in the following ways:

1) Static thresholding: A constant threshold is used to compare the computed discontinuity value of the adjacent frames. This performs well only if video content exhibits similar characteristics over time. The threshold has to be changed for each video.

2) Adaptive thresholding: The threshold has to be varied depending on the average discontinuity within a temporal domain (Yu and Srinath 2001) and ( Boccignone et al 2005).

3) Probabilistic detection: For a given type of shot transition, probability density function of the similarity/dissimilarity metric is estimated a priori, using several examples of that type of shot transition. Then an optimal shot change estimation is performed.

4) Trained classifier: Another method for detecting shot changes is to formulate the problem as a classification task with two classes namely "shot change" and " no shot change". The classifier needs to be trained to differentiate the two classes.

5) User Interaction: If automatic procedures fail, cut detection in ambiguous cases can be resolved by user input.


The previous approaches of SBD algorithms vary from the simple pixel based technique to the complex methodologies as mentioned in Koprinska and Carrato (2001) and Geetha and Vasumathi Narayanan (2008).

Pixel comparison is the basic method in SBD as described in (John S. Boreczky and Lawrence A. Rowe (1996). The simplest way to compute the dissimilarity between two frames is to count the number of pixels that change in value more than some threshold. This total is compared against the second threshold to determine if there is a shot. A problem with this approach is the sensitivity of discontinuity values to camera and object motion. To reduce the motion influence, a modification of the described techniques is presented by (Zhang and Kankanhalli 1993) where a 3x3 averaging filter is applied to frames before performing the pixel comparison. Although the method is somewhat slow, selecting a threshold tailored to the input sequence yields good results in this approach. But manually adjusting the threshold is difficult to perform.

Ueda et al (1991) proposed the histogram based SBD method. A popular alternative to pixel-based approaches is using histograms as features. Consecutive frames within a shot containing similar global visual material will show little difference in their histograms, compared to frames on both sides of a shot boundary. In this method the histogram of two images either gray or color image is computed. If the bit-wise difference between the two histograms is above the threshold, a shot boundary is declared. Although it can be argued that frames having completely different visual contents can still have similar histograms, the probability of such a case is small.

Kasturi and Jain (1991) have used statistical method by expanding the idea of pixel differences by breaking the images into regions and comparing statistical measures of the pixels in those regions. For example, compute a measure based on the mean and standard deviation of the gray levels in regions of the images. This method is insensitive to noise. Since the statistical formulae are involved in the computation, this method is slow. It also generates many false positives that is changes not caused by a shot boundary.

O'Toole et al (1999) have presented a detailed Histogram based shot cut detector. It is found that shot boundary detection using similarity threshold is difficult in broadcast video. This leads to discovery of an adaptive threshold for addressing this huge variation of characteristics. The histogram creation technique compare successive frames based on three 64 bit histograms. These histograms are one of luminance and two of chrominance. Three histograms where combined to form a single N dimensional vector where N stands for total number of bins in all the histograms. The histogram of adjacent frames are compared using cosine measure. If the value of cosine is low then it indicates similarity.

Zabih et al (1995) have aligned consecutive frames to reduce the effects of camera motion and compared the number and position of edges in the edge detected images. They compute the percentage of edges that enter and exit between the two frames and looking for large edge change percentages. Dissolves and fades are identified by looking at the relative values of the entering and exiting edge percentages. This method is more accurate at detecting cuts than histograms and much less sensitive to motion than chromatic scaling.

Ueda et al (1991) and Zhang et al (1993) have used motion vectors determined from block matching to detect whether or not a shot is a zoom or a pan. The motion vectors extracted as part of the region-based pixel difference computation has been used to decide if there is a large amount of camera or object motion present in a shot. Because shots with camera motion can be incorrectly classified as gradual transitions, detecting zooms and pans increases the accuracy of a shot boundary detection algorithm.

Bing Han et al (2005) have described a technique for video shot boundary detection using rough fuzzy set. 12 candidate features, classified into 5 types, have been used. The first is the RGB space model, the changes of three colors during shot transition can be measured, The second is HSV space model, the component of which can be measured to the changes of hue, saturation and value between adjacent frames. The mean of every component of each frame in the RGB or HSV mode is computed. The histogram features are categorized into two types: gray histogram and color histogram, which are the third and forth types of features. Finally, the statistic feature is considered as the fifth. The mean, variance and skewness of lightness component in each frame are computed. It includes cut, fade and dissolve, as well as zoom, pan and other camera motions and object motions. There are two types of false detections in videos. One results from the existence of irregular camera operations during the gradual transitions. The other is due to a lot of flash effects in a shot. The misses are mainly due to the small content changes between the frame pairs at some shot boundaries. During the experiments, it is found that the false detection in the coarse detection will affect on the following feature extraction and shot boundary detection while the missed detections have less effects because the rough-fuzzy calculator weakens the mistakes in coarse detection stage, the dissimilarity function is more fit for various video. Based on rough fuzzy set, by which the dissimilarity function for shot boundary detection is obtained. The dissimilarity function is generated by weighting these important features in term of their proportion in the whole feature. It shows that the proposed methods not only are both similar and effective but also can decrease data dimensions and preserve the information of original video farthest.

Wujie Zheng et al (2005) have applied a second-order difference method first to obtain candidate cuts, and then applied a post-processing procedure to eliminate the false positives for cut detection, and the twin-comparison approach has been employed to detect short gradual transition which lasts less than six frames, while for long gradual transition, an improvement of twin-comparison algorithm has been designed. A monochrome frame detector has been proposed for FOI detection. Three limitations mentioned in this work are

The usage of global threshold and are heuristically determined.

Simple flash detector is used

Simple motion feature has been used.

Cernekova et al (2003) have detected shot boundaries in video sequences using Singular Value Decomposition (SVD).The method relies on performing singular value decomposition on the matrix A created from 3D histograms of single frames. The SVD has been notified for its capabilities to derive a low dimensional refined feature space from a high dimensional raw feature space, where pattern similarity can easily be detected. The method can detect cuts and gradual transitions, such as dissolves and fades, which cannot be detected easily by entropy measures.

Nam and Tewfik (2005) have presented a technique for detecting the presence of a gradual transition in video sequences and automatically identifying its type. The scheme focuses on analyzing the characteristics of the underlying special edit effects and estimates actual transitions by polynomial data interpolation. In particular, a B-spline interpolation curve fitting technique is used. It is able to recover the original transition behavior of an edit effect even if it is distorted by various post-processing stages.

Zhou and Zhang (2005) have discussed a method for shot boundary detection using Independent Component Analysis (ICA). Each video frame is represented by a two dimensional compact feature vector by projecting video frames from illumination invariant raw feature space into low dimensional ICA subspace. In the low dimensional ICA subspace, a dynamic clustering algorithm based on adaptive thresholding is developed to detect shot boundaries.

Boccignone et al (2005) have approached the problem of shot boundary detection using the attentional paradigm for human vision. The algorithm computes for every frame, a set (called a trace) of points of focus of attention in decreasing order of saliency. It then compares nearby frames by evaluating the consistency of their traces. Shot boundaries are hypothesized when the above similarity is below a dynamic threshold.

Cernekova et al (2003) proposed a method for detecting shot boundaries in video sequences using metrics based on information theory. The method relies on the mutual information and the joint entropy between consecutive frames and can detect cuts, fade-ins and fade-outs. The mutual information is a measure of transported information from one frame to another. The mutual information is used for detecting abrupt cuts

Han and Yoon (2000) have described a technique to detect shot boundaries using low pass filtered histogram space. Twin comparison was developed to detect shot boundaries among cut and fade/dissolve using two thresholds (Zhang et al 1993). Bilge and Tekalp (1998) proposed one threshold method using Otsu method to find the threshold automatically. However, this system was presented for detection of cut-type shot boundaries. In model-based method, the edit effect showing gradual changes (fades, issolves, etc.) presents edit invariant property that is used in classifying shot boundaries. In which accentuates edit constancy effects by applying low pass filtering to histogram differences between frames, while suppressing motion effects causing false alarms. Edit constancy effects are rectangular shapes of cut and triangular shapes of fades/dissolves in filtered histogram differences after applying window convolution to original histogram differences. Thus the shot detection method utilizes low-pass filter to reduce false alarms caused by image motion such as camera and objects movements. Because this method uses only color histograms as feature data, the edit constancy effects are usually distorted in real images. New features resulting in edit constancy effects similar to ideal ones will be developed in the future.

Boreczky and Lynn (1998) have applied color feature for segmenting video. It uses three type of feature for the video segmentation, they are firstly the standard histogram difference based on luminance level. The pixels are distributed into 64 bins based on their level of luminance. The histogram feature is the bin wise difference of the histograms of adjacent frames. The measure of the distance between adjacent frames based on their different luminance level is called the histogram feature measure. Secondly an audio distance measure, the audio distance is first converted in to a sequence of cepstal vectors that are computed at intervals of 20 msec, the likelihood measures are computed separately over two adjacent intevals and then over their concatenation. This is done separately. A ratio between the two values is taken and it gives the likelihood ratio for testing the hypothesis which says that the same sound type is represented by the intervals. Third is an estimate of object motion between two adjacent frames. Here motion features are computed using nine motion vectors. The magnitude of these vector and the magnitude of the average of these vector help in the detection of zooms and pans. In HMM a sequence of features is given to the Viterbi algorithm which produces a sequence of states most likely to have generated these features.

Shan Li and Moon-Chuen Lee (2005) have described a shot change detection based on sliding window is described. The conventional Sliding Window Method (CSW) that has been used in window segmentation has a high rate of false alarm and mixed cuts. It uses the adaptive threshold techniques. A hard cut was detected based on its current feature value and its local neighborhood in the sliding window. When compared to other methods, good performance can be achieved by combing sliding window technique and the color histogram differences (Gaegi et al 2000, Yeo et al 1995). An improved sliding window method can be achieved which employs multi adaptive threshold during its three step processing. These steps are global prefiltering, sliding window filtering and scene activeness investigation of frame by frame discontinuity values.

Cernekova et al (2006) have described a technique for gradual transitions like dissolves and wipes. Compared to abrupt cuts, gradual transitions like dissolves and wipes are difficult to detect since they spread over a number of frames. This method compares more than two consecutive frames within a temporal window. A graph is created for window sequence using mutual information for multiple pair of frames. In the graph the frames are the nodes and the similarity measure corresponds to the weights of the edges. Later the weak connections between the nodes are disconnected and separate sub graphs are formed. Here a sub graph corresponds to a shot. The algorithm utilizes information from multiple frames within a temporal window. This ensures effective detection of gradual transitions as mentioned in (Gargi et al 2000) and (Yeo and Liu 1995).

Yu meng et al (2009) have proposed a new shot boundary detection algorithm based on Particle Swarm Optimization(PSO) classifier. This method firstly takes the difference curves of U-component histograms as the characteristics of the differences between video frames, and then utilizes a slide-window mean filter to filter difference curves and a KNN classifier applying PSO to detect and classify the shot transitions. This method has three advantages that it is more sensitive to gradual transitions; each curve graphic with remarkable characteristics corresponds to a shot transition; Cuts and Gradual transitions could be detected in a same step.

Chan and Wong (2011) have proposed a method for shot boundary detection via an optimization of traditional scoring based metrics using a genetic algorithm search heuristic. The advantage of this approach is that it allows for the detection of shots without requiring the direct use of thresholds. The methodology is described using the edge-change ratio metric.

Shujuan Shen and Jianchun cao (2011) have discussed a model of fuzzy clustering neural network which synthesizes unsupervised fuzzy competitive learning algorithm and self organized competitive network. Based on this model, an algorithm of abrupt video shot boundary detection is presented which is a two-stage clustering on a linear feature space.

Hua Zhang et al (2011) have proposed a new method of shot boundary detection based on color feature aiming to obtain accurate detection with inexpensive cost. The technique is able to detect cut shot boundaries through the analysis of color histogram differences and an adaptive threshold that is based on a sliding window. For gradual transitions such as fades and dissolves, a preprocessing has been introduced, and local histogram differences are quantified to binary values by selecting a threshold automatically with reference to the variation of histogram differences.

Jinhui Yuan et al (2005 ) have proposed a method to detect both the abrupt transitions (CUTs) and gradual transitions (GTs,excluding fade out/in) in a unified way, by incorporating temporal multi-resolution analysis into the model. Furthermore, instead of ad-hoc thresholding scheme, a novel kind of feature is used to characterize shot transitions and employ Support Vector Machine (SVM) with active leaning strategy to classify boundaries and non-boundaries. The problem with this approach is that they have not incorporated the detection of FOIs into the framework. With multi-resolution analysis, methods are needed to effectively reduce the disturbances of motion. How to effectively make use of information across different resolutions is also an important problem to solve.

Bescos et al (2005) have proposed a method which is based on mapping the interframe distance values on to a multidimensional space, while preserving the temporal sequence (or frame ordering information). It is shown that detection of boundaries is less sensitive to the choice of threshold in the multidimensional space.

Cernekova et al (2002) have proposed information theoretic measures for detecting shot boundaries. Mutual information and joint entropy between two successive frames is calculated for each of the RGB components, for detection of cuts, fade-ins and fade-outs.

Feng et al (2005) have used different types of transition to observe different temporal resolutions. Temporal multi-resolution analysis is applied on the video stream, and video frames within a sliding window are classified into groups such as normal frames, gradual transition frames and cut frames. Then the classified frames are clustered into different shot categories.

Tran Quang Anh et al (2012) have used histogram and SIFT feature with Graph-based image Segmentation and Lili et al (2009) and John (1998) used Hidden Markov model for VR and segmentation respectively. Kalpana Thakre et al (2010 ) used multiple features such as Quantized lab color histogram, Texture and Motion features for video retrieval.

Onur Kucuktunc et al ( 2010) have presented a fuzzy color histogram-based shot-boundary detection algorithm specialized for content based copy detection applications. The proposed method aims to detect both cuts and gradual transitions (fade, dissolve) effectively in videos where heavy transformations occur. Along with the color histogram generated with the fuzzy linking method on L*a*b* color space, the system extracts a mask for still regions and the window of picture-in-picture transformation for each detected shot, which will be useful in a content-based copy detection system.

Padmakala and Anandhamala (2010) have presented an image denoising strategy based on an enhanced sparse representation in transform domain which has been utilized in denoising the noisy frames. Expectation Maximization (EM) algorithm is utilized in the estimation of background . The approach presented for foreground object segmentation made use of biorthogonal wavelet transform and L2-norm distance measure.

Su C et al (2005) have presented a novel dissolve-type transition detection algorithm that can correctly distinguish dissolves from disturbance caused by motion. The dissolve has been modeled based on its nature and then use the model to filter out possible confusion caused by the effect of motion.

Hun-Woo Yoo (2006) have proposed a new algorithm for gradual shot boundary detection. The proposed algorithm is based on the fact that most of gradual curves can be characterized by variance distribution of edge information in the frame sequences. Average edge frame sequence has been obtained by performing Sobel edge detection. Features are extracted by comparing variance with those of local blocks in the average edge frames. Those features are further processed by the opening operation to obtain smoothing variance curves. The lowest variance in the local frame sequence is chosen as a gradual detection point.


From the literature it is understood that the CBVR is an active field of research and there are many problems in this field to take over the research process. In the following section the existing methods used for CBVR are discussed.

Alan F. Smeaton (2007) divides the traditional VR approaches into five categories. Each approach has its own advantages and disadvantages.

Using Metadata and Browsing Keyframes: In this technique metadata are used to search the video. Metadata includes characteristics such as video title, date, actor(s), video genre, running time and file size, video format, reviews by users and user ratings, copyright and ownership information, and so on. Each of these metadata fields are searchable and most systems are coupled with keyframe which allows users to preview the video content itself visually. The movies section of the Internet Archive and the Open Video Project are two good examples of video retrieval systems based solely on metadata. Metadata-based video navigation is quite limited in terms of supporting a user's information seeking and searching requirements but it is easy to implement, requires little analysis of the video content , and is leading to a rapid deployment of video libraries.

Using text for video searching: This type of the system uses spoken dialogue in the video for assistance. If the user is searching video that consists of spoken commentary, for example this can be things like nature documentaries or TV news broadcast, then this method can be used for retrieval. Here the spoken dialog may reflect the contents of the video itself. In such cases, search through the video can be considered as a good video search. Spoken dialog from video can be obtained from Automatic Speech Recognition (ASR) and text from the video can be obtained from Optical Character Recognition (OCR). There can even be link traversal where text associated with each story is used to automatically generate link from one news story to next most related news story. The combination of text search and link traversal is often used in Internet by the different search engines. The main disadvantage of text based video retrieval is that all video content do not have associated text. All information needs cannot be expressed as a set of text query.

Key frame matching: In this method the representative frame of the shot known as keyframe is used for retrieval. This type of video retrieval can be called Content Based Image Retrieval (CBIR). This technique requires efficient selection of set of images or video key frames. The image which is to be searched is used as a query. This query is compared against the video key frames from video library. The key frame based matching is good for video searching but the user need to very precise about the visual component. It can be combined with metadata and text based retrieval.

Semantic features for video retrieval: This type of system uses semantic features for search. Semantic feature refers to high level or mid level features that convey semantic contents. This can include indoor, outdoor, moving car and extracting this kind of features itself is a challenging task. The automatic extraction of low level features like color and texture is not so difficult compared to this. When there are more than a few semantic features then it can be noticed that they are related. Here the structure that relates them is ontology and some examples are person, man and woman. A set of semantic concept tag are developed for the video collection then these concept tags are used for feature based retrieval to filter the collection. Usage of semantic feature has progressed a lot and detection of the feature and retrieval of the feature are to be considered as independent of each other. In reality this is not the case when there are large set of features. Therefore instead of using feature as independent filter, develop a feature based retrieval system that uses features along with their utilities. In practice, automatic feature detection is not a process that provides yes or no assignment. Instead each feature has a confidence value for its presence in a given shot.

Object based video retrieval: Here the retrieval is based on objects. Retrieval of video based on object is a straight forward task theoretically but practically it is not so simple. If motor car is thought of as an object then its various characteristics are four wheels, smooth texture on the body work. If an object is viewed on different angles then it could be seen in different shapes, color and texture due to different lighting conditions and shadow. The other difficulty with it is that the objects occlude by other objects when they move during a video shot or movement of camera or change of shape over time. If the user wants to find video shots containing an object which is identifiable with a distinct shape, color or texture then the retrieval can be based on matching of the objects as query example. Some examples for such query are boat and car. Research in the field of object based shot retrieval has been very less as well. So there is no experimental result to support it. This technique has come up very recently and so more work can be done in this field.


Since the usages of multimedia database is increasing rapidly, an efficient way to access and manipulate the information in a vast database has become a challenging and timely issue

Flickner et al (1995) have developed the QBIC (Query by Image Content) system to explore content-based retrieval methods. QBIC allows queries on large image and video databases based on example images, user-constructed sketches, drawings, selected color, texture patterns, camera and object motion, and other graphical information. Two key properties of QBIC are its use of image and video content-computable properties such as color, texture, shape and motion of images, videos and their objects-in the queries, and its graphical query language, in which queries are posed by drawing, selecting and other graphical means.

Hamrapur et al (1997) have developed the Virage Video Engine (VVE) with the default set of primitives, provide the necessary frame work and basic tools for video content based retrieval. The video engine is a flexible platform independent architecture which provides support for processing multiple synchronized data streams like image sequences, audio and closed captions. The architecture performs multi-modal indexing and retrieval of video through the use of media specific primitives.

Chang et al (1998) have proposed an interactive system on the Web, based on the visual paradigm with spatio-temporal attributes playing a key role in video retrieval. The resulting system VideoQ, is the first on-line video search engine supporting automatic object based indexing and spatiotemporal queries.

Smith et al (1996) have described a highly functional prototype system for searching, by using visual features in an image database. In the VisualSEEk system, the user forms the queries by diagramming spatial arrangements of color regions. The system finds the images that contain the most similar arrangements of similar regions. Prior to the queries, the system automatically extracts and indexes salient color regions from the images. By utilizing efficient indexing techniques for color information, region sizes and absolute and relative spatial locations, a wide variety of complex joint color/spatial queries may be computed.

Pentland et al (1996) have described the Photobook system, which is a set of interactive tools for browsing and searching images and image sequences. These query tools differ from those used in standard image databases in which they make direct use of the image content rather than relying on text annotations. Direct search on image content is made possible by use of semantics-preserving image compression, which reduces images to a small set of perceptually significant coefficients. The three types of Photobook descriptions in detail are: that the first which allows search based on appearance, the second which uses 2-D shape, and a third which uses search based on textural properties. These image content descriptions can be combined with each other and with text-based descriptions to provide a sophisticated browsing and search capability.

Wen-Nung Lie and Wei-Chuan Hsiao (2002) have proposed a content-based video retrieval system based on object motion trajectory. It is an algorithm for tracking the moving objects in MPEG-compressed domain It is to link individual macroblocks in the temporal domain first and then prune and merge the formed paths by considering spatial adjacency of MBs. In this way, the difficult spatial segmentation problem of traditional methods is avoided and tracking of multiple deformed objects can be achieved. This system is capable of eliminating global motion so that camera motion is allowed. The extracted object motion trajectory is then converted into a form conformable to MPEG-7 motion descriptor. Both interfaces of query-by-example and query-by-sketch are provided and problems in descriptor matching (e.g., mismatch in keypoint interval and video time duration) are solved to achieve robustness and high recall rate.

Hua-Tsung Chen et al (2003) have proposed a mechanism of similarity retrieval. Similarity measure between video sequences considering the spatio-temporal variation through consecutive frames is presented. For bridging the semantic gap between low-level features and the high level features that users desire to capture, video shots are analyzed and characterized by the high-level feature of motion activity in compressed domain. The extracted features of motion activity are further described by the 2D-histogram that is sensitive to the spatio-temporal variation of moving objects. In order to reduce the dimensions of feature vector space in sequence matching, the Discrete Cosine Transform (DCT) is exploited to map semantic features of consecutive frames to the frequency domain while retaining the discriminatory information and preserve the Euclidean distance between feature vectors.

Bohm et al (2007) have described ProVeR: Probability video retrieval by the Gauss tree. The technique models objects using probability density function. The representation of complex objects in the database can be done easily using PDF. It represents complex objects in a compact and descriptive way. The ProVeR is a prototype search engine for content based video retrieval. It represents video as a set of Gaussians. The Gaussians are managed by the Gauss tree. This system helps even the non expert users to efficiently retrieve video containing similar shots and scenes.

Arasanathan Anjulan et al (2007) have proposed object based video retrieval based on local region racking. In this case first the local regions are tracked and those that are stable are extracted. Once a boundary of a shot is detected then within a shot newly starting LIR tracks will be considered. An LIR track is said to be newly starting if it starts from the frame and is not close to the existing track by threshold distance. Matching of LIR continuously across a number of frames is considered as a track. A track starts from the LIR of the first frame or from LIR of intermediate frame which has no match with the previous frame. A track terminates when the corresponding LIR has no match in the next frame. Algorithm like MSER (Matas et al 2002) is used to extract LIRs while the ellipse filling algorithm is also used along with it to fit an ellipse to each region. The algorithm requires parameters like centroids, radi and orientation. The SIFT(Lowe et al 2004) is another algorithm that is used to extract the feature vector from the extracted regions

Arslan Bashara et al (2007) have presented a framework for matching video sequences using the spatiotemporal segmentation of videos. Instead of using appearance features for region correspondence across frames, interest point trajectories has been used to generate video volumes. Point trajectories, which are generated using the SIFT operator, are clustered to form motion segments by analyzing their motion and spatial properties. The temporal correspondence between the estimated motion segments is then established based on most common SIFT correspondences. A two pass correspondence algorithm is used to handle splitting and merging regions. Spatiotemporal volumes are extracted using the consistently tracked motion segments. Next, a set of features including color, texture, motion, and SIFT descriptors are extracted to represent a volume. The Earth Mover's Distance (EMD) based approach has been for the comparison of volume features. Given two videos, a bipartite graph is constructed by modeling the volumes as vertices and their similarities as edge weights. Maximum matching of this graph produces volume correspondences between the videos, and these volume matching scores are used to compute the final video matching score.

Dyana a et al (2009) have proposed a system for CBVR based on shape and motion features of the video object. The algorithm used Curvature scale space for shape representation and Polynomial curve fitting for trajectory representation and retrieval. The shape representation is invariant to translation, rotation and scaling and robust with respect to noise. Trajectory matching incorporates visual distance, velocity dissimilarity and size dissimilarity for retrieval. To retrieve similar video shots, the cost of matching two video objects is based on shape and motion features,

Ma and Xianheng et al (2010) have stored Non-textual object information in videos as grids of numbers in the image flames. Therefore, it is hard to retrieve the object in the videos with classical methods. A robust color-feature model of the moving video objects is proposed by converting the RGB pixels to a color circle of hue. And a framework of video object retrieval has been developed based on the color feature model.

Padmakala et al (2011) have proposed an algorithm with the intention of retrieving video for a given query, the raw video data is represented by two different representation schemes, Video Segment Representation (VSR) and Optimal key Frame Representation (OFR) based on the visual contents. At first, the input raw video is segmented using video object segmentation algorithm so that the objects presented in this raw video can be obtained. Then, feature vectors are computed from VSR using the texture analysis and color moments. Furthermore, the OFR is extracted by considering the probability of occurrence of the pixel intensity values with respect to the pixel location among every frame presented in the raw video. Finally, all these features of a video, texture, color and optical frame are combined together as a feature set and stored in the feature library. For the query video clip, the aforesaid features are extracted and compared with the features in the feature library. The comparison is achieved via the feature weighted distance measure and the similar videos are retrieved from the collection of videos.

Smith and Khotanzad (2004) have proposed A new method for retrieving video segments from a digital video database. A unique signature for each video segment is generated from the analysis and tracking of interesting objects between each frame of the video region. The MPEG-7 descriptors based on (Day and Martinez 2001) consisting of the edge histogram, homogenous texture, dominant colors and color structure (188 features in all) are extracted from corresponding objects in successive frames for the duration of the considered sequence. The average of the MPEG-7 descriptors is computed for each object across all frames in the region thus resulting in a unique signature for the region. The stored signature which best matches this new signature (based on the earth mover distance (Rubner et al 1997) between the signatures) is used for retrieving the video segment.

Che-Yen Wen et al (2007) have utilized moving-object tracking technology. Here, the moving pixels have been detected by using background subtraction, which has defeated the shadow problem. Then, the noise has been eradicated and the moving pixels have been ameliorated by means of connected components labeling and morphological operations. The target's image and information for content base video retrieval in the database have been extracted with the aid of color histograms, color similarity, and "motion vector". Using the proposed technique, the image frame retrieval has been performed in single-CCD (Charge Coupled Device) or multi-CCD surveillance systems. During multi-surveillance retrieval, detection and retrieval error has been occurred due to sudden environment changes (such as light), CCD shift, viewing angle and position (it will produce same object with different size).

Joseph Sivic and Zisserman and (2003) have described an approach to video object retrieval which retrieves all shots containing the object which is specified as the query. The object is specified by a user outlining it in an image, and the object is then delineated in the retrieved shots. The method is based on three components: (i) an image representation of the object by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion; (ii) the use of contiguous frames within a shot in order to improve the estimation of the descriptors and motion group object visual aspects; (iii) vector quantization of the descriptors is used so that the technology of text retrieval, such as inverted file systems, can be employed at run time.

Kin-Wai Sze et al (2005) and Padmakala et al (2011) have represented the entire video as a single frame by creating a significant frame based on the spatial information of pixel values along with the probability occurance of the pixels.

Kong Juan and Han Cuiying (2010) have investigated the significant video retrieval based technologies. Based on their studies, a content-based video retrieval system has been constructed according to the system design requirements, the functions of each module carried out is explained briefly. Superior video analysis and retrieval capabilities have been achieved by splitting the system into video preprocessing and video query subsystems.

Hu and John Collomosse (2010) have presented a fast technique for retrieving video clips using free-hand sketched queries. Visual keypoints within each video are detected and tracked to form short trajectories, which are clustered to form a set of space -time tokens summarising video content. A Viterbi process matches a space-time graph of tokens to a description of color and motion extracted from the query sketch. Inaccuracies in the sketched query are ameliorated by computing path cost using a Levenshtein distance. Sports footage datasets have been used for evaluation .

Shanmugam and priya (2009) have presented a video data model that supports the integrated utilization of various approaches. the system splits the video into a sequence of elementary shots and extracts a small number of representative frames from each shot and subsequently calculates frame descriptors depending on the Motion, Edge, Color and Texture features. The video shots are segmented using 2-D correlation coefficient technique. The motion, edge histogram, color histogram and texture features of the elementary video shots are extracted by employing Fast Fourier transform and L2 norm distance function, Statistical approach, HSV color space conversion and Gabor wavelets using Fast Fourier transform respectively. The elementary video shots features, extracted using the above approaches, are stored in feature library. On the basis of a query clip, the videos are retrieved in this system. The color, edge, texture and motion features are extracted for a query video clip and evaluated against the features in the feature library. With the help of Kullback- Leibler distance similarity measure the comparison is carried out. Later, similar videos are retrieved from the collection of videos on the basis of the calculated Kullback- Leibler distance.

Sav et al (2006) has described a system to support object-based video retrieval where a user selects example video objects as part of the query. During the search the user builds up a set of these which are matched against objects previously segmented from a video library. This match is based on MPEG-7 Dominant Colour, Shape Compaction and Texture Browsing descriptors. a user-driven semi-automated segmentation process has been used to segment the video archive.

Su C et al (2005) have used motion vectors embedded in MPEG bitstreams to generate so-called "motionflows", which are used to perform quick video retrieval. They simply "link" the local motion vectors across consecutive video frames to form motion flows, which are then annotated and stored in a video database. In the video retrieval phase, a coarseto-fine strategy has been used to execute the video retrieval task. Motions that do not belong to the main-stream motion flows are filtered out. The retrieval process can be triggered by query-by-sketch (QBS) or query by-example (QBE).

Zhou et al (2005) have presented a novel technique for automatic identification of digital video. This new algorithm is based on dynamic programming that fully uses the temporal dimension to measure the similarity between two video sequences. A normalized chromaticity histogram is used as a feature which is illumination-invariant. Dynamic programming is applied on shot-level to find the optimal nonlinear mapping between video sequences. Two new normalized distance measures are presented for video sequence matching. One measure is based on the normalization of the optimal path found by dynamic programming. The other measure combines both the visual features and the temporal information.

The major difference among the retrieval systems is the way they extract the features and manage the features in the video retrieval process.


It is observed from the literature that the single feature selection is not at all enough to get a reasonable accuracy in the SBD since the selected feature will be suitable to detect any one kind of shot boundaries accurately. Moreover the accuracy of many algorithms depends on the selection of the threshold and the size of the sliding window used for the process. Selecting the threshold manually every time is not practical and feasible. So the automatic threshold selection methods are important in the area of the SBD.

Though many efforts have been taken in the field of CBVR to attain the performance, some limitations are still existing in the CBVR. The main problem with the existing methods is the selection and combination of feature sets. The feature sets should be selected in such a way that they should extract the characteristics from each and every bit of the frame in order to attain the maximum performance by giving more focus on the spatial correlation of pixels in the image. The conventional systems are dearth in the usage of Spatio-Temporal features resulting in a method similar to CBIR which leads to reduced Precision and Recall values and so a detector is necessary to improve the Precision and Recall values of the existing CBVR methods by considering multiple spatial and temporal (Spatio-Temporal) Features.

Another problem with the existing system is the computational cost of comparison strategy and detection accuracy. They spend more time in the case of comparison by doing full length search in the database with the query or comparing the key frames of the shot with the query. Key frames does not provide any guarantee to represent the features of all the frames in the shot effectively. A better comparison strategy is required to reduce the comparison time and eliminate the usage of keyframes.

Very minimum effort has been made in the field of the OBVR. So there is a vast space is there in the field of the OBVR for performance improvement.


An enhancement in the existing LEB method has to be made in the view of improving the performance by using the phase angle variation to overcome the drawbacks of using intensity based features such as camera and object motion. A system has to be proposed in the name Enhanced LEB (ELEB) for getting more detection accuracy in the SBD.

Combined Bitplane Triangular Template Mesh and Edge Features (CBEF) system has to be proposed to eliminate the traditional feature set extraction from the segmented regions of the image and to provide more focus on the spatial correlation of the pixels in the image by introducing the new feature BTTM which extracts the feature from the slices of the image rather than from the segmented regions or the sub blocks of the image. Genetic Algorithm (GA) based threshold selection has to be introduced in the proposed CBEF system to eliminate the threshold selection based on heuristic process or manual process for edge detection.

Integrated Lab color based Delaunay Triangles, Edge and Motion Features (ILEM) system has to be proposed to introduce the new set of spatio-temporal features in order to improve the performance of the system in terms of Precision and Recall and to provide a good balance between the computational cost and the detection accuracy by arranging the representative features of the shots of the video in a structural form eliminating the usage of keyframes.

The system object based video Retrieval (OBVR) has to be proposed to improve the performance of video retrieval based on object as the query.