Video Annotation And Retrieval Based On Ontology Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In accessing large collections of digitized videos, it is often difficult to find both the appropriate video file and the portion of the video that is of interest. In this paper a novel approach for knowledge assisted semantic analysis and annotation of video content, based on an ontology and WordNet infrastructure is presented. We use video analysis and semantic processing is a separate process to extract the semantic concepts from the videos by putting the annotation efforts as a post activity in conjunction with domain ontology and WordNet that makes the process simple and effective. The optimize structure of the semantic corpus structure are used to store the annotated data and videos to provide search and retrieval of videos with more accuracy.


Annotation, Retrieval, Ontology

1. Introduction

Due to the technologies advancement in multimedia, storage devices and the related technologies, the graph of digital video production increase day by day. With this large amount of video data, new needs arise regarding their effective and efficient manipulation, .i.e. browsing and searching for specific video of interest which is an extremely tedious task for human. For searching some significant event, this kind of video retrieval would become indispensable because a fast retrieval system could help to quickly identify the relevant events. It is therefore necessary to equip video system with efficient and accurate video retrieval functions such that users can search for specific videos of interest, within reasonable time, and with minimal amount of human intervention. An alternative approach is to automate the annotation process but this is, however, an extraordinary challenging task because there is no commonly agreed method for analyzing of digital contents. The current interim solution to the video retrieval problem is to perform an exhaustive search on the whole video archive. This is inherently a time consuming process because it has to decode each video frame and analyze its content to see if a particular video segment satisfies the user specified query.

The root of the retrieval problem is that currently there is no standard way to decompose a sequence of images into some semantically describable entities [2-4]. This problem is usually addressed first by segmentation [5-7], which try to decompose each video image into a set of regions, followed by pattern recognition that try to identify or recognize the group of regions extracted and classify them to different class of objects [8-10], such as humans, animals, furniture etc. Currently, there is no generally accepted methodology to extract this kind of semantic description for videos and this is the reason why the applications related video retrieval is progressing so slowly.

To achieve this, an intense effort is required towards the development of intelligent system capable of automatically locating, organizing, accessing and presenting such huge and heterogeneous amounts of multimedia information in an intuitive way, while attempting to understand the underlying semantics of the multimedia contents [1].

In this paper, an ontology-driven approach for the semantic analysis and retrieval of video is proposed. The approach builds on ontology and is accompanied with video processing and analysis techniques. The proposed system supports the video decomposition of the visual information and the detection of the defined concepts in combination with the WorldNet for expanding the concepts synset, thus resulting in a higher-level semantic representation of the video content.

2. Proposed System

The proposed system is consists of multiple steps, where video will first be preprocessed followed by video analysis which will emit video description. The video analysis output is then allowed to process semantically to extract semantics from the video. The overall model of the proposed system is based on two separate components. One is semantic annotation and the other one is semantic retrieval. During the semantic annotation process the input video will be process for semantic concept extraction from the video by using video analytics and other techniques, the Figure 1, shows the entire process of the semantic annotation process. While the semantic retrieval will do all the process for retrieval of videos related with the concepts, the Figure 3 shows the semantic retrieval model of the proposed system.

2.1. Preprocessing

Before actually trying to identify key properties of a video, it can be useful to gain additional information about it, which will be done in a preprocessing step. Where video will be segmented into shots, which is a smallest semantic unit within a video and are comprised of ordered set frames. Two shots are separated by a transition. There are mainly three types of transitions used for shot boundary detections.

Hard Cut Detection

Hard cuts are two directly concatenated shots without any sort of transition in-between. Color histograms, are useful for hard cut shot boundary detection [11]

Fade Detection

Fades are gradual changes to/from a scene from/to a monotone (e.g. completely black) image. If the scene is disappearing, this is called a fade out, otherwise the transition is called a fade in. This can be done by analyzing the standard deviation of the pixels colors or by edge-detection in case edges are slowly disappeared.

Dissolve Detection

It is defined as a blending over from one shot to another. Where the first shot is fades out, while the second shot is fades in. Color histogram and edge-based are useful in detecting dissolve effects.

Key Frame Selection

Video or motion pictures consist of a series of still images. Many applications extract one or more of these still images, termed keyframes, as useful graphical representations of the video data [14]. Keyframe extraction is used only to only process keyframes instead of all frames, while not losing too much discriminative information. On a shot level, it has been shown that using keyframes instead of either regularly sampled frames or the first frame of a shot improves performance [12].

The Bhattacharyya distance [9] between color-based histograms will be combined with a temporal distance measure to quantify inter-frame dissimilarity, and keyframes will be selected using dynamic programming.

Figure 1. Semantic video analysis

Video Analytics

Video Analytics is a technology used to analyze video for specific data, behavior, objects or attitude. Video analytics is the initial information extraction process from the video. The object detection and tracking in the shot and also scene understanding are the main objectives of this component. The main emphasis is on the keyframe/image segmentation. The ontology will be provided to support the labeling process during the segmentation.

Keyframe/Image Segmentation

Segmentation means partitioning the image to a number of arbitrarily shaped regions, each of them typically being assumed to constitute a meaningful part of the image, i.e. to correspond to one of the objects depicted in it or to a part of one such object [13].

The region growing segmentation method will be used to start from the first pixel of the image without seed and assign a region and then checks all neighboring pixels and compare them. If they are too different then a new region will be created for the second pixel, and this process will continue until all pixels belong to a region. Then merging and splitting techniques will be applied to make the regions more meaningful, where similar regions will be merged together if they are of same size and adjacent to each other. The process will be stopped when no more merging or splitting is possible. The Figure 2 shows the region based segmentation.

The segmented data will then be processed to label the region with the help of ontology and human intervention and the data will be stored in a file for further processing. The LSCOM ontology structure will be adapted to build small scale ontology for the analysis of natural scenes in the video.

Figure 2. Keyframe/image after segmentation

Semantic Processing

Semantic processing is the effort to minimize the semantic gap which is the most common problem in video annotation and retrieval. The two components in this module are the concept extraction and concept expansion that jointly perform the semantic concept extraction from the input data of the previous module. The working mechanisms of these two components are discussed as

Concept Extraction

This component will perform the duty of concept extraction in conjunction with the domain ontology from the input data taken from the previous module. The purpose of this domain module will be to assist the process of semantic concept extraction. The domain ontology will be build with the supporting rule based inference engine that will work by using IF-Then, IF-Then-Else structure to allow easy way to the process of concept extraction semantically. The output of this module is a statement that describes the entire scene of the selected keyframe called "Original Concept". The original concepts are then passed to the next component i.e. concept expansion.

Concept Expansion

This component will perform expansion of the concepts into more synset with the purpose to produce such a useful annotated corpus that can support every type of user query. For example, a statement like "SHIP, SEA" has more concept space with semantic concept intensity that specifies the level of relevance of the concept with the original concepts as shown in the Table 1.

Table 1. Concept expansion of the original concept.

Original Concept

Concept Expansion-1 (Semantic Intensity)

Concept Expansion-2 (Semantic Intensity)


Ship, Sea

Embark(1), Transport(1), Travel(0.8), vessel(0.95), Carry(0.5), watercraft(0.95),

River(0.73), Ocean(1), Lake(0.8),


Car, Road

Auto(1), automobile(1), motorcar(0.9), machine(1), railcar(1), railway car(0.87), railroad car(0.87), cable car(0.89), gondola(1), elevator car(0.89)

Route, Path(0.86), for travel(0.64),


Semantic Corpus

Semantically annotated corpora will be intended to capture semantic relations among elements. This module will be used to store annotated data (metadata) along with the access to the video stored in the next component called video bank. This module comprised from two components, one is the semantic corpus which is a structure schema used to store the output of the semantic processing module in a structure format. The other one is video bank where all processed videos are stored. The semantic corpus is the heart of the whole system and helping the user to search the video of their interest. Because, it not only store the original concepts and but also can store the obtained concepts in a relational table format along with semantic intensity. The purpose of this organization is to obtain more and more information in a simple way from the stored data.

Semantic Query Retrieval

This module will be used to assist the users in searching and retrieving videos from the corpus. The structure of the semantic query retrieval module is shown in Figure 3.

Figure 3. Semantic Retrieval

This module will take text query from the user and apply the query interpreter to interpret the query grammatically and spelling checking. On passing this phase, the query will then be allowed to pass through parts-of-speech tagging and lemmatization process that will make the query for further execution. The next module is the query data extraction that will help in extracting data from the semantic corpus. This module will apply each nouns of the input query on the semantic corpus and on success will receive the relevant concepts, their semantic intensity along with the video files. It will then arrange all the data in ascending order based on the semantic intensity for the user to display.

Query interpreter

The query interpreter will be used to interpret the query into more meaningful form. Firstly the query will be checked for the grammatical and spell checking errors. On successfully passing this phase, the query will be allowed to proceed further by firstly applying the Brill tagger [15] to tag the query into different part-of-speech and then will obtain the first form of the words by using the lemmatizer. The query interpreter will select only the nouns and verb nouns into a list of words which will be ready for data extraction. The output of this component will be a list of words constructed from the input query.

Query Data Extractor

The purpose of this module will be to extract data from the semantic corpus along with video files and arrange the data for display on the user based on the semantic intensity. The entire theme will be to apply the tag nouns one by one on the concept expansion columns. On matching, it will not only select the matching concept along with semantic intensity, but will also select the original concepts from the structure schema and the video from the video bank. Then output on the user interface will be displayed based on the semantic intensity order.

Conclusion and Future Work

In this paper, we have presented a novel structure for video annotation and retrieval with the help of ontology. The main aim is to avoid the semantic process during annotation and perform it after taking the initial data from the videos. With the help of WordNet and domain ontology the initial extracted concepts are expand to more relevant concepts.

In future, the initial concept extraction process will be make more intelligent and will apply the OpenCYC ontology along with ontology to make the concept expansion enlarge.