Multimedia Information Retrieval Current Trends Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Human knowledge is by far the richest multimedia storage system. Language and other communication mechanisms, e.g., facial expressions, can only express a small part of one's experiences and knowledge [ ] . Vision and hearing, the most used senses during communication, carry a great part of the experience or knowledge that we wish to share. Information captured by these two human senses can also be effectively and efficiently captured, stored and processed by computers - everyone has collections of his/her holiday pictures, karaoke songs, videos etc. For these recorded experiences to be shared, some mechanism must be able to interpret human queries, and retrieve the closest match. For example, if users search their collection using a keyword or a phrase such as "door" or "door bell" they will expect the computer to return all relevant items. However, in most cases their search results in disappointment. Today, after the revolutionary deployment of Bush inspired hyperlink technology on the World Wide Web, one of the key research issues of information systems deals with the development of technology that would facilitate the indexing and creation of metadata for the ever-growing surge of digital media.

Content-based retrieval (CBR) research strives to create a retrieval system that utilises digital content in the indexing process in a way that is ultimately independent of manual work. CBR is an umbrella term for content-based multimedia retrieval (CBMR), content based visual information retrieval (CBVIR), content-based image retrieval (CBIR), content-based video retrieval (CBVR) and content-based audio retrieval (CBAR). CBR may also be referred to as multimedia information retrieval (MIR). CBR originates from the fields of pictorial databases and visual information management, [ ] where the first systems were textual databasee management systems with manually-created metadata and a query language for pictures.

In addition to the early pictorial database development, another branch of research emerged with the focus on time-dependent information. The first studies that have been influential in the content-based analysis of temporal data were from the fields of knowledge representation and artificial intelligence. The articles from McDermott [ ] and Allen [ ] introduced models for describing relations between temporal events. Based on Allen's temporal logic, Abe et al. [ ] built a scene retrieval method for video database applications using temporal condition changes. During the 1990's CBR research rapidly gained more momentum in understanding for indexing audio, images and video [ ]. In order to accelerate the use of content based technology, Moving Pictures Expert Group MPEG initiated the standardisation of the multimedia content description interface MPEG-7 in October 1996. MPEG-7 was approved as a standard in 2002. [ ] CBR communicates closely with information retrieval (IR) that studies models for text document storage and retrieval. IR distinguishes itself from simple exact match techniques and data retrieval as it focuses on information that is relevant to the search task instead of data itself. It also operates with incomplete queries, partial relevance and natural query language. [ ].

Information retrieval presents a commonly adopted model for the search processes, illustrated in Fig. 1.

Fig. 1. Basic information retrieval processes [ ]

The figure shows that the retrieved document set is the result of a computational comparison between a human-generated query and indexed document representation. Since the results are susceptible to having varying degrees of relevance, a retrieval system attempts to maximise the relevance of all retrieved documents using ordered lists. The challenge for retrieving maximal relevance lies in the uncertainty of representations generated by both the human and the system. The information problem of a person may not be well articulated when the search process is started; it can simply be a vague idea that is associated with some retrieval need. Even if the person would have a concise definition of her information need, she might find it difficult to formulate with the system's query syntax. Due to this difference between user intentions and query formulation, retrieval becomes inaccurate with mere exact matching of parameters. Therefore there has been a large focus on creating ranking algorithms that sort results by their calculated relevance. According to Dimitrova et al. [ ], user groups have a need for content-based technology, but professional users are better motivated to learn new practices even before the technology has fully matured. They mention some example applications in professional and educational areas such as automated authoring of content for the World Wide Web, searching and browsing for large video archives, easy access to educational material, indexing and archiving multimedia presentations, and indexing and archiving multimedia collaborative sessions. In the consumer domain they describe applications for video overview and access, video content filtering and enhanced access to broadcast video.


Due to the advance of modern imaging techniques, rich media, e.g., images and videos, overflow in everyday life. For example, there is explosive growth of online media data. The photo sharing website, Flickr, has over 5 billion images 1 and is being uploaded at the rate of over 3000 images per minute. Another media sharing website, YouTube, receives more than 24 hours of uploaded videos per minute 2. Besides the impact on the average person's everyday life, modern imaging technique also provides an unprecedented ability to record phenomena in the micro-world for scientific research, such as microscopic imaging for bimolecular research. There is an emerging need to search and retrieve relevant content from such massive visual databases. The widely-used commercial search engines, like Google and Bing, heavily rely on keyword matching techniques, and their search performance is often unsatisfactory due to erroneous textural tag information. Content based image retrieval (CBIR) has attracted substantial attention over the past decade [ ]. Among the current CBIR research, the supervised methods, such as concept detection, have been extensively explored and applied for visual search [ ]. Briefly speaking, these methods first define some concept categories, including objects, scenes, events and so on, and then train classifiers for each category. These trained classifiers are used to classify and index query images as well as the images in the databases to generate search results in response to user queries.


Audio is everywhere. Audio is for everyone. Audio is more than just the pure acoustic perception; audio is a pop cultural phenomenon - maybe even the most traditional and most persistent in human history. It takes a central role in most people's lives, whether they act as producers or consumers, and has the power to amplify or change its listener's emotional state. Even more, for many people, their audioal preferences serve as a display of their personality. In short, if we deal with audio, we must be aware that many factors have to be considered, more or less all of them far beyond the technical definition of sound as sensation of the ear stimulated by an oscillation of pressure [ ]. Given its cultural importance, it seems no wonder music was the first type of media that underwent the so-called digital revolution. Based on the technological advancements in encoding and compression of audio signals (most notably the invention of the mp3 standard) together with the establishment of the Internet as mainstream communication medium and distribution channel, and, in rapid succession, the development of high capacity portable music players, in the late 1990s, digital music has not only stirred up the IT industry, but also initiated a profound change in the way people "use" music. Today, a lot more people are listening to a lot more music in a lot more situations than ever before. Music has become a commodity that is naturally being traded electronically, exchanged, shared (legally or not), and even used as a means for social communication. Despite all these changes in the way music is used, the way music collections are organised on computers and music players and the way people search for music within these structures have basically remained the same.

Currently, the majority of systems for accessing music collections - irrespective of whether they comprise thousands (private collections) or millions of tracks (digital music resellers) - makes use of arbitrarily assigned and subjective meta-information like genre or style in combination with (nearly) objective meta-data like artist name, album name, track name, record label, or year of release to index the underlying music collection. Often, the hierarchical scheme genre - artist - album - track is then used to allow for browsing within the collection. While this may be sufficient for small private collections, in cases where most contained pieces are not known a-priori, the unmanageable amount of pieces may easily overstrain the user and impede the discovery of desired music. Thus, a person searching for music, e.g., a potential customer must already have a very precise conception of the expected result. Obviously, the intrinsic problem of these indexing approaches is the limitation to a rather small set of meta-data, whereas the musical, or more general, the cultural context of music pieces is not captured. This results in inadequate representations and makes retrieval of desired pieces impractical and unintuitive. As a response to these shortcomings of interfaces to music collections, the still growing but already well-established research field known as "Music Information Retrieval" (MIR) is - among others - developing methods that aim at extracting musical descriptors directly from the audio signal. Representations built upon these descriptors allow, for instance, for applications that autonomously analyse and structure music collections according to some of their acoustic properties, or systems that recommend similar sounding music to listeners based on the music they already own. While these signal-based approaches open up many opportunities for alternative music interfaces based directly on the audio content, they are unable to capture the non-acoustic, i.e., the contextual, aspects of music. Furthermore, as the descriptors derived from the audio signal usually consist of low-level features of the signal (as opposed to common high-level concepts, such as melody or rhythm) that are optimised to perform well in their application area, the obtained representations used to index the music collection have often no significance for humans.


More and more of our lives is captured in digital multimedia documents, such as audio recordings, pictures or videos. For example, many children have a digital second life in the form of thousands of photos and endless hours of video footage being captured from the very moment of their birth. On the other extreme, patients suffering from amnesia can be helped through their external memory, which is automatically recorded by a camera taking more than 2,000 photos per day [ ]. Furthermore, in the professional domain, multimedia documents are a necessity. For example, press agencies store digital images and videos of almost every event of public interest [ ] , and cultural heritage archives digitize their multimedia assets for preservation and improved accessibility [ ]. There are the following main explanations for this trend. First, since the mid-1990s the production and storage of new content as well as the digitization of existing content has become constantly easier and cheaper. Second, some information types, for example learning material, can be faster absorbed via multimedia documents than by text [ ]. Finally, for many people multimedia content is more attractive than text - "A picture is worth a thousand words". As a result, multimedia collections grow rapidly, both in terms of numbers and volume. This growth and the wealth of information in the collections make an automated search facility (called a retrieval engine), which fulfills a user's information need, indispensable. The research discipline aiming to improve this search is called multimedia retrieval and is derived from the more general field of information retrieval. In order to find documents which fulfil an information need, retrieval engines base their search on document representations. Today, most multimedia retrieval engines use document representation of manually created, textual metadata, such as assigned keywords (tags) [ ]. Ranking multimedia documents using textual document representations often returns good results, since well performing text retrieval engines can be re-used. However, the use of manually created metadata also has serious limitations. First, the metadata is time consuming to create. Second, if the metadata is created by laymen it is subjective and ad-hoc ("How did I name this picture again?") and employing professionals to create metadata is expensive [ ]. Finally, because of the amount of required metadata it is practically infeasible to allow users to search for particular segments inside a video.

Concept-based multimedia retrieval which is based on document representations consisting of automatically detected concept occurrences was proposed to improve upon the limitations of manually created metadata, see Naphade and Smith [ ] for an overview of this emerging research discipline. For this introduction, the reader can think of a concept as a label attached to a (part of a) multimedia document where all users agree that this label is appropriate. For example, a concept could be a Flower, a Car or a scene being Outdoor. Here, we refer to concepts by English terms. However, these terms are just references to the concept which itself is language independent and could be referred to in other languages or by computer codes. For example, the concept Flower could also be referred to as Fleur (French for Flower) or #F1 (a reference to this concept in a computer). Furthermore, a concept is modality independent1. For example, the concept Singing Bird can occur in the visual modality as well as in the audio modality2. Note that there are other research areas in information retrieval which use concepts, for example in the biomedical domain (Trieschnigg et al., 2009) or for the description of web pages [ ]. However, in this work we will focus on the secured intelligent video data retrieval using data mining.

The Basic Components of a Retrieval Engine

This section introduces the basic components which are commonly used by retrieval engines. The basic components of a retrieval engine are motivated by the root challenge of information retrieval which is described by Spärck-Jones and Willett [ ] as follows. "The root challenge in retrieval is that (information-) user need and document content are both unobservable, and so is the relevance relation between them". Figure 1.1 shows the basic components of a retrieval engine inspired by the conceptual model for information retrieval by Fuhr [ ] and the following discusses the components shown. The three topmost components in Figure 1.1, information need, document content and relevance are the central objects in information retrieval. They are, according to Spärck-Jones and Willett [ ] unobservable, which means that the computer cannot comprehend their meaning, which is actually a question of not being able to represent their content. For example, a retrieval engine will never be able to capture all aspects of a painting by van Gogh or an information need corresponding to "exciting times", certainly because they will differ from user to user. As a result, the relevance of a document to an information need is also unobservable. In order to represent a document, the content analysis process extracts features from each document (see the right part of Figure 1.1). The output of the content analysis process is called the analysis result and consists of all supplied features produced by the content analysis process.

The match process iterates over all documents of a collection and applies the score function to the document representation, resulting in a ranking score value for each document. The documents are then sorted in descending order by the ranking score value to produce the answer to the query, a ranked list of documents.