0115 966 7955 Today's Opening Times 10:30 - 17:00 (BST)

Data Multimedia Images

Disclaimer: This dissertation has been submitted by a student. This is not an example of the work written by our professional dissertation writers. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

Chapter I


1.1 What is meant by Multimedia Data?

A number of data types can be characterized as multimedia data types. These data types are normally the essentials for the building blocks of core multimedia environments, platforms and integrating tools. The basic types can be described as text, images, audio, video and graphic objects. Following is a detailed explanation for the same.


Text can be stored in a variety of different forms. In addition to American Standard Code for Information Interchange (ASCII) based files, text is usually stored in spreadsheets, annotations, processor files, databases and common multimedia objects. The task of text storage is becoming more and more complex due to the easy availability and abundance of Graphical User Interfaces (GUIs) and text fonts, permitting unique effects such as text color, text shade etc.


Digitalized images are nothing but a string of pixels that signify an area in the user’s graphical exhibit. There is an immense variation in the quality and dimension of storage for motionless images. For motionless (still) images, the space overhead varies with respect to complexity, size, resolution and compression format used to store any given image. The frequently used and accepted image formats (file extensions) consist of bmp, jpeg, tiff and png.


Audio, being another frequently used data type is relatively space intensive. A minute of sound takes up to 3 Megabytes (MB) of space. Numerous methods can be deployed to compress an audio into suitable formats.


Another data type which consumes majority of space is categorized as the digitalized video data type. Videos are normally stored as a series of frames, the capacity of which depends on its resolution. A solo video frame can take up to 1 MB of space. Continuous transfer rate is needed to get a reasonable video playback with its proper transmission, compression, and decompression.

Graphic Objects

This data type consists of unique data structures that can define 2D and 3D shapes which further helps in defining multimedia objects. Today one can use different formats for image applications and video-editing applications. To list few examples Computer Aided Design (CAD) and Computer Aided Manufacturing (CAM) are graphic objects

1.2 How is Multimedia Data Different?

Theoretically multimedia data should be considered like any regular data based on the data types for instance numbers, dates and characters. Though, there are a few challenges that arise from multimedia as described in [2]:

  • Multimedia data is usually captured with various unreliable capturing techniques such as image processing. These multimedia processing techniques require capabilities for handling these various available methods of capturing content, this includes both automated and manual methods.
  • In multimedia database, the queries created by the user rarely come back with textual answer. To a certain extent, the answer to user query is a compound multimedia presentation that the user can glance through at one’s leisure.
  • The size of the multimedia data being large not only affects the storage, retrieval but also the transmission of data.
  • Time to retrieve information may be vital while accessing video and audio databases, for example Video on Demand.
  • Automatic feature extraction and Indexing: User explicitly submits the attribute values of objects inserted into the database in contrast to advanced tools with conservative databases, such as image processing and pattern recognition tools for images to extract the various features and content of multimedia objects. Special data structures for storage and indexing are needed due to the large size of data.

1.3 Basic Approaches for Data Retrieval

Data management is being implemented since long. Many approaches have also been invented for the same to manage and inquire various types of data in the computer systems. The commonly used approaches for data management comprise of conventional database system, information retrieval system, content based retrieval system and graph/ tree pattern matching. The details for the same are as follows:

Conventional database system

It is the most extensively used approach to manage as well as investigate structured data. Data in a database system must match to some predefined structures and limitations (schema’s). The user should specify the data objects to be retrieved and the tables from which data has to be extracted. The user also has to predicate on which the retrieval of data will be based to formulate a database query. SQL, a query language has a restricted syntax and vocabulary that can be used for such databases.

Information retrieval (IR) system

This system is prominently used to search enormous text collections; where in the content of the data (text) is illustrated with the help of an indexer using keywords or a textual summary. The query demands are expressed in terms of keywords or natural language. For instance, searching for an image or video, the user is required to describe using words and also need means to store large amount of metadata in textual form.

Content based retrieval (CBR) system

This approach facilitates in the retrieval of multimedia objects from an enormous collection. The retrieval is based on various features such as color, texture and shape which can be extracted automatically from the objects. Though keyword can be considered a feature for textual data, conventional retrieval of information has a higher performance as compared to content-based retrieval.

This is due to the fact that keyword has the demonstrated ability to characterize semantics while no other features have revealed convincing semantic describing capability. A key disadvantage of this particular approach is its lack of accuracy.

Graph or tree pattern matching

This particular approach seeks the retrieval of object sub-graphs from an object graph as per several designated patterns.

Chapter II

Data Structures for Multimedia Storage

Many modern database applications deal with large amounts of multidimensional data. Multimedia content-based retrieval is one of the examples. Access Methods are essential in order to deal with multidimensional data efficiently. They are used to access selective data from a big collection.

2.1 Importance of Access Methods

Efficient spatial selection support is the key purpose of access methods. These include range queries or nearest neighbour queries of spatial objects. The significance of these access methods and how they take into account both clustering techniques and spatial indexing is described by Peter Van Oosterom [3]. In the absence of a spatial index, every object in the database needs to be checked if it meets the selection criteria. Clustering is required to group the objects that are often requested together. Or else, many different disk pages will have to be fetched, resulting in a very slow response.

For spatial selection, clustering implies storing objects that are not only close in reality but also close in computer memory instead of being scattered all over the whole memory.

In conventional database systems sorting the data is the basis for efficient searching. Higher dimensional data cannot be sorted in an obvious manner, as it is possible for text strings, numbers, or dates. Principally, computer memory is one-dimensional. However, spatial data is 2D, 3D or even higher and must be organized someway in the memory. An intuitive solution to organize the data is using a regular grid just as on a paper map. Each grid cell has a unique name e.g. ’A1’, ’C2’, or ’E5’. The cells are stored in some order in the memory and can each contain a fixed number of object references. In a grid cell, a reference is stored to an object whenever the object overlaps the cell. However, this will not be very efficient due to the irregular data distribution of spatial data because of which many cells will be empty while many others will be overfull. Therefore, more advanced techniques have been developed.

2.2 kd Trees

A kd-tree or a k-dimensional tree is a space-partitioning data structure used for organizing points in a k-dimensional space. kd-trees are a useful for several applications such as searches involving a multidimensional search key like range searches and nearest neighbour searches. Kd-trees are a special case of Binary Space Partitioning (BSP) trees.

A kd-tree only uses splitting planes that are perpendicular to one of the coordinate axes. This is different from BSP trees, in which arbitrary splitting planes can be used. In addition to this, every node of a kd-tree, from the root to the leaves, stores a point. Whereas in BSP trees, leaves are typically the only nodes that contain points. As a consequence, each splitting plane must go through one of the points in the kd-tree. [4]

2.2.1 Addition of elements to kd trees

A new point is added to a kd tree in the same way as one adds an element to any other tree. At first, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the left or right side of the splitting plane. Once you get to a leaf node, add the new point as either the left or right child of the leaf node, again depending on which side of the node’s splitting plane contains the new point.

2.2.2 Deleting from kd trees

Deletion is similar as in Binary Search Tree (BST) but slightly harder.

Step1 find node to be deleted.

Step2 two cases must be handled:

(a) No children - replace pointer to node by NULL

(b) Has children - replace node by minimum node in right subtree. If no right subtree exists then first move left subtree to become right subtree. [1]

2.3 Quad-trees

Each node of a quad-tree is associated with a rectangular region of space. The top node is associated with the entire target space. Each non-leaf node divides its region into four equal sized quadrants, likewise, each such node has four child nodes corresponding to the four quadrants and so on. Leaf nodes have between zero and some fixed maximum number of points.

2.3.1 Simple definition of node structure of a point quad-tree

qtnodetype = record

INFO: infotype;

XVAL: real;

YVAL: real;

NW, SW, NE, SE: *qtnodetype


Here, INFO is some additional information regarding that point .

XVAL, YVAL are coordinates of that point.

NW, SW, NE, SE are pointers to regions obtained by dividing given region. [1]

2.3.2 Common uses of Quad-trees

  1. Image Representation
  2. Spatial Indexing
  3. Efficient collision detection in two dimensions
  4. Storing sparse data, such as formatting information for a spreadsheet or for some matrix calculations.

2.3.3 Representing Image Using Quad-tree: [7]

Let us suppose we divide the picture area into 4 sections. Those 4 sections are then further divided into 4 subsections. We continue this process, repeatedly dividing a square region by 4. We must impose a limit to the levels of division otherwise we could go on dividing the picture forever. Generally, this limit is imposed due to storage considerations or to limit processing time or due to the resolution of the output device. A pixel is the smallest subsection of the quad tree.

To summarize, a square or quadrant in the picture is either :

  1. entirely one color
  2. composed of 4 smaller sub-squares

To represent a picture using a quad tree, each leaf must represent a uniform area of the picture. If the picture is black and white, we only need one bit to represent the colour in each leaf; for example, 0 could mean black and 1 could mean white. Now consider the following image : The definition of a picture is a two dimensional array, where the elements of the array are colored points.

Figure 2.3: First three levels of quad-tree

Figure 2.4: Given Image

This is how the above image could be stored in quad-tree.

Figure 2.5: 8x8 pixel picture represented in a quad-tree

Figure 2.6: The quad tree of the above example picture. The quadrants are shown in counterclockwise order from the top-right quadrant. The root is the top node. (The 2nd and 3rd quadrants are not shown.)

2.3.4 Advantages of Quad-trees:

  1. They can be manipulated and accessed much quicker than other models.
  2. Erasing an image takes only one step. All that is required is to set the root node to neutral.
  3. Zooming to a particular quadrant in the tree is also a one step operation.
  4. To reduce the complexity of the image, it suffices to remove the final level of nodes.
  5. Accessing particular regions of the image is a very fast operation. This is useful for updating certain regions of an image, perhaps for an environment with multiple windows.

The main disadvantage is that it takes up a lot of space.

2.4 R-trees

R-trees are N-dimensional extension of Binary trees, but are used for spatial access methods i.e., for indexing multi-dimensional information. They are supported in many modern database systems, along with variants like R+ -trees and R*-trees. The data structure splits space with hierarchically nested, and possibly overlapping, minimum bounding rectangles.[4]

A rectangular bounding box is associated with each tree node. [5] 

  • Bounding box of a leaf node is a minimum sized rectangle that contains all the rectangles/polygons associated with the leaf node.
  • Bounding box associated with a non-leaf node contains the bounding box associated with all its children.
  • Bounding box of a node serves as its key in its parent node (if any)
  • Bounding boxes of children of a node are allowed to overlap.

2.4.1 Structure of an R-tree node

rtnodetype = record

Rec1, ....Reck : rectangle

P1, ....Pk : ∗rtnodetype


A polygon is stored in one node, and the bounding box of the node must contain the polygon. Since a polygon is stored only once, the storage efficiency of R-trees is better than that of k-d trees or quad-trees.

The insertion and deletion algorithms use the bounding boxes from the nodes to ensure that close by elements are placed in the same leaf node. Each entry within a leaf node stores two-pieces of information; a way of identifying the actual data element and the bounding box of the data element.

2.4.2 Inserting a node

1. Find a leaf to store it, and add it to the leaf.

  • To find leaf, follow a child (if any) whose bounding box contains bounding box of data item, else child whose overlap with data item bounding box is maximum

2. Handle overflows by splits. We may need to divide entries of an overfull node into two sets such that the bounding boxes have minimum total area.

2.4.3 Deleting a node

1. Find the leaf and delete object; determine new MBR.

2. If the node is too empty:

  • Delete the node recursively at its parent
  • Insert all entries of the deleted node into the R-tree

2.4.4 Searching R-trees

Similarly, for searching algorithms, bounding boxes are used to decide whether or not to search inside a child node. Here we need to find minimal bounding rectangle. In this way, most of the nodes in the tree are never touched during a search.

  1. If the node is a leaf node, output the data items whose keys intersect the given query point/region
  2. Else, for each child of the current node whose bounding box overlaps the query point/region, recursively search the child.

2.5 Comparison of Different Data Structures [1]

  • k-d trees are very easy to implement. However, in general a k-d tree consisting k nodes may have a height k causing complexity of both insertion and search in k-d trees to be high. In practice, path lengths (root to leaf) in k-d trees tend to be longer than those in point quad-trees because these trees are binary.
  • R-trees have a large number of rectangles potentially stored in each node. They are appropriate for disk access by reducing the height of the tree, thus leading to fewer disk access.
  • The disadvantage of R-trees is that the bounding rectangle associated with different nodes may overlap. Thus when searching an R-tree, instead of following one path (as in case of quad-tree), we might follow multiple path down the tree. This difference grows even more acute when range search and neighbour searches are considered.
  • In case of point quad-trees, while performing search/insertion each case requires comparisons on two coordinates. Deletion in point quad-trees is difficult because finding a candidate replacement node for the node being deleted is not easy.

Chapter III


Metadata is data about data. Any data that is used to describe the content, condition, quality and other aspects of data for humans or machines to locate, access and understand the data is known as Metadata. Metadata helps the users to get an overview of the data.

3.1 Need of Metadata

The main functions of metadata can be listed as follows: [8]


To describe and identify data sources. These descriptions help create catalogs, index, etc., thereby improving access to them.


Formulation of queries.


To provide information to help manage and administrate a data source, such as when and how it was created, and who can legally access it.


To facilitate data archival and preservation like data refreshing and migration, etc.


To indicate how a system functions or metadata behaves, such as data formats, compression ratios, scaling routines, encryption key, and security, etc.


To indicate the level and type of use of data sources like multiversion, user tracking, etc.

3.2 Metadata in the Life Cycle of Multimedia Objects

A multimedia object undergoes a life cycle consisting of production, organization, searching, utilization, preservation, and disposition. Metadata passes through similar stages as an integral part of these multimedia objects [8]:


Objects of different media types are created often generating data of how they were produced (e.g., the EXIF files produced by digital cameras) and stored in an information retrieval system. Associated metadata is generated accordingly for administrating and describing the objects.


Multimedia objects may be composed of several components. Metadata is created to specify how these compound objects are put together.

Searching and retrieval

Created and stored multimedia objects are subject to search and retrieval by users. Metadata provides aids through catalog and index to enable efficient query formulation and resource localization.


Retrieved multimedia objects can be further utilized, reproduced, and modified. Metadata related to digital rights management and version control, etc. may be created.

Preservation and disposition

Multimedia objects may undergo modification, refreshing, and migration to ensure their availability. Objects that are out-of-date or corrupted may be discarded. Such preservation and disposition activities can be documented by the associated metadata.

3.3 Classification of Metadata

Metadata directly affects the way in which objects of different media types are used. Classifying metadata can facilitate the handling of different media types in a multimedia information retrieval system. Based on its (in)dependence on media contents, metadata can be classified into two kinds, namely content independent and content-dependent metadata [8]:

  • Content-independent metadata provides information which is derived independently from the content of the original data. Examples of content independent metadata are date of creation and location of a text document, type-of-camera used to record a video fragment, and so on. These metadata are called descriptive data.
  • Content-dependent metadata depends on the content of the original data. A special case of content-dependent metadata is content-dependent descriptive metadata , which cannot be extracted automatically from the content but is created manually: annotation is a well-known example. In contrast, content-dependent non-descriptive metadata is based directly on the contents of data.

3.4 Image metadata

Some of the image files containing metadata include Exchangeable image file format (EXIF) and Tagged Image File Format (TIFF).

Having metadata about images embedded in TIFF or EXIF files is one way of acquiring additional data about an image. Image metadata are attained through tags. Tagging pictures with subjects, related emotions, and other descriptive phrases helps Internet users find pictures easily rather than having to search through entire image collections.

A prime example of an image tagging service is Flickr, where users upload images and then describe the contents. Other patrons of the site can then search for those tags. Flickr uses a folksonomy: a free-text keyword system in which the community defines the vocabulary through use rather than through a controlled vocabulary.

Digital photography is increasingly making use of metadata tags. Photographers shooting Camera RAW file formats can use applications such as Adobe Bridge or Apple Computer's Aperture to work with camera metadata for post-processing. Users can also tag photos for organization purposes using Adobe's Extensible Metadata Platform (XMP) language, for example. [4]

3.5 Document metadata

Most programs that create documents, including Microsoft PowerPoint, Microsoft Word and other Microsoft Office products, save metadata with the document files. These metadata can contain the name of the person who created the file, the name of the person who last edited the file, how many times the file has been printed, and even how many revisions have been made on the file. Other saved material, such as document comments are also referred to as metadata.

Document Metadata is particularly important in legal environments where litigation can request this sensitive information which can include many elements of private detrimental data. This data has been linked to multiple lawsuits that have got corporations into legal complications. [4]

3.6 Digital library metadata

There are three variants of metadata that are commonly used to describe objects in a digital library:

  • descriptive - Information describing the intellectual content of the object, such as cataloguing records, finding aids or similar schemes. It is typically used for bibliographic purposes and for search and retrieval.
  • structural - Information that ties each object to others to make up logical units e.g., information that relates individual images of pages from a book to the others that make up the book.
  • administrative - Information used to manage the object or control access to it. This may include information on how it was scanned, its storage format, copyright and licensing information, and information necessary for the long-term preservation of the digital objects. [4]

Chapter IV

Text Databases

Basic text comprises of alphanumeric characters. Optical character recognition (OCR) practices are deployed to translate analog text to digital text. The most common digital representation of characters is the ASCII code. For this, seven bits are required (eight bits might be used, where in the eighth bit is reserved for a special purpose) for each character. Storage space for a text document that is required is equivalent to the number of characters. For instance, a 15 page text document consisting of about 4000 characters generally consumes 60 kilobytes.

Now days, structured text documents have become extremely popular. They comprise titles, chapters, sections, paragraphs, and so forth. A title can be presented to the user in a different format than a paragraph or a sentence. Different standards are used to encode structured information such as HTML and XML (hyper text markup language and extensible markup language)

There are different approaches like Huffman and Arithmetic Coding, which can be used for text compression, but as the storage requirements are not too high, these approaches are not as important for text as they are for multimedia data. [10]

4.1 Text Documents

A text document consists of identification and is considered to be a list of words. Likewise, a book is considered to be a document, and so is a paper in the events of a conference or a Web page. The key identification used for a book may be an ISBN number or the title of the paper together with the ISBN number of the conference event or a URL for a Web page.

Retrieval of text documents does not normally entail the presentation of the entire document, as it consumes a large amount of space as well as time. Instead, the system presents the identifications of the chosen documents mainly along with a brief description and/or rankings of the document.

4.2 Indexing

Indexing refers to the derivation of metadata from their documents and storage in an index. In a way, the index describes the content of the documents. The content can be described by terms like social or political for text documents. Also, the system utilizes the index to determine the output during retrieval.

The index can be filled up in two ways, manually as well as automatically. Assigned terms can be added to documents as a kind of annotation by professional users such as librarians. These terms can be selected often from a prescribed set of terms, the catalog. A catalog describes a certain scientific field and is composed by specialists. One of the main advantages of this technique is that the professional users are aware of the acceptable terms that can be used in query formulation. A major drawback of this technique is the amount of work that has to be performed for the manual indexing process.

Document content description can also be facilitated automatically resulting in what are termed as derived terms. One of the many steps required for this can be a step in which words in English text are identified by an algorithm and then put to lower case. Basic tools are used in other steps such as stop word removal and stemming. Stop words are words in the document which have a little meaning and most of the times include words like the and it. These stop words are erased from the document. Words are conflated to their stem in the document through stemming. As an example, the stemmer can conflate the words computer, compute and computation to the stem comput.

4.3 Query Formulation

Query formulation refers to the method of representing the information need. The resultant formal representation of information is the query. In a wider perspective, query formulation denotes the comprehensive interactive dialogue between the system and the user, leading to both a suitable query and also a better understanding by the user of the information need. It also denotes the query formulation when there are no previously retrieved documents to direct the search, thus, the formulation of the preliminary query.

It is essential to differentiate between the expert searcher and the relaxed end user. The expert searcher is aware of the document collection and the assigned terms. He/ she will use Boolean operators to create the query and will be able to adequately rephrase the same as per the output of the system. In case the result is too small, the expert searcher must expand the query, and in case if the result is too large, he/she must be able to make the query more restrictive.

The communication of the need for information to the system in natural language interests the end user. Such a statement of the need for information is termed as a request. Automatic query formulation comprises of receiving the request and generating a preliminary query by the application of algorithms that were also used for the derivation of terms. In general, the query consists of a list of query terms. This list is accepted by the system and it composes a result set. The system can formulate a successive query based on this relevant feedback.

4.4 Matching

The matching algorithm is mainly the most important part of an information retrieval system. This algorithm makes a comparison of the query against the document representations in the index. In the exact matching algorithm, a Boolean query, which is formulated by an expert searcher, defines precisely the set of documents that satisfy the query. The system generates a yes or a no decision for each document.

In the case of an inexact matching algorithm, the system delivers a ranked list of documents. Users can traverse this document list to search for the information they need. Ranked retrieval puts the documents that are relevant in the top of the ranked list, thus, saving the time the user has to invest on reading those documents. Simple but effective ranking algorithms make use of the frequency allocation of terms over documents. Ranking algorithms that are based on statistical approaches, halve the time the user has to spend on reading those documents.

Chapter V

Image Databases

Digital images can be defined as an electronic snapshot scanned from documents or taken of a scene, for example printed texts, photographs, manuscripts, and various artworks.

Digital image is modeled and mapped as a grid of dots, pixels or commonly known picture elements. A tonal value is allocated to each of these pixels, which can be black, white, and shades of gray or color. Pixel itself is symbolized in binary code of zeros and ones. Computer stores these binary digits or bits corresponding to each pixel in a sequence and are later reduced to mathematical representation by compressing them. After compression these bits are interpreted and read to generate an analog output by the computer for display or printing purposes.

Figure 5.1: As shown in this bitonal image, each pixel is assigned a tonal value, in this example 0 for black and 1 for white.

To further describe the grayscale of a pixel one needs to say that one byte is of eight bits. For a color pixel one needs three colors of one bye each, these colors are red, green and blue. So, for a rectangular screen one can compute the amount of data required for the image using the formula:

A = xyb

Where A is the number of bytes needed,

x is the number of pixels per horizontal line,

y is the number of horizontal lines, and

b is the number of bytes per pixel.

Using this formulae for a screen with value of x being 800, y being 600, and for b being 3; A=xyb thus A = 1.44 Mbyte.

Compression is required for this significant amount of data. Image compression is based on exploiting redundancy in images and properties of the human perception. Pixels in specific areas appear to be similar; this concept of similarity is called Spatial Redundancy. Human’s views of images are tolerant regarding some information error or loss, which means that the compressed image does not need to exactly represent the original image. A compressed image with some error may still allow effective communication. [8]

5.1 Image Compression Algorithms [14]

Lossless and Lossy are the two major types of image file compression algorithms being used.

The Lossless compression algorithms help reduce any given files size with no loss of quality of an image. But this algorithm usually do not compress image as small a file as a lossy method does. While choosing quality of an image over its size Lossless algorithms are used.

On the other hand Lossy compression algorithms take benefit of the natural limitations of the human eye and abandon information that cannot be seen. Most of the Lossy compression algorithms allow inconsistent levels of compressed quality. With increase in levels of compression the size of file is reduced. Once the image is compressed to the highest level, worsening in the image quality is quite noticeable. This deterioration of image file is known as Compression Artifacting.

Listed below are some of the most commonly used compression algorithms for image data:

5.1.1 Run Length Encoding (RLE)

RLE is the simplest of all the compression technique being used. RLE algorithms consist of Lossless, and generally work by searching for runs of bits, bytes, or pixels of the same value, and by encoding the length as well as the value of the run. RLE achieves for best results with images includes large areas of adjoining colour, and particularly monochrome images. For complex color images, such as photographs RLE algorithms do not compress good enough in some cases. Though RLE can increases the size of image file.

For instance, when considering a screen consisting of plain black text on a solid white background. The representation will be several long runs of white pixels in the blank space, and several short runs of black pixels within the text. To further elaborate this with a hypothetical example of single scan line, where B is representing a black pixel and W represents white:

If one apply the RLE data compression algorithm to the above hypothetical scan line, the result will be as follows:


Interpret this as twelve W’, one B, twelve W’s three B’s, etc.

There are a number of RLE variants commonly used which are encountered in the Tagged Image File Format (TIFF), PC Paintbrush Exchange (PCX) and Bitmap (BMP) graphic formats.

5.1.2 Lempel-Ziv-Welch (LZW)

Terry Welch developed the LZW compression algorithm in 1984 as a modification to the LZ78 compressor. It is a lossless technique that can be applied to any data type, but is most commonly used for image compression. LZW compression is useful for images that consist of color depths from 1-bit (monochrome) to 24-bit (True Colour).

LZW compression is used in various common graphics file formats including Tagged Image File Format (TIFF) and Graphics Interchange Format (GIF).

5.1.3 Huffman Encoding

David Huffman developed Huffman encoding in 1952. It is one of the oldest and most recognized compression algorithms. It is a lossless algorithm and is used to provide a final compression stage in many modern compression schemes, such as JPEG.

Huffman coding provides a useful way to compress data by determining the frequency of occurrence for each character. The idea behind the method is to assign bit codes of varying lengths to characters where more common characters receive a short code and less common characters receive a longer one. It is best used on images which have large amounts of data repetition. [15]

5.1.4 JPEG

The JPEG compression algorithm was introduced to develop compression techniques for transmission of color and grayscale images. It was developed in 1990 by the Joint Photographic Experts Group of the International Standards Organization (ISO) and International Telegraph and Telephone Consultative Committee (CCITT). JPEG is a lossy technique, which provides best compression rates with complex 24-bit (True Colour) images. It functions by discarding image data, which is unnoticeable to the human eye, using Discrete Cosine Transform (DCT). Then it applies Huffman encoding to achieve further compression.

JPEG compression is used in the JPEG File Interchange Format (JFIF), Still Picture Interchange File Format (SPIFF) and TIFF.

5.1.5 Fractal Compression

Fractal compression uses the mathematical principles of fractal geometry to identify redundant repeating patterns within images. These matching patterns may be identified through performing geometrical transformations, such as scaling and rotating, on elements of the image. Once identified, a repeating pattern need only be stored once, together with the information on its locations within the image and the required transformations in each case.

Fractal compression is extremely computationally intensive, although decompression is much faster. It is a lossy technique, which can achieve large compression rates. Unlike other lossy methods, higher compression does not result in pixelation of the image and, although information is still lost, this tends to be less noticeable. Fractal compression works best with complex images and high colour depths.

5.2 Common File Types [11],[19],[4]

  • JPEG (Joint Photographic Experts Group) files are a lossy format. The DOS filename extension is JPG, although other operating systems may use JPEG. Nearly all digital cameras have the option to save images in JPEG format. The JPEG format supports 8 bits per color – red, green, and blue, for 24-bit total – and produces relatively small file sizes.
  • TIFF (Tagged Image File Format) is a flexible image format that normally saves 8 or 16 bits per color – red, green and blue – for a total of 24 or 48 bits, and uses a filename extension of TIFF or TIF. TIFF can be lossy or lossless.
  • RAW refers to a family of raw image formats that are options available on some digital cameras. These formats usually use a lossless or nearly lossless compression, and produce file sizes much smaller than the TIFF formats of full-size processed images from the same cameras.
  • PNG (Portable Network Graphics) file format is regarded, and was made, as the free and open-source successor to the GIF file format. The PNG file format supports true color (16 million colors) whereas the GIF file format only allows 256 colors.
  • GIF (Graphics Interchange Format) is limited to an 8-bit palette, or 256 colors. This makes the GIF format suitable for storing graphics with relatively few colors such as simple diagrams, shapes, logos and cartoon style images. It also uses a lossless compression that is more effective when large areas have a single color, and ineffective for detailed images or dithered images.
  • BMP file format (Windows bitmap) is used internally in the Microsoft Windows operating system to handle graphics images. These files are typically not compressed, resulting in large files. The main advantage of BMP files is their wide acceptance, simplicity, and use in Windows programs.

5.3 Advantages of Digital Images

There are a number of advantages of storing two-dimensional materials in digital formats. [13]

  • Digital images do not deteriorate physically over time whereas the originals can deteriorate.
  • Digital images allow identical reproduction quality from copy to copy.
  • Digital images may be manipulated far more easily than by photographic means.
  • Digital images can easily be linked to textual descriptions and catalog records.
  • Access is greatly improved, using standard Internet technologies.

5.4 Content based Image Retrieval [16],[17],[18],[20]

Content based image retrieval (CBIR) is the application of computer vision to the image retrieval problem, i.e., the problem of searching for digital images in large databases.

"Content-based" means that the search will analyze the actual contents of the image. The term 'content' in this context might refer to colors, shapes, textures, or any other information that can be derived from the image itself. Without the ability to examine image content, searches must rely on metadata such as captions or keywords, which may be laborious or expensive to produce.

5.4.1 Query Techniques

Different implementations of CBIR make use of different types of user queries. Query by example

Query by example is a query technique that involves providing the CBIR system with an example image that it will then base its search upon. The underlying search algorithms may vary depending on the application, but result images should all share common elements with the provided example.

Ways for providing sample images to the system include:

  • The user may choose from a random set or a pre-existing image may be supplied.
  • The user may draw a rough approximation of the image he/she is looking for, for example with blobs of color or general shapes.

This query technique removes the difficulties that arise when trying to describe images with words. Other query methods

Other methods include specifying the proportions of colors desired (e.g. "80% red, 20% blue") and searching for images that contain an object given in a query image.

CBIR systems can also make use of relevance feedback, where the user progressively refines the search results by marking images in the results as "relevant", "not relevant", or "neutral" to the search query, then repeating the search with the new information.

5.4.2 Content Comparison Techniques

Described below are some common methods for extracting content from images so that they can be easily compared. The methods outlined are not specific to any particular application domain. Color

Retrieving images based on color similarity is achieved by computing a color histogram for each image that identifies the proportion of pixels within an image holding specific values (that humans express as colors). Current research is attempting to segment color proportion by region and by spatial relationship among several color regions. Examining images based on the colors they contain is one of the most widely used techniques because it does not depend on image size or orientation. Color searches will usually involve comparing color histograms, though this is not the only technique in practice. Texture

Texture measures look for visual patterns in images and how they are spatially defined. Textures are represented by texels (texture pixels), which are then placed into a number of sets, depending on how many textures are detected in the image. These sets not only define the texture, but also where in the image the texture is located. Texture is a difficult concept to represent.

The identification of specific textures in an image is achieved primarily by modeling texture as a two-dimensional gray level variation. The relative brightness of pairs of pixels is computed such that degree of contrast, regularity, coarseness and directionality may be estimated. However, the problem is in identifying patterns of co-pixel variation and associating them with particular classes of textures such as ``silky, or ``rough. Shape

Shape does not refer to the shape of an image but to the shape of a particular region that is being sought out. Shapes will often be determined first applying segmentation or edge detection to an image. In some cases accurate shape detection will require human intervention because methods like segmentation are very difficult to completely automate.

5.4.3 Potential uses of CBIR

  • Art collections
  • Photograph archives
  • Retail catalogs
  • Medical diagnosis
  • Crime prevention
  • The military
  • Intellectual property
  • Architectural and engineering design
  • Geographical information and remote sensing systems

Chapter VI

Audio Databases

Audio is caused by air pressure waves having a frequency and amplitude. When the frequency of the waves is between 20 to 20,000 Hertz, a human hears a sound. A low amplitude causes the sound to be soft.

6.1 How to digitize these pressure waveforms?

First, the air wave is transformed into an electrical signal (by a microphone). This signal is converted into discrete values by processes called sampling and quantization. Sampling causes the continuous time axis to be divided into small, fixed intervals, see Fig 6.1(b). The number of intervals per second is called the sampling rate. The determination of the amplitude of the audio signal at the beginning of a time interval is called quantization.

So the continuous audio signal is approximated by a sequence of values, see Fig 6.1(c). If the sampling rate is high enough and the quantization is precise enough, the human ear will not notice any difference between the analog and digital audio signal. The process just described is called analog-to-digital conversion (ADC); the other way around is called digital-to-analog conversion (DAC). [8]

Figure 6.1: Analog-to-digital conversion. (a) Original Analog signal; (b) Sampling pulses; (c) quantization; (d) digitized values.

6.2 Compression

Since audio data occupies a lot of space, there has long been driving force to compress it. Compression techniques are of two basic types: lossless and lossy. A lossless compression technique is one that yields a compressed signal from which the original signal can be reconstructed perfectly. No information is lost as a result of the compression. A lossy compression technique is one that discards information. The original signal cannot be reconstructed perfectly from a signal compressed by a lossy method. Some of the compression methods are listed below. [9],[23],[25]

6.2.1 VOC File Compression

This is the simplest compression technique that simply removes any silence from the entire sample. This form of compression was introduced by Creative Labs. This method analyzes the whole sample and then codes the silence into the sample using byte codes. It is similar to run-length coding.

6.2.2 Linear Predictive Coding (LPC) and Code Excited Linear Predictor (CELP)

This was an early development in audio compression that was used primarily for speech. A Linear Predictive Coding (LPC) encoder compares speech to an analytical model of the vocal tract, then discards the speech and stores the parameters of the best-fit model. The output quality was poor and was often compared to computer speech and thus is not used much today.

A later development, Code Excited Linear Predictor (CELP), increased the complexity of the speech model further, while allowing for greater compression due to faster computers, and produced much better results. Sound quality improved, while the compression ratio increased. The algorithm compares speech with an analytical model of the vocal tract and computes the errors between the original speech and the model. It transmits both model parameters and a very compressed representation of the errors.

6.2.3 Adaptive Differential Pulse Code Modulation (ADPCM)

This process is a simple conversion based on the notion that the changes between samples will not be very large. The first sample value is stored as a whole, and then each successive value describes that the wave will change by +/- 8 levels, which uses only 4 instead of 16 bits. Hence, a 4:1 compression ratio is achieved with less loss as the sampling frequency increases. Due to its simplicity, wide acceptance, and high level of compression, this method is widely used.

6.2.4 MPEG for Audio [21],[22]

The Motion Picture Experts Group (MPEG) audio compression algorithm is an International Organization for Standardization (ISO) standard for high fidelity audio compressions. It is one of a three-part compression standard, the other two being video and system. The MPEG compression is lossy, but nevertheless can achieve lossless compression.

MPEG compression is based on psychoacoustic theory. The principle behind this is: if the listener cannot hear the sound, then it need not be coded.  Human hearing is quite sensitive, but making out differences in a large collection of sounds is difficult. The phenomenon where a strong signal covers the sound of the softer signal so that the human ear cannot hear the softer one is known as masking. MPEG compression uses masking as the basis for compressing the audio data.

In addition to encoding a single signal, the MPEG compression supports one or two audio channels in one of four modes:

  1. Monophonic
  2. Dual Monophonic -- two independent channels
  3. Stereo -- for stereo channels that share bits, but not using joint-stereo coding
  4. Joint - stereo -- takes advantage of the correlations between stereo channels

The MPEG method allows for a compression ratio of up to 6:1. Under optimal listening conditions, expert listeners could not distinguish the coded and original audio clips. Thus, although this technique is lossy, it still produces accurate representations of the original audio signal.

6.3 Common File Types [4]

  • Wav - standard audio file container format used mainly in Windows PCs. Commonly used for storing uncompressed, CD quality sound files, which means that they can be large in size. Wave files can also contain data encoded with a variety of codecs to reduce the file size (for example the GSM or mp3 codecs).
  • Ogg - a free, open source container format supporting a variety of codecs, the most popular of which is the audio codec Vorbis. Vorbis offers better compression than MP3 but is less popular.
  • Raw - a raw file can contain audio in any codec but is usually used with PCM audio data. It is rarely used except for technical tests.
  • Au - the standard audio file format used by Sun, Unix and Java. The audio in au files can be PCM or compressed with the μ-law, a-μ law or G729 codecs.
  • Aac - the Advanced Audio Coding format is based on the MPEG-2 and MPEG-4 standards,
  • Mp4/M4a - MPEG-4 audio; most often AAC but sometimes MP3
  • Mp3 - the MPEG Layer-3 format is the most popular format for downloading and storing music. By eliminating portions of the audio file that are essentially inaudible, mp3 files are compressed to roughly one-tenth the size of an equivalent PCM file while maintaining good audio quality.
  • Wma - the popular Windows Media Audio format owned by Microsoft. Designed with Digital Rights Management (DRM) abilities for copy protection.
  • Ra - a Real Audio format designed for streaming audio over the Internet. The .ra format allows files to be stored in a self-contained fashion on a computer, with all of the audio data contained inside the file itself.

6.4 Content based Audio Retrieval

As compared with the content-based image and video retrieval, content-based audio retrieval provides a special challenge because raw digital audio data is a featureless collection of bytes with the most elementary fields attached such as name, file format, sampling rate, which does not readily allow content-based retrieval.

Current content-based audio-retrieval methods are based on content-based image retrieval methods. Major procedures are: [1]

  • A feature vector is constructed by extracting acoustic and subjective features from the audio in the database.
  • The same features are extracted from the queries.
  • The relevant audio in the database is ranked according to the feature match between the query and the database.

6.4.1 Audio Feature Extraction

There are two categories used to characterize the audio signal. [23],[26]

  • Acoustic Features
  • Subjective/Semantic Features Acoustic Features

Acoustic features describe an audio in terms of commonly understood acoustical characteristic, and can be computed directly from the audio file. Major acoustic features include:

  • Loudness
  • Spectrum Powers
  • Brightness
  • Bandwidth
  • Pitch Subjective/Semantic Features

Subjective features describe sounds using personal descriptive language. The system must be trained to understand the meaning of these descriptive terms.

Semantic features are high-level features that are summarized from the low-level features. Compared with low-level features, they are more accurate to reflect the characteristics of audio content.

Major Subjective/Semantic Features

  • Timbre
  • Rhythm
  • Events
  • Instruments

6.4.2 Content based Audio Segmentation

  • It is important to segment an audio stream into different semantic parts, such as speech, music, silence, and environment sounds.
  • Extracting the features from each segment of the audio stream and applying classification methods to obtain the audio scene achieves segmentation.

Chapter VII

Video Databases

A digital video consists of a sequence of frames or images that have to be presented at a fixed rate. Digital videos can be obtained by digitizing analog videos or directly by digital cameras. Playing a video at a rate of 25 frames per second gives the user the illusion of a continuous view. It takes a huge amount of data to represent a video. So compression is a must in the case of videos.

7.1 Need for Digital Video [23]

  • Ease of manipulation - The difference between analog and digital is like comparing a typewriter with a word processor. Just like the cut and paste function is much easier and faster with a word processor, editing is easier and faster with a digital video. Also, many effects that were exclusive for specialized postproduction houses are now easily achieved by bringing in files from Photoshop, Flash, and Sound Edit as components in a video mix. In addition, the ability to separate sound from image enables editing one without affecting the other.
  • Preservation of data - It is not true that digital video is better simply because it is digital. Big screen films are not digital and are still highly esteemed as quality images. However, it is easier to maintain the quality of a digital video. Traditional tapes are subject to wear and tear more so than DVD or hard drive disks. Also, once done, a digital video can be copied over and over without losing its original information. Analog signals can be easily distorted and will lose much of the original data after a few transfers.
  • Internet - A digital video can be sent via the Internet to countless end users without having to make a copy for every viewer. It is easy to store, retrieve, and publish.

7.2 Digital Video Compression Algorithms

There are two types of compression, “lossless” and “lossy”. The lossless compression retains the original data so that the individual image sequences remain the same. It saves space by removing image areas that use the same color. The compression rate is usually no better than 3:1. The low rate makes most lossless compression less desirable. The “lossy” compression methods remove image and sound information that is unlikely to be noticed by the viewer. Some information is lost, but since it is not differentiated by the human perception, the quality perceived is still the same, while the volume is dramatically decreased.

At its most basic level, compression is performed when an input video stream is analyzed and information that is indiscernible to the viewer is discarded. Each event is then assigned a code - commonly occurring events are assigned few bits and rare events will have more bits. These steps are commonly called signal analysis, quantization and variable length encoding respectively. There are four methods for compression, discrete cosine transform, vector quantization, fractal compression, and discrete wavelet transform. [1], [24]

7.2.1 Discrete Cosine Transform (DCT)

Discrete cosine transform is a lossy compression algorithm that samples an image at regular intervals, analyzes the frequency components present in the sample, and discards those frequencies which do not affect the image as the human eye perceives it. DCT is the basis of standards such as JPEG, MPEG, H.261, and H.263.

7.2.2 Vector Quantization (VQ)

Vector quantization is a lossy compression that looks at an array of data, instead of individual values. It can then generalize what it sees, compressing redundant data, while at the same time retaining the desired object or data stream's original intent.

7.2.3 Fractal Compression

Fractal compression is a form of VQ and is also a lossy compression. Compression is performed by locating self-similar sections of an image, then using a fractal algorithm to generate the sections.

7.2.4 Discrete Wavelet Transform (DWT)

Like DCT, discrete wavelet transform mathematically transforms an image into frequency components. The process is performed on the entire image, which differs from the other methods (DCT) that work on smaller pieces of the desired data. The result is a hierarchical representation of an image, where each layer represents a frequency band.

7.3 Compression Standards [21], [24]

7.3.1 MPEG

Moving Picture Experts Group or MPEG is an ISO/IEC working group whose job is to develop audio and video encoding standards. As of now, four MPEG standards are being used and one is under development. Every standard has been designed for a specific bit rate and application. See Appendix Afor details.

7.3.2 AVI

AVI stands for Audio Video Interlaced. It is one of the oldest formats. It was created by Microsoft to go with Windows 3.1 and it’s “Video for Windows” application. Even though it is widely used due to the number of editing systems and software that use AVI by default, this format has many restrictions, specially the compatibility with operations systems and other interface boards.

7.3.3 MOV

MOV format, created by Macintosh, is the proprietary format of the QuickTime application. It can also run on PCs. Being able to store both video and sound simultaneously, the format was once superior to AVI. The latest version of QuickTime also has streaming capabilities for Internet video. However, with the new MPEG-2 format, the MOV format started to lose its popularity, until it was decided that the MPEG-4 is to use the QuickTime format as the basis of its standards.

7.3.4 DivX

DivX is a software that uses the MPEG-4 standard to compress digital video, so it can be downloaded over a DSL/cable modem connection in a relatively short time with no reduced visual quality. The latest version of the codec, DivX 4.0, is being developed jointly by DivXNetworks and the open source community. DivX works on Windows 98, ME, 2000, CE, Mac and Linux.

7.4 Context based video indexing and retrieval

There are four main processes involved in content-based video indexing and retrieval: video content analysis, video structure parsing, summarization or abstraction, and indexing. Each process poses many challenges. [8],[27],[28]

7.4.1 Video Content Analysis

The main problem in video content analysis is that we cannot easily map extractable visual features (such as color, texture, shape, structure, layout, and motion) into semantic concepts (such as indoor and outdoor, people, or car-racing scenes). Although visual content is a major source of information in a video, valuable information is also carried in other media components, such as text (superimposed on the images, or included as closed captions), audio, and speech that accompany the pictorial component. A combined and cooperative analysis of these components would be far more effective in characterizing the video for both consumer and professional applications.

7.4.2 Video Structure Parsing

An important step in the process of video structure parsing is that of segmenting the video into individual scenes. From a narrative point of view, a scene consists of a series of consecutive shots grouped together because they were filmed in the same location or because they share some thematic content. The process of detecting these video scenes is analogous to paragraphing in text document parsing, but it requires a higher level of content analysis. In contrast, shots are actual physical basic layers in video, whose boundaries are determined by editing points or where the camera switches on or off.

Fortunately, analogous to words or sentences in text documents, shots are a good choice as the basic unit for video content indexing, and they provide the basis for constructing a table of contents for video. Shot boundary detection algorithms that rely only on visual information contained in the video frames can segment the video into frames with similar visual contents. Grouping the shots into semantically meaningful segments such as stories, however, usually is not possible without incorporating information from the other components of the video. Multimodal processing algorithms involving the processing of not only the video frames, but also the text, audio, and speech components that accompany them have proven effective in achieving this goal.

7.4.3 Video Summarization

Video summarization is the process of creating a presentation of visual information about the structure of video, which should be much shorter than the original video. This abstraction process is similar to extraction of keywords or summaries in text document processing. That is, we need to extract a subset of video data from the original video such as key frames or highlights as entries for shots, scenes, or stories. Abstraction is especially important given the vast amount of data even for a video of a few minutes’ duration. The result forms the basis not only for video content representation but also for content-based video browsing. Combining the structure information extracted from video parsing and the key frames extracted in video abstraction, we can build a visual table of contents for a video.

7.4.3 Video Indexing

The structural and content attributes found in content analysis, video parsing, and abstraction processes, or the attributes that are entered manually, are often referred to as metadata. Based on these attributes, we can build video indices and the table of contents through, for instance, a clustering process that classi¬es sequences or shots into different visual categories or an indexing structure. As in many other information systems, we need schemes and tools to use the indices and content metadata to query, search, and browse large video databases. Researchers have developed numerous schemes and tools for video indexing and query. However, robust and effective tools tested by thorough experimental evaluation with large data sets are still lacking. Therefore, in the majority of cases, retrieving or searching video databases by keywords or phrases will be the mode of operation.

Chapter VIII

Multimedia Databases

8.1 Introduction

Multimedia data basically means digital audio, video, images, animations and graphics together with text data. In the recent past, the acquisition, generation, storage and processing of multimedia data in computers and its transmission ov

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Request Removal

If you are the original writer of this dissertation and no longer wish to have the dissertation published on the UK Essays website then please click on the link below to request removal:

More from UK Essays

Get help with your dissertation
Find out more