Retrieval Of Multimedia Documents From Web Source Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Multimedia documents retrieval is a topic of growing relevance. Significant contributions have been made to design and develop the multimedia information retrieval systems that make use of computer and internet. However, computer based information access restricts the user mobility and thus accessibility of information.

Cell phone is a ubiquitous device. Its accessibility and portability is more than computers, so retrieving multimedia information using spoken query on cell phone allows users to access multimedia from web anytime and anywhere.

The purpose of this research is to conceptualize a novel system, to retrieve the multimedia documents from web sources using cell phone. Cellular phone may be used as a pervasive device for accessing multimedia information through voice user interface technology. Moreover the spoken query interface helps people with disabilities to interact with web sources for multimedia information access.


World wide web is a collection of electronic documents consisting of text documents as well as multimedia documents. Most of the research work on retrieval of user relevant information from WWW is related to text documents and little work has been done on retrieval of multimedia documents [1]. In the third world, the multimedia retrieval systems use computers and internet which restrict the user mobility and thus costs the accessibility of information. The use of cell phones to retrieve multimedia documents from web, not only permits user mobility but also offers accessibility of information anytime and anywhere.

However, there are some issues with cell phones which make it difficult to access internet. Firstly, most of the cell phones don't have General packet radio service (GPRS). Secondly, cell phone's keypad is short and difficult to use if user wants to send a text query. Thirdly, cell phone's display is small which is inconvenient for web users [33].

In this paper, an approach for retrieval of multimedia documents from web sources using cell phone is proposed. This approach allows a user to pass a spoken query, and through speech recognition processes converts the query into textual form. Then it extracts keywords from it and the web is searched for relevant documents against these keywords. Then it sifts out multimedia documents, sorts them, plays list of documents (in audio form) on cell phone, gets user preference, retrieves the document from web and returns it in spoken form on the cell phone.

We have used voice calls for accessing web resources instead of using GPRS. Researches so far have used voice calls to browse the websites and to retrieve text file in spoken form. There is no such work found which uses voice calls to retrieve multimedia documents on cell phones.

The proposed system is developed using VoiceXML (VXML) [3], Apache open source search engine (Apache Solr) [4] and Apache open source multimedia metadata extractor (Apache Tika) [5]. This work enhances the capability of VXML interpreter from a simple voice browser to a spoken multimedia documents search engine where a user passes his query using built in microphone and obtains the results in audio form. Our results show that although there are issues related to speech recognition because of environmental noise and out of vocabulary (OOV) words, still retrieval of multimedia documents from web sources using cell phones is achievable. Moreover, it allows access to information with wide mobility, anytime and anywhere. It also caters the problem of small display and short keypad by using spoken query and getting spoken results. The solution is also useful and supportive for people with disabilities who can't use keypad properly or have vision problem.

The paper is organized as follows. Section 2 summarized the previously related research. Detailed solution is proposed in section 3. Section 4 presents and discusses the results of prototype system developed for testing the proposed solution. Section 5 concludes the research and Section 6 presents future directions.


The number of mobile handsets is increasing fast. Next generation of information retrieval methodologies target to use mobile devices to facilitate the retrieval process on anytime and anywhere basis. However the tiny size of these devices, their short keypads and small displays make them infeasible for information retrieval from the web. The use of spoken input can enhance their capabilities.

Gu et al. constructed a system which uses a client to record a spoken query. This client is accessible to mobile devices and personal computers, connected with the system. The spoken query is transmitted to the server and thus converted to a text query. An exhaustive work of continuous speech recognition is performed, and documents are searched and returned against the text query. Their system consists of four main components, the client, the server, the speech recognition component and the indexer. The voice search client developed for passing spoken query is a combination of HTML, JavaScript and an ActiveX component written with VC++. Their work demonstrates the feasibility of spoken query for retrieving information from web. Their system suffers lack of accuracy to recognize the spoken query. Therefore, they suggested that user can pass more query words which may help in improving retrieval results [33].

Zhuoran Chen et al. presented a system which allows users to search for information on mobile devices using spoken queries in natural language. This work evaluates spoken query based information retrieval on a commonly available and well researched text database [35]. For mobile devices with high quality microphones, spoken-query retrieval based on existing technologies yields retrieval precisions that come close to that for perfect text input [35]. However, their work does not include retrieval of multimedia documents which have great impact on the information and technology. [35]

Juin Ching et al. developed a system which offers a multimodal interface to users for accessing the web i.e. visual interface and audio interface. To access the web using voice, telephone or cell phone is used as a pervasive device. The results of user's search can be accessed either visually or audibly. The system includes a multimodal interaction mechanism, a text-to-speech (TTS) synthesizer and voice XML technology. Their work increases the degree of accessibility and mobility of any Web page and provides a friendly environment to disable people. A limitation of the system is that it works only for those websites which support the functionality of voice interaction using phone. Moreover, it supports only website browsing whereas web searching is not available in their system [2]. Text annotations, metadata information and transcriptions of audio stream are the major source of indexing and retrieving multimedia documents from the web [1, 24, 26-30].

M. J. Swain developed a multimedia search engine that index the images, audio and video documents using text extracted from web pages and headers of multimedia [20]. In this search engine, indexing is performed offline. Indexing data is extracted from audio and video files using speech recognition approach rather than content analysis and feature extraction [20].

More details on search for information through mobile devices using spoken queries can be found in [2, 33, 34, 35, 37, and 38]. However, it is related to searching for web browsing and textual information retrieving. Therefore, there is still no work done on retrieval of audio and video documents for cell phones.

Proposed Solution

To retrieve multimedia documents from web sources using cell phone, our proposed architecture for the system consists of following modules:

Voice Query reception from Cell Phone

Conversion of voice query into text query

Keywords Extraction

Search Engine

Multimedia Filter

Documents Sorting

Playing of Retrieved Document's List to get user Preference

Play Document

The architecture of the proposed solution is shown in Figure below.

Figure 1: Architecture of the System

The solutions and functionality of each Module is given below.

Step1: Voice Query Reception

Process Name:



User dials Query Reception Module




User dials the number mapped with the application running on VXML Server. After getting connected, the user waits for the prompt message. In case of no response from server, he retries to connect. After receiving the prompt message "speak your query," the user speaks his query on the cell phone. If the query is successfully received, it is forwarded to the next module. The process of voice query reception is shown in Figure 2.

Figure 2 : Voice Query Reception

Step2: Conversion of voice query into text query

Process Name:







VXML interpreter allows us to accept voice input and record it [3]. After getting voice input, the spoken words are recognized using VXML speech recognition engine against our defined grammar. The words recognized from the spoken query are combined to make text query which is returned for further action.

Figure 3 : Conversion of voice query into Text Query

Step3: Keywords Extraction

Process Name:

GetKeywords ( Text_Query )






For searching the web, keywords are extracted from recognized text query. Query parser parses the query, extracts useful keywords and key phrases and discards other terms like articles (the, a, an) and conjunctions (but, or, and, etc.). This module generates a keyword's vector and forwards it to search engine for further action.

Figure 4 : Keywords Extraction

Step4: Search Engine

Process Name:







Search engine search documents from web sources against particular keywords and returns a list of relevant documents. The documents in the list may be the text files, image files, audio or video files. The list of searched documents is returned from this module for further processing.

Figure 5 : Search Engine

Step5: Multimedia Filter

Process Name:







This module filters the multimedia documents from retrieved list of documents. In the filtering process extensions of file names are verified whether they are extension of some kind of multimedia files, e.g. wmv, wav, etc. The filtering process includes, picking a document, checking its file extension, and if it is multimedia extension keep the document otherwise discard it. This generates a collection of relevant multimedia documents which is forwarded to sorting module for further processing.

Figure 6 : Multimedia Filter

Step6: Document Sorting

Process Name:







Input to this module is the list of multimedia documents after filtering. It processes each multimedia document and sorts it based on relevancy score. Then, these documents are returned as a sorted array.

Figure 7 : Documents Sorting

Step7: Play Document List to get User Preference

Process Name:







This module takes documents one by one from the sorted array and plays its description (name, size, duration, and subject) to the user on cell phone as audio. The user, in turn, responds with the required document number.

Figure 8: Play Document List for User preference

Step8: Play Required Document

Process Name:

PlayMMDocument( User_Preference )






After getting a preference from the user, this module retrieves the relevant document from the web plays it for the user as audio on cell phone. The user may listen to the document completely or incompletely by stopping it at any time. Then the user requests for another document or passes a new query or disconnects.

Figure 9: Play Document

Results and Discussion

The proposed solution has been implemented by writing query reception module using VXML platform. This module receives query, converts it into the textual form and co-works with a web sever (which uses Apache Solr search engine) to retrieve documents from the web sources.

Experimental Evaluation

For experimental purpose, about 32 audio lectures on different topics were recorded (PCM 22.050 kHz, 16 Bit, Mono) and uploaded to the web [39] to generate a web resource. Apache Tika [5] was used to extract metadata of the web documents. Manual transcription was performed to generate indexing terms of the web documents corresponding to each metadata. Furthermore, an indexed database was developed by storing information of the metadata along with indexing terms. Apache Solr [4] was used to search the indexed database according to the key terms of a text query extracted from the spoken query. The web server was running on Pentium4 system (1GB RAM).

The VXML query reception module was uploaded to VML hosting site [40] which was assigned with a telephone number. This number was used for connecting with the system to make a query from the cell phone.

To test the system a query set was defined (one, two and three-word queries) by picking words from transcription of audio files. A small application specific grammar was defined for recognition of spoken query words. Following table shows the set of queries defined for this experiment.

Sr. No



Computer Memory


Subject of Sentence


Personal Pronouns


Personal Computers


Input Device


Random Access Memory




Read Only Memory

Figure 10: Queries defined for experimental purpose

The experiment was performed by making a voice call to the query reception module from the cell phone (Telenor SIM in Nokia 6312). On successful connection, a query from the query set was spoken out. The query conversion module converted this query into textual query utilizing voice recognition engine of VXML and our defined, application specific grammar. Keywords' extraction module extracted keywords from the text query. Then the Apache Solr, searches for relevant documents from its indexed database against extracted keywords. After this, filtering modules sifted out multimedia documents and sorting module sorted them using relevancy score. From this list document's names were played on cell phone as audio. The preferred document number in return was passed from cell phone to the server. Then the required document was retrieved from the web, played as audio and listened on the cell phone.

The module-wise working of proposed system is explained below.

In module1, the VXML interpreter was uploaded to the web [39] which was assigned with a phone number (0014089077328). This number was dialed from the cell phone (Telenor SIM in Nokia 6312). After getting connected, pin code (1234) and user id (6251279) was pressed from keypad of cell phone. The query reception module prompted for Query. The query "Computer Memory" was spoken on cell phone's Mic. Let us say it q1. Query reception module received the query and forwarded it to module 2. Module 2 converted it into textual query using voice recognition engine of VXML and our defined, application specific grammar. Note that number of words in the grammar is finite. The textual query was sent to the web server for further processing.

In module3, web server extracted keywords from text query and passed them to the search engine. The keywords from the query q1 are shown below.

computer, memory, computer memory

In module4, search engine searched for relevant documents from its indexed database built by manual indexing of documents from the web sources. The search engine being used here was Apache Solr [4] which is an open source search server based on the Apache Lucene search library. Metadata and text information from the web documents was extracted using apache TIKA [5]. The metadata and text information was used for indexing the web documents in indexed database.

Figure 11 shows the results for Modules 4, 5 and 6 against the query q1, made earlier.

In Module7, each document's name from the result set was played to cell phone user, who in turn specifies the required document number using keypad of the cell phone. The result against query q1, made earlier shows that first document is the most relevant, so '1' is pressed from cell phone keypad (shown in figure 2).

Figure 11 : Results against the query

In Module8, the web server retrieved the desired document (1st document "rom.wav") from web using its URI and played in audio form on cell phone.


Our experiments demonstrate feasibility of retrieving multimedia documents from the web sources using cell phones. We have indexed a collection of the web documents in Server's database for our experiment. We have listened to each lecture and defined the index terms. A query set is defined by selecting a subset from union of sets of index terms of lectures.

The following table represents the Interpolated Precision at 11 recall levels for query q1.

Figure 12: Interpolated Precision at 11 recall levels for query q1

The graph below represents the Interpolated Precision at 11 recall levels for query q1.

Figure 13: Interpolated Precision graph for query q1

The main contribution of our work is to provide the facility of retrieving multimedia documents from the web sources using cell phones. We have demonstrated the working of the proposed system. Table 5 shows the results of queries made during experimentation.

Figure 14: Interpolated Precision at 11 recall levels for 8 different queries

Figure 6 represents the Average Interpolated Precision at 11 recall levels for 8 different queries.

Figure 15: Average Interpolated Precision at 11 recall levels for 8 queries

Results of experiments show that spoken query always performed correctly, as it reported every document in which the queried word is present. However, the system is at immature stage and needs to be tested for out of vocabulary words and under noisy conditions.

Experiments show that audio files retrieval is satisfactory whereas video files retrieval in absolute formation is not practicable in the current setup. Instead only audio stream of the video file is retrievable as the medium being used for this purpose is voiced.

The retrieval of third type of multimedia (images) is not applicable in our system. The voice medium makes image files inaccessible in current setup.


The problem of small display and undersized keypad makes retrieval of multimedia documents using cell phones, a difficult job. In this work, voice calls have been used to overcome this problem.

This work uses Voice XML platform to retrieve multimedia documents using cell phone. This system has linked a search engine with VXML interpreter which enhances its capabilities to make it work as multimedia search engine.

As lot of cell phone models don't have GPRS facility (specifically being used in Asia [41]), the internet and multimedia documents from the web are accessible on these mobile sets through voice calls using our system.

The proposed solution allows mobile phone users to access multimedia documents on the move, any time and anywhere.

The use of spoken query provides easy to use interface. It provides faster input speed compared to keypads.

Moreover, the spoken query format used in the system allows multimedia information retrieval to the people with disabilities.

Future Directions

This work currently uses metadata and manual annotation for indexing. To avoid manual transcription, the indexing procedure of [24] may be incorporated which uses speech recognition technology. Moreover, occurrence of query words (timing information) in an audio stream is to be indexed and returned to the user, so that more relevant and accurate information is passed to user. This facilitates the user to listen to the exact location of occurrence in the document instead of listening to the whole document.

The developed prototype system works fine in controlled environment, however, necessary measures need to be taken while using it in open environment where the noise parameter affects the user query and creates problems in voice recognition process.