Search And Retrieval Of Audio Conversation Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- We are entering an era where it is easier than ever to record, disseminate, and browse vast amounts of audio-visual material. Conventional text-based retrieval engines cannot process these types of data however, so searching through audio can be very tedious. Manual annotations of these data are possible, but expensive solutions. What are needed are automated processing methods to provide structure to help people navigate through this growing data type. Speech Retrieval is a complex phenomenon. People rarely understand how is it produced and perceived. It's always useful to get a sound editor and look into the recording of the speech and listen to it. Information Retrieval (IR) is the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. But as far as non-textual data is concerned not much work is done, to store and retrieve the same. Our proposed system provides user friendly interface, efficient search mechanism to retrieve audio data.

Keywords: Retrieve, Search, Audio, Speech to text, Indexing


Organizations are increasingly finding themselves subject to discovery requests either for compliance or for litigation. Many organizations employ legal firms to review all of the information related to a particular request and then decide which pieces of information are relevant. To solve this problem many business units have come up with various software products which quickly search and retrieve the relevant data (documents). However there is not much of improvement visible in the storage and retrieval of non-text data such as audio and video. In the past few years, we entered an era where it is easier than ever to record, disseminate, and browse vast amounts of audio-visual material. Manual annotations of these data are possible, but they are expensive solutions. What are needed are automated processing methods to provide structure to help people navigate through this growing data.

Since corporate world needs to conform to the stringent norms for operation laid down by the Government, it is also required to record telephonic conversations that happen in any organization. Sometimes organizations need to search for particular keywords in large volumes of stored data. Currently, organizations hire people to manually listen to audio conversations and identify conversations involving those keywords, which is a cumbersome, time-consuming and costly affair. So, paper proposes a solution to automate this search and retrieval process.

Section II tells us about the previous proposed tools. Section III suggests the new approach for storing and retrieving audio conversation based on the previous tools. It contains system architecture. Section IV consists of design of the proposed system and how it works. Paper concludes with conclusion and future work as mentioned in Section V.


In past few years, many systems were developed that perform storage, search and retrieval of enterprise data. The details of one of those are mentioned below. The details of few "speech to text" tools are also mentioned.

Text Document Retrieval System: Eyebrowse

Eyebrowse is an open source Java-based tool for cataloguing and browsing mailing lists. A core requirement for Eyebrowse is flexible message search and retrieval capability. It demands an indexing and search component that would efficiently update the index base as new messages arrived, allow multiple users to search and update the index base concurrently, and scale to archives containing millions of messages. Eyebrowse differs from other popular archive browsers in that it does not require mailing lists to be exploded into individual HTML files, and that the HTML rendering is done at serving time, rather than at the time the message is received, to allow for easy customization of the message display. The Eyebrowse project provides a way to browse email mailing list archives in a way that is:

Scalable to millions of messages.

Easily customizable, especially in its look and feel.

Reliable in actual use, including protecting against malicious or malformed HTML code.

However, the above system is not capable of searching and retrieving the non-textual data like speech data. Our project deals with this speech data (audio conversations) and provides effective search and retrieval mechanism for the same. To achieve that goal, we plan to use automatic speech to text translation tools, which are discussed below.

Speech Recognition:

The traditional approach to speech recognition system design has been to create an entire system optimized around a particular methodology. As evidenced by past research systems such as Dragon [8], Harpy [9], Sphinx and others, this approach has proved to be quite valuable in that the resulting systems have provided foundational methods for speech recognition research.

In the same light, however, each of these systems was largely dedicated to exploring a single specific groundbreaking area of speech recognition. For example, Baker introduced hidden Markov models (HMMs) with his Dragon system, [8], [10] and earlier predecessors of Sphinx explored variants of HMMs such as discrete HMMs [4], semi continuous HMMs [5], and continuous HMMs [11]. Other systems explored specialized search strategies such as using lex tree searches for large N-Gram models [12].

Fig1: Sphinx-4 decoder Framework


Sphinx-4 is a flexible, modular and pluggable framework to help foster new innovations in the core research of hidden Markov model (HMM) recognition systems. The design of Sphinx-4 is based on patterns that have emerged from the design of past systems as well as new requirements based on areas that researchers currently want to explore. Sphinx is speech recognition system written entirely in the JAVA programming language. It is a hidden Markov Model (HMM) based speech recognizer. It supports dictionaries in CMU dictionary format. The framework and the implementations are all freely available via open source.

How Sphinx works? [13]

The following section tells us about how sphinx convert audio file to text.

Recognition process

The common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do that we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. There are few important things in this match.

First of all it's a concept of features. Since number of parameters is large, we are trying to optimize it. Numbers that are calculated from speech usually by dividing speech on frames. Then for each frame of length typically 10 milliseconds we extract 39 numbers that represent the speech. That's called feature vector. They way to generates numbers is a subject of active investigation, but in simple case it's a derivative from spectrum.

Second it's a concept of the model. Model describes some mathematical object that gathers common attributes of the spoken word. In practice, for audio model of senone is gaussian mixture of its three states - to put it simple, it's a most probable feature vector.

Third, it's a matching process itself. Since it would take a huge time more than universe existed to compare all feature vectors with all models, the search is often optimized by many tricks. At any points we maintain best matching variants and extend them as time goes producing best matching variants for the next frame.

Fig 2: STT tool - Sphinx


According to the speech structure three models are used in speech recognition to do the match:

Acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context).

Phonetic dictionary contains mapping from words to phones. This mapping is not very effective, for example only two-three pronunciation variants are only noted there, but it's practical enough most of the time. The dictionary is not the only variant of maper from words to phones. It could be done with some complex function learned with machine learning algorithm.

Language model is used to restrict word search. It defines which word could follow previously recognized words (remember that match is sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. Most common language models used are n-gram language model that contains statistics of word sequences and finite state language models that just define speech sequences by finite state automation sometimes with weights. To reach good accuracy rate your language model should be very successful in search space restriction that means it should be very good in prediction of the next word probability. Language model usually restricts the vocabulary considered to the words in contains. That's an issue for names recognition, to deal with this issue; language model can contain smaller chunks like subwords or even phones. Please note that search space restriction in this case usually worse and corresponding recognition accuracies are lower than with word-based language model.


Lucene is a free open source, information retrieval API originally implemented in Java by Doug Cutting.

Lucene is a highly scalable, high-performance IR library. It has a small memory footprint, only 1MB heap size, it index size is roughly 20-30% the size of the indexed text, it ranks search results, supports several query types like wildcard queries and proximity queries, it has field searching, date-range searching, the ability the search multiple indexes and getting a merged result, and it allows for simultaneous update and searching).

How Lucene Works?[6]

Also using Lucene is simple yet the indexing of a document undergoes some steps. The data should be text only, so we need to extract text content from files before indexing, which is responsibility of Importer module. So now we will assume that data is already in textual format.

Lucene Ist Phase: Analysis

Lucene first analyzes the data to make it more suitable for indexing. To do so, it splits the textual data into chunks, or tokens, and performs a number of optional operations on them. For instance, the tokens could be lowercased before indexing, to make searches case-insensitive. Typically its also desirable to remove all frequent but meaningless tokens from the input, such as stop words (a, an, the, in, on, and so on) in English text. Similarly, it's common to analyze input tokens and reduce them to their roots.

Lucene IInd Phase: Create Indexes

After the input has been analyzed, Lucene stores it in an inverted index data structure. This data structure makes efficient use of disk space while allowing quick keyword lookups. What makes this structure inverted is that it uses tokens extracted from input documents as lookup keys instead of treating documents as the central entities. In other words, instead of trying to answer the question "what words are contained in this document?" this structure is optimized for providing quick answers to "which documents contain word X?

The indexer is a module that is used to create the indexes from the list of text files, which are generated from the STT module. The indexes will be helpful in searching the text files for the input keywords and thereby searching the audio conversation files.

Fig 3: Indexing Tool - Lucene Java

The Search Engine takes the help of indexes in finding the text files and maps the text files to audio files, thus retrieving and presenting the actual audio files to the user.


In order to retrieve appropriate audio conversation based on the keyword. We first converted the audio files into corresponding text file. Next we apply indexing algorithm on those text file. For converting audio to text we have used an open source tool - Sphinx as discussed above, for indexing the text file we have used an open source tool- Lucene Java as discussed above.

Database Implementation:

We need to handle only two kinds of databases i.e. .wav files database and .txt files database. The .wav files are given as input to the STT tool and corresponding .txt files are generated side by side. The .wav files are stored in a WAVFILES directory and .txt files are stored in TEXTFILE directory.

For example, consider a file abc.wav is any wav file then that is stored in WAVFILES directory and is given as input to the STT tool then corresponding .txt file generated is abc.txt and is stored in TEXTFILE directory.

Fig 4: Database Implementation

System Architecture:

The system architecture describes the system in a layered format. Three layers in system architecture are as follows:


Business Logic

Database Layer

User Interface:

It provides user friendly environment to user.

Business Logic:

It is responsible for creating Indexes for text file using Lucene tool, searching and voice to text converter converts .wav file into respective .txt file using Sphinx tool.

Database Layer:

It handles location on .wav and .txt files; that is where the wav and text repository is located on the machine.

Fig 4: System Architecture


User gives input in the form of text or audio.

Audio files are converted to text using STT (Speech to Text) Tool (Sphinx).

Each text file is given as input to Indexing mechanism. Here text file is scanned for particular set of keyword, based on that indexes are created.

All the files containing same keyword would be assigned same index.

Suppose a person wants to search for a particular keyword, the search engine searches for that keyword in the indexes created by index mechanism.

User input in the form of audio is first converted to text using STT tool and that text is then given as input to the searching block.

Fig 5: Block diagram of System

The search operation returns a collection of audio files corresponding to retrieved text files from an index. These files are retrieved as per the frequency of occurrences of given keyword, sorted in descending order of occurrence.


The Audio Transcription system is developed such that it extracts the information relevant to the problem and has added vital search and retrieval functionalities into a single package. The system can be used by any client without having prior in-depth knowledge of searching and indexing. One can easily keep an eye on the activities going around in the entire audio transcription system.

The Audio Transcription system has helped us in keep an eye on the activities going in the organization and to relieve from the problem of information leakage. In future we will try to improve the performance of the STT tool.