Overview Of Information Retrieval Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

An overview of Information Retrieval is presented in this chapter. It briefly reviews information retrieval, its evaluation and how it can be used in the web search. This discusses the user model for retrieval. This chapter also defines the number of analytical approaches in information retrieval in terms of algorithm and application models. The numbers of approaches are proposed for the efficient and intelligent retrieval is discussed. It further focuses on the document organization as an important part of retrieval. The special attention for the different methods applied to the organization of the documents. The semantic techniques are also discussed here, which use a categorization, collaborative filtering and naive bayes algorithm. In particular, this thesis addresses solutions for the effective retrieval in model/algorithm and systems/application. The experimental observations are properly explored with the particular emphasis on the necessities of the information retrieval.

Paper based on these studies:

1. Challenges in Web Information Retrieval

(Monika Arora, Uma Kanjilal, and Dinesh Varshney)

2009, Innovation in Computer Science and Software Engineering [Springer Science, U.S.A], pp 141-145

2. Performances Evaluation in Information Retrieval System

(Monika Arora, Uma Kanjilal, and Dinesh Varshney)

2010, Innovations and Advances in Computer sciences and Engineering ISSN NO: 978-0230-32978-2

(Macmillan Publisher, India), pp. 456-464

By learning to discover and value our ordinariness, we nurture friendliness toward ourselves and the world that is the essence of a healthy soul.

Thomas Moore

1.1 Motivation

Information Retrieval [IR] is based on the principle that can find the relevant data from the corpus of data (Christopher et al., 2008). Information Retrieval is not new, but it started from the principle of search engines, an idea that exists in history of the data storage and data retrieval. Today, as businesses all over the world are based on the web where everyone is handling the electronic data, this frequency of changing data is large. The most critical factor in that is the data, which are stored and retrieved. For retrieval process, we have to focus not just on the data but also the user/people as it is linked with it. The changed or updated data should be maintained and available for the retrieval. If the stored data are changed, it has to also update in repository. As a result of the changing data in the World Wide Web, that is essential in web communities to update the documents/ refresh the data at every moment of time. Information extraction acknowledges the fact that most of the researchers today involve several different levels as storage, maintain and retrieval. Information retrieval (IR) deals with the representation, storage, organization of, and access to information documents/items. The representation and organization of the information items should provide the user with easy access to the information in which he is interested.

The retrieval process starts with a query and find the keyword in all the pages (documents) containing information on document hub, which is maintained by World Wide Web. To obtain a relevancy, the page must be latest/updated and the also the new documents have to be indexed and ranked regularly. The full description of the user information need cannot be used directly to request information using the current interfaces of Web search engines. Instead, the user must first translate this information need into a query, which can be processed by the search engine (or IR system). In its most common form, this translation yields a set of keywords (or index terms), which summarizes the description of the user information need. Given the user query, the key goal of an IR system is to retrieve information, which might be useful or relevant to the user. The emphasis is on the retrieval of information as opposed to the retrieval of data.

The Web is becoming a universal repository of human knowledge and culture for information. Its success is based on the conception of a standard user interface, which is always the same no matter what computational environment is used to run the interface. As a result, the user is shielded from details of communication protocols, machine location, and operating systems. Further, any user can create web documents and make them point to any other Web documents without restrictions.

1.2 Retrieval Mechanism

The study mainly focuses on either the data or the document. In case of data its database are structured having field as clear semantics. The query in database is also well defined and it is recoverable. Its retrieval matching criteria are exact so always gives the correct/exact results. Information Retrieval on the other hand, deals with the unstructured data with no field but only text as queries (Sanderson & Zobel, 2005). The matching results are imprecise to measure the effectiveness, as they are not exact match.

Data retrieval determines the documents from a collection containing the keywords in the user query that do not to satisfy the user need. In fact, the user of an IR system is concerned more with retrieving information about a subject than with retrieving data, which satisfies a given query. A data retrieval language aims at retrieving all objects, which satisfy clearly defined conditions such as those in a regular expression or in a relational algebra expression. Thus, for a data retrieval system, a single erroneous object among a thousand retrieved objects means total failure. For an information retrieval system, however, the retrieved objects might be inaccurate and small errors are likely to go unnoticed. The main reason for this difference is that information retrieval usually deals with natural language text, which is not always well structured and could be semantically ambiguous. On the other hand, a data retrieval system (such as a relational database) deals with data that has a well-defined structure and semantics.

Data retrieval, while providing a solution to the user of a database system, does not solve the problem of retrieving information about a subject or topic. To be effective in its attempt to satisfy the user information need, the IR system interprets the contents of the information items (documents) in a collection and ranks them according to a degree of relevance to the user query. This interpretation of document content involves extracting syntactic and semantic information from the document text and using this information to match the user need. The difficulty is not only knowing how to extract this information but also knowing how to use it to decide relevance. Thus, the notion of relevance is at the center of information retrieval. In fact, the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible.

In the early years, the area of information retrieval has grown well beyond its primary goals of indexing text and searching for useful documents in a collection. Nowadays, research in IR includes modeling (Ponte &Croft, 1998), document classification and categorization, systems architecture, user interfaces, data visualization, filtering, languages, etc. Despite its maturity, until recently, IR was seen as a narrow area of interest mainly to librarians and information experts. Such a tendentious vision prevailed for many years, despite the rapid dissemination, among users of modern personal computers, of IR tools for multimedia and hypertext applications. In the beginning of the early nineties, a single fact changed once and for all these perceptions -- the introduction of the World Wide Web. The great emphasis on the integration of the different areas, which are closed, related to the information retrieval problem and thus, should be treated together.

The effective retrieval of relevant information is directly affected both by the user task and by the logical view of the documents adopted by the retrieval system. The User Task defines the user of a retrieval system has to translate his information need into a query in the language provided by the system. With an information retrieval system, this normally implies specifying a set of words, which convey the semantics of the information need (Beitzel et al.,2004). With a data retrieval system, a query expression (such as, for instance, a regular expression) is used to convey the constraints that must be satisfied by objects in the answer set. In both cases, we say that the user searches for useful information executing a retrieval task.

Figure 1.1: Interaction of the user with the retrieval system

Consider now a user who has an interest, which is either poorly defined or which, is inherently broad. For instance, the user might be interested in documents about car racing in general. In this situation, we say that the user is browsing the documents in the collection, not searching. It is still a process of retrieving information, but one whose main objectives are not clearly defined in the beginning and whose purpose might change during the interaction with the system.

There is a clear distinction between the different tasks the user of the retrieval system might be engaged in. This defines the tasks as two distinct types: information or data retrieval and browsing. Classic information retrieval systems normally allow information or data retrieval. Hypertext systems (Brin & Page, 1998) are usually tuned for providing quick browsing. Modern digital library and Web interfaces might attempt to combine these tasks to provide improved retrieval capabilities. The interaction of the user is through the different tasks that identify the user interest. Information and data retrieval are usually provided by most modern information retrieval systems (such as Web interfaces). Both retrieval and browsing are, in the language of the World Wide Web, pulling actions. The user requests the information in an interactive manner. An alternative is to do retrieval in an automatic and permanent fashion using software agents, which push the information towards the user. For instance, information useful to a user could be extracted periodically from a news service. In this case, we say that the IR system is executing a particular retrieval task, which consists of filtering relevant information for later inspection by the user. Logical View of the Documents defines the documents in a collection, which are frequently represented through a set of index terms or keywords. Such keywords might be extracted directly from the text of the document or might be specified by a human subject. The keywords are derived automatically and generated by a specialist and they provide a logical view of the document.

The modern computers are making it possible to represent a document by its full set of words. In this case, we say that the retrieval system adopts a full text logical view of the documents. With the large collections, however, even modern computers might have to reduce the set of representative keywords. This can be accomplished through the elimination of stop words (such as articles and connectives), the use of stemming (which reduces distinct words to their common grammatical root), and the identification of noun groups (which eliminates adjectives, adverbs, and verbs). The full text is clearly the most complete logical view of a document but its usage usually implies higher computational costs. A small set of categories (generated by a human specialist) provides the most concise logical view of a document but its usage might lead to retrieval of poor quality. Several intermediate logical views (of a document) might be adopted by an information retrieval system as illustrated. Besides adopting any of the intermediate representations, the retrieval system might also recognize the internal structure normally present in a document (e.g., chapters, sections, subsections, etc.).

1.3 General Model of Information Retrieval

The information retrieval goals to satisfy the user needs. The word "document" in document searching take cares not only the text documents but also the objects such as multimedia objects. Figure 4.1 provides a general overview of the information retrieval process, which has been adapted from Lancaster and Warner (1993). Users have to give their information need and that can be understood by the retrieval mechanism. There are a number of steps involved in this translation process. Similarly, the contents of large document collections need to be described in a form that allows the retrieval mechanism to recognize the relevant documents. In this case, information may be lost in the transformation process leading to a computer-usable representation. Hence, the matching process is inherently imperfect.

Information seeking is also one of the form of problem solving [Marcus 1994, Marchionini 1992]. It proceeds according to the interaction among eight sub processes: problem recognition and acceptance, problem definition, search system selection, query formulation, query execution, examination of results (including relevance feedback), information extraction, and reflection/iteration/termination. In reference to perform effective searches, users have to develop the following expertise in reference to : knowledge about various sources of information as www, skills in defining search problems or defining the "Keywords" and applying search strategies, and capability in using electronic search tools.

The information need as defined in the pyramid peak as a conceptual query, where this part is visible to the user. (See Figure 2.1). The conceptual query captures the key concepts and also finds the relationships among them. As a result of conceptual analysis that operates on the information need, which may be well or unclearly defined in the user's mind. This analysis is challenging, because the users are faced "vocabulary problem" as general they are trying to translate their information need into a conceptual query. And also the problem refers to the fact where a single word can have more than one meaning, and, conversely, the same concept can be described by surprisingly many different words (Furnas, Landauer, Gomez and Dumais 1983) .Further, the concepts used for the documents representation may be different from the concepts used by the user. There the conceptual query can take and formalize the problem in a natural language statement. A list of concepts that can have degrees of conversion is important to assign to them, or it can be statement that coordinates the concepts using Boolean operators. Finally, the conceptual query has to be translated into a query substitute that can be understood by the retrieval system.

Figure 2.1: represents a general model of the information retrieval process, where both the user's information need and the document collection have to be translated into the form of surrogates to enable the matching process to be performed. This figure has been adapted from Lancaster and Warner (1993).

A text surrogate can consist of a set of index terms or descriptors. The text surrogate can consist of multiple fields, such as the title, abstract, descriptor fields to capture the meaning of a document at different levels depending upon document to document focusing on the characteristics aspects of a document. Either the user is satisfied by the retrieved information or he will evaluate the retrieved documents and modify the query to initiate a further search. The process of query modification based on user evaluation of the retrieved documents is known as relevance feedback [Lancaster and Warner 1993]. Information retrieval is an essentially an interactive process, and the users can change direction by modifying the query surrogate, the conceptual query or their understanding of their information need.

In the studies investigating the information-seeking process, describes information retrieval in terms of the cognitive and affective symptoms commonly experienced by a library user. The findings (Kuhlthau et al. 1990) indicate and think about the information need that will be much clear and more focused as users involve in the search process. At the same time the uncertainty, confusion, and frustration at the early stages of the retrieval in the search process reduce due to the universal experiences, and the search process progresses and feelings of being confident, satisfied, sure and relieved increase. These studies indicate the cognitive attributes that may affect the search process. User's expectations of the information system and the search process may influence the way they approach searching and therefore affect the intellectual access to information.

Analytical search strategies deal with the formulation of specific, well-structured queries and a systematic, iterative search for information. The browsing involves the broad query terms and a scanning of larger sets of information in unstructured documents. In information retrieval studies in hypertext systems that the predominant search strategy is "browsing" rather than "analytical search"( Campagnoni et al. 1989). Furthermore, the research showed in the search strategy is in the dimension of effective information retrieval. The search process caters the information with respect to the browsing interaction style, where the many of the search objectives will formulate with respect to the positive results.

Figure 1.4 Retrieval model and application

1.4 Models / Application

These retrieved objects can be evaluated and kept ready for the retrieval at the browser. These can be segregated in the two dimensions as models/Algorithm and Applications/Systems. The retrieval mechanism plays an important role in both the distinguish categories.

1.3.1 Models/Algorithm of Information Retrieval

The models have been developed to retrieve information are Boolean model, the Statistical model, which includes the vector space and the probabilistic retrieval model, and the Linguistic and Knowledge-based models. The first model is often referred to as the "exact match" model; the latter ones as the "best match" models [Belkin and Croft 1992]. (Frakes and Baeza-Yates ,1992), (Belkin and Croft ,1992

The queries generally are less than perfect in two respects: First, they retrieve some irrelevant documents. Second, they do not retrieve all the relevant documents. The following two measures tell the same things in different way. They are usually used to evaluate the effectiveness of a retrieval method. The first one, focuses on the precision rate, is equal to the proportion of the retrieved documents that are actually relevant. The second one, called the recall rate, is equal to the proportion of all relevant documents that are actually retrieved. If searchers want to raise precision, then they have to narrow their queries. If searchers want to raise recall, then they broaden their query. In general, precision and recall forms inverse relationships. The users need help to become knowledgeable terms of the managing the precision and recall in their particular information need [Marcus, 1991].

In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the traditional binary membership choice. The weight of an index term for a given document reflects the degree to which this term describes the content of a document. Hence, this weight reflects the degree of membership of the document in the fuzzy set associated with the term in question.

The P-norm method developed that (Fox, 1983) allows query and document terms to have weights, which have been computed in term frequency statistics using the proper normalization procedures. These normalized weights are used to rank the documents in the order of decreasing distance from the point (0, 0, ..., 0) for an OR query, and in order of increasing distance from the point (1, 1, ... , 1) for an AND query.

Several statistical and AI techniques have been used in association with domain semantics to extend the vector space model to help overcome some of the retrieval problems described above, such as the "dependence problem" or the "vocabulary problem". One such method is Latent Semantic Indexing (LSI). In LSI the associations among terms and documents are calculated and exploited in the retrieval process.

Algorithmic based Retrieval System

Ontology, wordnet indri and lumer tool

1.3.2 Systems/Application of Information Retrieval

Social networking, semi-nets and conceptual graphs Web mining is a Knowledge database process applied to Web data. A large amount of information is available on the Web which lacks structure where web mining may be useful. Web mining refers to discovery and analysis of useful information over the World Wide Web. The Web mining field encompasses a wide array of issues,primarily aimed at deriving actionable knowledge from the Web, and includes researchers from information retrieval, database technologies, and artificial intelligence [9].

The term ontology can be defined in many different ways. Genersereth and Nilsson defined Ontology as an explicit specification of a set of objects, concepts, and other entities that are presumed to exist in some area of interest and the relationships that hold them [4]. Usually, Ontologies are defined to consist of abstract concepts and relationships (or properties) only. In some rare cases, Ontologies are defined

also to include instances of concepts and relationships[5].Various algorithms may be proposed for

extracting information from collection of web pages across different sites.

Social Network Analysis (SNA) is a research area that tries to analyze and model actor behavior based on his or her connections or relations to other members of a group. Fur further reference see [WF99]. An actor is thus seen as restricted or empowered by his or her connection to others. The basis of this structural

approach is given by models about group interaction. The ¯rst research questions were posed to de¯ne roles to actors given a social context. Thus e.g. leadership of a group is such a role. There are also models about the power to manipulate. Thus a person in such a context may be called relevant, or central, if he or she is positioned in such a way in the group's network that all information exchanged between any two actors has to pass through this 'central' actor. He or she can thus manipulate the group. Thus the question of who is relevant within a group is one of the research questions with SNA. Based on graph theory this can be analyzed by using dif- ferent so called centrality indices. Some of them are inuitive, like e.g. degree

centrality, other are more elaborate like e.g. betweenness centrality or eigenvector centrality. But always the question is: given a clearly denied context, who within a group is relevant, who is not, how are the actors in the group connected and what, if any, predictions can be made for the future development of the

group structure. Thus, this analysis approach can be used to ¯nd the 'relevant' people or websites needed to enhance the information found by text retrieval.

Conceptual search, i.e. search based on meaning rather than just character strings has been the motivation of a large body of research in the IR field long before the Semantic Web envision emerged [1], [35]. Following the classification shown in Figure 1 we provide a brief description of the studied works from both the IR and the Semantic Web field that have attempted to solve the problem of conceptual search. As a conclusion of this section, we will point out the main distinctive aspects of our approach with respect to the ones described next. Conceptual search, i.e. search based on meaning rather than just character strings has been the motivation of a large body of research in the IR field long before the Semantic Web envision emerged [1], [35]. Following the classification shown in Figure 1 we provide a brief description of the studied works from both the IR and the Semantic Web field that have attempted to solve the problem of conceptual search. As a conclusion of this section, we will point out the main distinctive aspects of our

approach with respect to the ones described next.

Table 1 - Classification of semantic search systems Table 1 - Classification of semantic search systems

The IR field was the first to take a step towards conceptual search. This drive can be

found in widely explored areas such as Latent Semantic Indexing [15], [43], linguistic conceptualisation approaches [28], [48], or the use of thesaurus and taxonomies to improve retrieval [7], [26], to name a few. Such proposals are commonly based on shallow and sparse conceptualisations, usually considering very few different types of relations between concepts, and low information specificity levels. In the last few years the semantic Web has contributed with novel ontology-based proposals that consider a much more detailed and densely populated conceptual space in the form of an ontology-based KB. An obvious, immediate trade-off of these approaches is that such a rich conceptual space is more difficult and expensive to obtain, but this is being one of the major targets addressed by the Semantic Web research community, which is already providing significant results and dependable grounds to build upon [16], [56].

The application of conceptual search has been undertaken in different environments such as the Web, controlled repositories or even the desktop. While the simplification to limited environments do not have any effect from the conceptual search approaches coming from the IR field it has an important effect in the Semantic Web approaches. In these models conceptualizations are not expressed by means of mere thesaurus or taxonomies, but by much more enriched structures such as ontologies and KBs which generation is a difficult and high cost task. Among the environments cited before we should point out by its difficulty the Web [24], [51]. The Web is an open space where the information is distributed across millions of computers; where content evolves and grows extremely fast; which extends across multiple different domains; and to which millions of users with different characteristics and purposes turn to satisfy the most diverse information needs every day. However, obtaining conceptualizations able to amply cover the meanings involved in all web content is still an unresolved problem in general. Restricting themselves to more enclosed environments, many works have been undertaken and tested over controlled repositories [46], [76], where the available information can be covered one or more domain ontologies and KBs. However, as we pointed before, extracting conceptual meanings and formally representing them within ontologies and KBs, is a difficult and high cost task that requires major engineering efforts. To get around this problem, another important environment has been considered in literature, the desktop [6]. In this environment the semantic information can be easily extracted from semi-structured contents such as e-mails, folders, etc; there is not such diversification of users and the interaction with them is much more explicit,acquiring feedback continuously.

Another relevant aspect that characterises semantic search models is the way the user expresses her requirements. Four different approaches may be identified in the state of the art, characterised by a gradual increase of their level of formality. In the first level, queries are expressed by means of keywords [31]. For instance, a request of information about movies where Brad Pitt plays the leading role could be expressed by a set of Keywords like "Brad Pitt movies". This is the most traditional way of consultation, but also the less expressive one, since the information need is represented as a set of terms without any explicit relation between them. The next level involves a natural language representation of the information need [46]. In this case, the previously mentioned example could be expressed as a full (interrogative) sentence, such as "in what movies Brad Pitt plays the leading role?" This kind of query provides much more information than the keyword approach since a linguistic analysis can be performed to extract syntactic information, such as subject, predicate, object and other details of the sentence. The next level in formality is portrayed by systems where the query is expressed by adding tags that represent properties, values or objects within the consultation [9]. Following the previous example the query could be expressed as "s: Actor p: name v: Brad Pitt p: leading-role s: film". This kind of query is easier to process and map to the corresponding classes, properties and values of a schema or ontology underlying the search space, thus facilitating the acquisition of the semantically related information. Finally the most formal search systems are based on ontology-query languages [76] such as RDQL [63], PARQL [57], etc. In this approach, the previous example could expressed as "select ?f where (?a , < name>, 'Brad Pitt') , (?a, <leading-role>, ?f)" The full expressive power of this kind of query allows the system to automatically retrieve in a highly precise way the information that satisfies the information need.

Conceptual search approaches can be characterised by whether they aim at data retrieval or information retrieval (IR). While the majority of IR approaches always return documents as response to user request, and therefore should be classified as information retrieval models, a large amount of ontology-based approaches return ontology instances

rather than documents, and therefore should be classified as data retrieval models. For example, as a response to the query "films where Brad Pitt plays the leading role" a data retrieval system will retrieve a list of movie instances while an IR system will retrieve a list of documents containing information about such movies. Semantic Portals [3], [4], [10], [47] and query-answering systems [46], typically provide simple search functionalities that may be better characterised as semantic data retrieval rather than Information Retrieval where generally, no ranking method is provided. In some systems, links to documents that reference the instances are added in the user interface, next to each returned instance in the query answer [10], but neither the instances, nor the documents are ranked.

This feature can be refined by considering the kind of information the system retrieves in response to user queries. In approaches that aim at information retrieval, a distinction can be observed between systems that retrieve textual information [76] and systems that retrieve multimedia content [40], [75]. In data retrieval approaches, the expressive power of the provided formal language adds an additional distinction. In our state of the art analysis we shall observe whether the systems retrieve XML documents [9] or proper ontological pieces of knowledge [31], [46]. As we pointed out in the introduction, the use of ranking methodologies that take advantage of semantically related information is still a key drawback of semantic search models. Some approaches do not provide any ranking at all, other models base their ranking functionality in traditional keyword-based approaches [31] and a few ones attempt to take advantage of semantic information to generate the final ranking of conceptual data [71] or documents [76]. Within this analysis we may highlight three main trends, characterized by the type and the way of use the semantic Information:

Latent semantic analysis approaches: these models do not use human-based

language understanding methodologies, but statistical models to identify groups of words that commonly appear together, and therefore describe the same reality.

These approaches are the ones farther from the semantic search paradigm where the conceptual understanding should be performed at the level of languages.

Linguistic conceptualization approaches: these approaches are the fist ones to make a step towards the real semantic search where machines attempt to understand concepts in the same way as humans do. To do so, these approaches make use of thesaurus and taxonomies. Even this constitutes a big step, the used conceptualizations are shallow and sparse, and therefore limit the evolution and the improvements towards the achievement of the semantic search paradigm.

Ontology-based approaches: Ontology-based approaches use much more detailed

semantic information in the form of ontologies and KBs. However, despite they seem to dispose of the appropriate tools to achieve the main objectives traditional pursued by the semantic search paradigm, they are still far from this goal. One of the main identified limitations of these approaches is the lack of appropriate ranking models needed for scaling up to massive information sources. In this work, we propose a novel ontology-based IR model that attempt to address this main drawback.

The application includes social networking and conceptual graph repository in library and information sciences. The web application explores using semi-nets for data navigation and integration. The algorithmic approach includes data mining for selecting the relevant data, statistical data for grouping algorithm and natural language for handling semantics and syntactic of objects.

1.4 Text management and its applications

The user feedback is very important in the retrieval mechanism. In the prospective of the retrieval, the important aspects are the management and applications of text that play an important role in the organization of documents. The collection of documents is a repository for the web documents. These documents can be as a organization where these documents can be added to structured or annotated to the specific collections for the better retrieval and understanding. These documents can be applied to mining using different models to create knowledge. Also these can be accessed to select information. The access, mining and organization of documents can be work together to achieve the efficient retrieval (Strohman et al., 2004).

The natural language content analysis involves in the text/content/information organization using retrieval application and mining application. The retrieval application uses different functions to the data/documents for the information accesses are summarization, search, filtering and categorization. The mining application functions used for the knowledge acquisition are visualization, mining extraction and clustering (Diez et al., 2004).

Figure 1.2 Information Organizations

The Semantic analysis and syntactic analysis, is the study of the combinatorics of units of a language (without reference to their meaning), and pragmatics, the study of the relationships between the symbols of a language, their meaning, and the users of the language (Turtle & Croft, 1991). A sentence "The dog is chasing a boy on the playground" is having objects as dog (d1), boy (b1), playground (p1). The chasing defines the relationship between the dog, a boy and a playground i.e. chasing (d1, b1, and p1). It also defines the inference relations to scare, frightened etc. The semantic analysis extracts relations /entity in the combination of the words.

1.5 Information Retrieval Process and Search System

As the keyword entered in the search box by the user, the search system by internally runs the robotics application. It refines and then sends the query to the retrieval system. The retrieval system initiates the two important processes - firstly, it segregates and separates the huge set of documents, and secondly it also segregates the important and less important documents it relevant and non-relevant ones. The retrieval system is as the heart of the search process. It purifies the document as it is required or asks for. It also involves the process of information filtering. An Information filtering is a process that works as a system to removes redundant or unwanted information from an information stream using (semi) automated or computerized methods prior to presentation to a human user. Its main goal is the management of the information overload and increment of the semantic network. The user's profile is compared to some reference characteristics. These characteristics may originate from the information item (the content-based approach) or the user's social environment (the collaborative filtering approach). This can be categorized as stable and long term interest or dynamic information source. The information retrieval search system must make a delivery decision immediately as a document arrives in the corpus. Every document has to be filtered immediately as it arrives.

For efficient and intelligent retrieval, categorizing system uses collaborative filtering and maintaining the pre-given categories and labeled document in the form of hierarchy. It also classifies new documents. The robots or agents maintain the standard supervised by learning algorithm. The clustering is a process that discovers the natural structure based on the likes and dislikes (Fung et al.,2003). They make the group similar objects together. The object can be a document, term, passages, image, audio etc. The information retrieval process can be work for the broadly two areas to have efficient and intelligent retrieval of documents (Joachims, 2003). One can be model/algorithm and other can be applications/system. The whole study is based on these two categories (Strohman et al., 2004). The statistical language model assigns a probability to a sequence of m words by means of a probability distribution.

Figure : 1.3 Statistical language modeling: Naïve Bayes Algorithm

The information required has to be represented in the form, where the queries are handled. Similarly, the text objects can also be represented in index objects. They are the repositories for the retrieval objects. These data items are used after the comparison. Every new-retrieved object must be indexed.

2.4 Conclusion

There is a growing discrepancy between the retrieval approach used by existing commercial retrieval systems and the approaches investigated and promoted by a large segment of the information retrieval research community. The former is based on the Boolean or Exact Matching retrieval model, whereas the latter ones subscribe to statistical and linguistic approaches, also referred to as the Partial Matching approaches. First, the major criticism leveled against the Boolean approach is that its queries are difficult to formulate. Second, the Boolean approach makes it possible to represent structural and contextual information that would be very difficult to represent using the statistical approaches. Third, the Partial Matching approaches provide users with a ranked output, but these ranked lists obscure

Table 2.6: lists some of the key problems in the field of information retrieval and possible solutions.

valuable information. Fourth, recent retrieval experiments have shown that the Exact and Partial matching approaches are complementary and should therefore be combined [Belkin et al. 1993].

In Table 2.6 we summarize some of the key problems in the field of information retrieval and possible solutions to them. We will attempt to show in this thesis: 1) how visualization can offer ways to address these problems; 2) how to formulate and modify a query; 3) how to deal with large sets of retrieved documents, commonly referred to as the information overload problem. In particular, this thesis overcomes one of the major "bottlenecks" of the Boolean approach by showing how Boolean coordination and its diverse narrowing and broadening techniques can be visualized, thereby making it more user-friendly without limiting its expressive power. Further, this thesis shows how both the Exact and Partial Matching approaches can be visualized in the same visual framework to enable users to make effective use of their respective strengths.