This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In the generation of information overloaded, text documents, digital files, images, audios, etc. all kinds of information flood our daily lives. However, when people accomplish information searching, textual retrieval system is still the mainstream system to use. Why the textual retrieval system is still important? This is because of the increased amount of electronic information are most derived from documents or newspapers. This paper is focusing on the working principle and characteristics of textual retrieval system, users' needs and the current problems. The characteristics are identified from the descriptions of various IR systems such as the Boolean model, the Vector model, the Probabilistic model. Instead of trying to find out whether it is possible to model the data more precisely, but to find out a similarly simple model that can improve the results while not reducing retrieval efficiency. By analyzing the principle and model of textual retrieval system, we attempt to find a better information retrieval strategy as well as a better way of improving the textual retrieval system.
In the generation of information overloaded, text documents, digital files, images, audios, etc. all kinds of information flood our daily lives. However, when people accomplish information searching, textual retrieval system is still the mainstream system to use. Why the textual retrieval system is still important? This is because of the increased amount of electronic information are most derived from documents or newspapers. This paper is focusing on the working principle and characteristics of textual retrieval system, users' needs and the current problems. The characteristics are identified from the descriptions of various IR systems such as the Boolean model, the Vector model, the Probabilistic model and the Connectionist model. Instead of trying to find out whether it is possible to model the data more precisely, but to find out a similarly simple model that can improve the results while not reducing retrieval efficiency. By analyzing the principle and model of textual retrieval system, we attempt to find a better information retrieval strategy as well as a better way of improving the textual retrieval system.
An information retrieval system is a system to store, retrieve and maintenance of information. The system is designed to serve a specific function, each of which is made up of a set of interacting components, that are integrated together to achieve a goal. To minimize the overhead of a use locating needed information is the objective of information retrieval system. The area of retrieval system spans a variety of sub-field, including information retrieval, text categorization, information filtering and question answering. Information retrieval users of internet search engines or digital libraries are performed. Text categorization labels text documents with one or more predefined categories possibly organized in a hierarchy. Information filtering or routing that matches input documents with user's interest profiles. Moreover, question answering which aims to extract special and preferably short answers rather then provide full documents containing them.
Information retrieval is as synonymous with document or text retrieval. Nowadays, text retrieval is implying that the task of an information retrieval system is to retrieve documents or texts with information content. The goal of text retrieval is closely relevant to a user's information need to retrieve documents or texts. In general, the central field of text retrieval is the problem of translating the user's needs into queries for text retrieval system. By improving text retrieval system, people's access to the information available to them is being improved. In this paper, we will present the characteristic of textual retrieval system and the description of various models.
Textual Retrieval system
Searching for a specific piece of information of a specific topic in large document repositories is the problem of text retrieval. In practice, the user should be able to retrieve the relevant information which is given a certain natural language query by using this methodology. Indexes of documents are built and the user is given by standard text retrieval the possibility to perform searches by formulating queries in this index. Queries are usually formulated in natural language and the concept is expressed which the user wishes to retrieve. After that, the system should be able to compare the concept which expressed with all the documents in the query. And then, the system ranks the documents in order of relevance and the most relevant ones is given back to the user. Text retrieving contents can be described in many different ways using different words.
Information Retrieval Models
Retrieval systems have the descriptions of various models for retrieving documents from huge will now be presented: the Boolean model, the Vector model, the Probabilistic model, the Connectionist model and Latent Semantic Indexing. The most common types of information retrieval are Boolean, vector space and probabilistic model. A Boolean model views each document as a Boolean statement by using operators such as AND, OR, NOT. The query of resolving problem becomes the finding problem which documents make the query true. A vector space model views queries and documents in terms of Euclidean distance which is to reduce for searching the finding of the closest documents to a query. A probabilistic model views each document by sampling from a probability distribution as having been created. The probability of certain features taking place in a document can be described and looking at the probability of relevance can retrieve documents.
3.1 Boolean model
Boolean model is the simplest methods of information retrieval system and use of Boolean operators to combine for searching terms. Queries are Boolean expressions of keywords that is connected by AND, OR and NOT. For complex queries, nested parentheses are used and this model is widely used in commercial system. Textual repositories have already been manually indexed with keyword which applies this method. Moreover, Boolean model can be extended to include ranking and it is reasonably efficient implementation for normal queries. An inverted file organization or text signature is used in retrieval system and proximity operator can be used. Boolean retrieval is very powerful user who knows how to specify exactly what they want. User can be expressed structural and conceptual constraints to important linguistic features. If a query requires a comprehensive and unambiguous selection, Boolean retrieval is very effective. Therefore, it offers a large amount of techniques to narrow or broaden a query.
Boolean model have some drawbacks that natural language is the way more complex. For searching to be more efficient, the user has to have some knowledge to search the topic. This model is difficult to control the amount of documents retrieved and to rank output. If a document is identified by the use as relevant or irrelevant, it will be difficult to perform relevant feedback. Boolean retrieval is a set of documents as a result without any more ranking. If the result set is very huge, the presentation of relevant documents to the user could cause a problem. A query can be modified or structured overwhelms to users with multiple ways. When Boolean operator AND combine with more than three and four criteria, no or very few documents are retrieved. The null-output or the information overload problem is often faced. The degree of uncertainty or error is not represented due the vocabulary problem.
Vector Space Retrieval Model
Vector space model is one of the non-Boolean models and is also classical model. It can be used in many search engines as a classical model. This model is much more complicated one to implement, but, easy for users.
Vector space model can also be defined as a model which creates a space in which both documents and queries are represented by vectors of identifiers such as index terms. Terms are arranged alphabetically A to Z. A vector similarity function can be used to compute the similarity between a document and a query. Documents and queries are represented as vectors. To compare documents with queries, Vector can be used.
A separate term is related by each dimension. In the vector, its value is zero, if a term does not occur in the document. Various methods of computing these values have been developed. It depends on the application when the terms are defined. Terms are longer phrases, single-words or keywords typically. If the terms are set choosing from words, the vector dimensionality is vocabulary words count.
The vector space model procedure can be classified into three parts. The first part is the indexing of document where terms are retrieved from the document text. The second is the indexed terms weighting to enhance document retrieval that has relevancy to the user. Finally, the document is ranking with relevant to the similarity measure of the query.
There are some benefits and limitations in this model as follows:
Some advantages of this model are:
based on linear algebra (simple model)
using term weights (not binary)
Allows continuous computing of similarity among them (i.e. queries and documents)
Allows matching (partial matching)
Some disadvantages of this model are:
So many documents terms (very long documents) make similarity measures difficult
The document's terms order is lost in the representation of vector space.
Weighting is not very formal
User does not need too much experience concerning with this model for queries formulating rather than Boolean model. Although this system is difficult to implement, this system is user friendly and very simple to be applied by users.
Probability means the chance in numerical that a specific result will occur. Probability cannot compute exact one, but, have to use estimates. This model is used when information retrieval need to handle with uncertain information. Ranking is done based on an estimation of the probability of their relevance to the query.
There are some benefits and limitations in this model as follows:
Some advantages of this model are:
probabilities are familiar
A number of useful tools
Based on a firm theoretical foundation
Theoretically justified optimal ranking scheme
Some disadvantages of this model are:
Making the initial guess
Independence of terms
Sometimes costly to compute
Has never worked convincingly better in practice
Probabilistic models in information retrieval are one kind of the oldest methods but also one of the currently common topics in the field.
The information retrieval system is to meet variety of users' needs, thus it is necessary to analysis the users' needs in order to provide better services to the users.
Types of users' needs
It can be divided into needs of individual users, enterprise users and university users.
For individual users, obtaining information is the initial need. What is more, publishing, interchange and consulting are also the core needs of individual users.
For enterprise users, they may focus on the information of technology trend, market, environment scanning and competitive intelligence.
As to the university users, it will be divided into 3 fields by reasons of arousing, are professional needs, research needs and other needs.
The substance contents of users' needs
Basically, to the most users, like individuals or enterprise users except the university ones, they use textual retrieval systems to get the information, such as news of current affairs, entertainment, business and economics, social, technology and so on. But to the university users, it may become more stable. Usually the university users are using the textual retrieval systems to undertake the researching or professional works by databases of journals and monographs.
Characteristics of users' needs
The diversity subjects of users' needs
Here in above, the users' needs are divided into three parts. In general, to the subjects of users' needs, it can be divided into two parts. Ordinary users--individual users and enterprise users are focusing on gathering and acquiring knowledge while professional users' aims are to satisfy their working purpose.
However, it is often difficult to distinguish these two types of users strictly since in normal circumstances the professional users also will acquire new knowledge and improve their skills during the professional work.
In all, nowadays as the popularizing of internet and computer, it is not only the computer or information professionals that will use the information retrieval systems. The users are becoming diversity and large amounts.
Users' needs with a wide range
Under the modern information environment and the developing of the technology, the uses are no longer limited to the simple use of information retrieval systems. In the old time, maybe they just found out the information and sources they wanted while recent decade the demands for work and knowledge accumulation have been increased sharply. The users are becoming eager to acquire the completely components of working information, diversity of types and forms from the all kinds of information retrieval systems including textual retrieval systems. At the same time, to provide a comprehensive information service to meet the needs of users.
Trend of professionalized
Accompany with the increasing of information sources, uses have narrowed down their demands from width to depth. Users are focusing on the professional levels of information that they get. Nevertheless, the information overloads have caused massive useless information and blank area. It will cost double times for users to screen the useful information. Hence, in order to get the high value information sources, the consciousness of obtaining the professional information sources are gained, and the users' needs also gradually become professional.
å-å…¸ - æŸ¥çœ‹å-å…¸è¯¦ç»†å†…å®¹
Trend of personalized
Due to the digitized information sources, it is easier for individuals to have more information searching capacities than ever before. It is just because of the development of computer and internet, some of users are tend to study personalized settings and needs. At this time, these users' needs has become more targeted and personalized, they always follow the latest development direction and trends in order to complete their resources which we can tell that the users' needs have become personalized.
Obviously, despite the classics that uses are required, the newest and latest news, journals, monographs are in great request either. Since the high pace of development, the replacing frequencies of knowledge and literatures are shorter. Users hope to get fast and brand new information sources from the information retrieval systems. The newer, the better, which means that users are no longer thinking "the more, the better", they have transferred their attention to the novelty and timeliness.
Firstly, when the textual retrieval systems are working, some irrelevant results are often found out, the available of information sources are reduced greatly. Not all textual retrieval systems have a consolidated thesaurus and classification standard; the searching scopes are narrow, the database updating may be not very usual and when one is doing cross-category searching it is easy to miss the information and lead to the low efficiency searching.
Secondly, in the full-text searching engine, the automatic indexing is imperfect at the same time the frequently updating will give the systems strong searching functions in the keywords searching. However, the relevance of information is difficult to control. Users will get some repeat links and it leads to the low precision.
Thirdly, there is some useless information in the pages of search engine index, it not only reduces the index speed but also is a waste of network communication resources. In addition, some textual retrieval systems do not change their webpage frequently, thus the timeliness cannot be guaranteed.
Finally, different retrieval systems use different retrieval rules. For instance, the database ScienceDirect Online uses the AND, OR, AND NOT while the JSTOR uses AND, OR, NOT in the Boolean operator. This will bring influences to the users either.
In sum, from the users' sides, Boolean model is better than the other two models.
Boolean model is easy for all users to understand, use and operate. Whether the users have IT background or not, no matter the users are students, elder people or those who do not have many computer searching skills, they will learn how to use it quickly and easily. Once the users have got some training of using Boolean model, it will far more convenient and fast for them to get the results they want since they can write a Boolean searching formula without a hitch. In addition, Boolean model really play an important role in the modern searching engines.
However, Boolean model still has imperfect features that have been reminded previously. For the developers of the searching engines, it is necessary for them to continue on consummate the models and develop new ones. To the users, it is also crucial for them to improve themselves if they are trying to get more information and sources.