This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
World Wide Web is dating from the 1989, when Tim Berners-Lee proposed the creation of it. The original idea of designing the World Wide Web was the creation of a tool for easier communication between scientists and institutions through the Internet.
Late in the year 2012, the Internet has a massive development. Around seven billion web pages exist and are all accessible via the World Wide Web. The best, the fastest and also the most easily accessible source of knowledge is the World Wide Web. Therefore Web Information Retrieval becomes increasingly important.
The rapid growth of internet services and the huge amount of information based on the World Wide Web, have led to the pressing problem of information overload. (Michael Chau,Cho Hung Wong,2009) The difficulty of finding the exact information that someone want on the internet will deteriorate, as the number of online information is growing. (Sergey Brin, 1998) This gave the incentive to Web in creation of new innovative challenges for information retrieval such as the development of tools calling search engines like yahoo and Google which were very popular among users and had been wide bestselling.
'Alexa' is a Web information company which through its website www.alexa.com displays the top one hundred visited sites on the internet. In accordance of www.alexa.com the first page in November of 2012 is Google. Through the mix of top visited sites, there are a large number of WebPages, whose primary goal is to help users navigate the information they seek. Without a doubt Internet Searching tools are very important and popular.
This literature review explores current web search methodologies and information retrieval techniques. In this study will be analysed in detail how a search engine works and the whole process which is conducted in order to recover the data user wants at the end. Moreover in this research paper will be analyzed some of the mains search techniques and the search algorithms that will be used to make the search faster and also to return at the end most desirable and good results.
Finally, in this research paper author will have a critical evaluation of his research. This will help him to identify the good points and bad points, the strengths and weaknesses, the usefulness of a report and the limitations. The methods and algorithms that will be analysed, at the end of the day, will be compared and author will decide which of those will be used on the development of the project.
2. Search Engines on web:
Search engines are designed to make it as easy as possible for web users to find what they are searching on the internet. The use of search engines results in multiple benefits not only for the simple user of the internet but also for companies that has a website. More specifically, for the current user, the existence of Search Engines makes the search of information in Internet an extremely easy process regardless of whether the topic which is interested in is specialized or not. Users don't need to wander from one page to other and from one link to another in order to retrieve the information they wish. On the other, for a company the benefits from the existence of search engines are equally important. Considering that the number of users who use search engines in their everyday lives is huge, then the existence of the company's own website to the list of search engine results, automatically mean more users-clients that will visit the website or even learn for its existence.
Search engines are online services that allocate users to look into the inside information of the Internet to discover sites documents, pictures or specific information that they like to use. (Alfred and Emily Glossbrenner, 2001).A search term is inputted by the user and the search engine attempts to find in its catalogue, keywords that are matching with user search and finally generates a list of the results which are matched with the search criteria.. These presented according to some order of relevancy with a short and brief descriptions and hyperlinks to takes the user there. (Fielden Ned L. and Kuntz Lucy, 2002).
Search Engines store information about million pages of World Wide Web in a huge database. From the files that gathered, based in (their title, the full text, their size, their address etc.) one index is created. User can search this database entering keywords. The search engine is software that finds and classifies the results depending on the relevance of content, in connection with the terms of the search. Each search engine uses its own algorithm for presenting the most relevant results in relation to the user's search terms, and on this point hence the difference between the several of search engines.
Fig.1 In general, the structural design of a regular search engine. (Michael Chau, Cho Hung Wong, 2010)
A conventional search engine is a synthesis of several functions in which consists of the following six steps.
These stages are the basic procedures which are essential for a search engine. Subsequently these six steps will be analyzed in more details.
2.1. Crawling (Web spiders / agents)
Generally defined, agent is a program that can manage autonomously and complete unique tasks devoid of direct human supervision.(Chen Hsinchun,Chung Yi-Ming,Marshal Ramsey,Christopher C. Yang, 1998) Spiders, have been used as a support system for the client in gathering information.(Henry Chan, 2008) Search engines do their data gathering by deploying robot programs called spiders or crawlers. Those programs are planned to locate web pages, go after the links they include, and storing any new accessible web information they come across to a local database or index. (A McCallumzy,K Seymorey,K Nigamy,J Rennie ,1999).
A crawler cannot start on its work without a starting point. The crawlers have to start with a main page and download the documents at this page. Afterwards they go after the hyperlinks of this page to other pages, then take those links and follow them further. They keep on the process until the required amount of documents has been downloaded. This Web page gathering process is called 'crawling'. (Michael Chau, Jialun Qin, Yilu Zhou, Chunju Tseng, Hsinchun Chen, 2008).
Each crawler keeps about 300 connections open at one time. This is required to recover Web pages at a rapid tempo. A crawler is located in a single machine. The crawler simply sends document requests to other machines on the Web. That makes it exactly like a web browser does when a user clicks on a link. What a crawler actually does is to automate the process of following links and take advantage of some shortcuts and speed. (Ned L. Fielden and Lucy Kuntz, 2002)
To remedy shortcomings of spiders, three variations were introduced:
Specialized spiders: The amount of pages that are searched is reduced to improve the results of the spider. For example the choice of pages that will be searched will be based on the subject of the site, the country where the server resides, or language. As the catalogue of the pages are not too long and well maintained, this will act better than a normal website with categories and better from a common search engine. (Mark A.C.J. Overmeer,1999)
Meta-Spiders: Improves results by combining the results from a small number of spiders.
Meta-spiders combine meta-search and categorization in a comprehensive way. It helps users to interface with multiple spiders to get an overview of retrieved documents quickly and identify useful information. (Hsinchun Chen, Haiyan Fan, Michael Chau, and Daniel Zeng , 2001)The meta-spider passes the user's request to a number of Spiders. These spiders run the user request in their database and give back the suitable results. Finally is counted how many spiders gives back a certain page, those pages are ranked and returned to the user. (Mark A.C.J. Overmeer, 1999)
Meta-Information: Adds the information they gathered in pages HTML. These are to be used by spiders to build their indexes.
In that moment all of accessible web pages are stored locally. In that case the content of these pages should be indexed to allow for search based retrieval. Indexing in a search engine is operated to detect the content of crawled documents with a view to specify the search terms. Index terms are used to discover query terms that are corresponding. In spite of the fact that the amount of crawled web content is growing at tremendous pace, indexing is the most practical method to make available answers to queries in a logical period of time. (Anwar A. Alhenshiri, 2007)
Automatic indexing algorithms have been extensively used for the export of essential concepts from text data. Nowadays it has been proven that automatic indexing is effective as human indexing, for this and more tried and tested techniques have been developed. (Hsinchun Chen, Daniel Zeng, Michael Chau, 2001) The purpose of web indexing is to optimize the speed and performance on search of relevant documents for a query. Otherwise, the search engine will have to search each document throughout the database which requires much time and computing power. For example, an index of 10,000 documents can be searched in a few milliseconds, while a sequential search for every word in 10000 documents required hours.
2.2.1. The Inverted Index
At present the Inverted Index is regarded as the most appropriate technique for web data. Inverted index, which is highly popular in typical IR systems, is a technique based on words to build an index for the text to increase the speed of search activity. The usual inverted index form consists of a text word and its occurrence which sets out the positions in every document. (Chiyoung Seo , Sang-Won Lee, Hyoung-Joo Kim ,2002) This index can only determine whether a word exists in a particular document, since doesn't stores any information about the frequency and location of the word, as a result is considered a Boolean index. This index determines which documents match a query but doesn't sort them.
In a central operation, for every term t there is an inverted list that contains postings
< fd,t, d > where:
fd,t, : the frequency f of term t in the ordinal document d. (Falk Scholer Hugh E. Williams John Yiannis Justin Zobel, 2002 )
In some versions of the inverted index, is included additional information such as frequency of each word in each document or positions of a word in each document. The information of the position of the word allow search algorithm to determine the proximity of the word and as a result is possible to search for phrases. Consequently inverted list postings have to be of this type.
< fd,t, d, [O0,d,t ... Ofd,t,d,t] >
The extra information that appears is the list of offsets O. For each of the fd,t, occurrences of term t in document d one offset is stored.( Falk Scholer Hugh E. Williams John Yiannis Justin Zobel, 2002 )
Metadata is data about data and for that reason provides essential information such as the author's name of a work, the creation date, links to several related works, document description and some descriptive words that show some evidence of document's characteristics and features. (Paul Miller , 1996). All of this is made 'under the counter'. They are not displayed to the client when is visiting the webpage. (Ned L. Fielden and Lucy Kuntz , 2002).
Metadata attempts to make easy, identifying, understanding, describing, and retrieving sources of online information and their content. First and foremost, provides a useful mechanism for the description and identification of data that is related to a particular client. Search engines use the metadata to web pages to extract keywords and other significant information about the pages and use them as index terms in their databases. (Jin Zhang *, Alexandra Dimitroff, 2005) In recent time metadata is spoken as 'structured substitutes'. It is also gaining widespread acceptance as an essential component for managing unstructured data, finding and explanation of the internet. (K. Lang, M. Burnett, 2000)
After finish of crawling and indexing of the web content search engine must run its operation to present to the user answers concerning their queries. A query is a set of one or more search terms; it may include advanced search features, such as logical operators. It is the user request, in that case a formulation of keywords which reflect what the user looks to find. (Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic, 2001)
The user with a Query can provide search engines with a number of criteria that will limit the enormous universe of potential results. Regularly, a query is an approach to the intention of the client and at the same time a Query can put across a lot of dissimilar intentions. (James Grimmelmann, 2007)
a) Phrase-based: The exact phrase is tested adjacent to the index terms.
b) Proximity: The same with phrase-based but in a more relax form.
Simple Queries: A single-word query is the most basic and simplest method of a query. A simple query hasn't got any structure and is extremely simple to search. Moreover permit the ranking to become much faster. On the other hand the results of these queries have some problems. The results that presented are sometimes unnecessary because search engines with this type of queries produce a extremely big amount of hits. (Anwar A. Alhenshiri, 2007).
Boolean Queries: Operators (AND, OR, BUT/NOT) formulate the structure of the query. This model is used to combine both words and phrases into search queries in order to recover documents from databases. The three main functions that includes a Boolean system are the functional operators AND / OR / NOT. (Ned L. Fielden and Lucy Kuntz , 2002).
Searching consists of the three following stages:
A search in the words which isolates the search terms and executes an individual search term.
Retrieval of incidents of query terms in document collections.
The incidents management for resolving search.
During the ranking function documents are classified, driven by the similarities with user query terms. Ranking is the task that applies upon returning of documents from the operation Search to user and is relevance with the words and concepts in the query and the overall link popularity. There are plenty of ranking techniques used by search engines. Some of them are Boolean spread, vector space, most-cited and PageRank.
A) Boolean spread: The rank of the page depends on the number of query terms found in that page.
B) Vector Space: The term frequency (TF) and inverse document frequency (IDF) used in calculating the page rank.
C) Most-Cited: This algorithm exploits the information about hyperlinks between web pages. In each website is assigned a score that is the total of the amount of query words contained in other websites citing or having a hyperlink referring to the webpage. (Budi Yuwono,Dik L Lee,1996)
D) PageRank: This technique is used by Google. PageRank is an excellent technique to prioritize the results of web key word searches. The algorithm which calculates the ranks of the pages which been traversed using the following formula:
Formula del PageRank
A is the page being ranked.
T1...Tn : are pages pointing to page (A).
C[T1] ... C[Tn] : is the number of outgoing links of page A
d: is the normalization factor which normally set to (0.85. (Sergey Brin,Lawrence Page,1998)
The most important measure of a search engine is the quality of its search results. The last stage is browsing in which the results are given back to the user. Tools for better browsing can be used to improve the performance of the searching. In most of the search engines results provided into text format.
3. Search Techniques
3.1. Meta Search
Meta-search engine is an agent consisting of a number of search engines. Each search engine retains its individual index and gives to meta-search engine some descriptive information about it. Meta-search stores that information and makes use of them to assess the suitability and importance of the search engines on receipt of a query. In order to trim down the resources no more than the most pertinent search engines invoked for the processing of the query. The choice of the most appropriate search engine called as ranking. A good way of ranking can identify successfully the most relevant search servers and therefore the most relevant bundle of documents. (Yipeng Shen, Dik Lun Lee, 2001)
3.2. Dynamic Search
The dynamic search engine destined to search the web dynamically during the query time. During the query time, filtering, matching and ranking are executed. The dynamic search is fundamentally recommended as an alternative to the index-based methodology. Using dynamic search can benefit in improving user-interaction in searching the web, but on the other hand will never cover all the possible outcomes. Into account the fact that search engines took days to obtain the entire accessible websites, a dynamic search will never provide full results since query time has limited duration and may not be extended and cross the entire web. (Anwar A. Alhenshiri, 2007).
4. Critical Evaluation:
In this context refers to how search engines helped solving the problem of the web retrieval information. Moreover was mentioned how these search engines operate and the six steps that have taken place to return at the end the results to the user. However, these six steps have been analysed separately. For each step was presented with several examples. Finally, two search techniques that author choose between others, have been analysed. These two techniques are meta-search and dynamic search. To conclude with, the aim of this research context is to truly understand the need of search engines. Also help author to choose between a plenty of different techniques the one that will use and to give to him the required knowledge to complete the project that requested from his client.