Searching the World Wide Web is a difficult task. Machine learning is applied for this task, namely ranking indexed pages by their estimated relevance with respect to user queries. This task is crucial because it influences the effectiveness of a search engine. It is a common user behavior to look at only the first few matches of a search result among the millions of matches hence it makes the precision achieved by the ranking algorithm of utmost importance. In the past, search engines in the past ranked pages based on their lexical similarity to the query. The strategy was to develop the best weighting algorithm to represent web pages and queries in a vector space. Recently the hypertext links structure has been recognized as a new source of evidence for web semantics. [Pant, 2005]
Crawler programs are designed to retrieve web pages. This retrieval of web pages is essential for many of the key applications served by the web. Crawlers exploit the web's hyperlinked structure to retrieve new pages by traversing links from previously retrieved ones.
Get your grade
or your money back
using our Essay Writing Service!
However, all this research addresses only half of the problem. No matter how sophisticated the ranking algorithm is, the results can only be as good as the pages indexed by the search engine, i.e., a page cannot be retrieved if it has not been indexed. This brings out other aspect of web searching, namely crawling the web in search of pages to be indexed. The retrieval of information on the web with vast number of "static" pages is a challenging task.
With increasingly growing web, the hardware and bandwidth resources required by the search engines also grows, in such cases the search engines cannot keep up with this growth of the web. This retrieval challenge is further complicated by the fact that web pages change frequently.
Despite the attempts of search engines to index the whole web, it is expected that the indexing will continue to grow. The search engines offer the solution as: the capacity of a crawler to answer queries from any user is limited. Hence it's not a surprise for the topical crawler algorithms that they have not received significant attention. [Menczer, 2001].
Topical crawlers, which are also known as focused crawlers work by responding to the particular information needed and expressed by topical queries or interest profiles. This could be needed by individual user or a community of users having shared interests. Topical crawlers support a more scalable approach where the crawling process is decentralized.
In the past, researchers have explored many alternative methods for assessing the quality of web pages and for summarizing performance of the crawler [Menczer, 2001]. Few crawlers show emphasis on contextual knowledge for the topic including the knowledge received through relevance feedback [Najork, 2001]. A complex problem is the creative phase of the design of topical crawlers which is accompanied by research on evaluation of such crawlers. For example a web crawler specific challenge is that the magnitude of results retrieved poses a limitation on the availability of relevance judgments based on user.
Hence the goal of this article is to examine the algorithmic aspects of topical crawlers to find the best crawling algorithm for data retrieval over the web.
Chapter 2. Literature Review
A lot of research has been done in the past in order to find the best crawling algorithm. The work done by various people in order to accomplish the objective of finding a crawling algorithm that is best and also adaptive are studied.
The evaluation framework has been studied and analyzed from "A General Evaluation Framework for Topical Crawlers" by P. Srinivasan, F. Menczer and G. Pant [Srinivasan, 2005]. The system is designed so that all the logic about any specific crawling algorithm is encapsulated in a crawler module that can be easily plugged into the system through a standard interface.All crawling modules share public data structures and utilities to optimize efficiency without affecting the fairness of any evaluations. This evaluation framework is used in our research to study the algorithmic issues such as the role of exploration versus exploitation and the role of adaptation versus static approaches. This paper also describes the various evaluation metrics which are used in our research.
Always on Time
Marked to Standard
In a research done by Filippo Menczer, Gautam Pant and Padmini Srinivasan, on "Evaluating Topic-Driven Web Crawlers" [Menczer, 2001] the authors have used topics instead of queries for the crawling purpose. Each topic is represented by a collection of seed URLs. It simplifies issues by moving from queries to topics. It is assumed that if a page is on topic then it is a "good" page. While these criteria of topics are not underrated, given the reasons for using topics stated above it is chosen to focus only on topicality as an indicator of relevance for the extent of this research. It was concluded by Menczer et.al. that InfoSpiders web crawler could complement search engines to locate highly relevant pages that are too recent and/or distant to have been indexed by other crawlers. They have assessed various types of crawlers such as Breadth-First, Best-First, PageRank and InfoSpiders in order to come to such conclusion. But the drawback of their research was that the results assume that each crawler module has the same time complexity in determining the next page to be visited. This is not the case since different crawling algorithms have different computational requirements and hence different time complexity.
These algorithms were analyzed for dynamic performance, scalability and efficiency taking into consideration the various resources utilized by these algorithms. Hence, an attempt to extend the research done by Menczer et.al. is made. This is done by considering the dynamic nature of the web crawlers and taking into consideration the computational requirements of various algorithms which makes them adaptive in nature.
The concept of PageRank and methods used to measure the PageRank have been analyzed from the article "Measuring Index Quality using Random Walks on the Web" written by Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher and Marc Najork [Henzinger, 1999]. It was stated in this paper that in order to have more important pages indexed first, estimates of the PageRank of a page are used to guide a crawler. This demonstrates that the PageRank measures the page quality well, and hence it is ideal weight function for measuring index quality. Based on the approaches taken by Henzinger et.al., the PageRank algorithm is assessed for performance in this research using the weight function defined by Henzinger.
The Breadth-first crawling algorithm is analyzed from the article "Breadth-First Search Crawling Yields High-Quality Pages" work done by Marc Najork, Janet L Wiener [Najork, 2001]. Here the Breadth-first crawling algorithm is read in detail and its performance analyzed. This algorithm also helps us analyze the PageRank of a page as the PageRank is used to judge the quality of the page in the Breadth-first algorithm. It has been stated that a crawler which downloads pages in breadth-first order discovers the highest quality pages in the early stages of the crawl. This is desirable for web search engines like AltaVista or Google, given the condition none of the search engine is able to crawl and index more than a part of the web. Hence the article by Najork et.al. is helps to both find the PageRank and to assess the breadth-first algorithm.
The works of Menczer et. al., Henzinger et. al., Srinivasan et. al. And Najork et. al. are used to evaluate the algorithms like Best-First, Breadth-First, PageRank and InfoSpiders. After analyzing these algorithms we proceed with the evaluation of various crawling algorithms and then we develop and evaluate of adaptive algorithms based on the previously analyzed algorithms.
Chapter 3. Research Objective
The primary objective of this paper is to find a crawling algorithm that is most appropriate in the World Wide Web to find the pages relevant to the search query of the user. While searching the web using a search engine, the users often visit only the first few links on the web. Hence the search engines have to be effective in retrieving the most relevant pages first. In order to do so the task of retrieving the data is done by taking topics into consideration.
In the past, search engines in the past ranked pages based on their lexical similarity(a measure of the degree to which the word sets of two givenÂ languagesÂ are similar) to the query. The strategy was to develop the best weighting algorithm to represent Web pages and queries in a vector space.
This Essay is
a Student's Work
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.Examples of our work
No matter how sophisticated the ranking algorithm is, the results can only be as good as the pages indexed by the search engine that is a page cannot be retrieved if it has not been indexed. This brings out other aspect of Web searching, namely crawling the Web in search of pages to be indexed.
This research aims at comparing the already existing algorithms and finds the best of all. The previous researchers have compared crawling algorithms based on the assumption that the computational resources used by all the algorithms are same. This research aims at finding a crawling algorithm which is dynamic in nature. It aims at finding the generalized crawling algorithms and study the tradeoff between exploration and exploitation.
The objective of this research is to find the adaptive crawlers that take into consideration the resources utilized by various algorithms individually without making any assumptions of previous researchers that the resources utilized by all algorithms to be same.
Chapter 4. Research Methodology
In order to find the best crawling algorithm a framework will be developed that fairly evaluates topical crawling algorithms using a number of performance metrics. This will be a kind of framework that was employed in order to evaluate various algorithms that were proven to be highly competitive among those proposed in the literature and in previous research. In particular the tradeoff between exploration and exploitation of the topics available to a crawler, and on adaptive crawlers using machine learning techniques to guide search will be focused. For a fair comparison of diverse crawling algorithms, there are two measures:
We track the CPU time taken by each crawler for each page and each topic, ignoring the time taken by the fetch module, which is common to all the crawlers.
We limit the memory available to each crawler by constraining the size of its buffer.
The performance of the crawling algorithm can be measured using performance metrics such as those explored in previous research. These include methods for assessing page relevance using linear classifiers, and using similarity computations.
The algorithms are used to retrieve the results for a particular topic and comparing the metrics to judge the performance.
The data that would be collect will be such as the time taken by each algorithm to retrieve the results for the same topic. The accuracy of the results for a search. The time and accuracy of the algorithms proportional to the query size to find the scalability of the algorithms.
Using all these data collected graphs are plotted for each metric and using these graphs we would try to conclude which crawling algorithm is effective.
These methods compare the algorithms based on the resources used hence no assumptions are made.
These methods cannot be used to compare the algorithms keeping in mind the fact that the web pages keep changes frequently. This makes it difficult to tell that the time taken and efficiency of the algorithms is always as calculated as the web page may change in the while when conducting the test on various algorithms.Chapter5. Time Table
Pilot study on the research topics
Selection of a research topic
Survey paper preparation
Selection of research question
Calculate the results based on survey