Focused Crawler Ontology Based Focused Crawler Accomplishment Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Search engines of the past years, used to gather web documents, using their respective web crawlers. These documents were indexed appropriately to resolve queries fired at the search engines. This consortium of crawlers was termed "General Crawlers" or simply "Crawlers". Contemporary domain specific search engines on the other hand, possess the aptitude to respond to queries that are very information-precise. The kind of crawlers used by such search engine rightly termed as "Focused Crawlers". A Focused Crawler examines its crawl boundary so that it limits its crawling subroutine only to the links that are most appropriate. By doing so, it hence elegantly avoids irrelevant neighborhood of the web. This approach leads not only to a significant saving in hardware and network resources, but also facilitates in keeping the crawled database in state-of-the-art. In this paper we have explained the role of focused crawler in the domain specific search engine. Also we have given the brief description of various focused crawlers. At the end we have compared various ontology focused crawlers.

Index Terms-Search Engine; Crawler; Focused Crawler


Generally the popular portals like Yahoo and Alta Vista are used for gathering information on the WWW. However, with the explosive growth of the Web, fetching information about a special-topic is becoming an increasingly difficult task. Moreover, the Web has over more than12 billion pages and continues to grow rapidly at a million pages per day [1]. Such growth and flux pose basic limits of scale for today's generic search engines. Thus, much relevant information may not have been gathered and some information may not be up-to-date. Because of these problems, recently there is much awareness that for serious Web users, focused portholes are more useful than generic portals.

Focused crawler is an automated mechanism to efficiently find pages relevant to a topic on the web. Focused crawlers were proposed to traverse and retrieve only a part of the web that is relevant to a particular topic, starting from a set of pages usually referred to as the seed set. It makes efficient usage of network bandwidth and storage capacity. Focused crawling provides a viable mechanism for frequent updation of search engine indexes. They have been useful for other applications like distributed processing of the complete web, with each crawler assigned to a small part of the web. They have also been used to provide customized alerts, personalized/community search engines, web databases and business intelligence. One of the first focused Web crawlers

is discussed in [2]. Experiences with a focused crawler implementation were described by Chakrabarti in [3].

Focused crawlers contain two types of algorithms to keep the crawling scope within the desired domain: (1) Web analysis algorithms are used to judge the relevance and quality of the Web pages pointed to by target URLs; and (2) Web search algorithms determine the optimal order in which the target URLs are visited [4].

Role Of Focused Crawler In Web Directory

Mainly Focused crawlers are used in domain specific search engines. Figure 1 shows the role of it in such search engines. Here Crawler component with the classifier becomes focused crawler else it works as general crawler as used in general search engines.

Web directory

(Front end)

Index Barrel



Focused Crawler back end1 7


2 4


3 6


Fig 1. Role of Focused Crawler in web directory

Following steps describes the flow of domain specific search engines where the crawlers generally termed as focused crawlers.

Crawler is assigned seed URLs set as input to be crawled.

These URLs are then fetched from WWW by Crawler.

It then parses the document using parser and separate out the text and html tags from it.

Then it assigned the content to index barrel to build an index based on available words.

It will further crawl the extracted links available on seed URLs if classifier endorses it according to the technique it uses.

Various Focused Crawling Techniques

Fish Search

Some early work on the subject of focused collection of data from the Web was done by [2] in the context of client-based search engines. Web crawling was simulated by a "group of fish" migrating on the web. In the so called "fish search" each URL corresponds to a fish whose survivability is dependent on visited page relevance and remote server speed. Page relevance is estimated using a binary classification (the page can only be relevant or irrelevant) by a means of a simple keyword or regular expression match. Only when fish traverse a specified amount of irrelevant pages they die off - that way information that is not directly available in one 'hop' can still be found. On every document the fish produce offspring - its number being dependant on page relevance and the number of extracted links. The school of fish consequently 'migrates' in the general direction of relevant pages which are then presented as results. Starting point is specified by the user by providing 'seed' pages that are used to gather initial URLs. URLs are added to the beginning of the crawl list which makes this a sort of a depth first search.

Shark Searh

[5] extends fish algorithm into "shark-search". URLs of pages to be downloaded are prioritized by taking into account a linear combination of source page relevance, anchor text and neighborhood (of a predefined size) of the link on the source page and inherited relevance score. Inherited relevance score is parent page's relevance score multiplied by a specified decay factor. Unlike in [2] page relevance is calculated as a similarity between document and query in vector space model and can be any real number between 0 and 1. Anchor text and anchor context scores are also calculated as similarity to the query.

Accelearated Focused Crawling

An improved version was proposed in [6] which extend the previous baseline focused crawler to prioritize the URLs within a document. Relevance of a crawled page is obtained using the document taxonomy as explained above using the 'baseline' classifier. The URLs within the document are given priority based on the local neighborhood of the URL in the document. An apprentice classifier learns to prioritize the links in the document. Once sufficient number of source pages and the target pages pointed to by the URL in the source page are downloaded and labeled as relevant/irrelevant, the apprentice is learnt. A representation for each URL based on target page relevance, source page relevance, Document Object Model (DOM) structure, co-citation and other local source page information is constructed. The apprentice is trained online to predict the relevance of the target page pointed to by a URL in the source page. The relevance so calculated is used to prioritize the URLs. The apprentice is periodically retrained to improve the performance. Both these methods depend on the usage of quality document taxonomy for good performance. The dependence on the document taxonomy makes it inapplicable to applications where the topic is too specific.

The classifier is used to determine page relevance (according to the taxonomy) which also determines future link expansion. Two different rules for link expansion are presented. Hard focus rule allows expansion of links only if the class to which the source page belongs with the highest probability is in the 'interesting' subset. Soft focus rule uses the sum of probabilities that the page belongs to one of the relevant classes to decide visit priority for children; no page is eliminated a priori. Periodically the distiller subsystem identifies hub pages (using a modified hubs & authorities algorithm. Top hubs are then marked for revisiting.

Intelligent Crawling

Intelligent crawling was proposed in [7] that allow users to specify arbitrary predicates. It suggests use of arbitrary implementable predicates that use four sets of document statistics including source page content, URL tokens, linkage locality and sibling locality in the classifier to calculate the relevance of the document. The source page content allows prioritizing different URLs differently. URL tokens help in getting approximate semantics about the page. Linkage locality is based on the assumption that web pages on a given topic are more likely to link to those of the same topic. Sibling locality is based on the assumption that if a web page points to pages of a given topic then it is more likely to point to other pages on the same topic.

Focused Crawling Using Combination of Link Structure and Content Similarity

In this article [9] a crawler which uses a combination of links structure and contents to do focus crawling is introduced. To implement it we need to maintain link structure of pages and also introduce a metric for measuring the similarity of a page to a domain.

Using Context Graph

In the article [10] relevant pages can be found by knowing what kinds of off topic pages link to them. For each seed document a several layers deep graph is constructed that consists of pages pointing to that seed page. Because that information is not directly available from the web, a search engine is used to provide backward links. Graphs for all seed pages are then merged together and a classifier is trained to recognize a specific layer. Those predictions are then used to assign priority to the page.

Ontology Based Focused Crawling

Generally speaking, ontology-based focused crawlers are a series of crawlers which utilize ontologies to link the fetched web documents with the ontological concepts (topics), with the purpose of organizing and categorizing web documents, or filtering irrelevant web pages with regards to the topics . The harvest rate is improved compared to the baseline focused crawler, but is not compared to other types focused crawlers.

Study on ontology based focused crawler

Ehrig and Maedche proposed an ontology-focused crawler [11] [12]. Two cycles are there in the crawling framework. In the first cycle, users can define a crawling target by instantiating a domain specific ontology, and limit the crawling scope by providing the URLs of crawled websites. Based on the ontology and crawling scope, the focused crawler starts to work on retrieving data from those websites,

and computing the relevance between the ontological concepts and the crawled data by means of TF-IDF algorithm.

Ardo introduced a focused crawler working for the ALVIS

[13]. The focused crawler is used to retrieve, cluster and store relevant web pages by linking them to topics. Each topic is defined by an ontology of terms [14].

Chen and Soo designed an ontology-based information gathering agent, aiming at searching and integrating knowledge based on users' queries. An ontology is defined as the agent's domain knowledge. Users can instantiate the ontology by adding partial values to an interested concept, in

order to form a query for retrieving the values in the instance's blank fields. Four basic operations are involved in the gathering process - planning, search, information extraction and integration [15].

Tane et al. proposed a new ontology management system named Courseware Watchdog. One important component of the system is an ontology-based focused crawler. By means of the crawler, a user can specify his/her preference, by assigning weights to the concepts of an ontology. By means of the interrelations between concepts within the ontology, the weights of other concepts can be calculated[16].

Comparison Of Various Ontology Based Focused Crawlers

Comparison of various ontology based focused crawlers is given in table I.As shown in the table the survey is based on five perspectives, working environment, domain, function, technology used and evaluation matrics.

From the comparison table, we can see that none of the crawlers are domain-specific. These crawlers can be used in any domains for any crawling topics. This multi-domain adaptability could be beneficial for the future development of

these crawlers. For working environments, some of the crawlers are encapsulated in larger systems, while others are designed as separate tools. iFor functions, most crawlers' ontological concepts' weights on query topics can be customizing in order to highlight users' specific preference. One crawler can also provide the function that the crawled knowledge can be integrated according to domain-specific heuristic and rules, which could be useful to enhance the precision and reduce the recall. Another crawler can flexibly evolve the weights between concepts and topics through an ontology leaning model. This could be helpful to solve the Problem that predefined ontologies sometimes cannot completely inosculate the crawling topics. For utilized technologies, these crawlers use various technologies to satisfy different function requirements, except the commonly used ontology technology. In addition, the TF-IDF and PageRanks algorithm are adopted for the retrieved web documents ranking. While most crawlers do not provide evaluation methods, we still find that harvest rate is the primary metric to measure the crawlers' performance.


This paper introduces the role of focused crawler in web directory. Also various focused crawling techniques are discussed. Certain research has been done in the area of ontology focused crawler. So here we have provided the comparison of various ontology based focused crawlers with several perspectives.


I would like to sincerely thank to Prof. B.V.Buddhdev sir for encouraging me whenever needed. I would also like to thank my HOD Prof. J.S.Shah for his guidance. I would also like to thank my friend rashmi for her support.