Deep Web Crawler Using Genetic Technique Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Web has become one of the largest and most readily accessible repositories of human knowledge. The traditional search engines index only surface Web whose pages are easily found. The focus has now been moved to invisible Web or hidden Web, which consists of a large warehouse of useful data such as images, sounds, presentations and many other types of media. To use such data, there is a need for specialized technique to locate those sites as we do with search engines. This paper focuses on a effective design of a Hidden Web Crawler that can automatically discover pages from the Hidden Web by employing multi- agent Web mining system. A framework is also suggested to investigate the resource discovery problem and the empirical results suggest substantial improvement in the crawling strategy and harvest rate.

Index Terms- hidden Web crawler, information retrieval, reinforcement learning, multi-agents, Web mining.

1. Introduction

The Web has become one of the largest and most readily accessible repositories of human knowledge. The large volume of Web pages, users increasingly use search engines to find specific information. These search engines index only static Web pages. Recent literature shows that large part of the Web is available behind search interfaces and is reachable only when users fill up those forms with set of keywords or queries [2] [4]. These pages are often referred to as the Hidden Web [3] or Deep Web [2]. The data in digital libraries, various government organizations, companies is available through search forms. Formally, a deep Web site is a Web server that provides information maintained in one or more back-end Web databases, each of which is searchable through one or more HTML forms as its query interfaces [4]. The hidden Web is qualitatively different from surface Web in the sense that it contains real data on different subjects. The size of the Hidden Web rapidly increases as more and more of organizations put their valuable content online through an easy-to-use Web interface [2] and is estimated as more than 550 times larger than the surface Web in 2001[1]. The volume of information grows, there is a need for tools and techniques to utilize the information. It has become very important to automatically discover hidden Web sites through an easy to use interface. The traditional Web crawler automatically traverse, retrieve pages and build a large repository of Web pages.

The shortcomings of those crawlers with respect to hidden Web are quite significant. nal crawlers have low precision and recall .Lack of personalization. Inability to incorporate user feedbacks on a specific topic. They do not suit to locate deep Web sites This scenario leads to the development of a new Web crawler that autonomously discovers Web forms and should have the following properties should automatically locate hidden Web resources

should accurately describe schema of the relevant forms.

should perform a broad search .

search process must be efficient in terms of precision and recall .

should avoid visiting large unproductive portions of the Web

Retrieving data from hidden Web sites has two tasks: resource discovery and content extraction [5].

The first task deals with automatically finding the relevant Web sites containing the hidden information.The second task deals with obtaining the information from those sites by filling out forms with relevant keywords.It deals locating relevant forms that serve as the entry points to the hidden Web data using a multi-agent based Web mining crawler.Finding searchable forms is useful in the following fields [6]:

Entry points for deep Web

Derive source descriptions in the databases

Form matching to find correspondences among attributes

The architectural framework described the incorporates multi-agents that learn the environment online through reinforcement learning. The agent has been used increasingly with information technology in recent years. The voluminous and readily available information on the web has given rise to developing agents as the computational entities for accessing, discovering and retrieving information on Web. Multi-agent system is deployed when there are troubles in storing large amount of data in the database indices. Single agent approach will be inefficient and impractical for the large scale Web environment. The proposed system has populations of information agents interacting with each other to collaboratively learn about their environment so that they can retrieve desired information effectively. The empirical results show that the crawler is performing better with respect to harvest rate. The results also indicate that the automatic resource discovery improves as the agent learns progressively.

2. Related Work

Conventionally, agent based Web mining systems can be classified under three categories [14]:

intelligent search agents.

information filtering / categorization.

personalized Web agents.

Several intelligent Web agents like Harvest[15], ShopBot[16], iJADE Web miner[17], have been developed to search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information.

An adaptive Web search system based on reactive architecture has been presented in [18]. The system is based on ant searching behaviour and proved to be robust against environment alternations and adaptive to user's need. An adaptive meta search engine was developed based on neural network based agent [19] which improves search results by computing user's relevance feedback.

The pioneering work on hidden Web crawler design [5] focused on extracting content from searchable electronic databases. They introduced an operational model of HiWE(Hidden Web crawler). The paper discusses on different directions on research in designing the crawler for content extraction. In [4], useful observations and implications are discussed about hidden Web. Their survey reports on locating entry points to hidden Web, coverage of deep Web directories and so on. They also give a clear observation that the crawler strategy for deep Web are likely to be different from surface Web crawlers. In [6], form focused crawler for hidden Web is described which utilizes various classifiers to extract relevant forms. An adaptive strategy based crawler design is discussed in [11].

3. Architecture

Form distribution in a particular domain is of sparse nature [6]. This nature has given the indication for the crawler to crawl only in the particular domain and thus avoid visiting unproductive paths by learning feature of links. The simplified architecture of the crawler is given in Fig.1. The form focused crawler employs multi-agent system. Agents can acquire better search strategies by collecting and analyzing feedback information from previous search experiences. Two types of agents are employed:

One is to crawl the Web to retrieve relevant pages.Second is to coordinate results from crawling agents.

Each crawling agent has three components. The crawler component visits and downloads the Web forms. The downloaded forms are given to parser which analyzes it and extracts the hyperlinks. The extracted hyperlinks are fed to the classifier module. This module has three functionalities: to classify a Web form as relevant to topic, to categorize a hyperlink which gives immediate and delayed benefit, to classify the forms as searchable or non searchable. The relevant forms are then stored in the forms database for further explorations. Coordinating agent collects the results from the multiple crawling agents and displays it to the users.

Fig.1 Framework for Deep Web Crawler

3.1. Crawler

Initially the crawler is given a set of URLs to crawl called seed URLs. The pages that are retrieved are given to the parser module. The classifier component gives out list of URLs that are identified as relevant. The links that are identified as promising links are placed in the frontier queue. A specified number of agents are spawned and each agent is positioned at one of the links and given an initial amount of energy to crawl.

3.2. Parser

This module gets a page from the crawler, analyzes for its relevancy to the specific subject of retrieval. This task uses similarity measure to calculate the relevance of the page on a particular topic.

∑f k d

f k p


rel(d,p) =

k∈ p∩d



min (∑fk d

)(∑f k p



k∈ p

where d is the set of keywords on a particular domain, p is the page under investigation, and f kr is the frequency of search terms in the page. The links are extracted and sent to classifier module along with set of keywords, anchor text, and text around the hyperlink.

3.3 Classifier

This module deals with identifying a link that leads to searchable forms. There may be links whose single click immediately direct to a form. There may be links which give delayed benefit. To identify the links with immediate and delayed benefit reinforcement learning is employed. The authors of [12] have proved that the reinforcement learning based multi-agents provide good results in terms of harvest rate and freshness.

The feature space considered for computation of link relevance is URL, anchor text, and text around the hyperlink. Primarily the frequency of the search terms in the feature space is computed. While considering the URL as feature space, the frequency is computed as follows: if any of the search terms are found as substring in the URL, then the frequency of that search term is incremented. The computation of link relevance described above will identify the pages with immediate benefit. That is, following the link will provide the search interface belonging to that particular domain. Another objective of the classifier is to identify the links which give delayed benefit i.e. the links that eventually lead to searchable interfaces. Learning component of the classifier is involved in classifying the link with delayed benefit.

The learner learns using the same feature set described above and assigns an integer value as score which is the normalized distance of the relevant form from the link to be followed. The learner in this prototype does not need any training sets. It learns from the progressive crawling and modifies its current state using the feedback given from the crawler module as depicted in the Fig[1]. The learning methodology is described in the subsequent sub section.

The final objective of the classifier is to categorize the forms as searchable and non searchable forms. Non searchable forms are the interfaces for login, subscription, registration and so on. This module uses decision tree classifier which is proved to have lowest error rate [6]. The features considered for classifying are: checkboxes, password tags, submit tags, textboxes, number of items in select, submission method, and search keyword in the form.

4. Proposed Approach

Figure 1 shows the flow of a basic sequential crawler. The crawler maintains a list of unvisited URLs called the frontier. The list is initialized with seed URLs which may be pro-vided by a user or another program. Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page to extract the URLs and application specifc information, and finally adding the unvisited URLs to the frontier. Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL. The crawling process may be terminated when a certain number of pages have been crawled. If the crawler is ready to crawl another page and the frontier is empty, the situation signals a dead-end for the crawler. The crawler has no new page to fetch and hence it stops.

Initialize URL list With starting URLs


Add URL to URL list

Parse page

Pick URL from URL List


[Not done]

Loop [No More URL]


Fig1:Single Threaded crawler

Crawling can be viewed as a graph search problem. The Web is seen as a large graph with pages at its nodes and hyperlinks as its edges. A crawler starts at a few of the nodes (seeds) and then follows the edges to reach other nodes. The process of fetching a page and extracting the links within it is analogous to expanding a node in graph search. A topical crawler tries to follow edges that are expected to lead to portions of the graph that are relevant to a topic.

4.1 Multi threaded crawler:

A sequential crawling loop spends a large amount of time in which either the CPU is idle (during network/disk access) or the network interface is idle (during CPU operations).Multi-threading, where each thread follows a crawling Figure 2 shows a multi-threaded version of the basic crawler in Figure 1. Note that each thread starts by locking the frontier to pick the next URL to crawl. After picking a URL it unlocks the frontier allowing other threads to access it. The frontier is again locked when new URLs are added to it. The locking steps are necessary in order to synchronize the use of the frontier that is now shared among many crawling loops (threads). The model of multi-threaded crawler in Figure 2 follows a standard parallel computing model .Note that a typical crawler would also maintain a shared history data structure for a fast lookup of URLs that have been crawled. Hence, in addition to the frontier it would also need to synchronize access to the history. The multi-threaded crawler model needs to deal with an empty frontier just like a sequential crawler. However, the issue is less simple now. If a thread finds the frontier empty, it does not automatically mean that the crawler as a whole has reached a dead-end. It is possible that other threads are fetching pages and may add new URLs in the near future. One way to deal with the situation is by sending a thread to a sleep state when it sees an empty frontier. When the thread wakes up, it checks again for URLs. A global monitor keeps track of the number of threads currently sleeping. Only when all the threads are in the sleep state does the crawling process stop. More optimizations can be performed on the multi-threaded model described here, as for instance to decrease contentions between the threads and to streamline network access.

This section has described the general components of a crawler. The common infrastructure supports at one extreme a very simple breadth-first crawler and at the other end crawler algorithms that may involve very complex URL selection mechanisms. Factors such as frontier size, page parsing strategy, crawler history and page repository have been identified as interesting and important dimensions to crawler definitions.

Lock URL List

Load URL List

Parse Page

Fetch Page

Unlock URL List

Pick URL from List

Check for information

Check for information

Lock URL List

Load URL List

Parse Page

Fetch Page

Unlock URL List

Pick URL from List

Get Add Get Add


End End

Fig2: Multi threaded crawler

4.2 Gcrawler:

The problem which exists in the traditional focused crawler URL analysis model described previously is that the local optimal solution is often easily gotten in the process of searching the relevant pages according to the predetermined theme, namely only crawling around the related web pages, which results in some related web pages which are linked together through hyperlinks with lower degree of relevance are not crawled, then the effective coverage of the focused crawler reduces. The genetic algorithm is a global random search algorithm that based on the evolutionism and molecular genetics, whose prominent feature is the implicit parallelism and the capacity to effective use of the global information, and it can effectively find the global optimal solution jumping local optimum, which is the focused crawler URL analysis model needs.

But the genetic algorithm also has some disadvantages, for example, it cannot use the feedback in the system and lots of unnecessary redundancy iterations come out when the solutions reach a certain extent; and the capacity of local search is weak, also may not get the optimal solution. For the crawling strategy of the current common focused crawler, the content of the web pages is generally provided by the editors, which results in some information irrelevant to the Predetermined theme involved in the web pages. The whole web page documents will be often used in the genetic process, when the genetic algorithm is used in the focused crawler in the past, which results in that the theme drift easily comes out in the process.

For the problems mentioned above, this paper improves the genetic algorithm from the followed aspects:

a. The Fitness Function

The purpose of genetic algorithm is to find the best keywords that reflect the filtering requirements, whether or not reflecting the requirements of the filter are determined mainly by the theme vector, so the correlation is used as the fitness function.The average of the correlations between the theme vector

and some correlated documents is used as the fitness function.


in which P is the user templates, Di is the ith document in the pseudo relevance feedback, n is the number of the documents in the pseudo relevance feedback.

1)In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the search.

Precision=(relevant document)∩(retrieved document)

retrieved document

2) Recall in information retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved.

Recall = (relevant document) ∩(retrieved document)

relevant document

3)A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score:

F= 2. Precision. Recall

Precision + Recall

b. Construct the Virtual Document

User query is the description way of the theme that the user is interested in, the traditional description way of the theme is provided entirely by the editors, which will always be a certain deviation from user demand to a certain degree. This paper introduces the user query to construct virtual documents for the web pages, uses the virtual document to participate in the genetic process instead of the web page document, and then the optimal solution finally obtained is used to modify the description of the theme. This will not only avoid the interference to the genetic process caused by the contents of web documents irrelevant to the theme, but also can make the topic description closer to the user needs. The virtual document is based on the vector space model which uses the search engine query log in the user log to construct a virtual document for the query, the virtual document is a text vector composed of the keywords in the query.

5 Experimental Results:

As shown in the given table 1 improved genetic algorithm will increase the




PRECISION % : 88.0

PRECISION % : 80.0

PRECISION % : 90.0

RECALL % : 22.83

RECALL % : 37.31

RECALL % : 37.31

SEC : 42

SEC : 1

F1 : 0.5275

Table 1: Results

6 Conclusion:

Focused crawler is the key technology of vertical search engine, and the relevance analysis of URL topic is the problem faced by focused crawler which must be solved firstly. After analysis the existing URL topic relevance analysis algorithm and their deficiency, This paper proposes a new focused crawler analysis model based on improved genetic algorithm, introducing user query constructed page virtual document into genetic procession to correct theme description, optimizing the crossover operator, selection operator and mutation operator, using vector space model to calculate the similarity degree of anchor text and topic description.

The experiment shows that the focused crawler URL analysis model based on improved genetic algorithm proposed in this paper can improve accuracy rate, recall rate effectively, and avoid getting into the local optimal solution.Table 2 shows the comparision between crawlers.

Single Thread Crawler

Multi Thread Crawler

G Crawler






It maintains frontier gives required information and also added unvisited URLs to the frontier.

It maintains a fast lookup of URLs that have been crawled. Optimizations can be performed on the multi-threaded model.

The algorithm of GCrawler that based on the evolutionism and molecular genetics, it is an implicit parallelism and the capacity to effective use of the global information, and it can effectively find the global optimal solution jumping local optimization.


A sequential crawling loop spends a large amount of time in which either the CPU is idle or the network interface is idle.

The infrastructure supports at one extreme a very simple breadth-first crawler and at the other end crawler algorithms that may involve very complex URL selection mechanisms.

The capacity of local search is weak, also may not get the optimal solution.

Future work

Precision can be improved.

Precision can be improved.

Time can be reduced.

Table 2: Comparision between Crawlers