Deepweb And Web Crawlers Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The World Wide Web (WWW) is the largest source of information in recent era. This is a big source of information sharing between communities. This source of information is available in different forms: text, video, audio, tables etc [1, 2]. People use this information according to their requirements. This information is retrieved via web browsers. A web browser is an interface to retrieve web pages. Search engines are commonly used to search desired information from a bulk of available information on internet [3, 4].

Search engine is a program that searches a requested query from some database of HTML documents gathered by web crawler to finds out relevant URL's and lists them in a page. Search engines maintain their database of URL's by following one URL to another. Simply, when indexing a page, if crawler encounter a new URL, it index in search engine database. In this manner, a huge number of web pages are indexed by search engines [1, 3, 4].

A web crawler is an essential part of every search engine. Web crawler works like a spider that follows a hypertext link to another hypertext link [3]. A web crawler automatically finds out and collects resources like web pages, images, tables and multimedia information from internet. It search URL's from HTML documents, index them in the indexer of search engine and updates the index to keep fresh information for future use. As the internet grows day by day, the size of search engine's index grows with it [3, 4, 7].

A typical web crawler use the technique to follow the links which is good to crawl surface web but ineffective to crawl deep web content. Deep web content refer to World Wide Web content that are not part of surface web [10]. According to a rough survey, deep web 500 times larger than surface web [6]. It has been seen that sometimes very important information cannot be searched by search engine just because of disability of search engine to crawl deep web. Surface web is indexed by standard search engine but deep or hidden web is not indexed by standard search engine [11]. To index deep web require extra effort and some different techniques. There are many different techniques to surface deep web content. There are two deep web surfacing techniques; virtual integration and surfacing. Virtual integration is good for domain specific or vertical search engines and surfacing is good for standard search engines [5]. These two techniques are unable to surface the web pages those are dynamically generated by JavaScript or AJAX [5]. AJAX is widely used in today's web application to retrieve information from server asynchronously in the background without changing the display and behavior of current page [12]. Gmail, Yahoo mail, Google maps are well known AJAX applications. The major goal of AJAX is to run client code on browser instead of server to enhance the performance of web page. The other goal is to reduce the traffic load. An AJAX application is a dynamic web application at a given URL usually based on JavaScript and it changes its content without changing URL by refreshing page.

Traditional search engine cannot crawl AJAX application because there are many issues that prevent the AJAX applications to expose in front of crawler;

No pre-crawling: Traditional search engines pre-cache the web site and crawl locally. AJAX applications are event based so events cannot be cached.

Redundant states: There may be several events that lead to the same state because of same underlying JavaScript function is used to provide the content. This is essential to optimizing and decreasing the size of the application view.

Infinite event invocation: Since AJAX applications are event based. The application view may lead to infinite event invocation [14].

Data behind AJAX forms: There may be a form which needs user interaction to be filled out to move on next page. Crawler cannot understand form and unable to discover data behind those forms.

These issues are needed to be solved to make the AJAX applications crawl-able. There are different techniques available to resolve these issues

Two versions of same web: Special and important web pages can be set up in order to include an alternate view of dynamic content. This kind of web pages is specially designed to make dynamic contents crawl-able. Although alternate version of dynamic page contains less information. This is not acceptable solution to the problem because it is not generic for all.

Vertical search engine: Some applications have such as YouTube have their own search engine and have not access to other dynamic content. This also a special case not generic. It is also a small application limited to single application like car search engine.

Exposing data to search engine: Some application developers expose their data to the traditional search based on agreement so that their information can be crawled and indexed.

The existing solutions are not generic and precise. This research proposes a generic solution.


Deep Web is a portion of web which hidden behind HTML and AJAX forms or the web pages those are dynamically generated by JavaScript [5, 8]. There are different techniques like virtual integration and surfacing are available to surface deep web but these techniques are unable to surface the web pages behind AJAX forms [5]. Some of the Dynamic forms populate the next form item according to user input in previous form item. For example, when user selects Pakistan from Country Combo, next Combo shows the cities of Pakistan and cities combo keeps on changing for different Countries according to user selection without changing the whole page. This is done by AJAX. Now the problem is that; "how to surface web pages those are behind AJAX based forms as well as content behind the URL dynamically downloaded from web servers via AJAX functions."


AJAX works with JavaScript functions. It sends request to some other page, get response and show it on same page. By following the URL in JavaScript function we can get data to surface the complete form through web crawler. The objectives of this research are to figure out a procedure to crawl and index the data and URL's behind the forms powered by AJAX through some crawling technique. There are some major objectives of this research:

To read the relevant literature for existing techniques

To design a methodology to get the contents of a page that contains AJAX powered form and functions.

To index the web content behind AJAX forms

To provide a mechanism for search engines to move user directly to the data behind AJAX driven forms

Verify and validate the proposed solution using simulations


The world is getting closer and closer because of World Wide Web and information sharing is become too easy [1, 2]. Search engines are used to extract desired information from the pool of heterogeneous information. Search engines use crawlers to crawl the HTML documents and pages, storing information about these pages in database [3]. This information is used when a user request a query, search engine scan its database and display a list of relevant match to the query in a web page. Standard search engine crawlers crawl only the surface web and unable to crawl the hidden or deep web [5, 10]. Deep web is a part of World Wide Web which lies behind some HTML forms, AJAX powered forms or web pages those are dynamically generated by JavaScript or AJAX [10]. User has to perform a form submission with valid input to view such web pages. This is the reason; web crawler cannot surface them with typical crawling techniques. Many techniques are used to surface deep web like virtual integration and surfacing [5, 8]. These techniques are unable to surface the web which is behind AJAX forms. The scope of this research is to derive a procedure to surface the hidden web content similar to the surface web content. This research will address a solution to surface web pages those are behind AJAX forms.

This research work will be restrict to find solution to surface the web pages that are only accessible through links produced by JavaScript as well as content dynamically invoke by AJAX functions. Other scripted contents like ActionScript are beyond the scope of this research. It will not do anything with the web databases those are not accessible through conventional search engines. Traditional web browser sends an HTTP request with inputs and their value from a form using one of two methods: GET or POST. With get parameters are appended to the URL itself and with POST parameters are attached to the body of HTTP request. Since search engines identify web pages based on their URLs, the result pages from POST are indistinguishable and hence not directly index-able. A form has two types of inputs: binding inputs and free inputs. Binding input are those which are required to submit this form, and free inputs are may be related to presentation of data returned by database or something that is not making and impact on the data. So, free inputs are useless to index. Now the need is to guess input values for binding inputs and to check that the data retrieved by this query template is useful or not [5]. To continue my work, I will follow these restrictions and add the guessing technique for an input which is powered by AJAX. Developing a search engine which will surface the dynamic web is beyond the scope of this research. It will provide a solution that will be helpful for search engine developers.


The rest of this thesis is organized as follow: Chapter 2 contains the literature review and current status of the domain. Chapter 3 describes the methodology that is use to derive the solution to the problem, Chapter 4 contains the Experimental results and Chapter 5 contains the conclusion and future work.



The World Wide Web has grown from few thousand web pages in 1993 to almost 2 billion web pages at present. It is a big source of information sharing. This source of information is available in different forms; text, images, audio, video, tables etc. People use this information via web browsers. Web browser is an application to browse web on internet. Search engines are used to search specific data from the pool of heterogeneous information [1]. In the rest of this chapter I will how people can search relevant information, how search engine works, what a crawler is, how it works, and what related literature about the particular problem is.


A search engine is a program to search for information on the internet. The results against a search query given by user are presented in a list on a web page. Each result is a link to some web page that contains the specific information against the given query. The information can be a web page, an audio or video file, or a multimedia document. Web search engines work by storing information in its database. This information is collected by crawling each link on a given web site. Google is considered a most powerful and heavily used search engine in these days. It is a large scale general purpose search engine which can crawl and index millions of web pages every day [7]. It provides a good start for information retrieval but may be insufficient to manage complex information inquiries those requires some extra knowledge.


A web crawler is a computer program which is use to browse the World Wide Web in a automatic and systematic manner. It browses the web and save the visited data in database for future use. Search engines use crawler to crawl and index the web to make the information retrieval easy and efficient [4].

A conventional web crawler can only retrieve surface web. To crawl and index the hidden or deep web requires extra effort. Surface web is the portion of web which can be indexed by conventional search engine [11]. Deep or hidden web is a portion of web which cannot be crawled and indexed by conventional search engine [10].


Deep web is a part of web which is not part of surface web and lies behind HTML forms or the dynamic web [10]. Deep web content can be classified into following forms;

Dynamic Content: this is a type of web contents which are accessed by submitting some input value in a form. Such kind of web requires domain knowledge and without having knowledge, navigating is very hard.

Unlinked Content: These are the pages which are not linked in other pages. This thing may prevent it from crawling by search engine.

Private Web: These are the sites which require registration and login information.

Contextual Web: These are the web pages which are varying for different access context.

Limited Access Content: These are site which limit its access to their pages.

Scripted Content: This is a portion of web which is only accessible through links produced by JavaScript as well as content dynamically invoke by AJAX functions.

Non-HTML/ Text Content: The textual contents which are encoded in images or multimedia files cannot handled by search engines.[6]

All these create a problem for search engine and for public because a lot of information is invisible and a common user of search engine even don't know that might be the most important information is not accessible by him/her just because of above properties of web applications. The Deep Web is also believed that it is a big source of structured data on the web and retrieving it is a big challenge for data management community. In fact, this is a myth that deep web is based on structured data which is in fact not true because deep web is a significant source of data most of which is structured but not only one. [8].

Researchers are trying to find out the way to crawl the deep web content and they have succeeded in this regard but still there are a lot of future research problems. One way to search deep web content is domain specific search engine or vertical search engine such as and These search tools are providing a link to national and international scientific databases or portals [7]. In literature there are two other techniques to crawl the deep web content; Virtual Integration and Surfacing. The virtual integration is used in vertical search engine for specific domains like cars, books, research work etc. In this technique a mediator form is created for each domain and semantic mappings between individual data and mediator form. This technique is not suitable for standard search engine because creating mediator forms and mappings cost very high. Secondly, indentifying queries relevant to each domain is a big challenge and the last is that information on web is about everything and boundaries cannot be clearly defined. Surfacing uses a technique to pre-calculate the most relevant input value for all appealing HTML forms. The URLs resulting from these form submission are produced off-line and indexed like a normal URL. When user query for a web page which is in fact a deep web content, search engine automatically fill the form and show the link to user. Google uses this technique to crawl deep web content. This technique is unable to surface scripted content [5]. Today most web applications are AJAX based because it reduced the surfing effort of user and network traffic [12, 14]. Gmail, yahoo mail, hotmail and Google maps are famous AJAX applications. The major goal of AJAX based applications is to enhance the user experience by running client code in browser instead of refreshing the page from server. The second goal is to reduce the network traffic. This is achieved by refreshing only a part of page from server [14]. AJAX has its own limitations. AJAX applications refresh its content without changing URL which is a worm for crawler because crawlers are unable to identify new state. It is like a single page web site. So, it is essential to explore some mechanism to make AJAX crawl-able. To surface the web contents those are only accessible through JavaScript as well as contents behind URLs dynamically downloaded from web server via AJAX functions [5], there are different hurdles those are prevent the web to expose in front of crawlers;

Search engines pre-cache the web site and crawl locally. AJAX applications are event based so events cannot be cached.

AJAX applications are event based so there may be several events that lead to the same state because of same underlying JavaScript function is used to provide the content. It is necessary to identify redundant states to optimize the crawling results [14].

The entry point to the deep web is a form. When a crawler finds a form, it needs to guess the data to fill out the form [15, 16]. In this situation crawler needs to react like a human.

There are many solutions to resolve these problems but all have their limitations. Some application developer provides custom search engine or they expose web content to traditional search engine based on agreement. This is a manual solution and requires extra contribution from application developers [9]. Some web developers provide vertical search engine on their web site which is used to search specific information about their web site. There are many companies which have two interfaces of their web site. One is dynamic interface for users convenient and one is alternate static view for crawlers. These solutions only discover the states and events of AJAX based web content and ignore the web content behind AJAX forms. This research work is going to propose solution to discover the web content behind AJAX based forms. Google has proposed a solution but still this project is undergone [9].

The process of crawling web behind AJAX application becomes more complicated when a form encounters and crawler needs to identify the domain of the form to fill out the data in form to crawl the page. Another problem is that no form has the same structure. For example, a user looking for a car finds different kind of form than a user looking for a book. Hence there are different form schemas which make reading and understanding of form more complicated. To make the forms crawler read-able and understand-able, the whole web should be classified in small categories, each category belongs to a different domain and each domain has a common form schema which is not possible. There is another approach, focused crawler. Focused crawlers try to retrieve only a subset of the pages which contains most relevant information against a particular topic. This approach leads to better indexing and efficient searching than the first approach [17]. This approach will not work in some situations where a form has a parent form. For example, a student fills a registration form. He/she enters country name in a field and next combo dynamically load city names of that particular country. To crawl the web behind AJAX forms, crawler needs special functionality.


Traditional web crawlers discover new web pages by starting from known web pages in web directory. Crawler examines a web page and extracts new links (URLs) and then follows these links to discover new web pages. In other words, the whole web is a directed graph and a crawler traverse the graph by a traversal algorithm [7]. As mentioned above, AJAX based web is like a single page application. So, crawlers are unable to crawl the whole web which is AJAX based. AJAX applications have a series of events and states. Each event is act as an edge and states act as nodes. Crawling states is already done in [14, 18], but this research is left the portion of web which is behind AJAX forms. The focus of this thesis is to crawl web behind AJAX forms.


Indexing means creating and managing index of document for making searching and accessing desired data easy and fast. The web indexing is all about creating indexes for different web sites and HTML documents. These indexes are used by search engine for making their searching fast and efficient [19]. The major goal of any search engine is to create database of larger indexes. Indexes are based on organized information such as topics and names that serve as entry point to go directly to desired information within a corpus of documents [20]. If the web crawler index has enough space for web pages, then those web pages should be the most relevant to the particular topic. A good web index can be maintained by extracting all relevant web pages from as many different servers as possible. Traditional web crawler takes the following approach: it uses a modified breadth-first algorithm to ensure that every server has at least one web page represented in the index. Every time, when a crawler encounters a new web page on a new server, it retrieves all its pages and indexes them with relevant information for future use [7, 21].