Deep Web Content Extraction Approaches Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The World Wide Web has large number of web databases and these web databases can be searched through their web query interfaces. The web pages resulted are said to be surface web which can be accessed by search engines without accessing web databases while deep web can be accessed only by websites interfaces. It is inaccessible to search engines. Deep web pages have complex structure therefore extracting data from these web pages is critical problem. Solutions to this problem are typically web-page-programming language dependent. This paper studies some deep web data extraction techniques. A different way for deep web data extraction to overcome limitations of previous works is using visual approach. Visual features of deep web pages are used as primary concern to extract contents from deep web pages. It includes both data record extraction and data item extraction. Visual wrapper gets generated for web database to which a given deep web page belongs.

Keywords: Deep web, Web mining, Visual Block Tree, Web data extraction, Wrapper Generation.


The World Wide Web is the emerging field available to users to access the contents of the web. It has large number of web databases and these web databases can be searched through their web query interfaces. The web pages resulted are said to be surface web which can be accessed by search engines without accessing web databases. The surface web refers to the static and is linked with other pages and deep web refers to the web page that is not indexed by the general search engine. Deep web can be accessed only by websites interfaces. So it is inaccessible to search engines. The Web is a complex entity that contains information from a variety of source types and includes an evolving mix of different file types and media. It is much more than static, self-contained Web pages. The Deep Web comprises all information that resides in autonomous databases behind portals and information providers' web front-ends. Web pages in the Deep Web are dynamically-generated in response to a query through a web site's search form and often contain rich content. Data Extraction from Web even those web sites with some static links that are crawlable by a search engine often have much more information available only through a query interface. It is a big challenge to unlock the contents of deep web pages.

Deep web pages have complex structure therefore extracting data from these web pages is critical problem. Web pages are designed using HTML and HTML is frequently evolving to newer versions. In order to make web pages good in presentation, more and more presentation techniques are embedded into the web pages by web page designers. This makes structure of web page more complex. Previous systems for deep web data extraction have some limitations such as web-page-programming language dependency. First, they are HTML dependent because they are based on analyzing HTML source code of deep web pages. Second, they are not capable of handling ever increasing complexity of HTML source code of web pages. This motivates to seek a different way for deep web data extraction and to overcome limitations of previous works by using visual approach. Visual features of web pages can be used for deep web data extraction. The vision based system obtains visual representation of a given deep web page and converts it into Visual Block Tree. This Visual Block Tree helps to identify data region which contains the useful information to be extracted After removing noise blocks a filtered data region is further processed to extract data records and data items.


There are many approaches presented for extracting contents from deep web pages. Those approaches are categorized into manual approach, semi-automatic approach and automatic approach.

Manual Approach

In this approach users program a wrapper for each Web site by hand using general programming languages such as Perl or by using special-designed languages. These tools require the user to have substantial computer and programming backgrounds, so it becomes expensive. This manual approach utilizes various tools. Some of the tools are Minerva[10], web - OQL[1], TSIMMIS [5].


This tool uses the grammar in EBNF style, for each document, a set of productions is defined. This tool attempts to combine advantage of a declarative grammar based approach with features typical for procedural programming language by incorporating an explicit exception handling mechanism inside the grammar.


This tool is a declarative query language capable of locating selected pieces of data in the HTML pages. This tool originally aims at performing queries like SQL over the web.


This tool includes wrappers that can be configured through specification files written by the user. Specification files are composed by a sequence of commands that define extraction steps. An extractor based on the specification file parses an html page to locate the interesting data and extract them.

TSIMMIS provides two important operators: split and case. The split operator is used to divide the input list element into individual elements. The case operator allows the user to handle the irregularities in the structure of the input pages.

Semi-Automatic Approach

This approach uses the HTML - aware tools. The semi-automatic technique is broadly classified into text-based and sequence based technique. It rely on inherent structural features of HTML documents for accomplishing data extraction and grouping. The documents are turned to parsing tree before processing. Some representing tools of this approach are W4F [9], XWrap [4].

These are briefly summarized as follows:

World Wide Web Wrapper Factory

This is a toolkit for the construction of wrappers. It is the java toolkit for building wrappers. The wrapper development process consists of three independent layers. They are: Retrieval layer, Extraction layer, and Mapping layer. In the retrieval layer, a to-be processed document is retrieved (from the Web through HTTP protocol), cleaned and then fed to an HTML parser that constructs a parse tree following the Document Object Model (DOM).


XWRAP is another important HTML -aware tool for semi automatic construction of wrappers. The tool features a component library that provides basic building blocks for wrappers, and a user friendly interface to ease the task of wrapper development. This tool classifies the wrapper generation process into two phases: structure analysis and source -specific xml generation.

Automatic Approach

The automatic approaches are primarily on text-based and tag-structured based approach. This approach uses tools that each tool will perform their functions separately. They do not combine their process to give whole result. Each process is independent of their functions. Though this approach is automatic, it has some limitations.

The tools used by this approach are Roadrunner [3], IEPAD [7], DEPTA [8].


It is a tool that explores the inherent features of HTML documents to automatically generate wrappers. By comparing HTML structure of web pages of same "page class", generating a result of schema for the data contained in the pages. The unique feature of this tool is that no user intervention is requested.


This tool generalizes the extraction pattern from the unlabelled web pages. If a web page contains multiple homogenous data records to be extracted, they are rendered using the same template which provides good visualization. The center star algorithm is applied for the alignment of multiple strings.


Like IEPAD, DEPTA can be only applicable to Web pages that contain two or more data records in a data region. However, instead of discovering repeat substring based on suffix trees, which compares all suffixes of the HTML tag strings (as the encoded token string described in IEPAD), it compares only adjacent substrings with starting tags having the same parent in the HTML tag tree (similar to HTML DOM tree but only tags are considered).


This approach [11] employs a four-step strategy. First, given a sample deep Web page from a Web database, obtain its visual representation and transform it into a Visual Block tree which will be introduced later; second, extract data records from the Visual Block tree; third, partition extracted data records into data items and align the data items of the same semantic together; and fourth, generate visual wrappers (a set of visual extraction rules) for the Web database based on sample deep Web pages such that both data record extraction and data item extraction for new deep Web pages that are from the same Web database can be carried out more efficiently using the visual wrappers. It is as shown in Figure 1.

Figure 1.Four step strategy for data extraction using visual approach

Visual Features of Deep Web Pages

This approach is based on visual features for extracting the contents of the deep web pages. Since the web pages displayed consist most of text and images, web page layout and font are considered as visual information. The fonts are determined by its size, face, color, frame, etc., these visual features are important for identifying special information in the pages. To perform this, the features used are position, layout, appearance and content.

Visual Features

The main visual features that are to be considered before implanting the segmentation process is given as below:

(1) Position features (PF's) - These features indicate the location of the data region on a deep Web page.

PF1: Data regions are always centered horizontally.

PF2: The size of the data region is usually large relative to the area size of the whole page.

(2) Layout features - These features indicate how the data records in the data region are typically arranged.

LF1: The data records are usually aligned flush left in the data region.

LF2: All data records are adjoining.

LF3: Adjoining data records do not overlap, and the space between any two adjoining records is the same.

(3) Appearance features - These features capture the visual features within data records.

AF1: Data records are very similar in their appearances, and the similarity includes the sizes of the images

they contain and the fonts they use.

AF2: The data items of the same semantic in different data records have similar presentations with respect

to position, size (image data item), and font (text data item).

AF3: The neighboring text data items of different semantics often (not always) use distinguishable fonts.

(4) Content features - These features hint the regularity of the contents in data records.

CF1: The first data item in each data record is always of a mandatory type.

CF2: The presentation of data items in data records follows a fixed order.

CF3: There are often some fixed static texts in data records, which are not from the underlying Web


Visual Block Tree

To transform the deep web page into a visual block tree, VIPS algorithm is used. Visual block tree is resulted by the segmentation process. It is a segmentation of the web page. This tree contains the whole page as root block and the rectangular region represents each block of tree in the page. Leaf blocks cannot be segmented further which represents the semantic units.

Data Record Extraction

Data record extraction discovers the boundary of data records and extract them from the deep Web pages.Instead of extracting data records from the deep Web page directly, first locating the data region, and then, extracting data records from the data region. The data region corresponds to a block in the Visual Block tree.

Extraction can be achieved in the following three phases:

Phase 1: Filter out some noise blocks.

Phase 2: Cluster the remaining blocks by computing their appearance similarity.

Phase 3: Discover data record boundary by regrouping blocks.

Extraction of data items

The extraction of data item process focuses on the leaf nodes of the visual block tree. The extraction of data item process is carried out in two phases: segmentation of data record and aligning data item.

Generation of visual wrappers

The visual wrappers are the set of extraction rules that are generated by using the extracted data record and the data item. These are programs which performs the data record and data item extraction with the set of parameters obtained from the sample web pages. The visual information is used to generate the visual wrappers.


This discussion focuses on the deep web data extraction problem including data record extraction and data item extraction. First, we surveyed previous works on web data extraction and their inherent limitations.A new visual approach is introduced to achieve deep web data extraction. This vision-based approach is intended to solve the HTML-dependent problem. This approach employs the extraction of structured data using visual features, providing more efficiency. The primary steps in this approach are building visual block tree, extraction of data records and data items and the construction of visual wrappers.