Implementation Of Automatic Wrapper Adaptation Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Extracting precise information from Web sites is a useful task to obtain structured data from unstructured or semi structured data. This data is useful in further intelligent processing. Wrappers are the common information extraction systems which will transform largely unstructured information to structured data. Method in this paper is meant for extracting Web data. Some of the existing techniques require manually preparing training data and some does not require manual intervention. Wrapper generated for one site cannot be directly applied to new site even if the domain is same. Some methods only extract those data attributes which are specified in wrapper but, unseen Web pages may have additional attributes which needs to be identified. Automatically adapting the information extraction knowledge to a new unseen site, at the same time, discovering previously unseen attributes is the challenging task. System learns information extraction knowledge for new web site automatically. New attributes are discovered as well.

Keywords - DOM Tree, Wrapper Adaptation, Wrapper Learning, Web Mining.


Information extraction systems aim at automatically extracting exact data from documents. They can also transform largely unstructured information to structured data. A common information extraction technique for semi structured documents such as Web pages is known as wrappers. A wrapper consists of a set of extraction rules. Previous technique required manually preparing set of rules to construct a wrapper. Semi-automatic technique requires training a wrapper manually first and then using the same wrapper for remaining Web pages of same site for automatically extracting information. One restriction of a learned wrapper is that it cannot be applied to previously unseenweb sites, even in the same domain. To construct a wrapper for an unseen site, a separate human effort for the preparation of training examples is required.

Information extraction system should reduce the manual effort required to prepare training examples by wrapper adaptation which aims at automatically adapting a previously learned wrapper from one Web site, known as a source Web site, to new unseen sites in the same domain.

Another shortcoming of existing wrapper learning techniques is that attributes extracted by the learned wrapper are limited to those defined in the training process. As a result these wrappers at best can extract pre specified attributes only. A new unseen site may contain some additional attributes which are not present in the source Web site. We survey the problem of new attribute discovery which aims at extracting the unspecified attributes from new unseen sites. New attribute discovery can effectively deliver more useful information to users.


Cohen and fan [2] proposed a method which alleviates the problem of manually preparing training data by investigating wrapper adaptation. From number of Web sites some rules are learned and these rules are used for data extraction. One disadvantage of this method is that training examples from several Web sites must be collected to learn such heuristic rules. Golgher and Silva [3] proposed bootstrapping method which tries to solve the wrapper adaptation problem. Here a bootstrapping data repository is assumed, which is called as source repository, that contains a set of objects belonging to the same domain. This approach assumes that attributes in source repository must match the attributes in new web site. However, exact matching is not possible. Lerman, Gazen, Minton, and Knoblock [4] suggested a method called ADEL which is able to extract records from Web sites and semantically label the attributes in new unseen sites. The training stage consists of background knowledge acquisition, where data is collected in a particular domain and a structural description of data is learned. Now based on learned rules data from new site is extracted. The extracted data are then organized in a table format. Each column of the table is labelled by matching with the entries in the column and the patterns learned in the source site. It provides only a single attribute for the entire column which, may consists of inconsistent or incorrectly extracted data. These incorrectly extracted entries will be assigned a wrong attribute label. Liu, Grossman, and Zhai [5] proposed MDR, a method to mine data records in a Web page automatically. A generalized node of length r consists of r nodes in the HTML tag tree with the following two properties:

1) The nodes all have the same parent.

2) The nodes are adjacent.

A data region is a collection of two or more generalized nodes.

This method works as follows,

Step 1: Build a HTML tag tree of the page.

Step 2: Mining data regions in the page using the tag tree and string comparison.

Step 3: Identifying data records from each data region.

This method suffers from a major drawback that it cannot differentiate the type and the meaning of the information extracted. Hence, the items extracted require human effort to interpret the meaning.

Blei, Bagnell, and McCallum [7] proposed a probabilistic model. It assumes that future data will exhibit the same regularities as in the training data. In many data sets, there are scope-limited features which will predict only certain subset of the data. For example, in information extraction from web pages, word formatting on different web pages will be different. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. They proposed a hierarchical probabilistic model which uses both local / scope limited features such as word formatting and global features like word content. Random parameter is estimated and used to perform classification with both local and global features.

Freitag and McCallum [8] proposed a method which uses HMMs. HMMs are powerful probabilistic tool for modeling data and have been applied for data extraction task. HMM state transition probabilities are learned from labeled training data. In many approaches lack of sufficient labeled training data hinders the reliability of that model. A statistical technique called "shrinkage" that significantly improves parameter estimation of HMM probabilities in face of sparse training data is used. HMM is a finite state automaton with state transitions and symbol emissions. Model transition and emission probabilities are learned from training data. Given a model and all its parameters, information extraction is performed by determining sequence of states that was most likely to have generated the entire document, and extracting the symbol that were associated with designated target states.

Turmo, Ageno, and Catala [9] surveyed many techniques meant for information extraction. They described different adaptive information extraction approaches that use machine learning techniques to automatically acquire the knowledge needed when building an information extraction system. It is very difficult to decide which technique is best because as domain changes the system's behavior changes. There are many parameters to be considered while making the decision that which is the best technique. Riloff and Jones [10] proposed a multi level bootstrapping method. Information extraction requires two kinds of dictionaries a semantic lexicon and a dictionary of extraction patterns for a particular domain. Unannotated training texts and some seed words from a category are input. Mutual bootstrapping technique is used to select the best extraction pattern for the category and bootstrap its extractions into the semantic lexicon, which is the basis for selecting the next extraction pattern. To make this approach more robust, they added a second level of bootstrapping that retains only the most reliable lexicon entries produced by mutual bootstrapping and they restarted the process.

Kristjansson, Culotta, Viola, and McCallum [11] proposed interactive information extraction method. This system assists user in filling database fields. User is provided with a interactive interface which gives provision of correction of errors to user. In case where there are more number of errors, this system considers user corrections and automatically other fields are also corrected. Irmak and Suel [12] proposed semi-automatic wrapper generation method. Wrapper is trained by using different data sets in simple interactive manner. It minimizes user effort required for training wrappers through the interface. Crescenzi and Mecca [13] proposed automatic information extraction system. They defined class of regular languages, called the prefix mark-up languages, which abstract the structures usually found in HTML pages. There some algorithms defined for this class and which are unsupervised. The prefix mark-up languages and the associated algorithm can be used for information extraction.

Etzioni, Cafarella, Downey, Popescu, Shaked, Soderland, Weld, and Yates [14] proposed KNOWITALL. It introduces a novel, generate-and-test architecture that extracts information in two stages. KNOWITALL utilizes a set of eight domain-independent extraction patterns to generate candidate facts. KNOWITALL automatically tests these candidate facts and it extracts using point wise mutual information (PMI). Based on these PMI statistics, it associates a probability with every fact it extracts, enabling it to automatically manage the tradeoff between precision and recall.

Banko, Cafarella, Soderland, Broadhead, and Etzioni [15] proposed open information extraction system. In this system makes a single data-driven pass over its data set and extracts a large set of relational tuples without

requiring any human input. These are the relations of interest which are extracted and stored. Probst, Ghani, Krema, and Fano [16] proposed an approach to extract attribute-value pairs from product descriptions. Semi-supervised algorithm expectation and maximization is used along with Naïve Bayes. The extraction system requires little initial user supervision. Liu, Pu, and Han [17] proposed a XWrap system. This gets the formatting information from the web page to reveal the semantic structure of that page. The extraction knowledge is encoded in rule based language. Wrapper generation process is a two step process here. In first phase tree like structure is generated by cleaning up the page. In second phase XML template file is generated. Manual intervention is needed here. Califf and Mooney [18] proposed RAPIER. It begins with more specific rules and then replaces them with more general rules. It uses syntactic and semantic information including part-of-speech tagger. Pre-filler pattern, actual slot filler and post-filler pattern are considered here. Pre-filler pattern is text immediately preceding the filler. Post-filler pattern is text immediately following the filler.

Kushmerick, Weld and Doorenbos [19] proposed WIEN. They identified family of six wrapper classes. The four wrappers are used for semi-structured documents and remaining two are used for hierarchically nested documents. LR, HLRT, OCLR, HOCLRT, N-LR and N-HLRT are the six wrapper classes proposed by them. Since WIEN assumes ordered attributes in a data record, missing attributes and permutation of attributes cannot be handled.

Chang and Lui [20] proposed IEPAD. This method considers the fact that if a web page contains multiple records; they are presented in same template for good visualization. These templates will contain repetitive patterns of data records. Thus learning wrappers can be solved by discovering repetitive patterns. It uses PAT tree data structure to discover repetitive patterns in a Web page. After getting these patterns user is required to choose the relevant data.

Wang and Lochovsky [21], [22] proposed DeLa. It removes the interaction of users in extraction rule generalization and deals with nested object extraction. Data-rich Section Extraction algorithm (DSE) is designed to extract data-rich sections from the Web pages by comparing the DOM trees for two Web pages and discarding nodes with identical sub-trees. Pattern extractor is used to discover continuously repeated patterns using suffix trees. Each occurrence of the regular expression represents one data Object. The data objects are then transformed to a relational table where multiple values of one attribute are distributed into multiple rows of the table. At the end labels are assigned to columns of the table.

We surveyed few techniques which are meant for data extraction and wrapper adaptation. Wrapper created for one site cannot be directly applied to new unseen site. Some of the methods had drawback of requirement of human effort. Some reduced human effort required and some are fully automatic. Some of the methods included tedious task of creating training examples and some eased this task by making it unsupervised.


Problem of extracting data from web pages is addressed by many. This task is domain specific. Information extraction systems are often called wrappers. As source from where data is to be extracted changes, the wrapper does not work fine for this new source. The reason for this is, new source contains different features than the previous one. It means wrapper created for one web site cannot be directly used to extract data from another web site even in the same domain. As web site changes, patterns of the new site differ from the previous site and hence new rules must be generated for this new site.

In case of information extraction system, correct labelling of data also plays important part. Sometimes data values can be retrieved and placed in wrong column. Data regions of the new web site may contain extra attributes which were not present in previous site. Therefore new or adapted wrapper must be able to locate these new attributes also. This problem of extracting information from web sources has three aspects namely manual, semi-automatic, and fully automatic. Whether it require manual intervention for the construction of the wrapper all the time or whether it require manual intervention at the time of training and then extracts data from remaining pages automatically or it is fully automatic i.e. it does not require any manual intervention while adapting the wrapper for new site.

Consider a domain D. For example book domain which contains number of pages P= {p1,p2,p3,...}. A page contains number of records R= {r1,r2,r3,...}. Particular record contains number of attributes A= {a1,a2,a3,...}. For example book domain site contains web pages which in turn consist of book records. A record consists of attributes like title, author and price.

Wrapper learning:

Wrapper is the common system used to extract information form web site. Given a set of web pages P, goal of wrapper is to extract records from these web pages. Wrap(w1) is wrapper for web site w1. To extract records from site w1 Wrap(w1) should be trained with training examples of site w1. Wrap(w1) will be learned by using training examples of site w1.

Wrapper adaptation:

Wrapper created for one web site cannot be directly used to extract information from another web site even in the same domain. Wrapper adaptation aims at automatically learning a wrapper Wrap(w2) for the Web site w2 without any training examples from w2, such that the adapted wrapper Wrap(w2) can extract text fragments belonging to the pages of w2.

New attribute discovery:

New attribute discovery aims at automatically identifying attributes which were not present in web site w1. For instance, suppose we have a wrapper which can extract the attributes title, author, and price of the book records in the Web site shown in fig 1.

New attribute discovery can identify the text fragments referring to the previously unseen attributes such as ISBN, publisher etc as shown in fig 2.


In order to adapt information extraction wrapper to new site we need to take sample web pages of that site for training. Web pages of a site are divided in two sets. First set (training set) contains two web pages and are used for training. Second set (testing set) contains remaining pages of same site and are used for testing.


Selecting training data

We provide two web pages of a site for training. For example in a book domain select a page which contains all records of "java" and select second page which contains all records of "C programming". These web pages are used as training set for the wrapper.

Useful text fragments identification

To identify useful text fragments from web page, web page can be considered as DOM tree structure [6] [23] [24]. It is tree like structure. Internal nodes of this tree are HTML tags and leaf nodes are the text fragments displayed on the browser. Each text fragment is associated with a root-to-leaf path, which is the concatenation of the HTML tags as shown in fig 4. Suppose we have two Web pages of the same site containing different records. The text fragments related to the attributes of a record are likely to be different, while text fragments related to the irrelevant information such as advertisements, listings or copyright statements are likely to be similar in both the pages. In DOM tree representation all anchor tags are considered from both the web pages of same site. Anchor tags related to title of the book are likely to be different, but anchor tags related to other information such as listings of categories, advertisements are likely to be similar. Delete all the anchor tags which have same contents on both web pages [1]. Remaining are the useful text fragments. Here not all the text fragments are related to book records. Still there are some text fragments which are not related to any attribute of a book record.

Processing useful text fragments

Now we have all anchor tags which are different on both web pages. Generally in a book domain titles of books are represented using anchor tags. Here we try to find those anchor tags which are related to titles of books. Data contained in these anchor tags is processed.

a) Remove stop words: Stop words like a, and, an etc must be deleted first from useful text fragments as our next step in this method is frequency count. Stop words may be more in number on a web page and so they need to be deleted. Otherwise frequency count of stop words will be more than other useful words. Some of the stop words listed below will be deleted. Stop. add () is method.

stop.add ("in");

stop.add ("an");

stop.add ("for");

stop.add ("the");

stop.add ("a");

b) Frequency count: After removing stop words from useful text fragments, count the frequency of each word in the remaining text fragments. For example our web page used for training contains 100 records of "java" books. Each record will contain "java" word in the attribute title. Get the word which has maximum frequency count value. In our case "java" will be the most frequent word.

Locate the path:

Word with maximum frequency will give you the attribute title. Anchor tag of title of a book will be considered. This anchor tag is at leaf of the DOM tree representation. Find root to leaf path of title of the book. To find root to leaf path, go upward in DOM tree by finding parent of each tag until root is determined. This path will give you the tag tree for attribute title.

Other attributes of a record will be present in between two titles. We consider some features to locate the path of these attributes. We consider following features:

Each word of a title of a book contains first letter in capitals.

Author name is present immediately after title or may contain "by" keyword.

Author name may be in italic or bold or may contain semantic label "author".

Price of a book may contain symbols like $ or Rs. Price are numeric values and generally are bold.

ISBN of a book contains semantic label "ISBN" with numeric value and is in capitals always.

In this way by considering features of various attributes of records we can locate all the attributes in the web page. Tags are identified first and then root to leaf path in DOM tree is found for that attribute. For example from fig. x we can locate the paths from root to leaf for attribute title, author and price.

Title: a h3 li ol td tr tbody table div body html

Author: cite div div li ol td tr tbody table div body html

Price: big div div li ol td tr tbody table div body html

For example following is the path from root to leaf for attribute title.






< tr >

< td >

< ol >

< li >

< h3 >

<a >

These paths are used to train the wrapper. Wrapper is learned by using these paths and applied to remaining web pages of the site which is our testing set. Now by using these rules (paths) our wrapper can easily extract records from testing pages.


We conducted experiments on 8 real world Web sites collected from two domains, namely, the book domain and the electronics appliance domain to evaluate the performance of the framework. Table 1 depicts the Web sites used for experiments. B1, B2, B3, B4 are from book domain and E1, E2, E3, E4 are from electronic appliances domain.

To compare the results we have used the tool- Automation anywhere 6.6. Data is extracted from all the above listed sites (Table 1) by using automation anywhere and our method. The extraction performance is evaluated by two commonly used metrics, namely, precision and recall.

Precision is defined as the number of items for which the system correctly identified divided by the total number of items it extracts. Recall is defined as the number of items for which the system correctly identified divided by the total number of actual items. The results indicate that after applying our full wrapper adaptation approach, the wrapper learned from a particular Web site can be adapted to other sites.

Our wrapper adaptation approach achieves better performance compared with Automation anywhere. Table 2 and Table 3 show the comparison of results for book domain and electronic appliances domain respectively. Table 4, Table 5, and Table 6 show extraction performance of titles, authors, and prices of the books respectively. Graph represents precision and recalls of both domains. P1 and P2 are precisions of extracted data by Automation anywhere and our approach respectively. Similarly, R1 and R2 are recalls of extracted data by Automation anywhere and our approach respectively.

VI. Conclusion

We have a system for adapting information extraction wrappers with new attribute discovery. Our approach can automatically adapt the information extraction patterns for new unseen sites, at the same time can discover new attributes.

DOM tree technique with path identification is employed in our framework for tackling the wrapper adaptation and new attributes discovery tasks.DOM tree representation generates useful text fragments related to the attributes of the records and then we find path of those attributes from root to leaf. Experiments for real world Web sites in different domains were conducted and the results demonstrate that this method achieves a very promising performance.