Effective Travelling Guide Using Web Data Extraction Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Today the World Wide Web (WWW) has become the main source of information and data. It consists of enormous information and data as Web pages that has created a main problem of searching the required information efficiently and effectively. With the fast growth of WWW, it is realized that normal keywords based searching mechanism is inefficient. A user has to go through many numbers of Web pages trying several different queries in order to get the required knowledge. Commercial search engines are mainly concerned with response delay, which they try to keep less than two seconds and studies have shown that available search engines have less precision that is due to the unrelated Web pages in search results [1], [2].

Apart from the above mentioned problems in general searching, there are few other specific scenarios where we can improve the Web search. If we need a summery of available information on a subject in a particular area instead of bunch of hyperlinks it is not possible in normal searching engines. And also if we need to get all subject information related to another subject in a single search without searching separately for those subjects then again it is not possible in normal search engines. For example when searching travelling information it is not possible to get tourist attractions, accommodations, possible routes and transportations details in one search once the locations are given. In situations like these the user has to take the responsibility of manipulating the normal search engine results.

There are several reasons for these problems. One is the increasing huge size of the Web. Today WWW has approximately more than 20 billion pages. Search engines are able to index only a fraction of this amount. Searching most relevant information from this huge amount is a challenging task.

Second reason is inability of understanding the meaning of Web contents by the machines since most of the Web pages are not well-structured or organized like in databases that is they are not in a machine readable format. Moreover, the way they are presented and structure is significantly different from one page to another. Therefore, automatic discovery and summarization of information is a challenging research problem.

Third reason is unavailability of relationships among the data in the Web. Most of the data in the Web do not represent any relationships to each other even though Web pages are connected through links. Even if there are some technologies like RDF [3] which become popular recently that represents the data and relationships in a machine readable way still there is no any significant amount of RDF documents in the Web.

In this paper we are trying to introduce a solution to above mentioned two specific scenarios using travelling domain in Sri Lanka, which will in turn improve the efficiency and the effectiveness of Web searching. Our proposed searching system has the ability of providing well organized precise and concise information on tourist attractions, accommodations (e.g.: hotels, holiday resorts) and transportations (e.g.: railways, timetables) to a single page in a single search once the starting location and destination is given, by extracting and summarizing the available relevant information in the Web on the fly. Here the searching query is a string with starting location and destination.

There are several advantages in this system. Not like in normal search engines this system gives us a precise, concise and useful result. It is all in one place in a single search. Therefore it saves time because user does not need to go through many pages trying deferent queries. And this system will present only the required main information without advertisements, other links, and lotteries typically we find in Web pages.

Many researches have been carried out in order to solve similar problems using various methods. We can identify web data extraction as the main research area that is relevant to this project. Few research in this area, there strengths and drawbacks, and the way they are relating to our project will be discussed in the section 2.

2. Related Works

Web data extraction is a type of information extraction that can automatically extract structured information from unstructured or semi-structured web data sources. This software piece for extracting and delivering required data is called the wrapper. The importance is, after extracting data can be handled like data from databases. There are many researches in the area of web data extraction or information extraction and they have used various methods like machine learning [4], [5], natural language processing [6], [7], [8] and ontology design [9], [10], [11]. Following is a discussion of few of them.

Early approaches for information extraction used manual techniques for generating the wrapper [12], [13], [14]. For example in the Tsimmis project, that provides tools to assist humans in integrating information from heterogeneous information sources with structured or unstructured data, they have developed hard-coded wrappers for sources [12], [15], [16]. Another example is Minerva [13], in which a formalism has been introduced to write wrappers using qualities of declarative, grammar-based approach and procedural programming. They have done this by incorporating an explicit exception handling mechanism inside a grammar parser. Exception-handling procedures are written in Minerva by using a special language called Editor and these exception handlers can handle the irregularities found in Web data. WebOQL [14] is also a system based on a declarative query language that is capable of locating selected pieces of data in HTML by producing an abstract HTML syntax tree called hypertree representing the document. Using the language it is possible to navigate, query and restructure this hypertree and then output the data.

However, all the above mentioned methods require writing codes manually. The user must check the document and find the HTML tags that separate the data of interest to write the code which is a tiresome and error-prone. And also these manually created systems are domain dependent and unable to cope with changes in source pages resulting in a high maintenance cost. But there languages provide some features not available in general purpose languages that ease the task in to some extent. Anyway these manual methods are not suitable for our purpose since in our project we have to deal with Web pages with many different structures and formatting and manually coding wrappers for each and every page is not a possible task.

Since the manual techniques were not much effective, researchers discovered semi-automatic and automatic web data extraction methods which were more beneficial. Most of the semi-automatic wrapper generation take use of support tools, demonstration oriented interfaces to help design the wrapper so that users can show what information to extract like [4] and [17]. This approach does not require an expert knowledge in wrapper coding is and it is also less error-prone than coding. However, it must be demonstrated for each new site and for each changed site how the data should be extracted as these systems can not themselves induce the structure of a site.

For example XWRAP [17] is a semi-automatic wrapper construction tool with a user interface that will help to interact with the developer to identify source-specific metadata knowledge in the sample page in order to generate information extraction rules. In the first phase XWRAP fetches and cleans up the HTML page and generate a tree like structure for the page. Then the user identifies regions and semantic tokens of interest and XWRAP generate extraction rules and wrapper based on that. After that the user can debug the wrapper by running XWRAP automated testing module in which the system will be given alternative pages of the same Web source and check for new extraction rules and update the wrapper. Here the extraction rules are mainly based on DOM-tree path addressing. However, even if XWRAP have obtained 100% accuracy in there evaluations with pages of slightly different structure, when the pages are significantly different from example pages the wrapper will have to be refined. Anyway, this method is not suitable for our problem since it is not practical to generate wrappers using examples in our searching scenario.

Another semi-automated web data extraction tool is STALKER [4], [5] that uses a machine learning approach to learn extraction rules with a less number of examples. It can automatically adapt to the changes happening in source Web pages and can automatically repair the wrapper. That is because it verifies the extracted data by using patterns in extracted data learned from machine learning and by using statistical distribution of those patterns, identifying significant changes in extracted information that is due to the changes in source Web pages. After that it automatically launches the wrapper repair process in which wrapper induction system is re-run using automatically re-labelled examples. In addition, it has the ability to identify highly informative examples and to provide examples to the user that do not match the learned rules by using an active learning approach (co-testing) so that it can learn highly accurate rules. And also, this method can extract data from Web pages with complicated formatting layouts like lists embedded in other lists, since it uses a hierarchical approach. However, authors have assumed that source Web pages have a high degree of regular structure that may not be always true. They have said that their method has the ability of generating extraction rules from a small number of examples like less than ten, in many cases only two examples. But again it is because their source Web pages have a fixed template with little variations. In our problem, we need to extract data from Web pages that are very different in structure. Then the user will have to provide examples for all the structures and label them. But they have reduced some manual work by using the co-testing to identify automatically the source Web pages that are different than the initial training set. But still the user has to label those identified pages.

In contrast to above mentioned methods RoadRunner [18] is a completely automatic web data extraction system. RoadRunner generates the wrapper without using any prior knowledge or user specified examples about the organization of the source pages and without any human intervention. It generates the wrapper by comparing HTML pages of the same class and identifying similarities and mismatches and by generating regular expressions. RoadRunner considers the site generation process as encoding of the database content into strings of HTML code. Therefore, data extraction is considered as a decoding process. So, the authors have stated that generating a wrapper for a set of HTML pages corresponds to inferring a grammar for the HTML code. When RoadRunner is run on collection of HTML pages it starts taking any page as an initial wrapper and applies the matching algorithm iteratively to generate common wrapper for the pages. In the situations where all the pages cannot be described using a single wrapper, the algorithm produce more than one wrapper. However, according to their results the system has failed to extract any data from Web pages that form a non-regular language and Web pages that has no repeating patterns in the structure.

All the above mentioned methods rely on the structure of the data presentation within the Web page to generate rules and patterns for extraction. Therefore those wrappers failed if the formatting features of the source pages change and if the source pages are from many different Web sources with different formatting and templates. When dealing with the problem we are going to address in this project, we have to tackle with various types of Web sources that have different structure and formatting which are unknown at the beginning of the searching process. Therefore, here providing examples, labelling and creating and repairing wrappers for each and every different source pages are not possible. So above mentioned approaches are not suitable for giving a solution to our problem.

However, data extraction can be done relying directly on the data. Given a particular domain an ontology can be used to locate constants there in the page and to construct objects with them. Therefore ontology-based wrappers are flexible and adaptable to the changes of the source pages and can handle many different sources belonging to a same application domain. But in this approach a careful construction of the ontology is needed in order to work properly that must be done manually by an expert. And also the application becomes domain dependent. They are the main drawbacks regarding ontology based approaches. But if the ontology is representative enough then the extraction is fully automated disregarding whether the page is structured or unstructured [9], [10].

For example in [9] and [10] ontology based data extractor is presented in which wrapper generation is fully automated. In that system application, ontology is an independent input. Therefore authors of this system have stated that, when changing the domains only the ontology is needed to be change and all the other components remain the same. There ontology provides the relationships among the objects of interest, the cardinality constraints for these relationships, a description of the possible strings that can populate various sets of objects, and possible context keywords expected to help match values with object sets. After creating the ontology they parse it to generate a database scheme and to generate rules for matching constants and keywords. Then those matching rules generated by the parser are used to extract the objects and relationships from the pages. Finally, they have populated the generated database scheme by using heuristics that link extracted keywords with extracted constants and by using cardinality constraints in the ontology. Once the data is extracted, one can query the structure using a standard database query language.

In [11] another way is introduced for data extraction which has used an ontology to model the data to be extracted. Here, the data in a web page is first converted into XML, then mapped with the data model. The definition of the data model and the mapping are done manually. Then an automatic process is carried out for extraction task. The final result is an XML document that contains a standardized data set. This system is applicable only to Web sources that have a fixed structure which is a main weakness in the system. Therefore this method is not suitable for our problem since we have to deal with various types of Web pages.

Another approach for Web data extraction is natural language processing. RAPIER [6], SRV [7], and WHISK[8] are some of research we can find for natural language processing in the literature. RAPIER has used both delimiters and content description which exploits the syntactic and semantic information for the extraction. A part-of-speech tagger is used to obtain the syntactic information while a lexicon of semantic classes is used to obtain the semantic information. The tagger takes sentences as input and labels each word as noun, verb, adjective etc. [6].

However, natural language processing techniques are not well suited for web data extraction as the Web sources often do not have a rich grammatical structure. In addition, natural language processing techniques tend to be slow. Therefore these techniques are not suitable for our problem since the number of document collections on the Web is large and in our problem the extraction is expected to be performed on the fly.

Considering the above mentioned methods we can say to solve our problem an ontology based method will be suitable since those methods are independent from the structure and formatting of the source pages. Even if these methods are domain dependent and required manual coding they are best suited to work with distinct Web sources.