Data Extraction For Multiple Web Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

It has been observed over a period of time that large amount of information for each and every application is available in the web. With each passing day, there is no dearth of information for the end user. However getting relevant information has been a monstrous task. A considerable degree of research has been carried out. Research dealing with extraction and alignment of information using variety of methods has been carried out. This work focuses on combining appropriate techniques to incorporate the relevance and increase the user friendliness. The end result of this work would facilitate any researcher to pick out suitable information pertaining to the area of research properly aligned.


In the past few years, several works in the literature have addressed the problem of data extraction from Web pages. The significance of this problem derives from the fact that, while extracted, the data can be handled in a way similar to different instance of data repositeries. Various technique are involved in the literature to address the significance of various Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction techniques, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.


1.Mining Web Informative Structures and Contents Based on Entropy Analysis

Hung-Yu Kao, Shian-Hua Lin, says that it is used for mining information from the news websites becoz news is an important thing in this particular world and the concept told by the author is information discoverer algorithm that is how to extract data from the datasource. Data source is a repository of data which contains various files regarding to that particular information.The information discoverer algorithm is usedto eliminate the redundancy among the text contents which is slowly utilised in the previous algorithm called as hits,this algorithm is mainly to calculate the amount of hubs in the given webpages.lamis uses eigen vector calculation to obtained the specified result the common features are identified using intersection of the desired and the discovered features.further it is an automated method rather than hand written coding.

2. QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

James Caverlee, Ling Liu suggest that four main steps in Collecting Deep Web Pages.

First step is to collect the web pages with relevant information,second one is segmenting of the information

Which is collected.The third step is to identify the datas based on the page ranking algorithm.and the fourth phase is separation and localisation of the data utilised .the higher page ranklist are identified by making use of the tfid similarity calculation method,which provides better precision and recall factor.The tfid uses the kmeans clustering for identifying the highest page rank list. When the queries are provided it is searched in the corresponding web data bases the cleaned information are obtained tag extraction andweightening methods are carried out .then the top ranked clusters are passed in to thesingle page fitering and cross page filteringthen query answer pagelet selection ahich are then send in to the deep webpages which includes data mining indexing and so on.

3. STAVIES: A System for Information Extractionfrom Unknown Web Data Sources throughAutomatic Web Wrapper Generation UsingClustering Techniques

This based on the robotic approach for extraction of the online data.the format of the data is firstunderstood and analysed based on avoiding the raw tag is well semi automated approaches.The web pages are provided as input and the structural tokens are obtained as the preparation phase mainly deals with the hypertext markup language generation and in to a tree structure to find the terminal nodes effectively and information extraction is also carried out the segmentation phase the boundary selection and node comparison takes place.then we need to evaluate it it obtain the particular token to the enduser.the stavies method is compared with the mining data record concept which is the one first obtained to extract the data from web pages where the tokens cannot be extracted.

4. Structured Data Extraction from the Web Based on Partial Tree Alignment

Yanhong Zhai and Bing Liu says that the document object model technique is used to indentify visual information and tree matching techniques. The nested tag structures are used to built the document bject model tree here the parent, node sibling concept are used.Depta and road runner concept are used here.depta misses when there are close neighour are adjacent to it.and also all the nodes should contain the data for the easier extraction.road runner technique is based on parsing the data to identify the if there any miss matching in the string.recursion concept is used to find the matching the string.wrapper which are generated mainly focuses on the positive results rather than the negative should have some prior knowledge about the extraction techniques.

5. ViDE: A Vision-Based Approach for Deep Web Data Extraction

Wei Liu, Xiaofeng Meng, Weiyi Meng tells that it tries to reduce the human intervention while searching the websites in the web.A new concept revision is used here to identify the amount of human required to perform required to perform the human extraction. There are many kind of information extraction such as visual based, content based ,layout based and appearance based.the noise block which is presented are filtered and the blocks which are remaining arranged according to the visual similarity. the major with this scheme is it can only one data region at a is based on the concept of application programming interface generally ,a new set of application program interface should be developed.

6. Incremental Information Extraction Using Relational Databases

Luis Tari, , Phan Huy Tu,Jo¨ rg Hakenberg, Yi Chen, Member, Tran Cao Son, Graciela Gonzalez, and Chitta Baral specifies that the extraction goals are specified by the user itself.if any new extraction goal is identified which is done from the beginning.the parse tree query language is the information involves

The five phases which is include as splitting of the sentence based on some order and tokenies it that means identifying the particular sentence,partial parsing based on the grammatical structures and pattern matching based on the relations.the parse tree language usually contains four conditions which are named as tree pattern, link condition,proximity conditionand the return expression.the linked condition contains some predicate expression.The query evaluations are done by using the relational data based management system .here the sentence are splitted according to the noun and adverbs present in the sentence.

7. Combining Tag and Value Similarity for Data Extraction and Alignment

Weifeng Su, Jiying Wang, Frederick H. Lochovsky, Member, and Yi Liu approaches to handles only when there are atleast two query result record.The resultant web pager which is generated based on th user query is provided as input to the tag tree construction and the resultant tree enters in to the data region identification which are further splitted in to data regions which bye passes through the record segmentation where the particular record are identified then corresponding datamerge occurs.Then it enters in to the query result section identifications and finally the query result records are identified.


We present work assumes that all user local instance repositories have content-based descriptors referring to the subjects, however, a large volume of documents existing on the web may not have such content-based descriptors. For this problem, strategies like ontology mapping and text classification/clustering were suggested. These strategies will be investigated in future work to solve this problem. The investigation will extend the applicability of the ontology model to the majority of the existing web documents and increase the contribution and significance of the present work.