Invisible Web Intelligent Integration Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

For the purpose of this study, following features are considered essential to the Invisible Web intelligent integration and information retrieval. A huge amount documents in the hidden Web, as well as pages hidden behind search forms, specialized databases, and dynamically generated Web pages, are not accessible by universal Web mining application. A lot of work has been carried out in these fields by researchers. This part of study enlightens briefly on some of work done by those researchers. The work from various books, papers, articles, journals has been referred for this purpose. It is humble approach to thank those researchers whose work done will be referred in this research. Some of them are mentioned in this article. Rest is enlisted in the reference section of this document. We study the paper "Research on Knowledge-base and its Constructing for the Invisible Web Information Processing" by Wenhong Guo.

We have gone through many IEEE papers and journals for the purpose of our study. We are thankful to the authors of those literatures. Only few of those can be mentioned here as the literature survey but all of those have inspired us into this direction of research. A powerful web data / link extractor utility extract URL, Meta tag (title, desc, and keyword), body text, email, phone, and fax from web site, search results or list of URLs [Baum, 2008]. High speed, multi-threaded, accurate extraction - directly saves data to the disk file. Program has numerous filters to restrict session, like - URL filter, date modified, file size, etc. It allows user-selectable recursion levels, retrieval threads, timeout, proxy support and many other options [Baum, 2008]. The Invisible Web (or Hidden Web) comprises all information that resides in autonomous databases behind portals and information providers' web front-ends. Web pages in the Invisible Web are dynamically-generated in response to a query through a web site's search form and often contain rich content [Jussi, 2003].A recent study has estimated the size of the Invisible Web to be more than 500 billion pages, whereas the size of the "crawl able" web is only 1% of the Invisible Web. The problem of automatically discovering and interacting with Deep Web search forms is a problem examined in [16]. Other researchers [10] have proposed methods to categorize Deep Web sites into searchable hierarchies. More recently, there have been efforts to improve source selection for meta searching Deep Web sources [3] and to match Deep Web query interfaces [25], [23], [22].

As a first step toward supporting Deep Web data analysis, several hand-tuned directories have been built (e.g., and www.invisibleweb. com), and several domain-specific information integration portal services have emerged, such as the NCBI bioportal. These integration services offer uniform access to heterogeneous Deep Web collections using wrappers-programs that encode semantic knowledge about specific content structures of Web sites, and use such knowledge to aid information extraction from those sites. The current generations of wrappers are mainly semiautomatic wrapper generation systems (e.g., [1], [12]) that encode programmers' understanding of specific content structure of the set of pages to guide the data extraction process. As a result, most data analysis services based on wrappers have serious limitations on their breadth and depth coverage. Most closely related to Thor are a number of data extraction approaches that emphasize identifying and extracting certain portions of Web data. For example, the WHIRL system [8] uses tag patterns and textual similarity of items stored in a deductive database to extract simple lists or lists of hyperlinks. The system relies on previously acquired information in its database to recognize data in target pages. For data extraction across heterogeneous collections of Deep Web databases, this approach is infeasible. Similarly, Arasu and Garcia-Molina [2] have developed an extraction algorithm that models page templates and uses equivalence classes for data segmentation. In both cases, there is an assumption that all pages have been generated by the same underlying template, whereas Thor automatically partitions a set of diverse pages into control-flow dependent groupings. Thor builds on previous work in a related publication [8]. Bar-Yossef and Rajagopalan [4] call the functionally distinct portions of a page pagelets. We use this formulation to guide template discovery, which is ancillary to the data extraction problem. Zhang Z., He B., Chang K.C., [29] proposes a method based on grammar analysis to complete the query interface schema extraction. The method hypothesizes the existence of a hidden syntax for query interfaces and enable principled solutions for both declaratively representing common patterns by a derived grammar.

He H, Meng WY, Lu YY et al [30] proposes a schema model for representing complex search interfaces and then present a layout expression based approach to automatically extract the logical attributes from search interfaces.

Yoo Jung An, James Geller, Yi-Ta Wu, Soon Ae Chun [31] introduces the semantic Deep Web, utilizing an ontology to determine relevance of query interface attributes to access Deep Web. Ontology enriches the candidate query attributes by providing synonyms and by supporting the attributes used by designers and users.

Hexiang Xu et al [32]: The classification of deep Web Sources is an important area in large-scale deep Web integration, which is still at an early stage. They present a deep Web model and machine learning based classifying model. The throws experimental results that have a good performance with a small scale training samples for each domain, and as the number of training samples increases, the performance keeps stabilization.

Liu Jing et al [33] proposed a deep Web crawling approach based on ordinal regression model. He divide page into 3 levels, and take the feedback of page classifier as an ordinal regression problem. He also takes into account the interests of link delay; the related links are limited within 3 layers or less. Throw the Experiment results demonstrate that the feedback-based crawling strategy could effectively improve the crawling speed and accuracy.

Guangyue Xu et al [34]: Online databases maintain a collection of structured domain-specific documents dynamically generated in response to users' queries instead of being accessed by static URLs. Categorizing deep webs according to their object domains is a critical step to integrate such sources. While existing methods focus on supervised or post-query methodologies, they propose a more practical pre-query algorithm operating in an unsupervised manner. Given the domain number, their two phase approach firstly investigates the hidden domain distribution for each query form using topic models and each query form's object domain can be identified preliminarily. In this phase, they construct their training set composing the query forms deemed to have already been categorized correctly, and beside, the deep webs needed to be reclassified are also selected in this phase. In the second phase, they train a classifier with String Kernel methods to reclassify the uncertain deep webs to improve the overall performance. The advantage of their algorithm over previous ones is that they capture the semantic structure for each query form. Based on the two phase architecture, their framework works in an unsupervised manner and achieves satisfactory results. Experiments on the TEL-8 dataset from the UIUC Web integration repository show the effectiveness and efficiency of their algorithm.

Ritu Khare et al [35]: This paper presents a survey on the major approaches to search interface understanding. The Deep Web consists of data that exist on the Web but are inaccessible via text search engines. The traditional way to access these data, i.e., by manually filling-up HTML forms on search interfaces, is not scalable given the growing size of Deep Web. Automatic access to these data requires an automatic understanding of search interfaces. While it is easy for a human to perceive an interface, machine processing of an interface is challenging. During the last decade, several works addressed the automatic interface understanding problem while employing a variety of understanding strategies. This paper presents a survey conducted on the key works. This is the first survey in the field of search interface understanding. Through an exhaustive analysis, they organized the works on a 2-D graph based on the underlying database information extracted and based on the technique employed.

Tim Furche et al [36]: Forms are their gates to the web. They enable them to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, they present OPAL, the first comprehensive approach to form understanding. They identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, they introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

However, these methods above, some experimental results show that low for the accuracy of extraction, and some high complexity of schema extraction, which do not achieve the practical standards, so these methods are not adapt to extract schema of query interface automatically. Therefore, that how to extract the meaningful information from query interfaces and merge them into semantic attributes plays an important role for the interface integration in deep web.

Umara Noor et al [37]: To devise vision of the next generation of the web, deep web technologies have gained larger attention in a last few years. An eminent feature of next generation of web is the automation of tasks. A large part of Deep web comprises of online structured domain specific databases that are accessed using web query interfaces. The information contained in these databases is related to a particular domain. This highly relevant information is more suitable for satisfying the information needs of the users and large scale deep web integration. In order to make this extraction and integration process easier, it is necessary to classify the deep web databases into standard\ non-standard category domains. There are mainly two types of classification techniques i.e. manual and automatic. As the size of deep web is increasing at an exponential rate with the passage of time, it has become nearly impossible to classify these deep web search sources manually into their respective domains. For this purpose, several automatic deep web classification techniques have been proposed in their literature. In this paper apart from this, they propose a framework for analysis of automatic classification techniques of deep web. The framework provides a baseline for the analysis of rudiments of automatic classification techniques based on the parameters such as structured, unstructured, simple/advance query forms, content representative extraction methodology, level of classification, performance evaluation criteria and its results.

Jayant Madhavan et al [38]: The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of their surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, their goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. They present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. They present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into their web search index. They present an extensive experimental evaluation validating the effectiveness of their algorithms.

Jmuntunemuine et al [39]: With the rapid development of the World Wide Web, there are more and more Web databases available for users to access. This rapid development of the World Wide Web has dramatically changed the way in which information is managed and accessed, and

that information on Web has covered all aspect of human activities. So the Web can be divided into the Surface Web and the Deep Web. Surface Web refers to the Web pages that are static and linked to other pages, while Deep Web refers to the Web pages created dynamically as the result of specific search. This research focuses on querying the Deep Web. Deep Web refers to the databases accessible through query interfaces on the World Wide Web. A Deep Web query system presents to users a single interface for querying multiple Web databases in a domain such as airline booking and extracts the relevant information from different web databases sources, and then returns results for users.

Luciano Barbosa, Juliana Freire et al: In this paper they address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, their goal is to cluster the forms according to the database domains to which they belong. They propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context- both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, their approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that their strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that their approach provides an effective and general solution to the problem of organizing hidden-Web databases.

Carlos R. Rivero, Rafael Z. Frantz et al: The actual value of the Deep Web comes from integrating the data its applications provide. Such applications offer human-oriented search forms as their entry points, and there exists a number of tools that are used to fill them in and retrieve the resulting pages programmatically. Solution that rely on these tools are usually costly, which motivated a number of researchers to work on virtual integration, also known as metasearch. Virtual integration abstracts away from actual search forms by providing a unified search form, i.e., a programmer fills it in and the virtual integration system translates it into the application search forms. They argue that virtual integration costs might be reduced further if another abstraction level is provided by issuing structured queries in high-level languages such as SQL, XQuery or SPARQL; this helps abstract away from search forms. As far as they know, there is not a proposal in this research that addresses this problem. In this paper, they propose a reference framework called IntegraWeb to solve the problems of using high-level structured queries to perform deep-web data integration. Furthermore, they provide a comprehensive report on existing proposals from the database integration and the Deep Web research fields, which can be used in combination to address their problem within the previous reference framework.

Wanli ZUO, Ying WANG et al [41]: Deep Web contains a significant amount of visited information, in order to be able to make full use of the information; they need to organize it according to different domain. Therefore, it is imperative that Deep Web databases should be classified by domain automatically. In this paper, a new Deep Web database classification framework is proposed, which adds semantic information to feature vectors and centroid vector by extracting the synsets of terms which can be obtained from WordNet, and replace the terms by corresponding synsets in the feature vectors and centroid vector to achieve dimensionality reduction of vectors. Lastly, highlight the semantic feature vectors by semantic centroid vector, and classify the highlighted semantic feature vectors by classification algorithm. Experiments show that experiment 3 which combines experiment 1 and experiment 2 can effectively improve the classification accuracy of Deep Web databases.

Ling Song, Dongmei Zhang et al [42]: Deep Web database clustering is a key operation in organizing Deep Web resources. Cosine similarity in Vector Space Model (VSM) is used as the similarity computation in traditional ways. However it cannot denote the semantic similarity between the contents of two databases. In this paper how to cluster Deep Web databases semantically is discussed. Firstly, a fuzzy semantic measure, which integrates ontology and fuzzy set theory to compute semantic similarity between the visible features of two Deep Web forms, is proposed, and then a hybrid Particle Swarm Optimization (PSO) algorithm is provided for Deep Web databases clustering. Finally the clustering results are evaluated according to Average Similarity of Document to the Cluster Centroid (ASDC) and Rand Index (RI). Experiments show that: 1) the hybrid PSO approach has the higher ASDC values than those based on PSO and K-Means approaches. It means the hybrid PSO approach has the higher intra cluster similarity and lowest inter cluster similarity; 2) the clustering results based on fuzzy semantic similarity have higher ASDC values and higher RI values than those based on cosine similarity. It reflects the conclusion that the fuzzy semantic similarity approach can explore latent semantics.

Zhendong Qu, Derong Shen, Ge Yu et al [43]: The Deep Web data integration has become more and more important due to the large amount of deep web data sources. Nevertheless, how to select the most relevant data sources on deep web is still a challenging issue. However, the existing strategies only focus on the data sources interfaces, which are not enough to select the best-effort data sources in the same domain. To solve this problem, an integrative Data Sources Selection Model named as DSSM is proposed in this paper, in which, the interface schema, the search mode, the contents in background databases, as well as the quality of data sources are considered together. So the model has the ability to select the best-effort data sources satisfying user queries. After carrying out a series of experiments on real-world sources, they demonstrate the effectiveness of the DSSM model.

Tantan Liu, Gagan Agrawal et al [44]: This paper focuses on the problem of clustering data from a {\em hidden} or a deep web data source. A key characteristic of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. They have developed a new stratified clustering method addressing this problem for a deep web data source. Specifically, they have developed a stratified k-means clustering method. In their approach, the space of input attributes of a deep web data source is stratified for capturing the relationship between the input and the output attributes. The space of output attributes of a deep web data source is partitioned into sub-spaces. Three representative sampling methods are developed in this paper, with the goal of achieving a good estimation of the statistics, including proportions and centers, within the sub-spaces of the output attributes. They have evaluated their methods using two synthetic and two real datasets. Their comparison shows significant gains in estimation accuracy from both the novel aspects of their work, i.e., the use of stratification (5%-55%), and their and representative sampling methods (up to 54%).

Zilu Cui, Yuchen Fu et al [45]: As the volume of information in the Deep Web grows, a Deep Web data source classification algorithm based on query interface context is presented. Two methods are combined to get the search interface similarity. One is based on the vector space. The classical TF-IDF statistics are used to gain the similarity between search interfaces. The other is to compute the two pages semantic similarity by the use of HowNet. Based on the K-NN algorithm, a WDB classification algorithm is presented. Experimental results show this algorithm generates high-quality clusters, measured both in terms of entropy and F-measure. It indicates the practical value of application.

Boutros R. El-Gamil, Werner Winiwarter et al [46]: The problem of extracting data that resides in the deep Web has become the center of many research efforts in the recent few years. The challenges in this research area are spanning from online databases discovery and forms extraction from query interfaces, to receiving structured queries from the user, submitting them automatically and retrieving accurate results back to the user. Therefore, the main task is to build an integrated system that connects this variety of missions. In this paper they give an overview of this area of research. They start by surveying previous deep Web systems. After that they define the basic components of a typical deep Web integrated system. Finally, they highlight the current challenges along with possible future research directions.

Fan Shi, Yuliang Lu, Guozheng Yang, Jun Huang et al [47]: To solve the problem of Deep Web query generation automatically, this paper first proposes a query interface criterion of Deep Web called S-LAV, ascertaining predecessor and successor relations between controls and utilizing them in controls grouping, constructing RIT model with a corresponding algorithm description. Then, the process of controls grouping is divided into 2 phases: control pre-clustering based on heuristic rules and semantic annotation based on How Net, which aims to promote the precision and relativity of query word libraries. Finally, in order to build query word libraries, a method of query word libraries generation and a method of query word expanding using relative feedback are proposed. Experiment on a practical Deep Web site shows that the scale of query word libraries with a high relativity is reduced.

Fan Wang, Gagan Agrawal at el [48]: A large part of the data on the World Wide Web resides in the deep web. Most deep web data sources only support simple text interfaces for querying them, which are easy to use but have limited expressive power. Therefore, processing complex structured queries over the deep web currently involves a large amount of manual work. Their work focuses on addressing the existing gap between users' need of expressing and executing complex structured queries over the deep web, and the simple and limited input interfaces of the existing deep web data sources. This paper presents a query planning problem formulation, with novel algorithms and optimizations, for enabling a high-level and highly expressive query language to be supported over deep web data sources. They particularly target three types of complex queries, which are select-project-join queries, aggregation queries, and nested queries. They have developed query planning algorithms to generate query plans for each of these, and propose several optimization techniques to further speedup query plan execution.

In their experiments, they show their algorithm has good scalability and furthermore, for over 90% of the experimental queries, the execution time and result quality of the query plans generated by their algorithms are very close to the optimal plans generated by an exhaustive search algorithm. Furthermore, their optimization techniques outperform an existing optimization method in terms of both reduction in transmitted data records and query execution speedups.

Raju Balakrishnan, Subbarao Kambhampati et al [49]: One immediate challenge in searching the deep web databases is source selection - i.e. selecting the most relevant web databases for answering a given query. The existing database selection methods (both text and relational) assess the source quality based on the query-similarity-based relevance assessment. When applied to the deep web these methods have two deficiencies. First is that the methods are agnostic to the correctness (trustworthiness) of the sources. Secondly, the query based relevance does not consider the importance of the results. These two considerations are essential for the open collections like the deep web. Since a number of sources provide answers to any query, they conjuncture that the agreements between these answers are likely to be helpful in assessing the importance and the trustworthiness of the sources. They compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, they also measure and compensate for possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source that they call SourceRank, is calculated as the stationary visit probability of a random walk. They evaluate SourceRank in multiple domains, including sources in Google Base, with sizes up to 675 sources. They demonstrate that the SourceRank tracks source corruption. Further, their relevance evaluations show that SourceRank improves precision by 22-60% over the Google Base and the other baseline methods. SourceRank has been implemented in a system called Factal.

Anuradha, A.K. Sharma et al [50]: Ontologies act like a bridge between user expressions and raw data. Hence, they can play an important role in assisting the users in their search for Web pages. Different users use different queries according to their knowledge and intuition to find the results. The numbers of relevant Web pages returned to users differ depending on the terms entered into the search box of traditional search engines. Many Web pages returned to users may be completely irrelevant, and it takes too long for users to identify the relevant Web pages by going through too many results. It is necessary to develop a methodology such that the number of returned Web pages becomes smaller while the overall number of relevant Web pages becomes bigger. This paper proposes a novel approach that combines Deep Web information, which consists of dynamically generated Web pages and cannot be indexed by the existing automated Web crawlers, with ontologies built from the knowledge extracted from Deep web sources. Here, Ontology based search is divided into different modules. The first module constructs attribute-value ontology. Second module constructs the attribute-attribute ontology. Third module formulate the user query, fills the search interface using domain ontology, extract results by looking into the index database.

Wei Liu, Xiaofeng Meng et al [51]: The proliferation of deep Web offers users a great opportunity to search high-quality information from Web. As a necessary step in deep Web data integration, the goal of duplicate entity identification is to discover the duplicate records from the integrated Web databases for further applications (e.g. price-comparison services). However, most of existing works address this issue only between two data sources, which are not practical to deep Web data integration systems. That is, one duplicate entity matcher trained over two specific Web databases cannot be applied to other Web databases. In addition, the cost of preparing the training set for n Web databases is C_n^2 times higher than that for two Web databases. In this paper, they propose a holistic solution to address the new challenges posed by deep Web, whose goal is to build one duplicate entity matcher over multiple Web databases. The extensive experiments on two domains show that the proposed solution is highly effective for deep Web data integration.

 Baohua Qiang, Chunming Wu, Long Zhang et al [52]: With the rapid developments and extensive applications of internet, a large number of duplicated entities on the Web, especially on the Deep Web, require be eliminating and integrating effectively. So identifying the corresponding entities on the Deep Web is critical. Due to the query interface on the HTML page represents the schema of the Web database, they firstly try to obtain the schema of the entities on the Deep Web by extracting the schema of the query interface in order to improve the accuracy for entities matching. Then an entities identification approach on the Deep Web using neural network is proposed. The experimental results show the effectiveness of their proposed algorithm.

Jing Shan, Derong Shen, Tiezheng Nie, Yue Kou, Ge Yu  et al [53]: Because the amount of information contained on the Deep Web is much larger than the surface web, how to use it well has become a popular problem to research. When a query is sent to a deep web resource and the data sources return few results or even no result, a proper query relaxation solution should be adopted to get more satisfactory results to users. In this paper, such a query relaxation solution is presented. First, it solves the problem of relaxing attributes which contain multiple key words by value. That is, such attributes are not simply removed in the relaxation, but the query values of the attributes are modified. Second, when a data source returns many result pages, instead of getting all the pages, it evaluates the quality of the results in the current page to decide whether to send another query to fetch the next page. Thus, the number of query times is reduced. Finally, the experimental results demonstrate that both the result quality and the query efficiency are improved.

Huilan Zhao at el [54]: Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. Deep Web sources store their content in searchable databases that only produce result dynamically in response to a direct request. In this paper, they proposed an automatic classification algorithm of Deep Web sources based on hierarchical clustering method in order to facilitate users to browse this valuable information.

Huilan Zhao at el [55]: In the deep web, a significant amount of information can only be accessed through query interfaces. In this paper, a classification algorithm based on constructive neural network is used to automatically determine whether the web page is query interface.

Andrea Calì, Davide Martinenghi et al [56]: Data stored outside Web pages and accessible from the Web, typically through HTML forms, constitute the so-called Deep Web. Such data are of great value, but difficult to query and search. They survey techniques to optimize query processing on the Deep Web, in a setting where data are represented in the relational model. They illustrate optimizations both at query plan generation time and at runtime, highlighting the role of integrity constraints. They discuss several prototype systems that address the query processing problem.


Rong. Luo, Chunguang. Li, Yuxi. Gong et al [57]: In order to quickly and accurately using mass information in the Deep Web database. According to the characteristics of the Deep Web: large-scale, dynamic and heterogeneity, One new query method was put forward based on the document divergence for the first time. The experiment adopted testing data of TREC show the method can avoid the repeat of query document in effect, especially for a great deal of documents returned from the deep web database. The method improves the effectiveness of query.

Hao Liang, Fei Ren, Wanli Zuo, Fengling He, Junhua Wang et al [58]: There is myriad high quality information in the Deep Web and the feasible method to access the Deep Web is through the query interface of the Deep Web. It's necessary to extract abundant attributes and semantic relation description from the query interface. Automatic extracting attributes from the query interface and automatically translating a query is a solvable way for addressing the current limitations in accessing Deep Web data sources. They design a framework to automatically extract the attributes and instances from the query interface using the WordNet as a kind of ontology technique to enrich the semantic description of the attributes. Each attribute is extended into a candidate attribute set in the form of a hierarchy tree. At the same time, the hierarchy tree generated by ontology describes the semantic relation of the attributes in the same query interface.

Cui Xiao-Jun, Peng Zhi-Yong, Wang Hui et al [59]: A large number of web pages returned by filling in search forms are not indexed by most search engines today. The set of such web pages is referred to as the Deep Web. Since results returned by web databases seldom have proper annotations, it is necessary to assign meaningful labels to the results. This paper presents a framework of automatic annotation which uses multi-annotator to annotate results from different aspects. Especially, search engine-based annotator extends question-answering techniques commonly used in the AI community, constructing validate queries and posing to the search engine. It finds the most appropriate terms to annotate the data units by calculate the similarities between terms and instances. Information for annotating can be acquired automatically without the support of domain ontology. Experiments over four real world domains indicate that the proposed approach is highly effective.

Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu et al [60]: Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. They propose a general method in this regard. In order to minimize the duplicates retrieved, they reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, they carried out a set of experiments on large corpora including Wikipedia and Reuters. They show that their sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large.

Ying Wang, Wanli Zuo, Tao Peng, Fengling He et al [61]: The web has been rapidly deepened with myriad searchable databases online, where data are hidden behind query interfaces. However, users often have difficulties in finding the right sources and then querying over them in myriad useful databases online. For solving this problem, this paper presents a new method by importing focused crawling technology to automatically accomplish deep web sources discovery. Firstly, locate web sites for Domain-Specific data sources based on focused crawling. Secondly, judge whether the web site exists deep web query interface in the former three depths. Lastly, judge whether the deep web query interface is relevant to a given topic. Importing focused crawling technology makes the identification of deep web query interface locate in a specific domain and capture relative pages to a given topic instead of pursuing high overlay ratios. This method has dramatically reduced the quantity of pages for the crawler to identify deep web query interfaces.

Bao-hua Qiang, Jian-qing Xi, Ling Chen et al [62]: The Deep Web is becoming a very important information resource. Unlike the traditional Web information retrieval, the contents on the Deep Web are only accessible through source query interfaces. However, for any domain of interest, there may be so many query interfaces that users need to access them in order to get the desired information, which is time-consuming and requires building an integrated query interface over the sources. The first important task towards this goal is schema extraction of source query interface. In this paper, they will present a novel pre-clustering algorithm with proper grouping patterns to obtain partial clustering of attributes. Their approach can avoid obtaining the incorrect subsets when grouping attributes. The experimental results showed their approach is highly effective on schema extraction of source query interfaces on the Deep Web.

Dheerendranath Mundluru, Xiongwu Xia et al [63]: Local search engines allow geographically constrained searching of businesses and their products or services. Some of the local search engines use crawlers for indexing Web page contents. These crawlers mostly index Web pages that are accessible through hyperlinks and which include desirable location information. It is extremely important for local search engines to also crawl additional high-quality "local" content (e.g., user reviews) that is available in the Deep Web. Much of this content is hidden behind search forms and is in the form of structured data, which is increasing very rapidly. In this paper, they present their experiences in crawling and extracting a wide variety of local structured data from large number of Deep Web resources. They discuss the challenges in crawling such sources and based on their experience they offer some effective principles to address them. Their experimental results on several Deep Web sources with local content show that the techniques discussed are highly effective.

Alfredo Alba, Tyrone Grandison, Varun Bhagwan et al [64]: Prevailing wisdom assumes that there are well-defined, effective and efficient methods for accessing Deep Web content. Unfortunately, there are a host of technical and non-technical factors that may call this assumption into question. In this paper, they present the findings from work on a software system, which was commissioned by the British Broadcasting Corporation (BBC). The system requires stable and periodic extraction of Deep Web content from a number of online data sources. The insight from the project brings an important issue to the forefront and under-scores the need for further research into access technology for the Deep Web.

Soon Ae Chun, Janice Warner et al [65]: In this paper, they recognize the shortcomings of the current search engines that do not index and search the Deep Web. They present requirements of a Deep Web Service search engine that will lead to the query objects in the deep data sources. In order to realize the DWS search engine, they propose semantic metadata and annotation of Deep Web Services (DWS), a reasoning component to assess the relevance of DWS for searching the Deep Web contents, using likelihood of occurrence of data sources that contain the query terms, and present a method of ranking the DWSs. The Deep Web Service annotation considers not only the service descriptions like any Web services, but also has the frequency distribution, clustering and semantic prediction functions.

Yue Kou, Derong Shen, Ge Yu, Tiezheng Nie et al [66]: With the rapid growth of Web Databases, it's necessary to extract and integrate large-scale data available in Deep Web automatically. But current Web search engines conduct page-level ranking, which are becoming inadequate for entity-oriented vertical search. In this paper, they present an Entity-level Ranking Mechanism called LG-ERM for Deep Web query based on local scoring and Global aggregation. Unlike traditional approaches, LG-ERM considers more rank influencing factors including the uncertainty of entity extraction, the style information of entities and the importance of Web sources, as well as the entity relationship. By combining local scoring and global aggregation in ranking, the query result can be more accurate and effective to meet users' needs. The experiments demonstrate the feasibility and effectiveness of the key techniques of LG-ERM.

Peiguang Lin, Yibing Du, Xiaohua Tan, Chao Lv et al [67]: In recent years, the Web is "deepened" rapidly and users have to browse quantities of web sites to access Web databases in a specific domain. So, to build a unified query interface which integrates query interfaces of a domain to access various Web databases at the same time becomes a very Transcendence important issue. In this paper, the schema characteristics of query interfaces and common attributes in a same domain are firstly analyzed, and it also gives a new representation of query interface, then the definition of "Form term" and "Function term" are proposed, and a new similarity computing algorithm, literal and semantic based similarity computing (LSSC) is proposed, which is based on the two definitions. Secondly, a clustering algorithm for Deep Web query interfaces is given by combining LSSC and NQ algorithm: LSSC-NQ. Finally, experiments show that this algorithm can give accurate similarity computing, and cluster query interfaces efficiently, reliably and quickly.

Anna C. Cavender, Craig M. Prince, Jeffrey P. Bigham, Ryan S. Kaminsky and Tyler S. Robison et al [68]: A wealth of structured, publicly-available information exists in the deep web but is only accessible by querying web forms. As a result, users are restricted by the interfaces provided and lack a convenient mechanism to express novel and independent extractions and queries on the underlying data enables personalized access to the deep web by enabling users to partially reconstruct web databases in order to perform new types of queries. From just a few examples, Transcendence helps users produce a large number of values for form input fields by using unsupervised information extraction and collaborative filtering of user suggestions. Structural and semantic analysis of returned pages finds individual results and identifies relevant fields. Users may revise automated decisions, balancing the power of automation with the errors it can introduce. In a user evaluation, both programmers and non-programmers found Transcendence to be a powerful way to explore deep web resources and wanted to use it in the future.

Jufeng Yang, Guangshun Shi, Yan Zheng, Qingren Wang et al [69]: In this paper, they propose a novel model to extract data from Deep Web pages. The model has four layers, among which the access schedule, extraction layer and data cleaner are based on the rules of structure, logic and application. In the experiment section, they apply the new model to three intelligent system, scientific paper retrieval, electronic ticket ordering and resume searching. The results show that the proposed method is robust and feasible.

Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng et al [70]: An increasing number of databases have become Web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep Web data collection and comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, they present a multi-annotator approach that first aligns the data units into different groups such that the data in the same group have the same semantics. Then for each group, they annotate it from different aspects and aggregate the different annotations to predict a final annotation label. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same site. Their experiments indicate that the proposed approach is highly effective.

Hexiang Xu, Chenghong Zhang, Xiulan Hao, Yunfa Hu et al [71]: The classification of deep Web Sources is an important area in large-scale deep Web integration, which is still at an early stage. Many deep Web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. To date, in terms of the Classification, existing works mainly focus on classifying texts or Web documents, and there is little in the deep Web. In this paper, they present a deep Web model and machine learning based classifying model. The experimental results show that they can achieve a good performance with a small scale training samples for each domain, and as the number of training samples increases, the performance keeps stabilization.

Xin Zhong, Yuchen Fu, Quan Liu, Xinghong Lin, Zhiming Cui et al [72]: Schema matching is a critical problem in Deep Web integration process. This paper introduces a holistic approach, to match many schemas at the same time and find all matchings at once. They mainly analyses and compares the two existent archetypal systems: MGS and DCM. Furthermore, propose a new algorithm, named Correlated-clustering, based on advantages of the two existent systems. This algorithm first mines group attributes by positively correlated attributes, and then clusters the concepts by calculating the similarity of each two concepts, finally, develop a strategy to select matching. The experiment result shows the effectiveness and completeness of their algorithm, which demonstrate the promise of holistic schema matching.

The impetuous growth of web data and the variety of web technologies require researchers to develop usable web tools for classifying, searching, querying, retrieving, extracting, and characterizing web information. In this literature review they try to cover some works devoted to the indicated problems. They classify the description of related research into five parts:

• Invisible Web: The formidable part of the Web known as the invisible Web is not "crawlable" by traditional search engines [25]. Web pages in the invisible Web are dynamically generated in response to queries submitted via search interfaces to web databases. The invisible Web provides more relevant and high-quality information in comparison with the"crawlable" part of the Web.

• Querying search interfaces: A brief review of web query languages providing some mechanisms to navigate web search forms in the queries is given in this part. They also consider several architectures of form query systems and designate the limitations of these systems.

• Extraction from data-intensive websites: Pages in data-intensive sites are automatically generated: data are stored in a back-end DBMS and HTML pages are produced using server-side web scripts from the content of the database. The RoadRunner system [35], which has been specifically designed to automate the data extraction process, is described.

• Invisible Web characterization: Two key works devoted to the characterization of the invisible Web are described in this section. Special attention is paid to the methods for estimating the principal parameters of the invisible Web.

• Finding invisible web resources: Finding of search interfaces to web databases is a challenging problem. They discuss existing approaches and their role in efficient location of the entry points to invisible web resources [19].