Extraction For Information Retrieval Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The amount of useful semi-structured data on the web continues to grow at a stunning pace with Web 2.0 introduced. Often interesting web data are not in database systems but in HTML pages, XML pages, or text files. Web data extraction is deemed to be one of the growing technologies in web mining. Rife information content available on the World Wide Web is published within representation-oriented semi-structured HTML pages making it difficult for machines to access the content. Data extraction from the WWW (Internet) is performed using information retrieval tools according to the parameters given and stored in a local database or files. Information Retrieval calls for accurate web page data extraction. To enhance retrieval precision, irrelevant data such as navigational bar and advertisement should be identified and removed prior to indexing. Effective utilization of web based information retrieval tools to be enhanced in the user community. Monitoring of various web based products will be done easily with the full implementation of the system. It becomes a great tool for data analysis and intelligence in the new age.

(Keywords: Web Data Extraction, Information Extraction, Web Mining, Information Retrieval)


The Web provides a new medium for storing, presenting, gathering, sharing, processing and using information. It brings us to a new information age. Given the rapid growth and success of public information sources on the World Wide Web (WWW), it is increasingly attractive to extract data from these sources and make it available for further processing by end users and application programs. Unfortunately this increase was not followed by significant improvement in mechanisms for accessing and manipulating this data. It is still accessed by browsing Web pages, entering information in query forms and reading the results that Web sites present. No convenient mechanism exist that would give user more power over the data on the Web, by, for example, allowing her to define custom queries to Web sites or to extract the returned data from HTML pages and use it in external applications.

Data extracted from Web sites can serve as the springboard for a variety of tasks, including information retrieval (e.g. business intelligence), event monitoring (news and stock market), and electronic commerce (comparison shopping). In addition, with the transformation of the Web into the primary tool for electronic commerce, it is imperative for organizations and companies, who have invested millions in Internet and Intranet technologies, to track and analyze user access patterns. These factors give rise to the necessity of creating server-side and client-side intelligent systems that can effectively mine for knowledge both across the Internet and in particular Web localities.

Here, a variety of methods for schema discovery and data extraction from HTML documents have been proposed and presents a model-driven crawler and extractor for gathering sample topologies from content websites. Like the general purpose Web crawler, aimed at gathering Web pages and their interconnecting hyper-links, our system is aimed at creating structured data of entities and their inter-relationships in a given on-line environment, by employing a descriptive model on how to navigate through and extract information from such environment.

Extracting structured data from Web sites is not a trivial task. Most of the information in the Web today is in the form of Hypertext Markup Language (HTML) documents which are viewed by humans with a browser. HTML documents are sometimes written by hand, sometimes with the aid of HTML tools. Given that the format of HTML documents is designed for presentation purposes, not automated extraction, and the fact that most of the HTML content on the Web is ill-formed ("broken"), extracting data from such documents can be compared to the task of extracting structure from unstructured documents.


Web Mining:

Web mining is the Data Mining technique that automatically discovers or extracts the information from web documents. It is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web.

Web Mining Process

The various steps are explained as follows:-

Resource finding: It is the task of retrieving intended web documents.

Information selection and pre-processing: Automatically selecting and pre- processing specific from information retrieved Web resources.

Information selection and pre-processing: Automatically selecting and pre- processing specific from information retrieved Web resources.

Generalization: Automatically discovers general patterns at individual Web site as well as multiple sites. Analysis: Validation and interpretation of the mined patterns.

Web Mining Categories

Web mining research overlaps substantially with other areas, including data mining, text mining, information retrieval, and Web retrieval. The classification is based on two aspects: the purpose and the data sources. Retrieval research focuses on retrieving relevant, existing data or documents from a large database or document repository, while mining research focuses on discovering new information or knowledge in the data. On the basis of this, Web mining can be classified into web structure mining, web content mining, and web usage mining

Web content mining: - Web Content Mining is the process of extracting useful information from the contents of Web documents. Content data corresponds to the collection of facts a Web page was designed to convey to the users. Web content mining is related but is different from data mining and text mining. It is related to data mining. It is different from data mining because web data are mainly semi-structured and or unstructured. Web content mining is also different from text mining because of the semi-structure nature of the web, while text mining focuses on unstructured texts. The technologies that are normally used in web content mining are NLP (Natural language processing) and IR (Information retrieval).

Web Structure Mining: - It is the process by which we discover the model of link structure of the web pages. The goal of Web Structure Mining is to generate structured summary about the website and web page. It tries to discover the link structure of hyper links at inter document level. The other kind of the web structure mining is mining the document structure. It is using the tree-like structure to analyze and describe the HTML (Hyper Text Markup Language) or XML (Extensible Markup Language).

Web Usage Mining: - Web Usage Mining is the process by which we identify the browsing patterns by analyzing the navigational behavior of user. It focuses on techniques that can be used to predict the user behavior while the user interacts with the web. It uses the secondary data on the web. This activity involves the automatic discovery of user access patterns from one or more web servers. Through this mining technique we can ascertain what users are looking for on Internet. It consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. Web servers, proxies, and client applications can quite easily capture data about Web usage.


The system consists of 4 main components: -



Information Retrieval.


The system overview visualizes the system architecture and interactions between the components. By designing loosely-coupled components, the system became rather flexible and extensible. Basic descriptions of the individual components are listed below;

Figure1: Architectural Design






World Wide Web


Authentication is the act of confirming the truth of an attribute of a datum or entity. This involves confirming the identity of a person or user, ensuring that the user is what he claims to be.

The authentication Module is supposed to provide security mechanisms and permissions to access the application. The functioning is simple requiring only user-name and a password stored in the database. The data stored in the tables are encrypted using MD5 hash function. The registration of new users is also permitted through the user registration form. Only registered users have access.


A crawler is a Java-based application that extracts data from on-line sources. Web sites consist of two types of HTML pages: target HTML pages that contain the data we want to extract and navigational HTML pages that contain hyper-links pointing to target pages or other navigational pages.

The crawler starts the navigation by retrieving the seed page (or pages) from the Web site and, based on its URL, determines whether the page is a target page or a navigational page. If it is a target page, it is forwarded to the data extractor for subsequent processing. Hyperlinks from both types of pages are analyzed and a decision to follow them is made on a link-by-link basis. A crawling depth parameter determines how many links away from the seed page the crawler can move.

A Web search engine periodically downloads and indexes a sub-set of Web pages (Off-line operation). This index is used for searching and ranking in response to user queries (on-line operation). The search engine is an interface between users and the World Wide Web.

Figure 2: HTML Parsing Process.


Fetch Sub-Assembly






During parsing, links are extracted to build a Web graph, and they can be analyzed later to generate link-based scores that can be stored along with the rest of the meta data. The following steps occur.

(1) Pages are parsed and links and extracted.

(2) Partial indices are written on disk when main memory is exhausted.

(3) Indices are merged into a complete text index.

(4) Off-line link analysis can be used to calculate static link-based scores.

Generation of a Data Extractor

Web Pages


Link Extraction



Link Index

Metadata Index

Text Index

Link Analysis

The concept of Data extractor as a procedure extracting unstructured information from a source and transforming them into structured data A Web data extraction system must implement the support for Data extractor execution. The steps between extraction and delivering are called data transformation during these phases, such as data cleaning and conflict resolution, users reach the target to obtain homogeneous information under a unique resulting structure.

There are two important characteristics of the Web that generate a scenario in which Web crawling is very difficult: its large volume and its rate of change, as there is a huge amount of pages being added, changed and removed every day. Also, network speed has improved less than current processing speeds and storage capacities. The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its' downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages that have already been updated or even deleted.

Crawling the Web in a certain way, resembles watching the sky in a clear night: what is seen reflects the state of the stars at different times, as the light travels different distances. What a Web crawler gets is not a "snapshot" of the Web, because it does not represents the Web at any given instant of time. The last pages being crawled are probably very accurately represented, but the first pages that were downloaded have a high probability of have been changed.

Information Retrieval System

Finding relevant information within a large collection of documents is one of the first challenges information systems had to face. The process of identifying, searching and acquiring the potential documents that may meet these information needs is called user retrieval process. All of the retrieved documents aim at satisfying user information needs expressed in natural language.

Information Retrieval deals with the specified search that is applied on the system so as to be able to get a specified result from the data extraction process. This involves using basic keywords search. Information retrieval (IR) efforts on the Web are concerned with capturing useful and precise content on demand and in real time in order to adapt to such variety, changes and dynamism. These efforts face a number of challenges, and Various (IR) methods are being employed in dealing with those challenges.

The process of information extraction is twofold: firstly, precise and robust access to particular data needs to be established and secondly gathered data is structured and stored automatically in a database. The complexity of employed methods for information extraction depends on the characteristics of source texts. The method can be rather simple and straightforward if the source is well structured. If the source of information is less structured or even plain natural language, the complexity of the extraction method becomes high as it includes natural language recognition and similar processes. Information extraction can be viewed as a method in between information retrieval and natural language processing as the goal of the IE is to extrapolate concrete facts contained within a written document and representing them in some useful form (such as record in a database). Main difference between these two approaches is that for the purpose of information extraction relevant facts of interest are specified in advance, while information retrieval tries to discover documents that may have facts of interest for the user that the user is not aware of. Information extraction is primarily based on pattern matching algorithms, so they rely on the structure of the information source. In terms of structure content of a written document can be:

fully structured content

Fully-structured content is a content that includes detailed description of every piece of data contained within the document. This meta-data can be used to easily access any fact in the document by a program routine. The most typical documents with fully structured content are XML documents and their variations. Some of the natural language texts can also be considered as fully structured if the content satisfies additional formal rules such as it is the case with CSV (comma separated values). These texts are therefore limited to repositories of facts and raw data. Information retrieval from a fully structured content is accurate and complete if there are no faults in the document structure. Different program algorithms can be extended with error detection and correction routines, but this makes the IE process expensive and less efficient.

semi-structured content

Semi-structured content is a content that includes only partial description of some of the data contained within the document. This type of content can also contain explicate or implicate information about the position of the beginning of different pieces of data, but the data itself is uncoupled and expressed using natural language. Natural language texts that obey the same grammatical rule of the language can be viewed as semi-structured form an information extraction point of view. In this type of content there is a great possibility of using different terms for same types of information so usage of ontologies or repositories of synonyms is advised. Information extraction from a semi-structured content is as accurate and as complete as the ontologies used are complete and precise.

unstructured content

If the content that is used for information extraction has no meta-data about the facts within the document i.e. if the content is natural language text information extraction is limited to usually quite exhaustive pattern-matching algorithms and advanced natural language processing algorithms with partial success rates. Natural language algorithms include text mining techniques such as artificial neural networks or advanced linguistic analysis. Information extraction from unstructured content does not guarantee accuracy or completeness of data obtained while a portion of data can remain undetected. One of the most common methods used in information extraction today is the pattern-matching method that is based on the formal description of the structure of the text that contains explicate pieces of data. For this purpose a specifically dedicated formal language for pattern-matching is developed and used called RegEx (Regular Expression).

Algorithm Pseudo-code for template extraction

Algorithm: Template table text chunk removal algorithm.

Input: Clustered web pages.

Output: text Map: discovered template table text chunks.

* {table text chunk, document frequency} pairs saved in hash map.

Variables: Integer: Threshold; * document frequency threshold for template discovery.

Buffer: text Chunk; * temporary table text chunk.

For Each web page g in the cluster

P<- html parser of g

While p has more HTML nodes * a html node can be any tag, string element, etc

n <- next html node extracted by p;

Update current tag path;

If n is not a table node

TextChunk <- textChunk + extracted text of n

If n is a table node

If textChunk is already in text Map

If textChunk is in a different document

Increment document frequency for current text chunk;


Put textChunk in text Map with document frequency;

Clear textChunk Buffer;

End While

End For

While text Map has more {textChunk, document frequency} pairs h <- next item in text Map;

If document frequency of h ≥ threshold

Print textChunk of item h

End While


When the extraction task is complete, and acquired data are packaged in the needed format, this information are ready to be used; the last step is to deliver the package, now represented by structured data, to a managing system(e.g. a native XML DBMS, a RDBMS, a data warehouse, etc.).

Visualization provides the visual output of whatever is being done here. The visual output is displayed grid format laid on a list with a formatted outlay of pictures collected and a description of what was being mined from the web page.


In conclusion, this project comes up with a novel approach to identify the web page template and extract unstructured data. The algorithm is fine tuned for accuracy and efficiency. It formulates the page generation model using an encoding scheme based on tree templates and schema, which organize data by their parent node in the DOM trees. WEDIRM contains two phases: phase I is merging input DOM trees to construct the fixed/variant pattern tree and phase-II is schema and template detection based on the pattern tree.

According to the page generation model, data instances of the same type have the same path in the DOM trees of the input pages. Thus, the alignment of input DOM trees can be implemented by string alignment at each internal node. The project designs a new algorithm for multiple string alignment, which takes optional- and set-type data into consideration. The advantage is that nodes with the same tag name can be better differentiated by the sub-tree they contain. Meanwhile, the result of alignment makes pattern mining more accurate. With the constructed fixed/variant pattern tree, deducing the schema and template for the input Web-pages is easily done. The extracted data can then be imported into other applications and further analyzed and processed.

Several challenges were encountered during the project. Many HTML web pages incorporate dynamic technique such as JavaScript and PHP script. For instance, sub-menu of navigation bars is rendered by "document. Write" method of JavaScript on the event of mouse over. Image map are also implemented with JavaScript code. Inclusion of any template navigational bar in the indexing phase is unnecessary. In future works to avoid the risk of missing critical information handling of these scripts will be important.

Another issue is handling images with descriptive functions in the web pages. It is not good to simply remove any image from the page, as a picture is worth a thousand words. For example, fashion site like fashion.com" prefers image text rather than a plain text. Image text certainly facilitates readers' understanding; on the other hand it prohibits data extraction. The "ALT" parameter of an image link sometimes also describes the image content, if edited by a human editor. Image name and "ALT" value are included in the web data extraction process, while getting rid of worthless values such as "img9", "spacer" or "graphic". The working on evaluating the effect of including informational image name and image name parsing rules should be considered in future. Other limitation are Network performance issues involving network availability and high bandwidths, legal issues comes into play where some sites restrict the extracting of data from their sites.

Future enhancements

This was only applicable for only a single book sales store site, so in future more sites can be include for extraction.

The system is implemented for the local machine environment. The system can be enhanced to provide more advance features in the Internet.

Data privacy control in dealing with legal issues.