Nodes And The Values In A Web Page Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Chapter 1


The World Wide Web, abbreviated as WWW and commonly known as The Web, is a system of interlinked hypertext documents contained on the Internet. With a web browser, one can view web pages that may contain text, images, videos, and other multimedia and navigate between them by using hyperlinks.

The system is like a portal of endless information that can be viewed at all times. An appropriate system can be made to approach them in an arranged manner. The server, browsers may progressively render the page onto the screen by HTML [9], CSS [10], and other languages on the web. The images and other resources are collaborated to produce the on-screen websites that the user views.

Most of the sites will contain links to other pages and new resources. A collective and useful resource can be linked via hypertext links which is a "web" of information was termed as before.

Since, HTML [9] was simple but not sufficient to retrieve data in an organized form, XML was introduced. With the help of The Extensible Markup Language, the nodes were easy to form.

The parent nodes and children nodes were viewed in a document language. The web is full of hypertext documents that are suggested by HTML and XHTML. CSS [10] is used for giving these pages a well formed look.

1.1 Objectives

The main objective of this project is to extract the nodes and the values in a web page and map them to a proper format so that it is easy for the users to search for a particular data whenever they want. For this, scraping the data and indexing the nodes are the first two and foremost steps that are needed to be taken care of. After the initial steps are executed, it is needed to store the scraped data inside a database. A code is then created to link to the database and retrieve the data according to the user's search query. Hence, a search engine is to be created.

1.2 Motivation

This project will help in learning about Data Mining and searching files when there is a large amount of data stored inside a database. The popular use of the World Wide Web as worldwide information system has given enormous amount of data and information. This growth in store has generated a need for new strategies and automated tools that can guide in transforming the vast amounts of data into useful information and knowledge.

Data mining is a field of

Technology of database.

Statistics with machine learning.

Recognizing patterns.

Neural networks.

High-performance computing.

Visualization of data. [11]

1.3 Data Mining

Data mining refers to the process of extracting data from the web from a vast data stored in millions of databases. It is now an important tool in transforming the data into a series of information. This is being used in a range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. [12] Thus, data mining should be more correctly named as "knowledge mining". Mining is a term characterized by the process that finds a set of precious nodes from a big database of files and information. [12]





MySQL Database



Fig 1.1 Block diagram of the design component.

1.4 Methodology

Phase 1

Broad Research on mechanism for web based documents

Phase 2

Scraping the data for searching and indexing using a web extractor

Phase 3

Using MySQL database to store the extracted data via .txt format by using tab spaces as delimiters.

Phase 4

Maintenance of the files containing the indexed data.

Phase 5

Analyzing the problems. Troubleshooting the hardware and software resources.

Phase 6

Designing the PHP code to connect to the MySQL Database to extract the data as per the user's query. Creating an HTML code to display the data on a webpage.

Phase 7

Fine tuning of the report, conclusion and submission of the report.

Chapter 2

Related Work

2.1 Mining Techniques

Data mining is powerful new technology helping companies focus on the most vital information in the data warehouses. Data mining tools can allow trends and behaviors making proactive decisions.[13]

In Web mining the web content mining is a feature of data mining techniques for relational databases. It is feasible to find similar types of data from the unstructured information residing on the Web. [14]

The Web content mining comes from two areas:

Information Retrieval View.

Database View.

For the semi-structured data, all the tasks use the HTML files inside the documents and some use the link structure between the documents for representing the documents. As for the database view, having the better management of information and querying the Web, mining always tries to decipher the structure of the Web site to transform a Web site into a database. [14]

The majority of work on information extraction has concentrated more on specific tasks like limiting the extraction to particular domains which in this is academic departments or applying extractors to specific web sites. These techniques cannot be applied to the Web as a whole. In this paper three kinds of extraction methods are described that have the goal of being domain independent, and therefore apply to the entire Web. It targets different kinds of resources on the Web (text, Meta tags, and HTML pages). An extracted portion of something that can unify the Web knowledge base can also teach the vital lessons on extraction [1].

Fig 1.2. Working diagram of the project.

The text documents or content available in the web pages can be effectively searched but searches based on queries require much more detailed and fine-grained techniques. The information appearing in natural format online must be transformed into structured, formatted database form. Information extraction serves to do this-it is the process of arranging the fields of a web page from unstructured or loosely formatted text into a proper database format. Thus (as shown in figure 1.2), it can be seen that Information extraction creates a database from unstructured or text that is loosely structured; data mining then explores patterns in that database [2].





2.2 Text Mining

The terms "text mining" describes the data mining application techniques to create an automated discovery of useful knowledge from an unformatted text. Many techniques have been suggested for text mining involving conceptual structure, rule mining based on association, decision trees, and rule inductions. Information Retrieval techniques have been widely for document matching tasks, ranking, and clustering. [3]

2.3 Extracting various types of data

Structured data Extraction

It becomes easier to extract structured data from a web page. Some of the data are in the form of lists, trees, and tables. They can be extracted with the help of Wrapper generation.

Unstructured data Extraction

Data in the form of text documents are known to be unstructured. It is connected to text mining.

Semi structured data extraction

Grammatical texts which are not full are semi structured data. It has a hierarchical structure. Semi structured data does not hold a predefined structure. Several ways to extract semi-structured data consist of NLP techniques, wrapper generation, and ontology. [4]

2.4 Web Crawlers

External crawler

The orders of this crawler are unknown websites where it invokes depth crawling on first page. It is used to order depth crawl after which it will decide the next site for an internal crawl. This crawler takes user specific sites and expands the statistics using the new websites.

Internal crawler

This crawler checks the current sites to identify its purpose and download the meta tags from a few pages. In this process the web pages generate by external crawler. Depth crawl starts with home page and uses breadth first search to traverse the web. It will keep fetching enough links for expanding the crawl to find the relative pages. If pages are not found by traversing, this process will go on until enough number of links is found for traversing. [4]

2.5 Web Content Mining

The current work fixates on unstructured data such as loosely structured texts, semi-structured data like HTML documents, and a structured data in the table.. The goal of this mining is to model the data on the Web and integrate it so that the queries related to keywords based search can be done. [5]

2.6 Search Engines in Web Data Mining

Search engine architecture can be created for the Web to gather large scale information and improve information retrieval accuracy. High-level techniques are needed to design the structure of the engine. It can be based on keyword matching, page size classification and depth retrieval based on the links using a search algorithm. [6]

Chapter 3

System Configuration

3.1 Introduction

The process of system configuration is to set up the hardware devices and properly assigning resources to them so that they can work hassle free. Proper configuration in a system allows one to avoid nasty resource conflict problems. This is done so that strange errors can be avoided.

3.2 Hardware and Software Resources

Hardware Requirements:

Windows 95/98/2000/NT/ME/XP/Vista.

1.99 GB RAM, Up to 1 GB Hard Disk Space, Internet Connection (LAN/DSL/Cable).

These are the basic requirements for any operating system to run the following softwares and programs used for this project. They even run in Linux or Ubuntu [9] based OS but will need a different configuration as they use different file sharing programs.

Software Requirements:

WebDataExtractor Version 7.3[15]

This is a windows vista based software that extracts the meta tags, emails, faxes, phones and URLs from a website content. It mines the data on given the URL input. The extracting process can be started and stopped anytime the user wants. Once the process is done, the required number of data can be saved as a session on selection. This session file can be imported as a .txt, .csv or, .xml file as per the requirement.

Altova XMLSpy Enterprise Edition v2010 rel. 2[16]

This is an XML editor that can process and view the written XML file in a grid view or a text view or a tree node view. It also points out errors whenever any node or tag has a syntax error like an unlocked parenthesis.

WampServer 2.0i (includes Apache 2.2.11, MySQL 5.1.36, PHP 5.3.0) [17]

This is a server that sits on the localhost and runs server side programs on the client machine via Apache Server or MySQL database server or PHP scripts. In this project, the need for MySQL database is required for storing the extracted data. The configuration is user friendly.

NetBeansIDE Version 6.8 [18]

Java: 1.6.0_18; Java HotSpot(TM) Client VM 16.0-b13.

System: Windows XP version 5.1 running on x86.

This software runs on Windows, Linux, Mac OS X and Solaris. NetBeans IDE is open-source and free.

3.3 Summary

Thus, it was easy working with XP configuration and these softwares. There were some troubleshooting issues which were cleared by the help of forums and the experts of PHP. The extracted data was able to be stored in a text format with tab space delimiters. These files were imported to the MySQL database via the MySQL command line prompt. The indexes were created through commands. With the help of these indices, it was able to track down minute details in the database.

Chapter 4


4.1 Extracting

The HTML pages get generated by software querying many databases this is why the Meta information about the data structure is not found. An automated software program will be unable to process data in a powerful manner. The attributes of the records will be extracted by using a method based on multiple string alignment. This was tested on the techniques with a high number of real web sources which obtained high precision values [9].

Since, this way uses the DOM tree; the software called WebDataExtractor will be used, which will extract all the metatags in webpage content.

Fig1.3. Screenshot of WebDataExtractor extracting the metatags from a URL.

4.2 Database Storage

The database tables are stored in forms of ISAM. The most commonly used are B+ trees and ISAM. [19]

After extracting the data, conversion into a text file format is done. These text files are loaded into the MySQL database which uses MyISAM [19] database. It is the default storage engine for the RDMS of MySQL.

Fig1.4. Importing the saved records as .txt files with tab space delimiters.

Even the Meta tags can be selected based on the fields of extraction. If one needs only URL, page size, page last modified and dates, the other fields are needed to be unchecked in the window box.

4.3 Searching

In this project, the use of PHP [20] code with the help of a little AJAX [21] has enhanced the search query results. PHP [20] code is used to connect to the MySQL database. AJAX [21] code is used for interactive databases. The steps used in creating and executing the codes are as follows:

Connecting to the MySQL database.

Mentioning the hostname, username and password that connect to the database.

Selecting the database.

Selecting the table.

Giving the query which can accept any request from the user and match the title field of the database.

Error checking query.

Returning results in a flexible pattern.

Returning error results when nothing is entered in the search box.

PHP has helped in building an automated search engine that retrieves data from the MySQL database. The search engine code uses CSS for styling the webpage for display and uses HTML for creating the forms for the search query.

4.4 Displaying

The search engine checks the query and matches the characters entered by the user with the fields in the database. If any garbage value is entered, it will check that and return an error result. Three different tables are created for University information, Special Interest Groups and Conference alerts.

Since there are three different kinds of tables filled with one particular section of information, three different forms for querying and displaying were created separately. The process forms which consists the PHP [20] code that connects to MySQL has been linked to the HTML page containing the code for the forms.

As soon as the user enters the search portal, one can enter the required query which will be pulled from the respective table and displayed in the respective form. For making this easier, the search portal has a choice of selections in the beginning itself. The user may proceed to whichever section from which the query is needed about.

HTML and CSS make it easier to display the search results in a viewable format to the users.

Chapter 5

Building a User Interface for web data extractor

The user needs an automated application that can return results as soon as a query is entered into the search box. At first, the AJAX codes were tried for a simpler connectivity since it functions well with interactive database connectivity.

However, it was decided to create the PHP code that can help to connect to the MySQL database

Fig1.5 Query "a" displaying 6 results.

Fig1.6 Query "Abac Clubs" returned 1 result.

Chapter 6



Step 1: Start WampServer and enter the credentials for logging into MySQL Database

Step 2: Use PHP code to connect to MySQL Database

Step 3: Load Text files containing the extracted data (Meta tags) into the database

Step 4: If error in loading checks credentials and GOTO Step 2

Step 5: If no error GOTO Step 3 by inputting the command into the client

Step 6: Text files loaded into MySQL database.


Step 1: Enter query inside the search box

Step 2: The PHP code matches the query with the fields in the database

Step 3: If no data matches with the records, no results are fetched so GOTO Step 1

Step 4: If no error, it will return all the rows corresponding to the field matched.

Total execution time: 2 seconds

Chapter 7

Experimental Setup

In this project, three areas of higher education were focused on.

Conference Alerts.

Special Interest Groups.

University Information.

Some basic URLs were referred based on the above context. These URLs were applied in the software WebDataExtractor as shown in Fig 1.3. After they were imported in the text files, they looked like the following:

Fig1.7 Special Interest Group Information from a URL stored in a text file with tab spaces as delimiters.

Similarly, the other files were also collected and stored as text files.

These files were loaded into MySQL database. For this, the need to download MySQL which is available as a source code, msi package and also inside executable packages like XAMPP or WampServer 2.0i is needed. In this case, the wamp server was downloaded.

Fig1.8. Starting the Wamp Server.

Fig1.9 After starting the Wamp Server, MySQL console is started.

7.1 Test Case Scenario

Fig1.10 Database and table created. Text file is loaded into the database table.

Create database test: This command creates a database on the MySQL server database MyISAM.

Use test: This is a command to use the particular database "test" among the many other databases stored inside MySQL server.

Load Data Infile: This command loads the text file into the database when there's a large file of data to be dumped. The pathname mentioned should be exactly the name on the directory.

7.2 MySQL and phpmyadmin

The phpMyAdmin is a free software tool written in PHP intended to handle the administration of MySQL over WWW.

In this case, phpmyadmin was used to view and observe whether the database and tables are being loaded well.

Fig1.11 phpMyAdmin showing the sig table.

Fig1.12 Databases can be viewed in phpMyAdmin.

The MySQL commands correspond to the databases created in phpMyAdmin. The software itself has tools to create/drop/modify a database. It can also create tables and insert node values.

Fig1.13 Insert values in phpMyAdmin.

The structure of the database can also be viewed under the "structure" tab.

Fig1.14 Structure of the database "test".

If a particular field is selected, the SQL query can be given to get the output.

Fig1.15 SQL Query giving output.

7.3 PHP and HTML

The creation of search portal using PHP code is done so that it is easier for the users to search the database for information on academic websites.

Fig1.16 PHP code that created the search portal showing options.

Fig1.17 Clicking on the options will lead to this page.

Fig1.18 user selects conference alerts; this form will be shown which will display search results based on conference alerts in almost every university available.

Chapter 8

Applications of Web Mining

The main application of web mining is discovering patterns on the web. Web usage mining is the application of data mining to discover usage patterns from web data. It is to understand and serve the needs of web based applications. Web usage mining consists of three phases:


Pattern discovery.

Pattern analysis.

Fig1.19 Web usage mining. [7]

Preprocessing: It consists of converting the usage, content and structure information contained in the available data sources into the abstractions necessary. [7]

Pattern Discovery: It draws upon methods and algorithms from statistics, data mining, machine learning and pattern recognition. [7]

Pattern Analysis: It is the last step in the overall web usage mining process. The main objective behind pattern analysis is to filter out unnecessary rules or patterns from the set found in the phase. [7]

Chapter 9


In this project, many issues were faced while setting up a local server at home. The WampServer 2.0i is used which is easy to configure and has Apache, MySQL, PHP built in it all together.

The problem starts when MySQL stops working because of a particular file called mysqld. This mostly happens due to re-installing WampServer. If this server gives enough trouble, one can always switch to XAMPP. It's software that has all the features in the control panel very similar to Wamp.

For solving the errors in mysqld file, the document "" was needed to be edited.

Installing Wamp into C:\wamp on the hard drive in Windows XP without a Firewall

First WampServer is downloaded online.

The installation file is run keeping the default settings... It prompts for the browser to be used which points to the executable file where the browser is kept.

Wamp is installed. It is run so that one can see it on the quick launch. When the server runs, the little icon should be full white. If it's not, left click to see the menu and choose the option Start All Services. If one sees no errors, then left click to choose Localhost. This will show the browser and one should see the Wamp page.

For configuring Apache, left click on the Wamp icon and go to the Config Files. Open the httpd.conf file inside notepad and go to Edit/Find.

Choose Apache/Restart Service after saving the changes.

The best way would be to manually configure Apache with PHP to get the PHP codes running on a local server.

Chapter 10


The World Wide Web continues to increase the growing need to develop new techniques that helps in improving the features. This project helped in analyzing the ways of extracting web data and importing them to a simple MySQL database. Next, it was tried to understand how to store them as XML files and check the validity by using W3C tools and Sax Parsers.A detailed screenshot of the efforts in this area is shown. A general architecture of a system to do Web mining is shown and identified.

10.1 Future Directions

Web Mining is the future of data mining and making the World Wide Web easier. The hub of the web can be sorted quickly by tracing through the methods of mining. There are some issues that are needed to be kept in check. The organizations need to enhance and develop something huge that can elevate the efficiency of the whole system by debugging the errors in a fraction of a second. This project will really help in a broad spectrum as most of the projects based on this field are always looking at the current perspective. The topic fails to grow old and will always have room for further enhancements.