Biggest Problem Of Web Data Mining Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

With the advancement in information technology the world wide web has become popular source of information retrieval. Web revolution has changed the way people used to search and find information. Web has become an important tool for communicating ideas, conducting business as well as for entertainment. Millions of pages are added everyday and millions of others are deleted or modified. The web is an open medium. The strength of web is that one can find information on just about anything , even if the quality of information varies but the weakness is that the information is in abundance .users rely on a number of search engines for the retrieval of information but due to presence of large amount of information the user is not able to quickly and efficiently retrieve information that meets their needs. Since the data on web is displayed using HTML which doesn't handle unstructured data and is not able to store data. The presentation format of web data displayed on browser is due to HTML. The biggest problem of web data mining is that HTML is not able to describe data meaning and neither do data structure which makes it difficult to efficiently mine data on web. The solution to the problem is that the developer need to learn a new query language but the developer will require time to learn the language and that too cannot be used in any other situation. The emergence of XML based web data mining provides effective way to solve the problem of unstructured data as it presents data in a structured format and also stores data on the server.XML provides powerful functionality and flexibility to web based application software as a result of which it brings great advantage for the developers and the users.XML is capable of describing data by a simple, open and extended way. In XML based web data mining the client is capable of processing and choosing data according to their needs.


Literature review

Data Mining

Data mining is defined as the search for relationships and global patterns that exist in large databases but are hidden among huge amount of information. Data mining is a non trivial extraction of implicit ,previously unknown and potentially useful information from data .the data is often voluminous but as it stands of low as no direct use can be made of it, it is the hidden information that is valuable and useful. Data mining is a complex process that requires a variety of steps before useful results are obtained. Data mining is neither a simple nor an inexpensive process that anyone with the database can carry out.[1]

Data mining techniques

Association rules: The goal of association rules mining is to determine which items are purchased together frequently so that they may be grouped together on store shelves or the information may be used for cross selling. Association rules mining has many applications other than market basket analysis including applications in marketing, customer segmentation , medicine , electronic commerce ,classification ,clustering ,web mining and finance.

Classification : It is defined as the process of leaning a function that maps a data item into several predefined classes. The examples include classifying trends in financial markets and automated identification of objects of interest in large image databases.

Clustering: The focus of clustering is to find groups that are very different for each other in a collection of data. Often the clusters may be mutually exclusive and exhaustive or consists of a richer representation such as hierarchical or overlapping categories. In this technique the user is needed to specify the groups that are expected.

The knowledge discovery process involves following steps:

Data cleaning: It involves elimination of noise and inconsistent data.

Data integration: In this multiple data sources can be combined.

Data selection: Analysis of task related data

Data transformation: Transforming data into a form that is suitable for mining using summary or aggregation operations.

Data mining: The use of intelligent methods to extract data patterns.

Model evaluation : Identifying a truly interesting mode

Knowledge representation: Using visualization and knowledge representation techniques ,to provide users with knowledge of excavation.[2]

Web data Mining

Web data mining is a inclusive technology ,related to web, data mining, information and other fields of science. It can be defined as the analysis of the relation among the content of document , the use of available resources ,to find the knowledge which is effective ,potentially valuable and eventually understandable including the non-trivial process of patterns ,rules ,regularities constraints and visualizations.[3] Web data mining is used to extract information from the web using data mining technologies.Web data mining technology and data mining is a combination of web, is an integrated technology resources extracted from the www information of the course , is the implication of web resources ,interest ,unknown[4]. Web data mining is to use data mining technology to identify and extract information from web documents and services ,so the various forms documentation and user access information on the web constitute web data mining objects.[5]

The major difference between the conventional text search and searching on web

Hyperlink: The text document doesn't have hyperlinks whereas links play a vital role in case web documents. The web hyperlinks provide important information to the users.

Type of information: Web pages consists of frames, animated objects, multimedia objects, text, images whereas text documents mainly consists of text and have few other objects like diagrams, figures , tables and images.

Dynamics: Millions of web pages are added everyday on web. The text documents do not change frequently . finding a previous version of a web page is almost impossible on the web and links pointing to a page may work today but not tomorrow.

Quality: The quality of text documents is usually high as it pass through control process whereas the web data is of low quality.

Huge size: No doubt few libraries are very large but the web is much larger than the text book libraries.

Document use: Comparing the usage of both the web and conventional documents both differ a lot.

Four basic reasons for web data mining


When using a web data extraction software solutions, businesses usually eliminate all types of delays that usually used to accompany the manual process of information collection. Sick leaves and traffic jams are no longer causes for nervous breakdowns, especially for tasks that are essential to your daily business operation and that require special attention.


To err is human, and mistakes are inevitable even if the web data extraction task is assigned to the most attentive, intelligent and meticulous employees. However, there is no place for mistakes in software-based web data extraction in the present scenario.


Unlike people, computer programmes can easily be re-programmed with a company's changing web data extraction needs .


The use of software for web data extraction is much cheaper than doing it manually. Just sit and imagine the labour that goes in doing work manually rather than using software.

Biggest problem facing the research of data mining on web

Thedata on web is always irregular; semi structured and lacks a unified fixed pattern. Study from the database point of view, each site on web is a highly complex data source and the information is not organized in the same way, with which the whole web become a large and heterogeneous data environment and thus becomes for a user to handle it. Since most of the information on the current web is still described in HTML which only can be displayed in the browser rather than described with data meaning and data structure and cannot be stored, which makes it difficult to mine the data from web efficiently. The situation can be handled, by adopting some special query language and then save extracted information into the database. This would require developers to take some time to learn a separate query language that cannot be used in any other situation and a simple code modification would require code re-mapping which makes it less efficient. The web pages are almost dynamic, almost changing daily. The large number of web pages that disappear everyday create enormous problems on the web. The web is increasingly becoming multilingual.

Figure 2.1 Classification of web data mining

Web content mining:

web content mining refers to the process of mining from the content of web pages or its reports and extracting the knowledge .There are two kinds of web content mining according to the objects of mining: text documents mining including the text format ,HTML tag or uses XML tags of HTML or semi structured data and unstructured text of the free format and so on. Multimedia documents mining including image, audio, video and other types. In web content mining refers to the process of mining from the content of web pages from the hyperlink found in its structure and its relationship with each other. Text conclusion can extract key information from documents and summarize and explain the content of the documents with a concise form , so that users don't need to browse the full text. The purpose of text conclusion is to concentrate the text information and give out a compact description. Text classification is the core of text mining. Automatic text classification refers to use a large number of texts with class signs to train classification rules or modal parameters, then use the training the result to identify the text of which type is unknown . It not only allows users to easily browse documents , but also makes the search of documents more convenient by limiting the search scope.

Web structure mining:

It refers to derive knowledge from organizational structure of world wide web and the relationship of links. As a result of the interconnection of the documents , World Wide Web can provide the useful information besides the content of documents. Making use of this information , we can sort the pages and find the most important pages among them.web structure mining not only includes hyperlink structure between documents but also includes the internal structure documents ,the directory path structure in URL. The aim of web structure mining is to discover the link structure that is assumed to underlie the web.

Web usage mining:

It refers to mine information from access logs left on the servers when users visit the web. That means carry out mining from access methods of visited web sites in order to find the browse patterns when users visit web sites and he frequency of visiting the pages. There are two kinds .tracks in the analyzing of users browsing patterns , the first one is he general access pattern track for user groups and the second is the personalize use record track for single user. The mining objects are in the server including the logs such as Server Log Data.

There are two kinds method for discovering usage information one kind is that analyze through log files ,including two manners:

1). Pre-treatment that is the log data will be mapped into relationship list and use the corresponding data mining technology to access log data.

2).access log data directly to obtain the user's navigation information.

The other kind is that the users navigation behaviour can be discovered through the collection and analysis of user's click events.[6]

Log data analysis has been investigating using the techniques listed below:

Using association rules

Using composite association rules

Using cluster analysis

Related Work

The process of extracting data in data mining

Li lan[18], the conception and characteristics of data mining based on web are introduced and the general methods of data mining based on web are proposed.XML is used to transform semi-structured data to well structured data.

Figure 2.2 The process of extracting data in data mining

The appearance of XML has brought convenience for it.XML is used to transform semi structured data to well structured data. XSL is essentially a formatting or text parsing language,. formatting refers to applying consistent pictures to XML data can be displayed in consistent manner. For example a set of rows from a relational database table stored as an XML document can be very easily displayed by applying the same template to each row. PRACTICAL APPLICABILTY at Oxford University.


Well structured data representation.


XML data is static.

Existing research.

How to improve the efficiency of data mining methods.

Dynamic data and knowledge of the data mining

Problem in network and distributed environment such as data mining.

Web mining framework based on XML

Cheng Zheng[19] paper It describes the implementation process of specific web mining and put forward a promoting scheme on solving XML documents with VTD which solves the difficult mining problem on the web caused by the most of the non-structure information. This paper's main emphasis on web content mining by the use of XML technology. The focus of this paper is how to extract data structures based on XML technology from the web page.

Figure 2.3 Web mining framework based on XML

In this paper ,it is stated that XML help to normalize the network information ,so that developers and computers can easily recognize the web information and create open data that is not dependent on platforms, languages or limited in formats. Technologies of CSS,XSL,XSLT can be used to display the same XML document in many different interfaces which can meet the display requirements of a variety of web access devices such as PDA ,cell phone.

Advantage of this model

Improved efficiency: the principle of template matching we define a style sheet for the documents and the XML document named as test.xsl and then apply it to document merge.xml.This method improve the speed.

JTidy can automatically carry out necessary changes to make code consistent with the requirements of XHTML.

A file is created to display an error message.


Data is static

XSL is not used .

JTidy can only deal with English page. This problem is due to non-uniformity of the conversion among the byte stream.

Due to very large page , the corresponding html file is very complex ,so there are problems in the format of XML output.

XML based web data mining model graph

Pengwei [20] presents a web data mining model on XML and introduces the method to implement the model with XML and Java technologies in detail with the combination of an instance. The inhomogeneous and dynamically updated semi-structured data in web pages make web data mining difficult. To solve this problem, the paper represents a web data mining model on XML and introduces the method to implement the model with XML and Java technologies.

Figure 2.4 .XML based web data mining model graph

Model implementation steps are as follow:

To implementation the data source pages.

To map the HTML documents into XHTML documents which is a subset of XML.

To retrieve the data reference point .

To map the data into XML documents.

To merge the results, process and display the data

Tidy used in this paper is shared facility software released at W3C website ,which can be used to correct common errors in a HTML document with a good format such as XHTML.

XSL is derived from the XML language .It provides an affective transformation mechanism for displaying XML documents and helps to separate the XML data content from the presentation format.


Approach is flexible and extendable.

XSL is used effectively for transformation.

XQL checks, converts, constructs and integrate XML documents and extract the required information from one or more data sources.

Web extraction at low maintenance cost.


Relied on context matching based on XQL and Path , if there are little changes in the structure and the content of web pages.

Data is dynamic.

We need to find the reference point every time from XML tree for data extraction.

The path expression is too absolute it might lead to failure.

Time complexity.

Doesn't have a database.