Exploratory Data Analysis And Data Driven Discovery Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The term 'DATA MINING' is simply finding out the hidden information in a database.Hence it is called as :

Exploratory data analysis

Data Driven Discovery

Deductive Learning

It refers loosely to the process of semiautomatically analyzing large databases to find useful patterns.Like knowledge discovery in artificial intelligence ( machine learning ) or statistical analysis , data mining attempts to discover rules and patterns from data but it differs from them ,in that it deals with large volumes of data, stored primarily on disk.Hence data mining deals with "knowledge discovery in databases".Here our goal of study is to establish an overview of the past and current data mining research activities and data mining applications like web mining.Data mining has motivated the changes in the business environment as customers have become more demanding and markets have became saturated . Data mining is related with databases which are huge ranging from gigabytes to terabytes and are growing at an unprecedent rate.So decisions must be made in a rapid manner with maximum knowledge.

Here we are to elaborate about one of the developing application of data mining - "WEB MINING". Web mining - is the application of techniques to discover patterns from the web. Web mining is a very hot research topic which combines two of the activated research areas : Data Mining and World Wide Web. The Web mining research relates to several research communities such as Database , Information Retrieval and Artificial Intelligence.

This paper is a survey based on the recently published research papers. Besides providing an overall view of Web mining, this paper will focus on Web usage mining. The user privacy is another important issue in this paper.Finally, along with some other interested research issues, a brief overview of the current research work in the area of Web usage mining is included.


Web mining can be considered as the applications of the general data mining techniques to the Web. It changes the ways of doing business, providing and receiving education, managing the organization etc. The most direct effect is the completed change of 3 information collection, conveying, and exchange. Today, Web has turned to be the largest information source available in this planet. The Web is a huge, explosive, diverse, dynamic and mostly unstructured data repository, which supplies incredible amount of information, and also raises the complexity of how to deal with the information from the different perspectives of view - users, Web service providers, business analysts. The users want to have the effective search tools to find relevant information easily and precisely. The Web service providers want to find the way to predict the users' behaviors and personalize information to reduce the traffic load and design the Web site suited for the different group of users. The business analysts want to have tools to learn the users/consumers' needs. All of them are expecting tools or techniques to help them satisfy their demands and/or solve the problems encountered on the Web. Therefore, Web mining becomes an active and popular research field.

Firstly, even though Web contains huge volume of data, it is distributed on the internet. Before mining, we need to gather the Web document together. Secondly, Web pages are semi-structured, in order for easy processing; documents should be extracted and represented into some format. Thirdly, Web information tends to be of diversity in meaning, training or testing data set should be large enough. Even though the difficulties above ,the Web also provides other ways to support mining,

Although Web mining puts down the roots deeply in data mining, it is not equivalent to data mining.The unstructured feature of Web data triggers more complexity of Web mining. Web mining research is actually a converging area from several research communities ,such as Database, Information Retrieval, Artificial Intelligence , and also psychology and statistics as well.

Besides the challenge to find relevant information, users could also find other difficulties when interacting with the Web such as the degree of quality of the information found ,the creation of new knowledge out of the information available on the Web, personalization of the information found and learning about other users.


Web mining is discovering useful information from the World-Wide Web and its usage patterns.The data may be actually present in Web pages or data related to Web activity.The Web can be viewed as the largest database available and presents a challenging task for effective design and access. This process seems to be very easy but how can it be implemented ??

Determining the size of the World Wide Web is extremely difficult. Google recently announced that it indexes 3 billion Web documents.

Although there exists quite some confusion about the Web mining, the most recognized approach is to categorize Web mining into three areas:

Web Usage Mining

Web Content Mining

Web Structural Mining

It is believed that Oren Etzioni first proposed the term of Web mining in his paper 1996. In this paper, he claimed the Web mining is the use of data mining techniques to automatically discover and extract information from World Wide Web documents and services. Many of the following researchers cited this explanation in their works. In the same paper, Etzioni came up with the question: Whether effective Web mining is feasible inpractice? Today, with the tremendous growth of the data sources available on the Web and the dramatic popularity of e-commerce in the business community, Web mining has become the focus of quite a few research projects and papers. Some of the commercial considerations has presented on the schedule.

The researchers suggested a similar way to decompose Web mining into the following subtask:

a. Resource Discovery: The task of retrieving the intended information from Web.

b. Information Extraction: Automatically selecting and pre-processing specific information from the retrieved Web resources.

c. Generalization: Automatically discovers general patterns at the both individual Web sites and across multiple sites.

d. Analysis: Analyzing the mined pattern.


Web usage mining tries to discovery the useful information from the secondary data derived from the interactions of the users while surfing on the Web. Web usage mining collects the data from Web log records to discover user access patterns of Web pages. There are several available research projects and commercial products that analyze those patterns for different purposes. The applications generated from this analysis can be classified as personalization, system improvement, site modification, business intelligence and usage characterization .The challenges involved in web usage mining could be divided in three phases :

1. Pre-processing. The purpose of it is to produce results that can be used in the design tasks such as Web site design, Web server design and of navigating through a Web site . However , before applying the data mining algorithm, we must perform a data preparation to convert the raw data into the data abstraction necessary for the further process. The data can be collected at the server-side ,client-side, proxy servers, or obtained from database. For each type of data collection, the difference is not only the location, but also the available data type, the segment of population from which the data was collected and the method of implementation. The information sources available to mine include Web usage logs, Web page descriptions, Web site topology, user registries, and questionnaire

2. Pattern discovery. This is the key component of the Web mining.Several different methods and algorithms such as statistics, data mining, machine learning and pattern recognition could be applied to identify user patterns.

3. Pattern Analysis. Pattern Analysis is a final stage of the whole Web usage mining. The goal of this process is to eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of Web mining algorithms is often not in the form suitable for direct human consumption, and thus need to be transform to a format can be assimilate easily. There are two most common approaches for the pattern analysis.

Knowledge query mechanism - SQL

Multidimensional data cube


The challenge for Web structure mining is to deal with the structure of the hyperlinks within the Web itself. Link analysis is an old area of research. However, with the growing interest in Web mining, the research of structure analysis had increased and these efforts had resulted in a newly emerging research area called Link Mining. The goal of Web structure mining is to generate structural summary about the Web site and Web page. Web structure mining can be used to reveal the structure (schema) of Web pages, this would be good for navigation purpose and make it possible to compare/integrate Web page schemes. This type of structure mining will facilitate introducing database techniques for accessing information in Web pages by providing a reference schema.

The structural information generated from the Web structure mining includes the follows:

The information measuring the frequency of the local links in the Web tuples in a Web table.

The information measuring the frequency of Web tuples in a Web table containing links that are interior and the links that are within the same document.

The information measuring the frequency of Web tuples in a Web table that contains links that are global and the links that span different Web sites.

The information measuring the frequency of identical Web tuples that appear in a Web table or among the Web tables.


To discover the nature of the hierarchy or network of hyperlinks in the Web sites of a particular domain.

It help to generalize the flow of information in Web sites that may represent some particular domain, therefore the query processing will be easier and more efficient.

Web structure mining has a nature relation with the Web content mining, since it is very likely that the Web documents contain links, and they both use the real or primary data on the Web.It's quite often to combine these two mining tasks in an application



Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of text documents and, more recently, also the collections of multimedia documents such as images, videos, audios, which are embedded in or linked to the Web pages. Web content mining could be differentiated from two points of view: the agent-based approach or the database approach.

The first approach aims on improving the information finding and filtering and could be placed into the following three categories:

Intelligent Search Agents

These agents search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information.

Information Filtering/ Categorization

These agents use information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them.

Personalized Web Agents

These agents learn user preferences and discover Web information based on these preferences, and preferences of other users with similar interest.

The second approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it.

Multimedia data mining is part of the content mining, which is engaged to mine the high-level information and knowledge from large online multimedia sources. Multimedia data mining on the Web has gained many researchers' attention recently. Working towards a unifying framework for representation, problem solving, and learning from multimedia is really a challenge, this research area is still in its infancy indeed, many works are waiting to be done.


In this paper we survey the research area of Web mining focusing on the three categories of Web mining - Web usage mining , Web structure mining, Web content mining.We have discussed about the various phases involved in the category of Web usage mining.We have also detailed about the structural information generated in Web structural mining and its tasks.

Finally,we wrap up this paper with Web content mining.We have reviewed about the different approaches of Web content mining and we have discussed about Multimedia data mining in brief which is a part of Web Content Mining.