Looking At Data Mining Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Nowadays, digital information is relatively easy to capture and fairly inexpensive to store. The digital revolution has been collections of data grow in size and the complexity of the data therein increase. Question commonly arising as a result of this state of affairs is, having gathered such quantities of data, what we actually do with it? It is often the case that large collections of data, however well structured, conceal implicit patterns of information that cannot be readily detected by conventional analysis techniques. Such information may often be usefully analyzed using a set of techniques referred to as knowledge discovery or data mining. These techniques essentially seek to build a better understanding of data, and in building characterization of data that can be used as a basis for further analysis, extract value from volume.

1.1.1 Definition of Data Mining

Data Mining refers to Extraction or mining knowledge from large amounts of data [1]

Jiawei Han, Micheline Kamber

The process of extracting previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions Simoudis 1996 [2]

The non trivial extraction of implicit, previously unknown, and potentially useful information from data

William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus

Data Mining is the non trivial process of identifying valid, novel, potentially useful and ultimately under stable patterns in data [3]

Arun K Pujari

Researcher's View - The various researchers gives their different opinion about data mining [3].

Data Mining is the non trivial extraction of implicit, previously unknown and potentially useful information from the data. This encompasses a number of technical approaches, such as clustering, data summarization, classification, finding dependency networks, analyzing changes and detecting anomalies

Data Mining is the searches for the relationships and global patterns that exists in large databases but are hidden among vast amount of data, such as relationships between patient data and their medical diagnosis. This relationship represents valuable knowledge about the databases and the objects in the database, if the database is a faithful mirror of real world registered by the database

Data Mining refers to using a variety of techniques to identify nuggets of information or decision making knowledge in the database and extracting these in such a way that

they can be put in use in areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but it has low value and no direct use can be made of it. It is the hidden information in the data that is useful

Discovering relations that connects variables in a database is the subject of Data Mining. The data mining system self learns from the previous history of the integrated system, formulating and testing hypothesis about the rules which systems obey when concise and valuable knowledge about the system of interest is discovered, it can be interpreted in to some decision support system, which helps the manager to make wise and informal business decision

Data Mining is the process of discovering meaningful, new correlation patterns and trends by shifting through large amount of data stored in repositories using pattern recognition techniques as well as statistical and mathematical techniques

1.1.2 Knowledge Discovery Process

It consists of an iterative sequence of the following steps [1] -

Data Cleaning: It is used to remove noise and inconsistencies that is present in data

Data Integration: In this process multiple data sources may be combined

Data Selection: In this process data relevant to analysis task is retrieved from the database

Data Transformation: In this the data is consolidated into forms appropriate for mining by performing summary or aggregation operations

Data Mining: It is an essential process where intelligence methods are applied in order to extract data patterns

Pattern Evaluation: The extracted patterns are evaluated on the basis of interestingness measures

KnowledgeKnowledge Presentation: In this mined knowledge is presented to user by applying visualization and knowledge representation techniques like reports etc.

Evaluation and Presentation


Data Mining

Selection and Transformation



Cleaning and Integration

Databases Flat Files

Figure 1.1: Architecture of Knowledge Discovery Process

1.1.3 Architecture of Data Mining

A typical data mining system may have the following major components [1] -

Database, Data Warehouse, World Wide Web or other information repository: This is one or a set of database, data warehouse, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.

Database or Data Warehouse server: The db or data warehouse server is responsible for fetching the relevant data, based on the user's mining request.

Knowledge Base: This is the domain knowledge which is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attributes values into different levels of abstraction. Some of domain knowledge is additional interestingness constraints or thresholds and metadata.

Data Mining Engine: This is essential to the data mining system and consists of a set of functional modules for tasks such as characterization, association and correlations analysis, classification, prediction, cluster analysis, outlier analysis and evaluation analysis.

Pattern Evaluation Module: This component employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. This module may be integrated with the mining module, depending on the implementation of the data mining method used. For the efficient data mining, it is highly recommended to push the evaluation of patterns interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns.

User Interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search and performing the exploratory data mining based on the intermediate data mining result.





Other Info Repositories

Database or

Data Warehouse Server

Pattern Evaluation

Data cleaning, integration and selection

Data Mining Engine

User Interface

Knowledge Base

User Interface

Knowledge Base

Data Mining Engine

Figure 1.2: Architecture of Data Mining System

1.1.4 Goals of Data Mining

Data mining helps in achieving the following goals or tasks [4].

Prediction: Data mining can show how certain attributes within the data will behave in the future. Examples of predictive data mining in the business context includes the analysis of buying transactions to predict what consumers will buy under certain discounts and how much sales volume a store will generate in a given period. In a scientific context, certain seismic wave patterns may predict an earthquake with high probability.

Identification: Data patterns can be used to identify the existence of an item an event or an activity. For example, in biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. It also involves authentication where it is ascertained whether a user is indeed a specific user or one from an authorized class; it involves a comparison of parameters or images or signals.

Classification: Data mining can partition the data so that different classes or categories can be identified based on combination of parameters. For example, customers in a supermarket can be categorized into discount seeking shoppers, shoppers in a rush, loyal regular shoppers and infrequent shoppers. This classification may be used in different analysis of customer buying transactions as post mining activity.

Optimization: One eventual goal of data mining activity is to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints. These goals are realized with the help of different approaches such as Discovery of sequential patterns, Discovery of patterns in time series, Discovery of classification rules, Regression, Neural networks, Genetic Algorithms, Clustering and Segmentation.

1.1.5 Applications of Data Mining

Data mining applications are continuously developing in various industries to provide more hidden knowledge that enable to increase business efficiency and grow businesses [3].

Data Mining Applications in Sales/Marketing: Data mining enables the businesses to understand the patterns hidden inside past purchase transactions, thus helping in plan and launch new marketing campaigns in prompt and cost effective way. The following illustrates several data mining applications in sale and marketing.

Data mining is used for market basket analysis to provide insight information on what product combinations were purchased, when they were bought and in what sequence by customers. This information helps businesses to promote their most profitable products to maximize the profit. In addition, it encourages customers to purchase related products that they may have been missed or overlooked

Retails companies uses data mining to identify customer's behavior buying patterns

Data Mining Applications in Banking / Finance: Several data mining techniques such as distributed data mining has been researched, modeled and developed to help credit card fraud detection.

Data mining is used to identify customer's loyalty by analyzing the data of customer's purchasing activities such as the data of frequency of purchase in a period of time, total monetary value of all purchases and when was the last purchase. After analyzing those dimensions, the relative measure is generated for each customer. The higher of the score, the more relative loyal the customer is

To help bank to retain credit card customers, data mining is used.  By analyzing the past data, data mining can help banks to predict customers that likely to change their credit card affiliation so they can plan and launch different special offers to retain those customers

Credit card spending by customer groups can be identified by using data mining

The hidden correlation's between different financial indicators can be discovered by using data mining

From historical market data, data mining enable to identify stock trading rules

Data Mining Applications in Health Care and Insurance: The growth of the insurance industry is entirely depends on the ability of converting data into the knowledge, information or intelligence about customers, competitors and its markets. Data mining is applied in insurance industry lately but brought tremendous competitive advantages to the companies who have implemented it successfully. The data mining applications in insurance industry are listed below:

Data mining is applied in claims analysis such as identifying which medical procedures are claimed together

Data mining enables to forecasts which customers will potentially purchase new policies

Data mining allows insurance companies to detect risky customers' behavior patterns

Data mining helps detect fraudulent behavior

Data Mining Applications in Transportation

Data mining helps to determine the distribution schedules among warehouses and outlets and analyze loading patterns

Data Mining Applications in Medicine

Data mining enables to characterize patient activities to see coming office visits

Data mining help identify the patterns of successful medical therapies for different illnesses

1.1.6 Advantages of Data Mining

There are various advantages of data mining as follows [5]-

Marketing / Retail

Data mining helps marketing companies to build models based on historical data to predict who will respond to new marketing campaign such as direct mail, online marketing campaign and etc. Through this prediction, marketers can have appropriate approach to sell profitable products to targeted customers with high satisfaction. Data mining brings a lot of benefit s to retail company in the same way as marketing. Through market basket analysis, the store can have an appropriate production arrangement in the way that customers can buy frequent buying products together with pleasant. In addition, it also help the retail company offers a certain discount for particular products what will attract customers.

Finance / Banking

Data mining gives financial institutions information about loan information and credit reporting. By building a model from previous customer's data with common characteristics, the bank and financial can estimate what are the god and/or bad loans and its risk level. In addition, data mining can help banks to detect fraudulent credit card transaction to help credit card's owner prevent their losses.


By applying data mining in operational engineering data, manufacturers can detect faulty equipments and determine optimal control parameters. For example semi-conductor manufacturers had a challenge that even the conditions of manufacturing environments at different wafer production plants are similar, the quality of wafer are lot the same and some for unknown reasons even contain defects.

Data mining has been applied to determine the ranges of control parameters that lead to the production of golden wafer. Then those optimal control parameters are used to manufacture wafers with desired quality.


Data mining helps government agency by digging and analyzing records of financial transaction to build patterns that can detect money laundering or criminal activity.

1.1.7 Disadvantages of Data Mining

There are following disadvantages of using data mining [6]-

Privacy Issues

The concerns about the personal privacy have been increasing enormously recently especially when internet is booming with social networks, e-commerce, forums, blogs…. Because of privacy issues, people are afraid of their personal information is collected and used in unethical way that potentially causing them a lot of trouble. Businesses collect information about their customers in many ways for understanding their purchasing behaviors trends. However businesses don't last forever, some days they may be acquired by other or gone. At this time the personal information they own probably is sold to other or leak.

Security issues

Security is a big issue. Businesses own information about their employee and customers including social security number, birthday, payroll and etc. However how properly this information is taken is still in questions. There have been a lot of cases that hackers were accesses and stole big data of customers from big corporation such as Ford Motor Credit Company, Sony… with so much personal and financial information available, the credit card stolen and identity theft become a big problem.

Misuse of information/inaccurate information

Information collected through data mining intended for marketing or ethical purposes can be misused. This information is exploited by unethical people or business to take benefit of vulnerable people or discriminate against a group of people.

In addition, data mining technique is not perfectly accurate therefore if inaccurate information is used for decision-making will cause serious consequence.

1.1.8 Issues and Challenges in Data Mining

Data mining applications rely on databases to supply the raw data for input. The issues in the databases / data (e.g. volatility, incompleteness, noise, and volume) augment the issues by the time it reaches Data Mining task. Other problems arise as a result of the adequacy and relevance of the information stored [7].

Limited Information

A database is often designed for purposes different from data mining and sometimes the properties or attributes that would simplify the learning task are not present nor can they be requested from the real world. Inconclusive data causes problems because if some attributes essential to knowledge about the application domain are not present in the data it may be impossible to discover significant knowledge about a given domain. For example cannot diagnose malaria from a patient database if that database does not contain the patient's red blood cell count.

Noise and missing values

Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct.

Attributes which rely on subjective or measurement judgments can give rise to errors such that some examples may even be miss-classified. Errors in either the values of attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects the overall accuracy of the generated rules. Missing data can be treated by discovery systems in a number of ways such as;

simply disregard missing values

omit the corresponding records

infer missing values from known values

Treat missing data as a special value to be included additionally in the attribute domain

Or average over the missing values using Bayesian techniques

Noisy data in the sense of being imprecise is characteristic of all data collection and typically fit a regular statistical distribution such as Gaussian while wrong values are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise.


Uncertainty refers to the severity of the error and the degree of noise in the data. Data precision is an important consideration in a discovery system.

Size, updates, and irrelevant fields

Databases tend to be large and dynamic in that their contents are ever-changing as information is added, modified or removed. The problem with this from the data mining perspective is how to ensure that the rules are up-to-date and consistent with the most current information. Also the learning system has to be time-sensitive as some data values vary over time and the discovery system is affected by the `timeliness' of the data

Another issue is the relevance or irrelevance of the fields in the database to the current focus of discovery for example post codes are fundamental to any studies trying to establish a geographical connection to an item of interest such as the sales of a product.


Web data mining is the process of applying data mining techniques to Web data. Web Mining is the application of data mining techniques to extract knowledge from Web. Web mining has been explored to a vast degree and different techniques have been proposed for a variety of applications that includes Web Search, Classification and Personalization etc. Web data mining can be defined as the discovery and analysis of useful information from the WWW data. Web involves three types of data; data on the WWW, the web log data regarding the users who browsed the web pages and the web structure data [8].

Research in this area has the objectives of helping e-commerce businesses in their decision making, assisting in the design of good Web sites and assisting the user when navigating the Web.


The ongoing increase in the amount of Web data has led to the explosive growth of Web data repositories. Web pages and their contents are accessed and provided by a wide variety of applications and they are added and deleted every day. Moreover, the Web does not provide its users with a standard coherent page structure across Web sites. These facts make it very difficult to analyze the content of Web pages by automated tools.

Therefore, there arises a need for Web data mining techniques. Data mining involves the study of data-driven techniques to discover and model hidden patterns in large volumes of raw data. The application of data mining techniques to Web data is referred to as Web data mining. Web data mining can be divided into three distinct areas: Web content mining, Web structure mining and Web usage mining. Web content mining involves efficiently extracting useful and relevant information from millions of Web sites and databases. Web structure mining involves the techniques used to study the Web pages schema of a collection of hyper-links. Web usage mining on the other hand, involves the analysis and discovery of user access patterns from Web servers in order to better serve the users' needs.

1.2.1 Types of Web Data: World Wide Web contains various information sources in different formats [9]. As it is stated above World Wide Web involves three types of data, the categorization is given in Figure 1.3


Content Data

Structure Data

Usage Data

Free Texts

HTML files

XML Files

Dynamic Content


Static links

Dynamic links

Figure 1.3: Types of Web Data Web Content Data

It is the data, which web pages are designed for presenting to the users. Web content data consists of free text, semi-structured data like HTML pages and more structured data like automatically generated HTML pages, XML files or data in tables related to web content. Textual, image, audio and video data types falls into this category. The most common web content data is HTML pages in the web. HTML (Hypertext Markup Language)

It is designed to determine the logical organizations of documents with hypertext extensions. HTML was firstly implemented by Tim Berners-Lee at CERN, and became popular by the Mosaic browser developed at NCSA. In 1990s it has become widespread with the growth of the Web. After that, HTML has been extended in various ways.

The www depends on the web page authors and vendors sharing the same conventions of HTML. Different browsers in various formats can view an HTML document in different ways.

To illustrate, one browser may indent the beginning of a paragraph, while another may only leave a blank line. However, base structure remains the same and the organization of document is constant.HTML instructions divide the text of a web page into sub blocks called elements. The HTML elements can be examined in two categories: those that define how the body of the document is to be displayed by the browser, and those that define the information about the document, such as the title or relationships to other documents.

Another common web content data is the XML documents. XML

It is a markup language for documents containing structured information. Structured information contains both the content and the information about what content includes and stands for. Almost all documents have some structure. XML has been accepted as a markup language, which is a mechanism to identify structures in a document. XML

specification determines a standard way to add markup to documents. XML doesn't specify semantic or tag set. In fact it is a meta-language for describing markups. It provides mechanism to define tags and the structural relationships. All of the semantics of an XML document will either be defined by the applications that process them or by style sheets. Dynamic Server Pages

They are also important part of web content data. Dynamic content can be any web content, which is processed or compiled by the web server before sending the results to the web browser. On the other hand, static content is content, which is sent to the browser without modification. Common forms of dynamic content are Active Server Pages (ASP), Pre-Hypertext Processor (PHP) pages and Java Server Pages (JSP). Today, several web servers support more than one type of active server pages. Web Structure Data

It describes the organization of the content. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. Inter-page structure information is hyper-links connecting one page to another. Web graph is constructed by hyperlinks information from web pages. The web graph has been widely adopted as the core describing the web structure. It is most widely accepted way of representing web structure related to web page connectivity (dynamic and static links). The Web graph is a representation of the WWW at a given time. It stores the link structure and connectivity between the HTML documents in the www. Each node in the graph corresponds to a unique web page or a document. An edge represents an HTML link from one page to another.

The general properties of web graphs are given below:

Directed, very large and sparse

Highly dynamic

- Nodes and edges are added /deleted very often

- Content of the existing nodes is also subject to change

- Pages and hyperlinks created on the fly

Apart from primary connected component there are also smaller disconnected components

The size of the web graph is varying from one domain to another domain.








Figure1.4: Web Graph for a Particular Web Domain

The edges of web graph has the following semantics: Outgoing arcs stands for hypertext links contained in the corresponding page and incoming arcs represent the hypertext links through which the corresponding page is reached. Web graph is used in applications such as web indexing, detection of web communities and web searching. The whole web graph grows with an amazing rate. Web Log Data

Web usage data includes web log data from web server access logs, proxy server logs, browser logs, registration data, cookies and any other data generated as the results of web user interactions with web servers. Web log data is created on web server. Every Web server has a unique IP address and a domain name. When any user enters (a URL) in any browser, this request is send to the web server. A web server log, containing Web server data, is created as a result of the httpd process that is run on Web servers. All types of server activities such as success, errors, and lack of response are logged into a server log file. Web servers dynamically produce and update four types of "usage" log files: access

log, agent log, error log, and referrer log. Web Access Logs has fields containing web server data, including the date, time, user's IP address, user action, request method and requested data. Error Logs includes data about specific events such as "file not found," "document contains no data," or configuration errors; providing server administrator information on "problematic and erroneous" links on the server. Other type of data recorded to the error log is aborted transmissions. Agent logs provide data about the browser, browser version, and operating system of the requesting user. User Profile Data

User profile data provide information about the users of a Web site. A user profile contains demographic information for each user of a Web site, as well as information about users' interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs.

1.2.2 Types of Web Data Mining

The World Wide Web data mining focuses on three issues: Web structure mining, Web content mining and Web usage mining [10]. Web Content Mining

Web Content Mining is the process of extracting useful information from the contents of Web documents. Content data corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images, audio, video, or structured records such as lists and tables. Web content mining involves mining Web data contents. It focuses on various techniques that assist in searching the Web for documents whose content meets a certain goal. Those documents, once found, are used to build a knowledge base. The emphasis here is on analyzing the Internet hypertext material. The Internet data that is available in digital form has to be prepared for analysis.

A large number of researches have been conducted in this area in the past few years. For instance, Zaiane & Han (2000) [11], focused on resource recovery on the Web. The authors made use of a multi-layered database model to transform the unstructured data on the Web into a form acceptable by database technology. Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and natural language processing (NLP). Web Structure Mining

Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:

Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.

Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

It aims at generating structured summary about web sites and web pages in order to identify relevant documents. The focus here is on link information, which is an important aspect of Web data. Web structure mining can be used to reveal the structure or schema of Web pages which would facilitate Web document classification and clustering on the basis of its structure Spertus (1997) [12].Web structure mining is very useful in generating information such as visible Web documents, luminous Web documents and luminous path which is the path common to most of the results returned. Web Usage Mining

Web usage mining is the process of extracting useful information from server logs i.e. user's history. Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web usage mining involves the automatic discovery and analysis of patterns in data as a result of the user's interactions with one or more Web sites. It focuses on tools and techniques used to study and understand the users' navigation preferences and behavior by discovering their Web access patterns.

The goal of Web usage mining is to capture, model and analyze the users' behavioral patterns. It, therefore, involves three phases: Preprocessing of Web data, pattern

discovery and pattern analysis Srivastava et al. (2000) [13]. Of these, only the latter phase is performed in real-time. The discovered patterns are represented as collections of pages that are frequently accessed by groups of users with similar interests within the same Web site.

Web Data Mining

Web Content Mining

Web Structure Mining

Web Usage Mining

Page Content Mining

Search Result Mining

Pattern Tracking

Customized Tracking

Figure1.5: Web Data Mining Architecture

1.2.3 Architecture of Web Usage Mining Data Collection

The first step in the Web usage mining process consists of gathering the relevant Web data [14], which will be analyzed to provide useful information about the users' behavior. There are two main sources of data for Web usage mining- data on the Web server side and data on the client side. Additionally, when intermediaries are introduced in the client-server communication, they can also become sources for usage data, e.g. proxy servers and packet sniffers. Each of these sources is examined in earlier subsections. Server Side Data

There are basically two types of server side data as follows-

Server Log Files: Server side data are collected at the Web server(s) of a site. They consist primarily of various types of logs generated by the Web server. These logs record the Web pages accessed by the visitors of the site. Most of the Web servers support as a default option the Common Log File Format, which includes information about the IP address of the client making the request, the hostname and user name, if available, the time stamp of the request, the file name that is requested, and the file size. The Extended Log Format which is supported by Web servers such as Apache and Netscape, Microsoft Internet Information Server, include additional information such as the address of the referring URL to this page, i.e., the Web page that brought the visitor to the site, the name and version of the browser used by the visitor and the operating system of the host machine.

The problem of here come is data reliability and the two major sources of data unreliability are: Web caching and IP address misinterpretation.

The Web cache is a mechanism for reducing latency and track on the Web. A Web cache keeps track of Web pages that are requested and saves a copy of these pages for a certain period of time. Thus, if there is a request for the same Webpage, the cached copy is used instead of making a new request to the Web server. Web caches can be configured either at the users' local browsers, or at intermediate proxy servers. The problem occurs here is. If the requested Web page is cached, the client's request does not reach the corresponding Web server holding the page. As a result, the server is not aware of the action and the page access is not recorded into the log files. One solution that has been proposed is cache-busting, i.e., the use of special HTTP headers defined either in Web servers or Web pages, in order to control the way that those pages are handled by caches. These headers are known as Cache-Control response headers and include directives to define which objects should be cached, how long they should be cached etc. However this approach works against the main motivation for using aches, i.e., the reduction of Web latency.

The second problem, IP misinterpretation in the log files, occurs for two main reasons. The first reason is the use of intermediate proxy servers, which assign the same IP to all users. As a result, all requests from various host machines that pass through the proxy server are recorded in the Web server log as requests from a single IP address. This can cause misinterpretation of the usage data. The same problem occurs when the same host is used by many users. The opposite problem occurs when one user is assigned many

different IP addresses, e.g. due to the dynamic IP allocation that is used for dial-up users by ISPs. A variety of heuristics have been employed in order to alleviate the problem of IP misinterpretation, Finally, information recorded at the Web servers' log files may pose a privacy threat to Internet users.

Cookies: In addition to the use of log files, another technique that is often used in the collection of data is the dispensation and tracking of cookies. Cookies are short strings dispensed by the Web server and held by the client's browser for future use. They are mainly used to track browser visits and pages viewed. Through the use of cookies, the Web server can store its own information about the visitor in a cookie log at the client's machine. Usually this information is a unique ID that is created by the Web server, so the next time the user visits the site; this information can be sent back to the Web server, which in turn can use it to identify the user. Cookies can also store other kind of information such as pages visited, products purchased, etc., although the maximum size of a cookie cannot be larger than Kbytes, and thus it can hold only a small amount of such information. The use of cookies causes some problems.

One problem is that many different cookies may be assigned to a single user, if the user connects from different machines, or multiple users may be using the same machine and hence the same cookies. In addition, the users may choose to disable the browser option for accepting cookies, due to privacy and security concerns. This is specified in HTTP State Management Mechanism which is an attempt of the Internet Engineering Task Force to set some cookie standards. Even when they accept cookies, the users can selectively delete some of them. Cookies are also limited in number. Restriction is there on the use of cookies.

Only 20 cookies are allowed per domain, and no more than 300 cookies are allowed in the client machine. If the number of cookies exceeds these values, the least recently used will be discarded.

Explicit User Input: Various user data supplied directly by the user, when accessing the site, can also be useful for personalization. User data can be collected through registration forms and can provide important personal and demographic information, as well as explicit user preferences. However, this method increases the load on the user. Client Side Data

Client side data are collected from the host that is accessing the Web site. One of the most common techniques for acquiring client side data is to dispatch a remote agent, implemented in Java or JavaScript. These agents are embedded in Web pages, for example as Java applets, and are used to collect information directly from the client, such as the time that the user is accessing and leaving the Web site a list of sites visited before and after the current site ,i.e., the user's navigation history ,etc. Client side data are more reliable than server side data, since they overcome caching and IP misinterpretation problems. However, the use of client side data acquisition methods is also problematic. One problem is that the various agents collecting information affect the client's system performance, introducing additional overhead when a user tries to access a Web site. Furthermore, these methods require the cooperation of users. Intermediary Data

Proxy Servers: A proxy server is a software system that is usually employed by an enterprise connected to the Internet and acts as an intermediary between an internal host and the Internet so that the enterprise can ensure security, administrative control and caching services. Despite the problems that they cause, which were mentioned above, proxy servers can also be a valuable source of usage data.

Proxy servers also use access logs, with similar format to the logs of Web servers, in order to record Web page requests and responses from the server. The advantage of using these logs is that they allow the collection of information about users operating behind the proxy server, since they record requests from multiple hosts to multiple Web servers.

Packet Sniffers: A packet sniffer is a piece of software, or sometimes even a hardware device, that monitors network traffic, i.e., TCP/IP packets directed to a Web server, and extracts data from them.

One advantage of packet sniffing over analyzing raw log files is that the data can be collected and analyzed in real time. Another important advantage is the collection of

network level information that is not present in the log files. This information includes detailed timestamps of the request that has taken place, like the issue time of the request, and the response time.

On the other hand, the use of packet sniffers also has important disadvantages compared to log files. Since the data are collected in real time and are not logged, they may be lost forever if something goes wrong either with the packet sniffer or with the data transmission. For example, the connection may be lost. Data Preprocessing

Web data collected in the first stage of data mining are usually diverse and vast in volume. These data must be assembled into a consistent, integrated and comprehensive view, in order to be used for pattern discovery. As in most applications of data mining, data preprocessing involves removing and filtering redundant and irrelevant data, predicting and filling in missing values, removing noise, transforming and encoding data, as well as resolving any inconsistencies. The task of data transformation and encoding is particularly important for the success of data mining. In Web usage mining, this stage includes the identification of users and user sessions, which are to be used as the basic building blocks for pattern discovery.

Data Filtering: The very first step in data preprocessing is to clean the raw Web data. During this step the available data are examined and irrelevant or redundant items are removed from the dataset. This problem mainly concerns log data collected by Web servers and proxies, which can be particularly noisy, as they record all user interactions. Due to these reasons, we concentrate here on the treatment of Web log data. Data generated by client-side agents are clean as they are explicitly collected by the system, without the intervention of the user. On the other hand, user supplied data like registration form information need to be verified, corrected and normalized, in order to assist in the discovery of useful patterns. Pattern Discovery

In this stage, machine learning and statistical methods are used to extract patterns of usage from the preprocessed Web data. A variety of machine learning methods have been used for pattern discovery in Web usage mining.


The large majority of methods that have been used for pattern discovery from Web data are clustering methods. Clustering aims to divide a data set into the following categories:

Partioning methods, that create k groups of a given data set, where each group represents a cluster

Hierarchical methods that decompose a given data set creating a hierarchical structure of clusters

Model-based methods, that find the best fit between a given data set and a mathematical model

Clustering has been used for grouping users with common browsing behavior, as well as grouping Web pages with similar content.


Instead of clustering, the goal of classification is to identify the distinguishing characteristics of predefined classes, based on a set of instances, e.g. users, of each class. This information can be used both for understanding the existing data and for predicting how new instances will behave. Classification is a supervised learning process, because learning is driven by the assignment of instances to the classes in the training data. Knowledge Post Processing

Finding patterns are not sufficient, unless they used by users. User can only use those things which are easily viewable to them, so try to convert or present patterns in to under stable format like graphical presentation, visualization and reports. So that user can easily used knowledge to increase profits. Visualization is a more effective method for presenting comprehensive information to humans.

Data Collection

Data Processing

Pattern Discovery

Knowledge Post Processing



Server side Data

Client side Data

Intermediatry Data

Figure 1.6: Web Usage Mining Architecture

1.2.4 Personalization on Web

Web personalization is a strategy, a marketing tool, and an art. Personalization requires implicitly or explicitly collecting visitor information and leveraging that knowledge in your content delivery framework to manipulate what information you present to your users and how you present it [8]. Correctly executed, personalization of the visitor's experience makes his time on your site, or in your application, more productive and engaging. Personalization can also be valuable to you and your organization, because it drives desired business results such as increasing visitor response or promoting customer retention. Unfortunately, personalization for its own sake has the potential to increase the complexity of your site interface and drive inefficiency into your architecture. It might even compromise the effectiveness of your marketing message or, worse, impair the user's experience. Few businesses are willing to sacrifice their core message for the sake of a few trick web pages.

Web personalization can be seen as an interdisciplinary field that includes several research domains from user modeling, social network, web data mining, human-machine interactions to Web usage mining; Web usage mining is an example of approach to extract log files containing information on user navigation in order to classify users. Other techniques of information retrieval are based on documents categories' selection. Contextual information extraction on the user and/or materials (for adaptation systems) is a technique fairly used also include, in addition to user contextual information, contextual information of real-time interactions with the Web proposed a multi-agent system based on three layers: a user layer containing users' profiles and a personalization module, an information layer and an intermediate layer. They perform an information filtering process that reorganizes.

Web documents propose reformulation query by adding implicit user information. This helps to remove any ambiguity that may exist in query: when a user asks for the term "conception", the query should be different if he is an architect or a computer science designer. Requests can also be enriched with predefined terms derived from user's profile develop a similar approach based on user categories and profiles inference. User profiles can be also used to enrich queries and to sort results at the user interface level. Other approaches also consider social-based filtering and collaborative filtering.

These techniques are based on relationships inferred from users' profile. Implicit filtering is a method that observes user's behavior and activities in order to categorize classes of profile.

1.2.5 Personalization Strategies

Personalization falls into four basic categories, ordered from the simplest to the most advanced [8]: Memorization

In this simplest and most widespread form of personalization, user information such as name and browsing history is stored (e.g. using cookies), to be later used to recognize and greet the returning user. It is usually implemented on the Web server. This mode depends more on Web technology than on any kind of adaptive or intelligent learning. It can also jeopardize user privacy. Customization

This form of personalization takes as input a user's preferences from registration forms in order to customize the content and structure of a web page. This process tends to be static and manual or at best semi-automatic. It is usually implemented on the Web server. Typical examples include personalized web portals such as My Yahoo and Google. Guidance or Recommender Systems

A guidance based system tries to automatically recommend hyperlinks that are deemed to be relevant to the user's interests, in order to facilitate access to the needed information on a large website. It is usually implemented on the Web server, and relies on data that reflects the user's interest implicitly (browsing history as recorded in Web server logs) or explicitly (user profile as entered through a registration form or questionnaire). This approach will form the focus of our overview of Web personalization. Task Performance Support

In these client-side personalization systems, a personal assistant executes actions on behalf of the user, in order to facilitate access to relevant information. This approach requires heavy involvement on the part of the user, including access, installation, and maintenance of the personal assistant software. It also has very limited scope in the sense that it cannot use information about other users with similar interests.

1.2.6 Personalization Process

The Web personalization process can be divided into four distinct phases as follows - Collection of Web Data

Implicit data includes past activities/click streams as recorded in Web server logs and/or via cookies or session tracking modules [14]. Explicit data usually comes from registration forms and rating questionnaires. Additional data such as demographic and application data (for example, e-commerce transactions) can also be used. In some cases, Web content, structure, and application data can be added as additional sources of data, to shed more light on the next stages. Preprocessing of Web Data

Data is frequently pre-processed to put it into a format that is compatible with the analysis technique so that it can be used in the next step. Preprocessing may include cleaning data of inconsistencies, filtering out irrelevant information according to the goal of analysis (example: automatically generated requests to embedded graphics will be recorded in web server logs, even though they add little information about user interests), and completing the missing links (due to caching) in incomplete click through paths. Most importantly, unique sessions need to be identified from the different requests, based on a heuristic, such as requests originating from an identical IP address within a given time period. Analysis of Web Data

This step applies machine learning or Data Mining techniques to discover interesting usage patterns and statistical correlations between web pages and user groups. This step frequently results in automatic user profiling, and is typically applied offline, so that it does not add a burden on the web server. Decision making/Final Recommendation Phase

The last phase in personalization makes use of the results of the previous analysis step to deliver recommendations to the user. The recommendation process typically involves generating dynamic Web content on the fly, such as adding hyperlinks to the last web page requested by the user. This can be accomplished using a variety of Web technology options such as CGI programming.

Log Files

Web Server



Web Usage Mining




Usage Patterns

Recommendation Engine

Figure 1.7: Personalization Architecture

1.2.7 Advantages

The advantages of Web Mining are as follows -

Eliminating/ Combining low visit pages

Shortening Paths of high visit pages

Redesigning pages to help user navigation

Redesigning pages for search engine optimization

Help evaluating effectiveness of advertising campaigns

1.2.8 Disadvantages

The most criticized ethical issue involving web usage mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent.

1.2.9 Applications of Web Data Mining

The main motivation behind this dissertation is the correlation between Web usage mining and Web personalization. The work on Web usage mining can be a source of ideas and solutions towards realizing Web personalization. The ultimate goal of Web personalization is to provide Web users with the next page they will access in a browsing session. This achieved by analyzing their browsing patterns and comparing the discovered patterns to similar patterns in history. Traditionally, this has been used to support the decision making process by Web site operators in order to gain better understanding of their visitors, to create a more efficient structure of the Web sites and to perform a more effective marketing.

Guiding the Web site users by providing them with recommendations of a set of hyperlinks that are related to the users' interests and preferences and improve the users' navigational experience and providing users with personalized and customized page layout, hyperlinks and content depending on their interests and preferences

Performance of the system of some actions on behalf of users such as sending e-mail, downloading items, completing or enhancing the users' queries, or even participating in Web auctions on behalf of Web users

Learning and predicting user clicks in Web based search facilities Zhou et al. (2007) [15].This offers an automated explanation of Web user activity. Also, the measurement of the likelihood of clicks can infer a user's judgment of search results and improve Web page ranking

Minimizing latency of viewing pages especially image files, by pre-fetching Web pages or by pre-sending documents that a user will visit next Yang et al. (2003) [16]. Web pre-fetching goes one step further by anticipating the Web users' future requests and pre-loading the predicted pages into a cache. This is a major method to reduce Web latency which can be measured as the difference between the time when a user makes a request and when the user receives the response. Web latency is particularly important to Web surfers e-commerce Web sites

Customizing Web site interfaces by predicting the next relevant pages or products and overcoming the information overload by providing multiple short-cut links relevant to the items of interest in a page

Improving site topology as well as market segmentation

Improving the Web advertisement area where a substantial amount of money is paid for placing the correct advertisements on Web sites. Using Web page access prediction, the right ad will be predicted according to the users' browsing patterns