This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Nowadays, digital information is relatively easy to capture and fairly inexpensive to store. The digital revolution has been collections of data grow in size and the complexity of the data therein increase. Question commonly arising as a result of this state of affairs is, having gathered such quantities of data, what we actually do with it? It is often the case that large collections of data, however well structured, conceal implicit patterns of information that cannot be readily detected by conventional analysis techniques. Such information may often be usefully analyzed using a set of techniques referred to as knowledge discovery or data mining. These techniques essentially seek to build a better understanding of data, and in building characterization of data that can be used as a basis for further analysis, extract value from volume.
1.1.1 Definition of Data Mining
Data Mining refers to Extraction or mining knowledge from large amounts of data 
Jiawei Han, Micheline Kamber
The process of extracting previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions Simoudis 1996 
The non trivial extraction of implicit, previously unknown, and potentially useful information from data
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
Data Mining is the non trivial process of identifying valid, novel, potentially useful and ultimately under stable patterns in data 
Arun K Pujari
Researcher's View - The various researchers gives their different opinion about data mining .
Data Mining is the non trivial extraction of implicit, previously unknown and potentially useful information from the data. This encompasses a number of technical approaches, such as clustering, data summarization, classification, finding dependency networks, analyzing changes and detecting anomalies
Data Mining is the searches for the relationships and global patterns that exists in large databases but are hidden among vast amount of data, such as relationships between patient data and their medical diagnosis. This relationship represents valuable knowledge about the databases and the objects in the database, if the database is a faithful mirror of real world registered by the database
Data Mining refers to using a variety of techniques to identify nuggets of information or decision making knowledge in the database and extracting these in such a way that
they can be put in use in areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but it has low value and no direct use can be made of it. It is the hidden information in the data that is useful
Discovering relations that connects variables in a database is the subject of Data Mining. The data mining system self learns from the previous history of the integrated system, formulating and testing hypothesis about the rules which systems obey when concise and valuable knowledge about the system of interest is discovered, it can be interpreted in to some decision support system, which helps the manager to make wise and informal business decision
Data Mining is the process of discovering meaningful, new correlation patterns and trends by shifting through large amount of data stored in repositories using pattern recognition techniques as well as statistical and mathematical techniques
1.1.2 Knowledge Discovery Process
It consists of an iterative sequence of the following steps  -
Data Cleaning: It is used to remove noise and inconsistencies that is present in data
Data Integration: In this process multiple data sources may be combined
Data Selection: In this process data relevant to analysis task is retrieved from the database
Data Transformation: In this the data is consolidated into forms appropriate for mining by performing summary or aggregation operations
Data Mining: It is an essential process where intelligence methods are applied in order to extract data patterns
Pattern Evaluation: The extracted patterns are evaluated on the basis of interestingness measures
KnowledgeKnowledge Presentation: In this mined knowledge is presented to user by applying visualization and knowledge representation techniques like reports etc.
Evaluation and Presentation
Selection and Transformation
Cleaning and Integration
Databases Flat Files
Figure 1.1: Architecture of Knowledge Discovery Process
1.1.3 Architecture of Data Mining
A typical data mining system may have the following major components  -
Database, Data Warehouse, World Wide Web or other information repository: This is one or a set of database, data warehouse, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.
Database or Data Warehouse server: The db or data warehouse server is responsible for fetching the relevant data, based on the user's mining request.
Knowledge Base: This is the domain knowledge which is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attributes values into different levels of abstraction. Some of domain knowledge is additional interestingness constraints or thresholds and metadata.
Data Mining Engine: This is essential to the data mining system and consists of a set of functional modules for tasks such as characterization, association and correlations analysis, classification, prediction, cluster analysis, outlier analysis and evaluation analysis.
Pattern Evaluation Module: This component employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. This module may be integrated with the mining module, depending on the implementation of the data mining method used. For the efficient data mining, it is highly recommended to push the evaluation of patterns interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns.
User Interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search and performing the exploratory data mining based on the intermediate data mining result.
Other Info Repositories
Data Warehouse Server
Data cleaning, integration and selection
Data Mining Engine
Data Mining Engine
Figure 1.2: Architecture of Data Mining System
1.1.4 Goals of Data Mining
Data mining helps in achieving the following goals or tasks .
Prediction: Data mining can show how certain attributes within the data will behave in the future. Examples of predictive data mining in the business context includes the analysis of buying transactions to predict what consumers will buy under certain discounts and how much sales volume a store will generate in a given period. In a scientific context, certain seismic wave patterns may predict an earthquake with high probability.
Identification: Data patterns can be used to identify the existence of an item an event or an activity. For example, in biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. It also involves authentication where it is ascertained whether a user is indeed a specific user or one from an authorized class; it involves a comparison of parameters or images or signals.
Classification: Data mining can partition the data so that different classes or categories can be identified based on combination of parameters. For example, customers in a supermarket can be categorized into discount seeking shoppers, shoppers in a rush, loyal regular shoppers and infrequent shoppers. This classification may be used in different analysis of customer buying transactions as post mining activity.
Optimization: One eventual goal of data mining activity is to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints. These goals are realized with the help of different approaches such as Discovery of sequential patterns, Discovery of patterns in time series, Discovery of classification rules, Regression, Neural networks, Genetic Algorithms, Clustering and Segmentation.
1.1.5 Applications of Data Mining
Data mining applications are continuously developing in various industries to provide more hidden knowledge that enable to increase business efficiency and grow businesses .
Data Mining Applications in Sales/Marketing: Data mining enables the businesses to understand the patterns hidden inside past purchase transactions, thus helping in plan and launch new marketing campaigns in prompt and cost effective way. The following illustrates several data mining applications in sale and marketing.
Data mining is used for market basket analysis to provide insight information on what product combinations were purchased, when they were bought and in what sequence by customers. This information helps businesses to promote their most profitable products to maximize the profit. In addition, it encourages customers to purchase related products that they may have been missed or overlooked
Retails companies uses data mining to identify customer's behavior buying patterns
Data Mining Applications in Banking / Finance: Several data mining techniques such as distributed data mining has been researched, modeled and developed to help credit card fraud detection.
Data mining is used to identify customer's loyalty by analyzing the data of customer's purchasing activities such as the data of frequency of purchase in a period of time, total monetary value of all purchases and when was the last purchase. After analyzing those dimensions, the relative measure is generated for each customer. The higher of the score, the more relative loyal the customer is
To help bank to retain credit card customers, data mining is used. By analyzing the past data, data mining can help banks to predict customers that likely to change their credit card affiliation so they can plan and launch different special offers to retain those customers
Credit card spending by customer groups can be identified by using data mining
The hidden correlation's between different financial indicators can be discovered by using data mining
From historical market data, data mining enable to identify stock trading rules
Data Mining Applications in Health Care and Insurance: The growth of the insurance industry is entirely depends on the ability of converting data into the knowledge, information or intelligence about customers, competitors and its markets. Data mining is applied in insurance industry lately but brought tremendous competitive advantages to the companies who have implemented it successfully. The data mining applications in insurance industry are listed below:
Data mining is applied in claims analysis such as identifying which medical procedures are claimed together
Data mining enables to forecasts which customers will potentially purchase new policies
Data mining allows insurance companies to detect risky customers' behavior patterns
Data mining helps detect fraudulent behavior
Data Mining Applications in Transportation
Data mining helps to determine the distribution schedules among warehouses and outlets and analyze loading patterns
Data Mining Applications in Medicine
Data mining enables to characterize patient activities to see coming office visits
Data mining help identify the patterns of successful medical therapies for different illnesses
1.1.6 Advantages of Data Mining
There are various advantages of data mining as follows -
Marketing / Retail
Data mining helps marketing companies to build models based on historical data to predict who will respond to new marketing campaign such as direct mail, online marketing campaign and etc. Through this prediction, marketers can have appropriate approach to sell profitable products to targeted customers with high satisfaction. Data mining brings a lot of benefit s to retail company in the same way as marketing. Through market basket analysis, the store can have an appropriate production arrangement in the way that customers can buy frequent buying products together with pleasant. In addition, it also help the retail company offers a certain discount for particular products what will attract customers.
Finance / Banking
Data mining gives financial institutions information about loan information and credit reporting. By building a model from previous customer's data with common characteristics, the bank and financial can estimate what are the god and/or bad loans and its risk level. In addition, data mining can help banks to detect fraudulent credit card transaction to help credit card's owner prevent their losses.
By applying data mining in operational engineering data, manufacturers can detect faulty equipments and determine optimal control parameters. For example semi-conductor manufacturers had a challenge that even the conditions of manufacturing environments at different wafer production plants are similar, the quality of wafer are lot the same and some for unknown reasons even contain defects.
Data mining has been applied to determine the ranges of control parameters that lead to the production of golden wafer. Then those optimal control parameters are used to manufacture wafers with desired quality.
Data mining helps government agency by digging and analyzing records of financial transaction to build patterns that can detect money laundering or criminal activity.
1.1.7 Disadvantages of Data Mining
There are following disadvantages of using data mining -
The concerns about the personal privacy have been increasing enormously recently especially when internet is booming with social networks, e-commerce, forums, blogsâ€¦. Because of privacy issues, people are afraid of their personal information is collected and used in unethical way that potentially causing them a lot of trouble. Businesses collect information about their customers in many ways for understanding their purchasing behaviors trends. However businesses don't last forever, some days they may be acquired by other or gone. At this time the personal information they own probably is sold to other or leak.
Security is a big issue. Businesses own information about their employee and customers including social security number, birthday, payroll and etc. However how properly this information is taken is still in questions. There have been a lot of cases that hackers were accesses and stole big data of customers from big corporation such as Ford Motor Credit Company, Sonyâ€¦ with so much personal and financial information available, the credit card stolen and identity theft become a big problem.
Misuse of information/inaccurate information
Information collected through data mining intended for marketing or ethical purposes can be misused. This information is exploited by unethical people or business to take benefit of vulnerable people or discriminate against a group of people.
In addition, data mining technique is not perfectly accurate therefore if inaccurate information is used for decision-making will cause serious consequence.
1.1.8 Issues and Challenges in Data Mining
Data mining applications rely on databases to supply the raw data for input. The issues in the databases / data (e.g. volatility, incompleteness, noise, and volume) augment the issues by the time it reaches Data Mining task. Other problems arise as a result of the adequacy and relevance of the information stored .
A database is often designed for purposes different from data mining and sometimes the properties or attributes that would simplify the learning task are not present nor can they be requested from the real world. Inconclusive data causes problems because if some attributes essential to knowledge about the application domain are not present in the data it may be impossible to discover significant knowledge about a given domain. For example cannot diagnose malaria from a patient database if that database does not contain the patient's red blood cell count.
Noise and missing values
Databases are usually contaminated by errors so it cannot be assumed that the data they contain is entirely correct.
Attributes which rely on subjective or measurement judgments can give rise to errors such that some examples may even be miss-classified. Errors in either the values of attributes or class information are known as noise. Obviously where possible it is desirable to eliminate noise from the classification information as this affects the overall accuracy of the generated rules. Missing data can be treated by discovery systems in a number of ways such as;
simply disregard missing values
omit the corresponding records
infer missing values from known values
Treat missing data as a special value to be included additionally in the attribute domain
Or average over the missing values using Bayesian techniques
Noisy data in the sense of being imprecise is characteristic of all data collection and typically fit a regular statistical distribution such as Gaussian while wrong values are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise.
Uncertainty refers to the severity of the error and the degree of noise in the data. Data precision is an important consideration in a discovery system.
Size, updates, and irrelevant fields
Databases tend to be large and dynamic in that their contents are ever-changing as information is added, modified or removed. The problem with this from the data mining perspective is how to ensure that the rules are up-to-date and consistent with the most current information. Also the learning system has to be time-sensitive as some data values vary over time and the discovery system is affected by the `timeliness' of the data
Another issue is the relevance or irrelevance of the fields in the database to the current focus of discovery for example post codes are fundamental to any studies trying to establish a geographical connection to an item of interest such as the sales of a product.
1.2 WEB DATA MINING
Web data mining is the process of applying data mining techniques to Web data. Web Mining is the application of data mining techniques to extract knowledge from Web. Web mining has been explored to a vast degree and different techniques have been proposed for a variety of applications that includes Web Search, Classification and Personalization etc. Web data mining can be defined as the discovery and analysis of useful information from the WWW data. Web involves three types of data; data on the WWW, the web log data regarding the users who browsed the web pages and the web structure data .
Research in this area has the objectives of helping e-commerce businesses in their decision making, assisting in the design of good Web sites and assisting the user when navigating the Web.
The ongoing increase in the amount of Web data has led to the explosive growth of Web data repositories. Web pages and their contents are accessed and provided by a wide variety of applications and they are added and deleted every day. Moreover, the Web does not provide its users with a standard coherent page structure across Web sites. These facts make it very difficult to analyze the content of Web pages by automated tools.
Therefore, there arises a need for Web data mining techniques. Data mining involves the study of data-driven techniques to discover and model hidden patterns in large volumes of raw data. The application of data mining techniques to Web data is referred to as Web data mining. Web data mining can be divided into three distinct areas: Web content mining, Web structure mining and Web usage mining. Web content mining involves efficiently extracting useful and relevant information from millions of Web sites and databases. Web structure mining involves the techniques used to study the Web pages schema of a collection of hyper-links. Web usage mining on the other hand, involves the analysis and discovery of user access patterns from Web servers in order to better serve the users' needs.
1.2.1 Types of Web Data: World Wide Web contains various information sources in different formats . As it is stated above World Wide Web involves three types of data, the categorization is given in Figure 1.3
Figure 1.3: Types of Web Data
220.127.116.11 Web Content Data
It is the data, which web pages are designed for presenting to the users. Web content data consists of free text, semi-structured data like HTML pages and more structured data like automatically generated HTML pages, XML files or data in tables related to web content. Textual, image, audio and video data types falls into this category. The most common web content data is HTML pages in the web.
18.104.22.168.1 HTML (Hypertext Markup Language)
It is designed to determine the logical organizations of documents with hypertext extensions. HTML was firstly implemented by Tim Berners-Lee at CERN, and became popular by the Mosaic browser developed at NCSA. In 1990s it has become widespread with the growth of the Web. After that, HTML has been extended in various ways.
The www depends on the web page authors and vendors sharing the same conventions of HTML. Different browsers in various formats can view an HTML document in different ways.
To illustrate, one browser may indent the beginning of a paragraph, while another may only leave a blank line. However, base structure remains the same and the organization of document is constant.HTML instructions divide the text of a web page into sub blocks called elements. The HTML elements can be examined in two categories: those that define how the body of the document is to be displayed by the browser, and those that define the information about the document, such as the title or relationships to other documents.
Another common web content data is the XML documents.
It is a markup language for documents containing structured information. Structured information contains both the content and the information about what content includes and stands for. Almost all documents have some structure. XML has been accepted as a markup language, which is a mechanism to identify structures in a document. XML
specification determines a standard way to add markup to documents. XML doesn't specify semantic or tag set. In fact it is a meta-language for describing markups. It provides mechanism to define tags and the structural relationships. All of the semantics of an XML document will either be defined by the applications that process them or by style sheets.
22.214.171.124.3 Dynamic Server Pages
They are also important part of web content data. Dynamic content can be any web content, which is processed or compiled by the web server before sending the results to the web browser. On the other hand, static content is content, which is sent to the browser without modification. Common forms of dynamic content are Active Server Pages (ASP), Pre-Hypertext Processor (PHP) pages and Java Server Pages (JSP). Today, several web servers support more than one type of active server pages.
126.96.36.199 Web Structure Data
It describes the organization of the content. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. Inter-page structure information is hyper-links connecting one page to another. Web graph is constructed by hyperlinks information from web pages. The web graph has been widely adopted as the core describing the web structure. It is most widely accepted way of representing web structure related to web page connectivity (dynamic and static links). The Web graph is a representation of the WWW at a given time. It stores the link structure and connectivity between the HTML documents in the www. Each node in the graph corresponds to a unique web page or a document. An edge represents an HTML link from one page to another.
The general properties of web graphs are given below:
Directed, very large and sparse
- Nodes and edges are added /deleted very often
- Content of the existing nodes is also subject to change
- Pages and hyperlinks created on the fly
Apart from primary connected component there are also smaller disconnected components
The size of the web graph is varying from one domain to another domain.
Figure1.4: Web Graph for a Particular Web Domain
The edges of web graph has the following semantics: Outgoing arcs stands for hypertext links contained in the corresponding page and incoming arcs represent the hypertext links through which the corresponding page is reached. Web graph is used in applications such as web indexing, detection of web communities and web searching. The whole web graph grows with an amazing rate.
188.8.131.52 Web Log Data
Web usage data includes web log data from web server access logs, proxy server logs, browser logs, registration data, cookies and any other data generated as the results of web user interactions with web servers. Web log data is created on web server. Every Web server has a unique IP address and a domain name. When any user enters (a URL) in any browser, this request is send to the web server. A web server log, containing Web server data, is created as a result of the httpd process that is run on Web servers. All types of server activities such as success, errors, and lack of response are logged into a server log file. Web servers dynamically produce and update four types of "usage" log files: access
log, agent log, error log, and referrer log. Web Access Logs has fields containing web server data, including the date, time, user's IP address, user action, request method and requested data. Error Logs includes data about specific events such as "file not found," "document contains no data," or configuration errors; providing server administrator information on "problematic and erroneous" links on the server. Other type of data recorded to the error log is aborted transmissions. Agent logs provide data about the browser, browser version, and operating system of the requesting user.
184.108.40.206 User Profile Data
User profile data provide information about the users of a Web site. A user profile contains demographic information for each user of a Web site, as well as information about users' interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs.
1.2.2 Types of Web Data Mining
The World Wide Web data mining focuses on three issues: Web structure mining, Web content mining and Web usage mining .
220.127.116.11 Web Content Mining
Web Content Mining is the process of extracting useful information from the contents of Web documents. Content data corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images, audio, video, or structured records such as lists and tables. Web content mining involves mining Web data contents. It focuses on various techniques that assist in searching the Web for documents whose content meets a certain goal. Those documents, once found, are used to build a knowledge base. The emphasis here is on analyzing the Internet hypertext material. The Internet data that is available in digital form has to be prepared for analysis.
A large number of researches have been conducted in this area in the past few years. For instance, Zaiane & Han (2000) , focused on resource recovery on the Web. The authors made use of a multi-layered database model to transform the unstructured data on the Web into a form acceptable by database technology. Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and natural language processing (NLP).
18.104.22.168 Web Structure Mining
Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:
Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.
Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.
It aims at generating structured summary about web sites and web pages in order to identify relevant documents. The focus here is on link information, which is an important aspect of Web data. Web structure mining can be used to reveal the structure or schema of Web pages which would facilitate Web document classification and clustering on the basis of its structure Spertus (1997) .Web structure mining is very useful in generating information such as visible Web documents, luminous Web documents and luminous path which is the path common to most of the results returned.
22.214.171.124 Web Usage Mining
Web usage mining is the process of extracting useful information from server logs i.e. user's history. Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web usage mining involves the automatic discovery and analysis of patterns in data as a result of the user's interactions with one or more Web sites. It focuses on tools and techniques used to study and understand the users' navigation preferences and behavior by discovering their Web access patterns.
The goal of Web usage mining is to capture, model and analyze the users' behavioral patterns. It, therefore, involves three phases: Preprocessing of Web data, pattern
discovery and pattern analysis Srivastava et al. (2000) . Of these, only the latter phase is performed in real-time. The discovered patterns are represented as collections of pages that are frequently accessed by groups of users with similar interests within the same Web site.
Web Data Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Page Content Mining
Search Result Mining
Figure1.5: Web Data Mining Architecture
1.2.3 Architecture of Web Usage Mining
126.96.36.199 Data Collection
The first step in the Web usage mining process consists of gathering the relevant Web data , which will be analyzed to provide useful information about the users' behavior. There are two main sources of data for Web usage mining- data on the Web server side and data on the client side. Additionally, when intermediaries are introduced in the client-server communication, they can also become sources for usage data, e.g. proxy servers and packet sniffers. Each of these sources is examined in earlier subsections.
188.8.131.52.1 Server Side Data
There are basically two types of server side data as follows-
Server Log Files: Server side data are collected at the Web server(s) of a site. They consist primarily of various types of logs generated by the Web server. These logs record the Web pages accessed by the visitors of the site. Most of the Web servers support as a default option the Common Log File Format, which includes information about the IP address of the client making the request, the hostname and user name, if available, the time stamp of the request, the file name that is requested, and the file size. The Extended Log Format which is supported by Web servers such as Apache and Netscape, Microsoft Internet Information Server, include additional information such as the address of the referring URL to this page, i.e., the Web page that brought the visitor to the site, the name and version of the browser used by the visitor and the operating system of the host machine.
The problem of here come is data reliability and the two major sources of data unreliability are: Web caching and IP address misinterpretation.
The Web cache is a mechanism for reducing latency and track on the Web. A Web cache keeps track of Web pages that are requested and saves a copy of these pages for a certain period of time. Thus, if there is a request for the same Webpage, the cached copy is used instead of making a new request to the Web server. Web caches can be configured either at the users' local browsers, or at intermediate proxy servers. The problem occurs here is. If the requested Web page is cached, the client's request does not reach the corresponding Web server holding the page. As a result, the server is not aware of the action and the page access is not recorded into the log files. One solution that has been proposed is cache-busting, i.e., the use of special HTTP headers defined either in Web servers or Web pages, in order to control the way that those pages are handled by caches. These headers are known as Cache-Control response headers and include directives to define which objects should be cached, how long they should be cached etc. However this approach works against the main motivation for using aches, i.e., the reduction of Web latency.
The second problem, IP misinterpretation in the log files, occurs for two main reasons. The first reason is the use of intermediate proxy servers, which assign the same IP to all users. As a result, all requests from various host machines that pass through the proxy server are recorded in the Web server log as requests from a single IP address. This can cause misinterpretation of the usage data. The same problem occurs when the same host is used by many users. The opposite problem occurs when one user is assigned many
different IP addresses, e.g. due to the dynamic IP allocation that is used for dial-up users by ISPs. A variety of heuristics have been employed in order to alleviate the problem of IP misinterpretation, Finally, information recorded at the Web servers' log files may pose a privacy threat to Internet users.
Only 20 cookies are allowed per domain, and no more than 300 cookies are allowed in the client machine. If the number of cookies exceeds these values, the least recently used will be discarded.
Explicit User Input: Various user data supplied directly by the user, when accessing the site, can also be useful for personalization. User data can be collected through registration forms and can provide important personal and demographic information, as well as explicit user preferences. However, this method increases the load on the user.
184.108.40.206.2 Client Side Data
220.127.116.11.3 Intermediary Data
Proxy Servers: A proxy server is a software system that is usually employed by an enterprise connected to the Internet and acts as an intermediary between an internal host and the Internet so that the enterprise can ensure security, administrative control and caching services. Despite the problems that they cause, which were mentioned above, proxy servers can also be a valuable source of usage data.
Proxy servers also use access logs, with similar format to the logs of Web servers, in order to record Web page requests and responses from the server. The advantage of using these logs is that they allow the collection of information about users operating behind the proxy server, since they record requests from multiple hosts to multiple Web servers.
Packet Sniffers: A packet sniffer is a piece of software, or sometimes even a hardware device, that monitors network traffic, i.e., TCP/IP packets directed to a Web server, and extracts data from them.
One advantage of packet sniffing over analyzing raw log files is that the data can be collected and analyzed in real time. Another important advantage is the collection of
network level information that is not present in the log files. This information includes detailed timestamps of the request that has taken place, like the issue time of the request, and the response time.
On the other hand, the use of packet sniffers also has important disadvantages compared to log files. Since the data are collected in real time and are not logged, they may be lost forever if something goes wrong either with the packet sniffer or with the data transmission. For example, the connection may be lost.
18.104.22.168 Data Preprocessing
Web data collected in the first stage of data mining are usually diverse and vast in volume. These data must be assembled into a consistent, integrated and comprehensive view, in order to be used for pattern discovery. As in most applications of data mining, data preprocessing involves removing and filtering redundant and irrelevant data, predicting and filling in missing values, removing noise, transforming and encoding data, as well as resolving any inconsistencies. The task of data transformation and encoding is particularly important for the success of data mining. In Web usage mining, this stage includes the identification of users and user sessions, which are to be used as the basic building blocks for pattern discovery.
Data Filtering: The very first step in data preprocessing is to clean the raw Web data. During this step the available data are examined and irrelevant or redundant items are removed from the dataset. This problem mainly concerns log data collected by Web servers and proxies, which can be particularly noisy, as they record all user interactions. Due to these reasons, we concentrate here on the treatment of Web log data. Data generated by client-side agents are clean as they are explicitly collected by the system, without the intervention of the user. On the other hand, user supplied data like registration form information need to be verified, corrected and normalized, in order to assist in the discovery of useful patterns.
22.214.171.124 Pattern Discovery
In this stage, machine learning and statistical methods are used to extract patterns of usage from the preprocessed Web data. A variety of machine learning methods have been used for pattern discovery in Web usage mining.
The large majority of methods that have been used for pattern discovery from Web data are clustering methods. Clustering aims to divide a data set into the following categories:
Partioning methods, that create k groups of a given data set, where each group represents a cluster
Hierarchical methods that decompose a given data set creating a hierarchical structure of clusters
Model-based methods, that find the best fit between a given data set and a mathematical model
Clustering has been used for grouping users with common browsing behavior, as well as grouping Web pages with similar content.
Instead of clustering, the goal of classification is to identify the distinguishing characteristics of predefined classes, based on a set of instances, e.g. users, of each class. This information can be used both for understanding the existing data and for predicting how new instances will behave. Classification is a supervised learning process, because learning is driven by the assignment of instances to the classes in the training data.
126.96.36.199 Knowledge Post Processing
Finding patterns are not sufficient, unless they used by users. User can only use those things which are easily viewable to them, so try to convert or present patterns in to under stable format like graphical presentation, visualization and reports. So that user can easily used knowledge to increase profits. Visualization is a more effective method for presenting comprehensive information to humans.
Knowledge Post Processing
Server side Data
Client side Data
Figure 1.6: Web Usage Mining Architecture
1.2.4 Personalization on Web
Web personalization is a strategy, a marketing tool, and an art. Personalization requires implicitly or explicitly collecting visitor information and leveraging that knowledge in your content delivery framework to manipulate what information you present to your users and how you present it . Correctly executed, personalization of the visitor's experience makes his time on your site, or in your application, more productive and engaging. Personalization can also be valuable to you and your organization, because it drives desired business results such as increasing visitor response or promoting customer retention. Unfortunately, personalization for its own sake has the potential to increase the complexity of your site interface and drive inefficiency into your architecture. It might even compromise the effectiveness of your marketing message or, worse, impair the user's experience. Few businesses are willing to sacrifice their core message for the sake of a few trick web pages.
Web personalization can be seen as an interdisciplinary field that includes several research domains from user modeling, social network, web data mining, human-machine interactions to Web usage mining; Web usage mining is an example of approach to extract log files containing information on user navigation in order to classify users. Other techniques of information retrieval are based on documents categories' selection. Contextual information extraction on the user and/or materials (for adaptation systems) is a technique fairly used also include, in addition to user contextual information, contextual information of real-time interactions with the Web proposed a multi-agent system based on three layers: a user layer containing users' profiles and a personalization module, an information layer and an intermediate layer. They perform an information filtering process that reorganizes.
Web documents propose reformulation query by adding implicit user information. This helps to remove any ambiguity that may exist in query: when a user asks for the term "conception", the query should be different if he is an architect or a computer science designer. Requests can also be enriched with predefined terms derived from user's profile develop a similar approach based on user categories and profiles inference. User profiles can be also used to enrich queries and to sort results at the user interface level. Other approaches also consider social-based filtering and collaborative filtering.
These techniques are based on relationships inferred from users' profile. Implicit filtering is a method that observes user's behavior and activities in order to categorize classes of profile.
1.2.5 Personalization Strategies
Personalization falls into four basic categories, ordered from the simplest to the most advanced :
In this simplest and most widespread form of personalization, user information such as name and browsing history is stored (e.g. using cookies), to be later used to recognize and greet the returning user. It is usually implemented on the Web server. This mode depends more on Web technology than on any kind of adaptive or intelligent learning. It can also jeopardize user privacy.
This form of personalization takes as input a user's preferences from registration forms in order to customize the content and structure of a web page. This process tends to be static and manual or at best semi-automatic. It is usually implemented on the Web server. Typical examples include personalized web portals such as My Yahoo and Google.
188.8.131.52 Guidance or Recommender Systems
A guidance based system tries to automatically recommend hyperlinks that are deemed to be relevant to the user's interests, in order to facilitate access to the needed information on a large website. It is usually implemented on the Web server, and relies on data that reflects the user's interest implicitly (browsing history as recorded in Web server logs) or explicitly (user profile as entered through a registration form or questionnaire). This approach will form the focus of our overview of Web personalization.
184.108.40.206 Task Performance Support
In these client-side personalization systems, a personal assistant executes actions on behalf of the user, in order to facilitate access to relevant information. This approach requires heavy involvement on the part of the user, including access, installation, and maintenance of the personal assistant software. It also has very limited scope in the sense that it cannot use information about other users with similar interests.
1.2.6 Personalization Process
The Web personalization process can be divided into four distinct phases as follows -
220.127.116.11 Collection of Web Data
Implicit data includes past activities/click streams as recorded in Web server logs and/or via cookies or session tracking modules . Explicit data usually comes from registration forms and rating questionnaires. Additional data such as demographic and application data (for example, e-commerce transactions) can also be used. In some cases, Web content, structure, and application data can be added as additional sources of data, to shed more light on the next stages.
18.104.22.168 Preprocessing of Web Data
Data is frequently pre-processed to put it into a format that is compatible with the analysis technique so that it can be used in the next step. Preprocessing may include cleaning data of inconsistencies, filtering out irrelevant information according to the goal of analysis (example: automatically generated requests to embedded graphics will be recorded in web server logs, even though they add little information about user interests), and completing the missing links (due to caching) in incomplete click through paths. Most importantly, unique sessions need to be identified from the different requests, based on a heuristic, such as requests originating from an identical IP address within a given time period.
22.214.171.124 Analysis of Web Data
This step applies machine learning or Data Mining techniques to discover interesting usage patterns and statistical correlations between web pages and user groups. This step frequently results in automatic user profiling, and is typically applied offline, so that it does not add a burden on the web server.
126.96.36.199 Decision making/Final Recommendation Phase
The last phase in personalization makes use of the results of the previous analysis step to deliver recommendations to the user. The recommendation process typically involves generating dynamic Web content on the fly, such as adding hyperlinks to the last web page requested by the user. This can be accomplished using a variety of Web technology options such as CGI programming.
Web Usage Mining
Figure 1.7: Personalization Architecture
The advantages of Web Mining are as follows -
Eliminating/ Combining low visit pages
Shortening Paths of high visit pages
Redesigning pages to help user navigation
Redesigning pages for search engine optimization
Help evaluating effectiveness of advertising campaigns
The most criticized ethical issue involving web usage mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent.
1.2.9 Applications of Web Data Mining
The main motivation behind this dissertation is the correlation between Web usage mining and Web personalization. The work on Web usage mining can be a source of ideas and solutions towards realizing Web personalization. The ultimate goal of Web personalization is to provide Web users with the next page they will access in a browsing session. This achieved by analyzing their browsing patterns and comparing the discovered patterns to similar patterns in history. Traditionally, this has been used to support the decision making process by Web site operators in order to gain better understanding of their visitors, to create a more efficient structure of the Web sites and to perform a more effective marketing.
Guiding the Web site users by providing them with recommendations of a set of hyperlinks that are related to the users' interests and preferences and improve the users' navigational experience and providing users with personalized and customized page layout, hyperlinks and content depending on their interests and preferences
Performance of the system of some actions on behalf of users such as sending e-mail, downloading items, completing or enhancing the users' queries, or even participating in Web auctions on behalf of Web users
Learning and predicting user clicks in Web based search facilities Zhou et al. (2007) .This offers an automated explanation of Web user activity. Also, the measurement of the likelihood of clicks can infer a user's judgment of search results and improve Web page ranking
Minimizing latency of viewing pages especially image files, by pre-fetching Web pages or by pre-sending documents that a user will visit next Yang et al. (2003) . Web pre-fetching goes one step further by anticipating the Web users' future requests and pre-loading the predicted pages into a cache. This is a major method to reduce Web latency which can be measured as the difference between the time when a user makes a request and when the user receives the response. Web latency is particularly important to Web surfers e-commerce Web sites
Customizing Web site interfaces by predicting the next relevant pages or products and overcoming the information overload by providing multiple short-cut links relevant to the items of interest in a page
Improving site topology as well as market segmentation
Improving the Web advertisement area where a substantial amount of money is paid for placing the correct advertisements on Web sites. Using Web page access prediction, the right ad will be predicted according to the users' browsing patterns