Data Preprocessing Usage Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The Web Mining is the process of discovering potential useful and previously unknown information from the Web data. It uses many data mining techniques and is not solely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data. Web mining process is similar to the data mining process.[1] The difference is usually in the data collection. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. Once the data is collected, we go through the same four major tasks: Data Collection, Data Preprocessing, Pattern Discovery, Pattern Analysis. However, the techniques used for each step can be quite different from those used in traditional data mining.

Web mining is the relevance of data mining techniques to extract knowledge from web data, i.e. web content, web structure, and web usage data mining. Web content mining extracts or mines useful information or knowledge's from web page contents. Web document text mining, resource discovery based on concepts indexing or agent; based technology may also fall in this category. Web structure mining aims to discover useful knowledge from hyperlinks, which represent the structure of the Web. Hyperlink is a link that exists in a web page and refers to another region in the same web page or another web page. Finally, web usage mining, also acknowledged as Web Log Mining, aims to capture and model behavioural patterns and profiles of users who interact with a web site.

Figure: 1 Taxonomy of Web Mining

General Process of Web Usage Mining

Web usage mining (WUM), an emergent domain in web mining that has greatly concerned both academia and industry in recent years. Web servers collect large volumes of data from the Web sites usage. This data is stored in Web access log files. Together with the Web access log files, other data can be used in Web Usage Mining like the Web structure information, user profiles, Web page content, etc.

We divide the WUM in three main steps: preprocessing, pattern discovery and pattern analysis. The Web site structure information could be used in the preprocessing task, for example to generalize Web pages (i.e. replace multiple pages with a higher level index page). Moreover, when analyzing the patterns discovered, the site structure could be used to highlight "unexpected" patterns, i.e. patterns having high link distances between their Web pages.


It is generally used as groundwork of data mining practice. The preprocessing task within the WUM process involves cleaning and structuring data to prepare it for the pattern discovery task.

Pattern Discovery

In this, WUM can be able to unearth patterns in server logs and carried out only on samples of data.

Interpretation and evaluation of results be done on samples of data. The various pattern discovery methods are- Statistical Analysis, Association Rules, Clustering, Classification, Sequential Patterns, and Dependency Modeling.

Pattern Analysis

The necessitate behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery stage. Most common form of pattern analysis consists of a knowledge query mechanism such as SQL. Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure.

Web Log Format

Log files are a standard tool for computer systems developers and administrators. They record the "what happened when by whom" of the system. Log files contain information about User Name, IP Address, Time Stamp, Access Request, number of Bytes Transferred, Result Status, URL that Referred and User Agent.[5] The log files are maintained by the web servers. By analysing these log files gives a neat idea about the user. A Web log is a file to which the Web server writes information each time a user requests a website from that particular server. A log file can be located in three different places:

• Web Servers

• Web proxy Servers

• Client browsers

A web server log file contains requests made to the web server, recorded in chronological order. The most popular log file formats are the Common Log Format (CLF) and the extended CLF. [6][11] A common log format file is created by the web server to keep track of the requests that occur on a web site. A standard log file has the following format[12][13]




Preprocessing Techniques

Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user -- for example, in a neural network.[3][7] There are a number of different tools and methods used for preprocessing, including: sampling, which selects a representative subset from a large population of data; transformation, which manipulates raw data to produce a single input; denoising, which removes noise from data; normalization, which organizes data for more efficient access; and feature extraction, which pulls out specified data that is significant in some particular context.[10]

Figure: 2 Methodology for Preprocessing in WUM

Data Fusion

At the beginning of the data preprocessing, we have the Log containing the Web server log files collected by several Web servers as well as the Web site maps. First, we join the log files and then anonymize the resulting log file for privacy reasons.

Joining Log Files

First, we join the different log files from the Log. Then put together the requests from all log files into a joint log file. Generally, in the log files, the requests do not include the name of the server file. However, we need the Web server name to distinguish between requests made to different Web servers; therefore we add this information in the requests (before the ¯le path). Moreover, we have to take into account the synchronization of the Web server clocks, including the time zone differences.

Anonymizing Log Files

When sharing log files or publishing results, for privacy reasons, we need to remove the host names or the IP addresses. Therefore, we replace the original host name with an identifier that keeps the information about the domain extension.

Data Cleaning

The data cleaning process involves removing the irrelevant data from the database log. This data can be in the form of requests from a non-analyzed source, data with missing attributes or the attributes that are not needed for the project goal. This step helps in reducing the size of the data to a great extent. This reduction in size also helps in removing any false associations that could have been created because of this data. When a request to a web page is made, there are various attributes that are called and a lot of contents are loaded in that request. This includes the image files and graphics that are loaded with the web page because of the HTML tags. Since we are interested only in the data that is requested by the user and not any system generated data, we need to make sure that only the user requested data is present in the server logs. Therefore, any of the system-generated data should be avoided and removed from the log files.

The following is the algorithm[21] which can be used to data cleaning:

Algorithm DataCleaning (LogFile: Web log file; LogFile: Web log file)


While not eof (LogFile) Do

LogRecord = Read (LogFile)

If ((LogRecord.Cs-url-stem <> gif,jpeg,jpgcss,js))

AND (LogRecord.Cs-method= 'GET') AND

(LogRecord.Sc-status = (200) AND

(LogRecord.User-agent <> Crawler, Spider,


Then Write (LogFile, LogRecord)

End If

End While


Figure: 3 Sample report of web log

Data Structuration

This step groups the unstructured requests of a log file by user, user session and page view. At the end of this step, the log file will be a set of transactions.

User Identification

In most cases, the log file provides only the computer address (name or IP) and the user agent. For Web sites requiring user registration, the log file contains also the user login (as the third record in a log entry). In this case, we use this information for the user identification. For example, there can be a scenario where two users have the same common IP address and use the same browser agent for a page request. In this case, both the users appear to be as a single user. There can also be the exactly opposite scenario, where the same user can have a different IP address and be using a different browser resulting in confusion again.

User Session Identification

As the name suggests, session identification defines the number of times the user has accessed a web page. We use the time out mechanism to identify the access time of the user for a respective web page. The time out mechanism basically defines a time limit for the access of a particular page and this limit is usually 30 minutes. Therefore, if the user has accessed the web page for more than 30 minutes, this session will be divided in to more than one session. This approach lets us develop some user statistics and helps us in identifying if the user is no longer accessing the requested page.

Figure: 4 A Frequency chart for the frequency visited sessions.

Page View Identification

The requests are grouped by page views with the following algorithm:

A) When the request for the page view pi is in the log file, we remove the log entries corresponding to the embedded resources from pi, and we keep only the request for pi.

B) When the request for pi is absent (due to the browser or proxy cache), but some entries for its corresponding resources are present and these entries have pi in the referrer field, we replace the entries corresponding to the resources with a request for pi and we set the time of this request to ti = minftime(li)g, where li is the corresponding log entry for the resource ri.

C) A third solution consists in using a statistical or a DM approach for identifying Web pages that are usually requested together and in a short period of time.

D) Other one is use sequential patterns with high confidence obtained from the user sessions or visits.

After the page view identification, the log file will contain, normally, only one request for each user action.

Data Summarization

This is one of the advanced data preprocessing tasks that are performed after all of the above processes. In this process, the data is inserted in to a relational database system for further generalization and computations. We designed different tables in the relational database for each object identified in classical preprocessing. This process of generalization transforms the set of URLs syntactically or semantically to consistently reduce their number. The aggregated data computation refers to building new parameters from the existing data and adding these parameters. These parameters represent statistical values that characterize the analyzed object. For instance, if the analyzed object is a user session, we can compute the:

• Number of visits for that session

• Session length in time (the difference between the last visit's date and the first visit's date) and in the total number of page views

• Number of visits per considered period, which can be a day, week, or month

• Percentage of requests made to each Web server

If the analyzed object is a visit, we can compute the

• Visit length in terms of time and page views

• Percentage of successful requests

• Average time spent on a page

Frequent Pattern Algorithm

Association rule mining are one of the major techniques of data mining and it is the most common form of local frequent-pattern discovery in unsupervised learning systems. It serves as a useful tool for finding correlations between items in large databases. The terms used in these rule are :

Support: The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset.

supp(X) = no. of transactions which contain the itemset X / total no. of transactions

Confidence: The measurement of certainty coupled with each and every discovered pattern. The confidence for an association rule X implies Y is the ratio of the number of transaction that contains X U Y to the number of transaction that contains X.

con f (X->Y) = supp (XUY) / supp(X)

Large Item Set: A large item set is an item set whose number of occurrences is above a threshold or support.

Association rule mining is the most used technique in Web Usage Mining generally applied to databases of transactions where each transaction consists of a set of items. When applied to Web Usage Mining, association rules are used to find associations among web pages that frequently appear together in users' sessions. When applied to Web Usage Mining, association rules are used to find associations among web pages that frequently appear together in users' sessions. The most common approach to finding association rules is to break up the problem into two parts:

1. Finding every occurred frequent itemsets.

2. Generating strong association rules commencing the frequent itemsets satisfying minimum support and minimum confidence.

Many algorithms have been proposed to solve the problem of detecting frequent itemsets in transaction database. In this paper we are explaining Aprioi Algorithm.

Apriori represents the candidate generation approach. Apriori is a Breadth First Search Algorithm (BFS) which generates candidate k+1-itemsets based on frequent k-itemsets. The key idea of Apriori algorithm is to make multiple passes over the database.

The advantages of using apriori algorithm are

Uses large item set property.

Easily parallelized.

Easy to implement

Apriori Algorithm

Apriori is a seminal algorithm for finding frequent itemsets using candidate generation. It is characterized as a level-wise full search algorithm using anti-monotonicity of itemsets, "if an itemset is not frequent, any of its superset is never frequent". By principle, Apriori assume that items contained by transaction or itemset are sorted in lexicographic strategy.

Let the set of frequent itemsets of size k be Lk and their candidates be Nk. Apriori first scans the database and searches for frequent itemsets of size 1 by accumulating the count for respected item and gathering that one which assure the minimum support requirement. It then iterates on the following three steps and extracts all the frequent itemsets:

1. Generate Nk+1, candidates of frequent itemsets of size k +1, from the frequent itemsets of size k.

2. Scan the database and calculate the support of each candidate of frequent itemsets.

3. Add those itemsets that satisfies the minimum support requirement to Lk+1.

The Apriori algorithm is shown below. Function apriori-gen in line 3 generates Nk+1 from Lk in the following two step process:

1. Join step: Generate Rk+1, the initial candidates of frequent itemsets of size k + 1 by taking the union of the two frequent itemsets of size k, Pk and Qk that have the first k−1 elements in common.

Rk+1 = Pk U Qk = {iteml, itemk−1, itemk, itemk'}

Pk = {iteml, item2, . . . , itemk−1, itemk}

Qk = {iteml, item2, . . . , itemk−1, itemk'}

where, iteml < item2 < · · · < itemk < itemk'.

2. Prune step: Check if all the itemsets of size k in Rk+1 are frequent and generate Nk+1 by removing those that do not pass this requirement from Rk+1. This is because if any subset of size k of Nk+1 is not recurrent then it cannot be a subset of frequent itemset of size k + 1.

It is evident that Apriori scans the database at most kmax+1 times when the maximum size of frequent itemsets is set at kmax.


L1= (Frequent itemsets of cardinality 1);

for(k=1;Lk !=0;k++) do begin

Nk+1 = apriori-gen(Lk);//New candidates for all transactions t Є Database do begin

N'k =subset(Nk+1, t);//Ck+1 that are contained in t for all candidate n Є N't do



Lk+1 = candidates in Nk+1 with min_support



return Uk Lk;


In this paper we survey some data preprocessing activities like data cleaning and data summarization. It is important to note that before applying data mining techniques to discover user access patterns from web log, data must be processed because quality of results is based on data to be mined. Data preprocessing is a significant and prerequisite phase in Web mining. Various heuristics are employed in each step so as to remove irrelevant items and identify users and sessions along with the browsing information.

In addition to the above mentioned preprocessing and formatting tasks, the future work involves various data transformation tasks that are likely to influence the quality of the discovered patterns resulting from the mining algorithms.


Authors desire to show gratitude to our guides for giving their valuable time and resources that helped us to implement research work. This research was carried out as part of the activities of, dissertation work in Post-Graduation.