Overview Of Preprocessing In Web Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

World Wide Web is increasing rapidly with huge amount of users interactions. Web usage mining is the process of extracting navigation behavior patterns that allow analyzing users' interactions with Web-based environments. The user's access on the web stored in the server logs. The server log files do not present on accurate picture of the user's access to the website. This paper discusses the overview of preprocessing phase in web usage mining. Preprocessing of the Web log data is essential and prerequisite phase, before it can be used for pattern discovery or mining tasks. The preprocessed wed data can then be suitable for the discovery and analysis of useful data for web mining.

The World Wide Web is considered as a huge library. The number of websites and its usage by the user s are increasing rapidly. The World Wide Web consists of documents, images and some resources and interconnected by links and it has referenced with uniform Resource Identifiers. It identifies documents, files, service provider, and servers' services. The main access protocol of WWW is HyperText Transfer Protocol (HTTP) is used to communicate hundreds of protocols on the Internet.

D. Uma Maheswari, Assistant Professor, Department of Computer Science, Bharathiar University Arts and Science College, Valparai, Coimbatore, Tamil Nadu. ([email protected]).

Dr. A. Marimuthu, Associate Professor, Department of Computer

Science, Government Arts College, Coimbatore. ([email protected])

Through web services can communicate with different applications and share some information and services. A web service has more opportunity to connect with partners to exposing more services through that business for increase the revenue. To increase the web services some browser software such as Opera, Apple's safari, Google Chrome, Mozilla Firefox, Netscape Navigator, Mosaic, and Internet Explorer 9 is used to navigate from one page to another by hyperlinks rooted in the pages. These pages may contain some combination of Graphics such as 2D, 3D and animated graphics, audio, video, text, etc. These ages will run automatically while the user interacts on the pages .By using the keywords can get some relevant information's by using some search engines like Google, Yahoo!, Bing, etc. Through the web, we can share some ideas to audience on online from this can reduce the expenses and time. On web many cost free services are also supporting and build the web page application, build the web site, and blog. Web mining is the application of data mining. Web mining can be defined as to extract the knowledge from the web data including web documents, logs of websites; etc. The web mining is divided into three c categories i.e. Web content mining, Web structure mining and Web usage mining. Web usage mining is the part of web mining. Web usage mining is divided into three phases according to the kinds of data mined i.e. Data collection, preprocessing, pattern discovery and pattern analysis. The data are collected from three main sources. They are web servers, Proxy servers, and web clients. This paper focuses on the following: i) Preprocessing of data from logs result in a user session file. ii) Format the user session file into Suitable Mining task.


Data preprocessing phase presents various research papers. C.P. Sumathi[1] has discussed the data preprocessing steps in detail. Aye, T.T, et al.,[2] discussed the removal of noisy data can be useful e for discovering patterns that the cleaning algorithm is used. V.Chitraa et al.[3] haave been Experiment used for session identification. Yhan[4] have presented the path completion technique used in web usage mining. Federico Michele Facca et al.,[5] have been proposed the recent development of research in web usage minng.J.Srivatsava et al.,[6] describes the discovery of web usage patterns in web usage mining.


In web usage mining, the main data origin has three kinds: server data, client data and middle data (agent server data and package detecting). Web servers contain information stored on disk, which they make available on the Internet.web servers are the richest source can collect large amount of data in their log files. These log files usually contain basic information so many users may visit one web site and user behavior about that website can be captured accordingly. The web server log file is most reliable and accurate for WUM process. This information has some standard format such as Common Log File Format (NCSA), Extended Log Format (W3C), and IIS Log Format (Microsoft).

Sample log format

- -

[08/Dec/2011:11:17:55 -0400]

"GET / HTTP/1.1"



"http://www.yahoo.com/search?q=log+analyzer&ie=utf-8&oe=utf- 8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"

"Mozilla/9.0.1 (Windows; U; Windows xp 5.2; en-US; rv: Gecko/20070914 Firefox/9.0.1"

IP address: ""

The IP address of the machine in our website.

Remote log name: "-"

This is a dash unless Identity Check on web server.

Authenticated user name: "-"

It is password protected by web server to access content. For authentication purpose the system can be used.

Timestamp: [08/Dec/2011:11:17:55 -0400]

Time of the visit can be seen by the web server. -0400 is time zone designator.

Access request: "GET / HTTP/1.1"

Access request can be made by "GET "request i.e. shows me the page for the file "/" homepage using the "HTTP/1.1"protocol.

Result status code: "200"

If the status code is 200 the request was successful otherwise the request will contain some error. By using this can identify the error files also such as 401 "Unauthorized", 404 "File or Page Not Found", HTTP Error 500 "Internal Server Error", 503 "Service Unavailable".

Bytes transferred: "10802"

The bytes transferred to the user. The homepage file is 10802 or about 10.2 KB. By adding all of this information can tell the total bandwidth the user used in the web site and can tell the file what used for each file and each visitor.

Referrer URL : "http://www.yahoo.com/search?q=log+analyzer&ie=utf-8&oe=utf- 8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"

This is the referring URL. It is the page that the visitor was clicked to come this page. This page can link to other page. This page can be viewed when the user can type their relevant address into their web browsers. Depending on the users visit can identify the user profile and the webmaster can optimize their web sites. he referrer is appears in the log as"-".

User Agent: "Mozilla/9.0.1 (Windows; U; Windows xp 5.2; en-US; rv: Gecko/20070914 Firefox/9.0.1"

This is the "User Agent" identifier software that the visitor or the user used to access the web site. It's normally a web browser, it equally is a web robot, a hyperlink checker, an FTP client that stores and retrieves the information.

A software agent that identifies the browsers and provides some details to server host to the particular website.The Mozilla/9.0.1 means the web browser of the Mozilla to interact with the server," Windows XP indicates the operating system and en implies that this is in English version. Firebox 9.0.1 means 9.The first line hit is caused by google spider.


Data Preprocessing is an important step in the data mining process. Some databases are insufficient, resulting in out-of-range values, impossible data combinations, missing values, inconsistent, and noise and irrelevant is more difficult. Data preparation and filtering steps can be taken for these kinds of data. The data preprocessing is necessary to transform to those databases. The result is that the database will to become integrate and consistent, thus can be established the purified database. In the data preprocessing work, mainly include data cleaning, user identification, session identification and path completion.

Fig.1 Details of preprocessing phase in web usage mining


A user comes from multiple web, the multiple servers with redundant content are used to reduce the load on any particular server. Data fusion refers to the merging of log files from several Web and application servers. The referrer field in server logs along with various sessionization and user identification methods can be used to perform the merging.

The process of data cleaning is removing of outliers and inappropriate data including references to style files, graphics, or sound files that may not be important for the purpose of analysis. The requests fail for various reasons that generated erroneous log records can be identified by HTTP status code. The records with failed HTTP status code are also eliminated from logs. Cleaning the data reduces the number of log entries and it is important for finding valid access patterns.


After data cleaning phase the next important and complex step is unique user identification. The analysis of web usage mining does not require knowledge about a user history. In the absence of authentication mechanisms to distinguish among unique visitors is identified by the use of cookies. Cookies are sometimes disabled by users. An IP address, alone, is not sufficient for mapping log entries to identify the set of unique visitors. This is due to the creation of proxy servers which is assigning the IP addresses to the clients as they browse on the Web. It is not rare to find many log entries corresponding to a limited number of proxy server IP addresses. Another solution can find from user registration but most of the users are neglecting to give their information .so most of the records which is not contain the information in the userID and authentication field. The fields which are useful to find unique users and sessions are IP address, User agent in log files, time, and Referrer URL.

Fig.2 Building user profiles for user identification


User Identification is necessary to the discovery of access patterns. To distinguish among different users are important for user identification. Client request to the server then it generate log files at that time client also send user agent to server, the server response to the client .A user may visit a site more than one web site. If two records has different IP address they are two different users else if both IP address are same then User agent field is checked. The combination of User IP + User Agent can find out users. If the web browser and the operating system is different in two records can considered as different users. User Identify by using IP address, User Agent in log files, Time, and URL.

Fig.3 Log File






















Fig.4 User 1

Fig.5 User 2


After user identification the next step is the identification of sessions. Session is an activity made by the internet users that may incorporate in an http request or response in the website. The user can visit more than one web site. Closing a web browser then reopening and visiting the website again generates a new session. The aim of session identification is to split the page accesses of each user into one or more sessions. In general, 30 minute default time out is taken to split all the pages accessed by the user into different session. If the time taken by the users using the web site exceeds the default limit can be considered as new session.

The following example shows a HTTP request incorporating a session identifier.

GET / HTTP/1.0

Accept: text/plain

Accept: text/html

Session-Id: SID:ANON:w3.org:j6oAOxCWZh/CD723LGeXlf-01:034

User-Agent: libwww/4.1


In web usage mining identification of the Pageview is depend on the structure of the web site. Pageview is the collection of web objects, files and resources that represents the user action by clicking the hyperlink of the website and viewing the products and adding the product shopping cart. Each activities of the Pageview can be recorded for flexible framework of data mining.


Path completion is also the preprocessing task. After completion of session identification the path completion task is carried out .The user cannot access the sequenced of pages, it is complicated in client side caching, actual pages will be missing from the server side caching. To predict the missing pages the path completion techniques can be applied. It mostly depends on Referrer field and URL fields in the server log file. In the path completion the missing reference pages can be added that exists in the log. Referrer URL is not equal to the URL in the previous record then that URL in the Referrer URL field of current record is inserted in this session and thus path completion is obtained. During path completion the reference length can be modified for obtaining new pages, and then the length is determined.


The Data preprocessing is a significant and prerequisite phase in the web mining. In this paper the data was collected from the web server and other associated field .The log file which is contained in the web server do not reflect the exact users and session for this inability identification the log file must be preprocessed. Thus the Preprocessing is essential that the log files need to be preprocessed initially before the mining tasks can be undertaken. This paper has presented the overview of preprocessing task that are necessary for performing web usage mining to discover the patterns. Different methods are involved in each step to remove the irrelevant items for identify users and user session with browsing information. This paper focused the primary task for preprocessing, In future work various transformations can be applied by using mining algorithms. The discovered patterns can be used for various applications such as System performance, web prefetching, link prediction, site reorganization and web personalization.