Approaches For Identification Of Distinct Users Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The World Wide Web is an enormous repository of web pages and links. The massive amount of data is generated on daily basis. This massive information is available in the form of Log files. Log files contain all the records of the client server interaction such as IP Address, Time Stamp, and numbers of Bytes Transferred etc. This paper focuses on three important parts: understanding the format of the access log file, pre-processing phase and finally identifying the distinct users. Identifying the distinct users from the log is a challenging task. The paper focuses on identifying the distinct users based on different parameters such as user ID, Sessions, Referrer, User agent or browser. This paper also classifies the data into interested users and non-interested users by applying an existing decision tree algorithm. The analysis of the log file provides user navigation behavior that can be used for personalization of system.

Keywords: Log Data, Preprocessing, Decision Tree, Access Log, Common Log Format

1. Introduction

Log files are files that list the activities that have been arose as an interaction between the client and the server. The log files are present in the web server. Computers that deliver the request made by the client are known as web servers.

âˆ-Bina Kotiyal

1,2 Graphic Era University, Dehradun.

Preprint submitted to ICCN-2013/ICDMW-2013/ICISP-2013 September 18, 2012

The Server stores all of the files needed to show the Web pages on the user's computer. Individual web pages are combined together to structure the completeness of a Web site. These file can be in different format such as Images/graphic files and any scripts. The client make use of the browser to requests the data from the Web server using HTTP protocol, the server delivers the data back to the browser that made the web page request. Similarly, the server can send the files to many client computers at the same time and letting multiple clients to sight the same page concurrently.

1.1. Problem Statement

The problem is to identify the interesting users and finding the surfing behavior of the users. User's identification is the process of identifying who access Web site and which pages are accessed frequently. This process of user identification could be simplified, if users provide their login information, but due to the security reason the clients are reluctant to reveal their information. Moreover, number of users access websites through same computer, agent, browser, whereas some user use different browsers etc. These are the obstacles in finding out the unique users and therefore make the task complicated and very challenging. The information from the cookies can be used to track users' behaviors, but again due to the security reason they are not used by many users. In an order to identify the users who uses same computer or agent some papers has given solution such as, to identify the users automatically [2] presents a method called navigation pattern to identify users automatically. However, all the methods are not accurate as they considered some features that influence the method of user identification. Similarly, [3] uses heuristic method to solve the problem of identifying users in same computer or agent. In this if the page requested is not directly linked by any of the page visited by the user then heuristic assumes it as a new user with the same IP address.

2. Web Log Format

The web usage data, [4] comprises the data from web server logs, browsers logs, proxy servers, and user profiles. Usage data can be divided on the basis of the source collection. The Web Server logs are simple text files in ASCII, which are independent of the server platform. Traditionally four types of server logs are present with some distinctions in the software. At present the three types of formats are available to record the log activities:

World Wide Web Consortium (W3C) Log File Format

Microsoft IIS Log File Format

NCSA Common Log File Format

All the above mentioned formats are ASCII text formats. The NCSA and the W3C Extended formats record logging data in four-digit year format.[5]

The server log file contains requests made to the web server which are recorded in a sequential order. The mostly used popular log file formats, which are being used to store the information of the client with the server is the Common Log Format (CLF) and the extended CLF. Web server create the log file to keep the track of the request made by the user and this information can be later on used by the end users in designing the websites of users choice by adding the links mostly visited to those websites which are accessed by them frequently.

3. Log File Contents

The Log file contains the basic information about the request made. Log file are generated when the client interacts with the system. [1] It contains the user name, visiting path, path traversed, time stamp, last page visited, user agent, URL, protocol used and the request type.

Last Visited Page: The page that was accessed by the user before one leaves the web site.

Path Traversed: Path traversed find out the track taken by the user with in the web site using the several links.

Request type: Request type is the process used for information transfer (GET, POST).

Time Stamp: Time spent is the actual time spent by the user in each web page while navigating via the web site. This is identified as the session.

User Name: This detects who visited the particular web site. The user identification can be done mostly by the IP address that is being assigned by Internet Service provider (ISP). However, this might not be a permanent address, and therefore the unique identification of the user is still not clear. Sometime user identification can be achieved by getting the user profile and permit the user to access the website by providing their information.

Visiting Path: Visiting path is the path followed by the user while accessing the web site.

User Agent: User agent is the browser from where the user makes the request to the server. This describes type and version of agent software being used.

URL: The resource accessed by the user. It could be CGI program, a script or a HTML page.

The above mentioned lists are the contents present in the log file, these information are used by the web usage mining process.

Web usage mining process which is an application of data mining techniques mines the most frequently accessed websites from the raw file.

4. Access Log

The content and the location of the log file are exacted by the Custom Log Directive; the custom log directory keeps the log requests to the server. Therefore, the Access log files are the file that contains the request records that are being processed by the server. The log format is the specified format. To simplify the selection of the contents of logs "Log Format directive" can be used. The format of the access log file is exceedingly configurable. The format of access log file is specified using a format string that looks much like a C-style printf(1) format string. The three log formats well-thought-out for access log entries in the case of Apache HTTP Server Version 1.1. A brief discussion of the log format is given below:

4.1 Common Log and Combined Log Format

Table 1. Shows the configuration that writes the log in the Common Log Format:

LogFormat "%h %l %u %t \"%r\ " %>s %b" common CustomLog logs / access_log common

Table 2. Shows the entries generated in Common Log Format (CLF) are: - - [11/Mar/2012:10:59:31 +0530] "" GET/backup/restorefile.php?contextid=2/HTTP/1.1"" 200 - "" 103520

The above given CLF gives detailed information about the client who had requested for the web site. (%h) - Client IP address (remote host), who made the request to the server. If the command of Hostname Lookup is on then server try to replace it with the Host Name and put it in place of IP Address. However this process slows down the speed of the system.

(%l) - Hyphen in the log entry next to the IP address displays the information is not available. This is the information about the user identification and therefore highly unreliable. Apache doesn't attempt to determine this information without the command of "IdentityCheck" is set to on.

Frank (%u) - It is the user identification of the individual requesting the document as determined by HTTP authentication. If the status code for the request is 401, then the user is not a genuine user.

[11/Mar/2012:10:59:31 +0530] (%t) - This is the format of time; it can be changed by selecting the format string. It is the time of the server when it finished processing the request. It resembles like [dd/mm/yy: hh:mm:ss zone]

GET / theme/index.php HTTP/1.1" (\"%r\") - This is the request made by the client. It contains information's:

Method used by the client (GET/ POST/ HEAD)

Requested resource by the client

Protocol Version (HTTP/ 1.1)

200 - (%>s) - The status code sent by the server. The codes beginning with 2 show the successful response, 3 for redirection etc.

103520 (%b) - This entry indicates the size of the object returned to the client by the server; it does not include the response headers. In case of no content returned this shows a "-".

Table 3. Shows configuration for the combined log format is:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

An example of Combined Log Format is shown in Table 4: - - [11/Mar/2012:11:02:42] GET /lib/yui/3.4.1/build/yui/yui-min.js HTTP/1.1 200 68018 Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727)

It is similar like Common Log File format but it has two additional fields.

The two additional parameters are:

http://admin/index.php?lang=en (\"%{Referer}i\") - This is the site from where the client is being referred. This is the link from where the client has visited /lib/yui/3.4.1/build/yui/yui-min.js.

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727) (\"%{User-agent}i\") - This information is about the user agent or the browser being used by the client.

5. Preprocessing Technique

One of the intensive steps in web usage mining is the data pre-treatment which is known as pre-processing of the data. Data pre-processing is the most time consuming process and computationally demanding.

This procedure encompasses pre-processing the actual data, assimilating data from multiple sources, and finally converting the integrated data into a form suitable for input for the specific data mining operations. This complete process is known as data preparation [6]. To identify the user in log file is a tedious task. To overcome the problem of identifying the user we will take the input of session. A session is the set of the pages visited during a user specific time threshold. Session gives the information related to client such as who has accessed the Web site, what all pages were requested and in what order, and how long the pages were viewed. However the information stored in the web server log does not give the accurate picture of the user access without pre-processing. The steps in data pretreatment involves data cleaning, identification of the user, session identification etc.

6. Steps Performed in Web Usage Mining

6.1. Data Source

The usage data is collected from different sources. It is in the form of web logs collected at server side, it requires recording of the access patterns into a text file i.e. ASCII. The three sources of data are:

Server Log File

Proxy Server Log File

Client File

6.2. Data Cleaning

The purpose of data cleaning is to omit extraneous items, and this process is very important for any data mining task, if it contains the irrelevant information then it is difficult to find the behavior of the user. Therefore during the cleaning process the irrelevant data is removed from the web log file. [7] As web usage mining targets to find the users most frequently visited web pages and therefore all the records are not required for the mining process. This includes files like image, video, audio, extensions such as CSS, GIF and JPEG and so on. The status code 200 to 299 are considered whereas below and above can be removed.

6.3. User Identification

User identification is a tedious task; in this the unique users are identified from log file. This information can be extracted in various ways such as using cookies, IP addresses, direct authentication etc. because this paper focuses on identifying the users from the server log file. Therefore it is the process of finding the different user sessions from the original access log file.

User's identification is basically about who has access the website and what all pages are being accessed in a sequence. The objective of session identification is to split the page accesses of each user at a time into separate sessions. It is a sequence of web pages that the user surfs in a single access. This process becomes more difficult when the proxy servers are used. For example, when different users might be using same IP address for accessing the web page and therefore registered with the single IP address in the log. Referrer-based method is also proposed to solve the problem of identifying the users from the log data. Rules adopted for distinguishing user sessions are as follows:

Different IP addresses distinguish different users

If IP addresses are identical, then different browsers or user agents and operation systems specify different users

6.4. Session Identification

Identifying the session from the raw log data is a complex process because the server does not contain all the information needed for identifying the unique users. Session is considered as a sequence of activities performed by a user when he or she is surfing through a given web site. As web server logs do not contain enough information to rebuild the user sessions, in that condition tow things can be taken into account such as time-oriented or structure-oriented heuristics. [4]

Session Identification can be done on following parameters:

If browser, operating system and IP address are same then the referrer information should be taken into consideration for identifying a new session. In this step the Refer URI field is checked, and a new user session is identified.

If the URL in the Refer URI field is not been retrieved previously, or there is a large interval of usually more than 15 seconds between the retrieving time of this record and the previous one.

New sessions can also be identified if the Refer field is empty.

The three heuristic used to find out the different user sessions are:

Time Oriented Heuristic

Session for the time is considered as 30 minutes

Single Page Stay

Navigation Oriented Heuristic

The one of the easiest procedure of identifying the different sessions in the log is the time oriented method which is based on user defined threshold. And the second method is based on Single Page Stay time.

The first method is based on the set of pages accessed by a unique user at a specific time, which is assumed to be of 30 minutes by [9]. In the second method the page stay time is estimated with the difference between two timestamps. Again time factor is taken into consideration if it exceeds 10 minutes then the next entry is assumed and account for a new session. However, this time factor cannot yield subsequent result because the users may involve in some other events after opening the web page. Third method [2] is founded on navigation uses web topology in graph format. The third technique uses web topology in graph format, this method take into consideration that the web pages are connected to one another through the links, however this is not always necessary.

Time oriented heuristic is used to divide the different web accesses into different user's sessions, as the identified session may comprise more than one visit by the same client at different time. For obtaining the complete path of the user, the records from the log file is grouped into sessions and then the path completion algorithms are applied. This algorithm helps in finding out the complete path of the users.

The identification of the correct users from the access log is a challenging task since the protocol used by access log is HTTP, [8] which is stateless and connectionless. It means that it does not hold the information of the earlier stages. Once the users are identified then the next task is to identify the different user sessions. With respect to our task as the name suggests, Session is an occurrence of the events made by an individual or the user during the visit to the web site. These sessions are used as an input to our method and then different mining techniques are applied such as classification, clustering into different groups, pattern analysis and prediction etc. This method has complexity therefore less used for identifying the unique users in the log file as massive amount of data is generated and the complexity in topology. In this paper for identifying the unique users we will take session as an input.

The users are identified on the following basis:

User Identification = Session, User ID, Browser, Index Page, {URL, (Time, Date) with Referrer}, Missing Referrer

The paper focuses on identifying the users based on different parameters. This information can be further used for applying the data mining techniques and identifying the potential users from the log data.

6.5. User Agent

Browser or the User agent plays a vital role in identifying the user from the log data. It is the browser being used by the user for accessing the different web pages. Change in the same IP address such as the browser or the operating system shows a new user. A change in the browser or the operating system under the same IP address represents a different user empirical. Through the agent we can identify the new users in two ways:

User Agent/ Browser

Operating System

6.6. Referring URL

The IP address is not sufficient for identifying the unique users. Because of the following reasons:

Proxy server may assign the same IP address to number of users or multiple users.

Same user may be assigned multiple IP addresses by the proxy.

The navigation path helps in dertmining the user. For this the HTTP request procedure is checked. If the requested page is not linked to the previous page then the user is considered as a new user under same IP address.

6.7. Path Completion

It is essential to find the existence of important accesses that are not recorded in the access log file. Therefore, Path completion refers to the insertion of imperative page accesses which are not present in the access log due to the browser and proxy server caching.

Here the heuristic assumes that:

if a page is requested which is not directly connected to the prior page accessed by the same user, in that condition the referrer log can be used to referred to see from which page the request has been made.

If the page is in the user's recent click stream, it is expected that the user browsed back with the "back" button, making use of cached sessions of the web pages. And therfore each session reflects the full path with the pages that are being backtracked.

Finally after this phase, the user session file gives results in paths formation which consists of a collection of page references counting repeat page accesses made by a client/user.

6.8. Formatting

The output file is not a suitable format used for the mining tasks. Therefore it is formatted into one of the appropriate format which can be understood by the system.

7. Proposed Architecture

The proposed architecture is divided into the two phases:

7.1. Off Line Phase - Datd Preprocessing/ pretreatement

7.2. Online Phase - Recommendation of the navigated links

As per the functionality of the system it is partitioned into two parts one is online and second is off line. These two phases are not overlapping and working jointly.

The offline phase is also known as the back end and the on line phase is known as the front end.

Proposed system architecture is given in Figure. 1 and the steps are in Table 5:

Access Log - Collect the data from the log file.

Data Cleaning

Filtering out images, themes, video, audios etc.

Removing the redundant data

Remove the request created by robot.txt or spider crawlers[10]

User or Session Identification

Naviagtion Patterns Mining

Identifying Potential users through Classification

Prediction Engine

Prediction engine objective is to classify the user navigation patterns and based on that predict the users next future request.

Client next future request

8. Classification Concept

Classification is an imperative data mining technique with wide range of applications. [11] It classifies records of various types. Classification method is used to organize each item in a set of data into one of the predefined set of classes. Classification process makes use of accurate techniques such as statistics, decision trees, neural network and linear programming.

In this paper the decision tree is used for the classification.

Access Log

Raw Data Cleaning

Session Identification

Interested Users

Prediction Engine

User Next Request

Online Phase

Offline Phase



Figure. 1 Proposed Architecture

8.1 Decision Tree

Decision trees are the utmost frequently used because of its comfort of implementation and it is easily understood as compared to the other classification algorithms. Decision Tree algorithm can be implemented in a serial/ parallel fashion based on the size of the data, scalability and memory space available on the computer resource. [12]

Decision tree algorithm recursively partitions a data set of records using either of the two types approach:

Depth first greedy approach

Breadth first greedy approach

Until all the items are classified.[13] The structure of decision tree consistes of root, leaf nodes and internal. Here the tree leaves is consists of the class labels in which the data items have been grouped.

9. Experiment on Learning Management System

This experiment is conducted on the Learning management system of a college. It generates large amount of data. In this experiment the log data of 11-Mar-2012 to 9-Mar-2012. The size of the log file is 286, 191KB and it contains 1048576 records. This file is in the text form (ASCII).

Table 6. Raw/ Access Log Data

" - - [11/Mar/2012:10:59:31 +0530] ""GET / HTTP/1.1"" 302 - ""-"" ""Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727)"""

" - - [11/Mar/2012:10:59:31 +0530] ""GET /install.php HTTP/1.1"" 200 8982 ""-"" ""Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727)"""

" - - [11/Mar/2012:10:59:33 +0530] ""GET /install/css.php HTTP/1.1"" 200 70037 """" ""Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727)"""

Table 7 shows the various attributes of the log file as filtering cannot be performed on actual access log file.

Table 7. All the Attributes of Log File

On the raw data preprocessing is performed. After preprocessing phase only 20% of the data is left for applying the decision tree algorithm. Figure 2. Shows the decision tree method to identify the interested users and non-interested users from the filtered log data based on the rules generated.

Assign Numbers to nodes


Figure 2. Structure of Decision Tree

Rules for Identifying Interested Users or Non-Interested Users

if (Session>30)

Then New User

if Session<=30)

if (Browser == Mozilla or Opera)

if (Pages < 5)

if (Method ==Post)

then NIU

if (Session < =30)

if (Browser = = Mozilla or Opera)

if (Pages > 5)

if (Method == Get)

if( Referrer = Yes)

Then NIU

10. Conclusion

One of the imperative processes of web usage mining is preprocessing which consumes maximum time in identifying interested users from the web log data and at the other side the challenging task is to find distinct users in web log data. This paper focuses on three different phases that is preprocessing, identifying distinct users by considering different parameters and at last identifying the potential users by applying classification algorithm. The paper has performed experiment on the Learning Management System of college and find out the potential users. This mined information can result in improved web personalization.


The authors gratefully thank to Mr. Akhil, who provided the log information. Without this the preprocessing step could not be carried out.