Web Usage Mining Using Ant Nestmate Approach Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- The web usage mining is the application of data mining techniques to discover usage patterns from web data. Web personalization uses web usage mining techniques for the process of customization. Customization involves knowledge acquisition done by analysis of user's navigational behavior. A user when goes online would like to get the links which suits his requirements or usage in the website he visits. The next business requirement in the online industry will be personalizing/customizing the web page fulfilling for each individuals requirement. The personalization of the web page will involve clustering of different web pages having common usage pattern. As the size of the cluster goes on increasing due to increase in users or growth of interest of users it will become inevitable need to optimize the clusters. This paper proposes a cluster optimizing methodology based on ants nestmate recognition ability and is used for eliminating the data redundancies that may occur after the clustering done by the web usage mining methods. For clustering an ART1-neural network based approach is used. "AntNestmate approach for cluster optimization" is presented to personalize web page clusters of target users.

Index Terms-Web usage mining, user profiles, pheromones, neural network, adaptive resonance theory, Ant Colony.


THe Online information management and service business is challenged continuously for new methods of design and development by the exponent ional growth in size and use of world wide web. This requires delegating these challenges it is required to identify and understand each individual interest and the navigational behavior of each user. Web personalization involves knowledge acquisition done by analysis of user's navigational behavior. A user when goes online would like to get the links which suits his requirements or usage in the website he visits. The next business requirement in the online industry will be personalizing/customizing the web page fulfilling for each individuals requirement.The Web personalization implies a service with which user is provided with what they want or need without them asking for it.

Anna Alphy , Assistant professor ,Dept of Computer science and Engineering, Kakkanad, Kerala. And Research scholar, Dept. of Computer Science and Engineering, SRM University, Chennai.

E-mail: anna.urumbath@gmail.com

S.Prabhakaran, Professor, Dept of Computer Science and Engineering, SRM University, Chennai.

E-mail: prabakaran.mani@gmail.com

To attract the customer his needs and his interests should be understood. For doing that data or information which gives an understanding on the customer's needs and demand needs to be collected from within or outside the organization. On basis of these knowledge gained user profiles can be created. A user profile illuminates needs, interest, hobbies etc about individual user. A webpage could be personalized for each user based on these user profiles. The competition in the market is growing exponentially. The service from an opponent is just a click away for the user. So the customer preference or needs should be constantly and religiously satisfied.

Web usage mining uses the data mining techniques to discover interesting usage patterns from web log data [4]. Web personalization uses web usage mining techniques for the process of customization of a web page for a specific user. This includes the extraction of user sessions from web log files which logs the user's sessions established when a user uses the internet. A user session is a sequence of web access by a user. The user sessions shapes up a user profile which characterizes each user. That is the log information can be used to fetch profile information which in turn can be used for personalizing the website for individual user. This personalizing boosts the development of the business as it targeted market (which is the user or a group of user) is given an easy access to the information they are interested in, every time they log in.

A user session is a sequence of web access by a user. Currently for web page development different clustering methods are available. But the data redundancy and performance issues are high in these methodologies. In this paper an optimizing methodology is proposed for eliminating the data redundancies that may occur after the clustering done by the above mentioned web personalizing methods. For the process of clustering an ART1 version of adaptive resonance theory is used. ART1 is an unsupervised clustering approach that conforms to changes in users' navigation patterns over time without losing earlier information. In the ART1 input is usually binary vectors. In the ART1-based algorithm, a prototype vector represents each user cluster by generalizing the URLs most frequently accessed by all cluster members. This paper proposes Ant ClusterTrack algorithm for cluster optimization which uses the ants nest mates identification techniques. Social insects like ants and honey bee have multiple level of recognition. Hierarchy relationship exists within groups. These complex behaviors can be instantiated with a fact that ants can distinguish between nest mates and non-nest mates. The level of interaction and cooperation among ants of different colony is nearly nil as to protect the exploitation of the colony from outsiders.

Related Works

Web usage mining is the application of data mining techniques to discover usage patterns from web data. Web usage mining consist of three steps preprocessing, pattern discovery, pattern analysis. Preprocessing consists of converting the usage, content and structure information available data sources into the data abstractions necessary for pattern discovery [1,3,4,15].In[5] a web utilization miner (WUM) ,a mining system for the discovery of interesting navigation patterns. The interestingness criteria for navigation patterns are dynamically specified by a human expert using WUM's mining language MINT.In[6] an overall design for preprocessing ,clustering and dynamic link suggestion tasks had given and these concepts was used to segment user sessions to clusters or profiles that can later form the basis for personalization. In [7] Adaptive web sites that improve themselves by learning from user access patterns. Adaptive sites can make the popular pages more accessible highlighting interesting links, connect related pages,& cluster similar document together. In [8] uses the N-Gram model which assumes that the last N pages browsed affect the probability of the next page to be visited. The model is based on the theory of probabilistic grammars providing it with a sound theoretical for future enhancements. In [9] uses a knowledge discovery tool Web LogMinner for mining web server log files. In this a data cube is constructed using available dimensions and an online analytical processing (OLAP) is used to drill down, roll up and slice the web log data cube. Finally data mining techniques are put to use with the data cube to predict, classify and discover interesting correlations. In [17] propose a new scalable clustering methodology that gleams inspiration from the natural immune system. In this web server plays the role of a human body .Incoming request as foreign antigens/bacteria/viruses that need to be detected by the proposed immune based clustering technique. Clustering algorithm as a cognitive agent whose goal is to continuously perform an intelligent organization of the incoming noisy data into clusters. In [10] trained decision trees to learn how an individual's meetings can be scheduled in a personalized calendar. A time window was used to confine and adapt the training samples for learning changing user preferences .News Dude [11] is an intelligent agent built to adapt to changing users' interests by learning two separate user models that represent short-term and long-term interests. The short-term model is learned from the most recent observations only, whereas the long-term (default) model represents the user's general preferences. In [12], a user profiling system was developed based on monitoring the user's Web browsing and e-mail habits. Relying only on Web usage data for user modeling or for personalization can be inefficient, either when there is insufficient usage data for the purpose of mining certain patterns or when new pages are added and thus do not accumulate sufficient usage data at first. The lack of usage data in these cases can be compensated by adding other information such as the content of Web pages [13] or the structure of a Web site [2], [3]. In [13], the keywords that appear in Web pages are used to generate document vectors, which are later clustered in the document space to further augment user profiles. In[14] ,Web usage logs were enriched with semantics derived from the content of the Web site's pages. Content keywords were first mapped to the categories of a manually constructed domain-specific taxonomy through the use of a thesaurus, and then the Web documents were clustered based on the taxonomy categories. The enhanced Web logs, called C-Logs, were then used as input to Web usage mining.

Pattern RECOGNITION BASED on Web Usage Mining

Web usage mining deals with the automatic discovery of user access patterns from web log files. This includes the following step.

1. Preprocess Web log file to extract user sessions.

2. Pattern discovery based on data mining methods [1,3,4,15]

This paper uses the ART1 neural network based clustering technique to group the user sessions. The clusters obtained are optimized using the proposed AntClusterTrackAlgorithm

3. Generate user profiles from clusters

4. Track Evolving User Profiles.[15]


Web usage mining is the application of data mining techniques to discover usage patterns from web data. Figure 1 describes the steps involved in web usage mining.


ART1-Neural network based Clustering

AntClusterTrack - Ant Nest mates Identification

web log

Post Processing & Tracking Evolving User Pofiles

Fig 1.Steps involved in web usage mining

3.1 Preprocessing the Web Log File to Extract User sessions

Web usage mining is the automatic discovery of user access patterns from web servers. The access log of a Web server is a record of all files (URLs) accessed by users on a Web site. Each log entry consists of the access time, IP address, URL viewed. Preprocessing of Web data means s filtering crawler's requests, requests to graphics, and identifying unique sessions [15]. The click-stream of the user's visitors of an on-line web portal is completely recorded. This gives a detailed record of every action taken by the user, which helps in providing a deep precision in the pattern discovery process.

The first preprocessing task is data cleaning that includes elimination of irrelevant items form web log so that they may not be involved in further analysis. The discovered patterns or generated statistical values are only useful if the data represented in the server log gives an accurate picture of the user accesses of the Web site. Elimination of irrelevant items can be reasonably accomplished by checking the suffix of the URL name. For instance all log entries with filename suffixes such as gif, jpeg, GIF, JPEG, jpg, JPG [1] and map can be removed.

After the removal of irrelevant items the next step in preprocessing consists of mapping the Ns URLs on a Web site to distinct indices. A request from the same User ID within a predefined time period is considered as a user session. Each URL in the site is assigned a unique identifier such as j ε 1. . . Ns, where Ns is the total number of valid URLs. The ith user session is then encoded as an Ns-dimensional binary attribute vector U(i) with the following property:

Uj(i)={ 1 if user i accessed URL j;

0 otherwise}

3.2 ART1 -Neural Network Based Clustering approach

The ART1 algorithm that we adopted for user clustering uses the concepts of competitive learning and interactive activation. Figure 2 illustrates the architecture for our ART1-based neural network for clustering usercommunities.1 It consists of Nu nodes in the L1 layer, with each node presented a 0 or 1 binary value.

The L1 layer presents the pattern vector Uj(i) , which represents the access pattern of each user. The L2 layer consists of a variable number of nodes corresponding to the number of clusters. The nodes at L2 layer represents the clusters formed. The L1 and L2 layers are totally interconnected, that is the activation of each L1 unit is fed into all L2 unites and vice versa. This interlayer feedback structure is used to facilitate 'resonance' when a match between the encoded pattern and input pattern occurs. Following steps explains the working of ART1 [18] approach.

Input: The pattern vector Uji that represents the navigation

patterns of each user

Output: The prototype vector that forms for each cluster


Step2: Select ε and ρ and initialize the interlayer connec-

tions as follows.

tij=1 and bij=1/(1+n)

Where ρ represents the vigilance parameter that determines the error degree to be tolerated, tij represents the top down weights and bij represents the bottom up weights.

Step3: Present Uj i the binary input pattern vector to the L1


Step 4: Using bij, determine the activations of the L2 layer.

Step 5: Determine the node with maximum activation in


Step6: The top down verification begins. Using the winner

unit found in step 5, this result is then fed back to L1 via the top down weight tij.This is an attempted confirmation of the winning node found in step 4.

Step7: If the test in step 6 suceeds,the bij and tij

interconnections are updated to accommodate the

results of input Uj i using the discrete version of

the slow learning dynamics.

If the step fails this node is ruled out and step 5 is repeated until a winner can be found or there are no remaining candidates.

Step8: Stop

When the algorithm stabilizes, the nodes at layer L2 represent clusters with a generalized representation of the URLs most frequently requested by all members (hosts) of that cluster.















Uj(i) = 1 0 1 0

Figure 2. ART1- based clustering architecture.

Ant Nestmate Based Cluster Optimization by Ant ClusterTrack Algorithm

In recent years, Swarm Intelligence which is the study of the ability of ant, bees, fish, birds to cooperate, selforganise, and to communicate in the absence of any a prior knowledge of the distribution environment has gained the attention of research scholars. These Swarm intelligence concepts are mainly used by researchers for clustering. Denubourg the first simulation, such as automatic classification and ants accumulated body phenomenon, put forward the basic clustering model (Basic Model, BM) [16]. Ant colony clustering algorithm does not require prior knowledge of the data partition.

The complex social behavior of ants and other social insects requires multiple levels of recognition. Ants can distinguish nest mates from non-nest mates, which allow them to limit altruism and cooperation to members of their own colony and protect their colony from exploitation by outsiders. Ants that have the same odor will be in the same nest.

This paper proposes an AntClusterTrack algorithm; a cluster optimization algorithm which takes its input from a neural network based training process, ART1. The clusters obtained by L2 layer of ART1 is feed into an ant based clustering approach that checks for the similarity of the pheromone values of the artificial ants. This is done on the fact that ants belonging to the same nest will have similar odor. In this algorithm clusters are considered as the ants nest and the url combinations in each cluster is considered as the artificial ants.

Pheromone (U, V) is the jaccard similarity coefficient between the url vectors U and V.

Pheromone (U, V) =M11/ (M01+M10+M11)

M11 represents the total number of attributes where U and V both have a value of 1.

M01 represents the total number of attributes where the attribute of U is 0 and the attribute of V is 1.

M10 represents the total number of attributes where the attribute of U is 1 and the attribute of V is 0.

M00 represents the total number of attributes where U and V both have a value of 0.

FNest(Ci) is the mean vector that contains the arithmetic averages of the attributes of cluster Ci elements.

T(A,B) is the Tanimoto coefficient which can be used to measure similarity

T(A,B) = A . B/ (â•‘ Aâ•‘2 + â•‘ Bâ•‘2 - A . B)

Ant ClusterTrack algorithm

Input : N clusters which gives a generalized representation of the URLs most frequently requested by all members of that cluster

Ouput: M<<N clusters that maximizes inter cluster distance and minimizes intra cluster distance

Step 1: Start

Step 2: Initialize the value of i as 1.

Step 3: Repeat steps 3 to 8 until i is equal to N, Where N is the number of clusters specified as input

Step 4. Perform step 5 to 7 for each url j present in cluster i

Step 5: For each clusters i+1 to n

Step 5.1 Check if the Pheromone (U ij , Uxz) > α

// where x=i+1 to n and z= 1 to n' where n' is the no of urls in each clusters i+1 to n and P denotes the pheromone value.

Step 5.1.1 Check if FNest(Ci) >FNest(Cx) then

Cx=Cx - Uxj

Step 5.1.2 Else


Step 6: Check If T(FNest(Ci),FNest(Cj)) >=β

Step 6.1 Check if(FNest(Ci)>FNest(Cx)

Step 6.1.1 Ci=Ci +Uxz

// where z varies from 1 to n, where n is the number of urls in cluster X, j varies from i+1 to n

Step 6.1.2 Delete cluster Cx

Step 6.2 Else

Step 6.1.1 Cx=Cx+Uiz,

// where z varies from 1 to n'. where n' is the number of urls in cluster i.

Step 6.1.2 Delete Cluster Ci.

Step 7 End For

Step 8 End For

In this experiment the value of α =0.8 and β=0.58.

Once all the iterations are completed we get M clusters which is less than the initial N clusters (M<N). Some clusters will have higher densities and some of them will be vaporized.

Post Processing and TRACKING EVOLVING User Profiles

Post processing [15] includes creating user profiles as a set of relevant urls from the clusters obtained. Tracking evolving user profiles [15] include the comparison of different profile events across different time periods to generate a better understanding of the evolution of user access patterns and seasonality.

V. Performance Analysis

Performance can be analyzed based on the temporal dynamics shown by the web logs that had undergone web usage mining in different time periods.The various temporal events are profiles birth, persistence, death, and atavism (rebirth)[15] has been identified . This can be visualized by plotting a graph that has the x-axis with the periods on which Web log batches that undergo Web usage mining. And the y-axis is used to indicate the profile number. In Fig. 3 a symbol is added whenever a profile y appears in period x. Four different graphs a drawn that shows web site user trend such as Birth, Death, persistence and Atavism. The right period length should be determined based on the applications.

Figure3. Profile Evolution Events

A systematic approach to profile evolution validation [15] in information acquisition context is in terms of precision and coverage.

Precision: A summary profile's items are all correct or included in the original input data; that is, they include only the true data items.

and Precij = sj ∩ pi /pi[15]

Coverage/Recall: A summary profile's items are complete compared to the data that is summarized; that is, they include all the data items.

Covij = sj ∩ pi / sj , where sj is a summary of input sessions and pi represents discovered mass profile [15 ]

The precision and coverage are contradictory because precision will favor only the smallest profiles that are the faithful and accurate representation of the user input. But the coverage will favor the largest or the complete possible profiles. The efficiency of an algorithm depends on how well it can balance the amount of precision and coverage.


Figure4. Quality Measure

For each of the session's precision obtained was low as compared coverage in the ART1 based neural network approach. As an enhancement the output of the ART1 based approach is given as the input for AntClusterTrack optimization algorithm. Its quality measure is shown in figure 5.

Figure5. Quality Measure

This increases precision and coverage .Thus accuracy and completeness of user profiles increases. Hence a better result is obtained by this cluster optimization approach.


This paper deals with a cluster optimization technique. The web log is accessed and performs data cleaning to remove crawlers request and request to graphics. The cleaned web log is used for pattern analysis. This paper uses the clustering technique for discovering interesting usage patterns. Clustering is done based on user sessions. An ART1, which is a neural network based clustering techniques is used. The user with the same browsing pattern comes under the same cluster. Here the input pattern is given to an adaptive resonance network. The clusters obtained are again feed into an Ant ClusterTrack algorithm, a cluster optimization which uses ant nest mate's identification techniques. Then extract user profiles from each cluster as a set of relevant url's. The user profiles discovered in certain period of time is compared with the user profiles discovered in later period. This allows us to discover and track evolving user profiles. Based on the user profiles the web page is personalized. As a future enhancement scalability can be taken into consideration.